I am creating a couple of bigger database tables with at least hundreds of millions of observations, but growing. Some tables are by minute, some by milliseconds. timestamps are not necessarily unique.
Should I create separate year, month, or date and time columns? Is one unique datetime column enough? At what size would you partition the tables?
Raw data is in csv.
Currently I aim for postgres and duckdb. Does timescaledb make a significant difference?
Hy (a lisp built on top of Python similar to how Clojure is built on top of Java) released v1 recently. I couldn't resist playing with it and found it worked sooo nicely. Thanks all the maintainers for creating a great language!
Hey there Data Engineers. Want to stop putting out fires and start preventing them? Then it might be time to "shift left." By tackling quality, governance, and security from the get-go, you'll save time, money, and headaches.
If you want to learn more, follow the paywall bypassed link to my latest article. I hope some of you find this useful!
Master Apache Iceberg with this comprehensive guide by Dremio. Get expert insights on how to optimize big data management with open table formats.
Link Actions
Book Preface:
Welcome to Apache Iceberg: The Definitive Guide! We’re delighted you have embarked on this learning journey with us. In this preface, we provide an overview of this book, why we wrote it, and how you can make the most of it.
About This Book
In these pages, you’ll learn what Apache Iceberg is, why it exists, how it works, and how to harness its power. Designed for data engineers, architects, scientists, and analysts working with large datasets across various use cases from BI dashboards to AI/ML, this book explores the core concepts, inner workings, and practical applications of Apache Iceberg. By the time you reach the end, you will have grasped the essentials and possess the practical knowledge to implement Apache Iceberg effectively in your data projects. Whether you are a newcomer or an experienced practitioner, Apache Iceberg: The Definitive Guide will be your trusted companion on this enlightening journey into Apache Iceberg.
We respond to Pinecone's recent blog post comparing Postgres and Pinecone. We show that Postgres can outperform Pinecone in the same benchmarks Pinecone covered in their article.
Link Actions
July 18, 2024
Narek Galstyan writes:
We were naturally curious when we saw Pinecone's blog post comparing Postgres and Pinecone.
In their post on Postgres, Pinecone recognizes that Postgres is easy to start with as a vector database, since most developers are familiar with it. However, they argue that Postgres falls short in terms of quality. They describe issues with index size predictability, index creation resource intensity, metadata filtering performance, and cost.
This is a response to Pinecone's blog post, where we show that Postgres outperforms Pinecone in the same benchmarks with a few additional tweaks. We show that with just 20 lines of additional code, Postgres with the pgvector or lantern extension outperforms Pinecone by reaching 90% recall (compared to Pinecone's 60%) with under 200ms p95 latency.
Many in the data space are now aware of Iceberg and its powerful features that bring database-like functionality to files stored in the likes of S3 or GCS. But Iceberg is just one piece of the puzzle when it comes to transforming files in a data lake into a Lakehouse capable of analytical and ML workloads. Along with Iceberg, which is primarily a table format, a query engine is also required to run queries against the tables and schemas managed by Iceberg. In this post we explore some of the query engines available to those looking to build a data stack around Iceberg: Snowflake, Spark, Trino, and DuckDB.
...
DuckDB + Iceberg Example
We will be loading 12 months of NYC yellow cab trip data (April 2023 - April 2024) into Iceberg tables and demonstrating how DuckDB can query these tables.
Summary This blog post is a detailed story about how I ported a popular data quality framework, AWS Deequ, to Spark-Connect. Deequ is a very cool, reliable and scalable framework that allows to compute a lot of metrics, checks and anomaly detection suites on the data using Apache Spark cluster. But ...
Link Actions
Let me share my post with a detailed step by step guide how an exisiting Spark scala library may be adopted to work with recently introduced Spark Connect. As an example I have chosen a pupular open source data quality tool AWS Deequ. I made all the necessary protobuf messages and a Spark Connect Plugin. I tested it from PySpark Connect 3.5.1 and it works. Of course, all the code is public in git.
Time and again I see the same questions asked: "Why should I use dbt?" or "I don't understand what value dbt offers". So I thought I'd put together an article that touches on some of the benefits, as well as putting together a step through on setting up a new project (using DuckDB as the database), complete with associated GitHub repo for you to take a look at.
Having used dbt since early 2018, and with my partner being a dbt trainer, I hope that this article is useful for some of you. The link is paywall bypassed.
If you're a Data Engineer, before long you'll be asked to build a real-time pipeline.
In my latest article, I build a real-time pipeline using Kafka, Polars and Delta tables to demonstrate how these can work together. Everything is available to try yourself in the associated GitHub repo. So if you're curious, take a moment to check out this technical post.
As a Data Engineer in a fast-paced company, you know that Entity Relationship Diagrams (ERDs) are essential for documenting and communicating database structures. However, traditional no-code tools…
Link Actions
How often do you build and edit Entity Relationship Diagrams? If the answer is ‘more often than I’d like’, and you’re fed up with tweaking your diagrams, take <5 minutes to read my latest article on building your diagrams with code. Track their changes in GitHub, have them build as part of your CI/CD pipeline, and even drop them into your dbt docs if you like.
This is a ‘friends and family’ link, so it’ll bypass the usual Medium paywall.
I’m not affiliated to the tool I’ve chosen in any way. Just like how it works.
Marking a major investment in Meta’s AI future, we are announcing two 24k GPU clusters. We are sharing details on the hardware, network, storage, design, performance, and software that help us extr…
Building data abstractions with streaming at Yelp Hakampreet Singh Pandher, Software Engineer Mar 8, 2024 Yelp relies heavily on streaming to synchronize enormous volumes of data in real time. This...
Link Actions
Mar 8, 2024 | Hakampreet Singh Pandher writes:
Yelp relies heavily on streaming to synchronize enormous volumes of data in real time. This is facilitated by Yelp’s underlying data pipeline infrastructure, which manages the real-time flow of millions of messages originating from a plethora of services. This blog post covers how we leverage Yelp’s extensive streaming infrastructure to build robust data abstractions for our offline and streaming data consumers. We will use Yelp’s Business Properties ecosystem (explained in the upcoming sections) as an example.
The buzz of ‘Big Data’ has passed. Terabytes of data is the new normal, and efficiently managing and processing data is more critical than ever. Companies across industries strive to harness the…
Link Actions
I’ve written a series of Medium articles on creating a Data Pipeline from scratch, using Polars and DeltaTables. The first (linked) is an overview with link to the GitHub repository and each of the deeper dive articles. I then go into the next level of detail, walking through each component.
The articles are paywalled (it took time to build and document), but the link provided is the ‘family & friends’ link which bypasses the paywall for the Lemmy community.
I am looking for some advice to help me out at my job. Apologies if this is the wrong place to ask.
So, basically my boss is a complete technophobe and all of our data is stored across multiple excel files in drop box and I'm looking for a way to change that into a centralized database. I know my way around a computer but writing code is not something I have ever been able to grasp well.
The main issue with our situation is that our workers are all completely remote, and no I don't mean work from home in the suburbs from a home office. They use little laptops with no data connection and go out gathering data every day from a variety of locations, sometimes not even cell coverage.
We need up to 20 people entering data all day long and then updating a centralized database at the end of the day when they get back home and have internet connection. It will generally all be new entries, no one will need to be updating old entries.
A demonstration of how Terraform can be used to manage Snowflake infrastructure - nydasco/snowflake-terraform-demo
Link Actions
A few years ago, if you'd mentioned Infrastructure-as-Code (IaC) to me, I would've given you a puzzled look. However I'm now on the bandwagon. And to help others understand how it can benefit them, I've pulled together a simple GitHub repo that showcases how Terraform can be used with Snowflake to manage users, roles, warehouses and databases.
The readme hopefully gives anyone who wants to give it a go the ability to step through and see results. I'm sharing this in the hopes that it is useful to some of you.
In big data processing and analytics, choosing the right tool is paramount for efficiently extracting meaningful insights from vast datasets. Two popular frameworks that have gained significant traction in the industry are Apache Spark and Presto. Both are designed to handle large-scale data processing efficiently, yet they have distinct features and use cases. As organizations grapple with the complexities of handling massive volumes of data, a comprehensive understanding of Spark and Presto’s nuances and distinctive features becomes essential. In this article, we will compare Spark vs Presto, exploring their performance and scalability, data processing capabilities, ecosystem, integration, and use cases and applications.