Data Engineering

Skip Navigation

Data Engineering

A community for discussion about data engineering

Icon base by Delapouite under CC BY 3.0 with modifications to add a gradient

Members

429

Posts

39

Active Today

1

Created

2 yr. ago

Sort

Data Engineering @programming.dev
jupyter @programming.dev
4mo ago

Are there best practices for time series database designs?

I am creating a couple of bigger database tables with at least hundreds of millions of observations, but growing. Some tables are by minute, some by milliseconds. timestamps are not necessarily unique.
Should I create separate year, month, or date and time columns? Is one unique datetime column enough? At what size would you partition the tables?
Raw data is in csv.
Currently I aim for postgres and duckdb. Does timescaledb make a significant difference?

3
Data Engineering @programming.dev
houseofleft @slrpnk.net
7mo ago

benrutter.github.io Fun with Hy and Pandas
Functional data programming with Hy!

cross-posted from: https://slrpnk.net/post/13881784
Hy (a lisp built on top of Python similar to how Clojure is built on top of Java) released v1 recently. I couldn't resist playing with it and found it worked sooo nicely. Thanks all the maintainers for creating a great language!

0
Data Engineering @programming.dev
Sem @lemmy.ml
8mo ago

geekpython.in Efficiently Manage Memory Usage in Pandas with Large Datasets
Pandas supports Copy-on-Write, an optimization technique that helps improve memory use, particularly when working with large datasets.

0
Data Engineering @programming.dev
Andy @lemmy.world
8mo ago

Shift Left

medium.com /@nydas/4-key-benefits-of-shift-left-ff0e4bb74a3f

Hey there Data Engineers. Want to stop putting out fires and start preventing them? Then it might be time to "shift left." By tackling quality, governance, and security from the get-go, you'll save time, money, and headaches.
If you want to learn more, follow the paywall bypassed link to my latest article. I hope some of you find this useful!

3
Data Engineering @programming.dev
ericjmorey @programming.dev
9mo ago

Dremio is offering free pdf copies of "Apache Iceberg: The Definitive Guide: Data Lakehouse Functionality, Performance and Scalability on the Data Lake"

hello.dremio.com Apache Iceberg: The Definitive Guide | Dremio
Master Apache Iceberg with this comprehensive guide by Dremio. Get expert insights on how to optimize big data management with open table formats.

Book Preface:
Welcome to Apache Iceberg: The Definitive Guide! We’re delighted you have embarked on this learning journey with us. In this preface, we provide an overview of this book, why we wrote it, and how you can make the most of it.
About This Book
In these pages, you’ll learn what Apache Iceberg is, why it exists, how it works, and how to harness its power. Designed for data engineers, architects, scientists, and analysts working with large datasets across various use cases from BI dashboards to AI/ML, this book explores the core concepts, inner workings, and practical applications of Apache Iceberg. By the time you reach the end, you will have grasped the essentials and possess the practical knowledge to implement Apache Iceberg effectively in your data projects. Whether you are a newcomer or an experienced practitioner, Apache Iceberg: The Definitive Guide will be your trusted companion on this enlightening journey into Apache Iceberg.

1
Data Engineering @programming.dev
ericjmorey @programming.dev
9mo ago

Postgres vs. Pinecone | Lantern Blog | Narek Galstyan | July 18, 2024

lantern.dev Postgres vs. Pinecone | Lantern Blog
We respond to Pinecone's recent blog post comparing Postgres and Pinecone. We show that Postgres can outperform Pinecone in the same benchmarks Pinecone covered in their article.

July 18, 2024 Narek Galstyan writes:
We were naturally curious when we saw Pinecone's blog post comparing Postgres and Pinecone.
In their post on Postgres, Pinecone recognizes that Postgres is easy to start with as a vector database, since most developers are familiar with it. However, they argue that Postgres falls short in terms of quality. They describe issues with index size predictability, index creation resource intensity, metadata filtering performance, and cost.
This is a response to Pinecone's blog post, where we show that Postgres outperforms Pinecone in the same benchmarks with a few additional tweaks. We show that with just 20 lines of additional code, Postgres with the pgvector or lantern extension outperforms Pinecone by reaching 90% recall (compared to Pinecone's 60%) with under 200ms p95 latency.
Read Postgres vs. Pinecone

0
Data Engineering @programming.dev
ericjmorey @programming.dev
10mo ago

Definite: Comparing Iceberg Query Engines (with Duckdb and Iceberg Full Notebook Example) | Steven Wang | 7/3/2024

www.definite.app Duckdb and Iceberg | Definite
Duckdb and Iceberg

7/3/2024
Steven Wang writes:
Many in the data space are now aware of Iceberg and its powerful features that bring database-like functionality to files stored in the likes of S3 or GCS. But Iceberg is just one piece of the puzzle when it comes to transforming files in a data lake into a Lakehouse capable of analytical and ML workloads. Along with Iceberg, which is primarily a table format, a query engine is also required to run queries against the tables and schemas managed by Iceberg. In this post we explore some of the query engines available to those looking to build a data stack around Iceberg: Snowflake, Spark, Trino, and DuckDB.
...
DuckDB + Iceberg Example
We will be loading 12 months of NYC yellow cab trip data (April 2023 - April 2024) into Iceberg tables and demonstrating how DuckDB can query these tables.
Read Comparing Iceberg Query Engines

0
Data Engineering @programming.dev
Sem @lemmy.ml
10mo ago

A guide how to adopt an existing Spark scala library for Spark Connect

semyonsinchenko.github.io Spark-Connect: I'm starting to love it!
Summary This blog post is a detailed story about how I ported a popular data quality framework, AWS Deequ, to Spark-Connect. Deequ is a very cool, reliable and scalable framework that allows to compute a lot of metrics, checks and anomaly detection suites on the data using Apache Spark cluster. But ...

Let me share my post with a detailed step by step guide how an exisiting Spark scala library may be adopted to work with recently introduced Spark Connect. As an example I have chosen a pupular open source data quality tool AWS Deequ. I made all the necessary protobuf messages and a Spark Connect Plugin. I tested it from PySpark Connect 3.5.1 and it works. Of course, all the code is public in git.

0
Data Engineering @programming.dev
Andy @lemmy.world
10mo ago

Why Use Data Build Tools (dbt)

medium.com /@nydas/the-power-of-data-build-tool-dbt-6b26dfab5bac

Time and again I see the same questions asked: "Why should I use dbt?" or "I don't understand what value dbt offers". So I thought I'd put together an article that touches on some of the benefits, as well as putting together a step through on setting up a new project (using DuckDB as the database), complete with associated GitHub repo for you to take a look at.
Having used dbt since early 2018, and with my partner being a dbt trainer, I hope that this article is useful for some of you. The link is paywall bypassed.

1
Data Engineering @programming.dev
gecloslatitude @lemmy.world
11mo ago

7 best open-source chart libraries for developers

dev.to 7 Best Chart Libraries For Developers In 2024 🤯
Many applications use charts or graphs for data visualization, which can be implemented using...

0
Data Engineering @programming.dev
Andy @lemmy.world
11mo ago

Building a real-time data pipeline - Technical article and GitHub repo

medium.com /@nydas/building-a-real-time-data-pipeline-5eff6c6d8a3c

If you're a Data Engineer, before long you'll be asked to build a real-time pipeline.
In my latest article, I build a real-time pipeline using Kafka, Polars and Delta tables to demonstrate how these can work together. Everything is available to try yourself in the associated GitHub repo. So if you're curious, take a moment to check out this technical post.

0
Data Engineering @programming.dev
Andy @lemmy.world
12mo ago

Diagrams as Code

medium.com Diagrams as Code: Streamlining ERD Creation for Data Engineers
As a Data Engineer in a fast-paced company, you know that Entity Relationship Diagrams (ERDs) are essential for documenting and communicating database structures. However, traditional no-code tools…

How often do you build and edit Entity Relationship Diagrams? If the answer is ‘more often than I’d like’, and you’re fed up with tweaking your diagrams, take <5 minutes to read my latest article on building your diagrams with code. Track their changes in GitHub, have them build as part of your CI/CD pipeline, and even drop them into your dbt docs if you like.
This is a ‘friends and family’ link, so it’ll bypass the usual Medium paywall.
I’m not affiliated to the tool I’ve chosen in any way. Just like how it works.
Let me know yours thoughts!

7
Data Engineering @programming.dev
gecloslatitude @lemmy.world
12mo ago

6 Best Embedded Databases for 2024

dev.to 6 Best Embedded Databases for 2024 📊
Embedded databases provide many benefits compared to traditional databases. They are simpler to set...

2
Data Engineering @programming.dev
ericjmorey @programming.dev
1y ago

engineering.fb.com Building Meta’s GenAI Infrastructure
Marking a major investment in Meta’s AI future, we are announcing two 24k GPU clusters. We are sharing details on the hardware, network, storage, design, performance, and software that help us extr…

0
Data Engineering @programming.dev
ericjmorey @programming.dev
1y ago

engineeringblog.yelp.com Building data abstractions with streaming at Yelp
Building data abstractions with streaming at Yelp Hakampreet Singh Pandher, Software Engineer Mar 8, 2024 Yelp relies heavily on streaming to synchronize enormous volumes of data in real time. This...

Mar 8, 2024 | Hakampreet Singh Pandher writes:
Yelp relies heavily on streaming to synchronize enormous volumes of data in real time. This is facilitated by Yelp’s underlying data pipeline infrastructure, which manages the real-time flow of millions of messages originating from a plethora of services. This blog post covers how we leverage Yelp’s extensive streaming infrastructure to build robust data abstractions for our offline and streaming data consumers. We will use Yelp’s Business Properties ecosystem (explained in the upcoming sections) as an example.
Read Building data abstractions with streaming at Yelp

0
Data Engineering @programming.dev
Andy @lemmy.world
1y ago

Building a Data Pipeline from Scratch

medium.com Building a Modern Data Pipeline: A Journey from API to Insight
The buzz of ‘Big Data’ has passed. Terabytes of data is the new normal, and efficiently managing and processing data is more critical than ever. Companies across industries strive to harness the…

I’ve written a series of Medium articles on creating a Data Pipeline from scratch, using Polars and DeltaTables. The first (linked) is an overview with link to the GitHub repository and each of the deeper dive articles. I then go into the next level of detail, walking through each component.
The articles are paywalled (it took time to build and document), but the link provided is the ‘family & friends’ link which bypasses the paywall for the Lemmy community.
I hope some of you may find this helpful.

0
Data Engineering @programming.dev
Pyr @lemmy.ca
1y ago

Data Newbie Looking for Advice

Hello,
I am looking for some advice to help me out at my job. Apologies if this is the wrong place to ask.
So, basically my boss is a complete technophobe and all of our data is stored across multiple excel files in drop box and I'm looking for a way to change that into a centralized database. I know my way around a computer but writing code is not something I have ever been able to grasp well.
The main issue with our situation is that our workers are all completely remote, and no I don't mean work from home in the suburbs from a home office. They use little laptops with no data connection and go out gathering data every day from a variety of locations, sometimes not even cell coverage.
We need up to 20 people entering data all day long and then updating a centralized database at the end of the day when they get back home and have internet connection. It will generally all be new entries, no one will need to be updating old entries.
It would be nice to have some sort of data en

4
Data Engineering @programming.dev
Sem @lemmy.ml
1y ago

An implementation of Apache Spark physical execution from Apple

github.com Initial PR by sunchao · Pull Request #1 · apache/datafusion-comet
This is the initial PR for Comet. Related mailing list discussion: https://lists.apache.org/thread/0q1rb11jtpopc7vt1ffdzro0omblsh0s

Apple donated to community their own implementation of native physical execution of Apache Spark plan with Data Fusion.

0
Data Engineering @programming.dev
Andy @lemmy.world
1y ago

Infrastructure-as-Code Demo of Terraform on Snowflake

github.com GitHub - nydasco/snowflake-terraform-demo: A demonstration of how Terraform can be used to manage Snowflake infrastructure
A demonstration of how Terraform can be used to manage Snowflake infrastructure - nydasco/snowflake-terraform-demo

A few years ago, if you'd mentioned Infrastructure-as-Code (IaC) to me, I would've given you a puzzled look. However I'm now on the bandwagon. And to help others understand how it can benefit them, I've pulled together a simple GitHub repo that showcases how Terraform can be used with Snowflake to manage users, roles, warehouses and databases.
The readme hopefully gives anyone who wants to give it a go the ability to step through and see results. I'm sharing this in the hopes that it is useful to some of you.

2
Data Engineering @programming.dev
ericjmorey @programming.dev
1y ago

www.analyticsvidhya.com Spark vs Presto: A Comprehensive Comparison
Here is the comparison of Spark vs Presto in big data processing. Understand the nuances to make informed choice in data analytics journey.

December 28 2023 Pankaj Singh writes:
In big data processing and analytics, choosing the right tool is paramount for efficiently extracting meaningful insights from vast datasets. Two popular frameworks that have gained significant traction in the industry are Apache Spark and Presto. Both are designed to handle large-scale data processing efficiently, yet they have distinct features and use cases. As organizations grapple with the complexities of handling massive volumes of data, a comprehensive understanding of Spark and Presto’s nuances and distinctive features becomes essential. In this article, we will compare Spark vs Presto, exploring their performance and scalability, data processing capabilities, ecosystem, integration, and use cases and applications.
Read Spark vs Presto: A Comprehensive Comparison

6

0 active users