Data processing

Bigslice - System for fast, large-scale, serverless data processing using Go.

Reflow - Language and runtime for distributed, incremental data processing in the cloud.

Self-managing serverless computing with Bigmachine (2019)

Bigslice: a cluster computing system for Go (2019)

When your data doesn’t fit in memory: the basic techniques (2019) (HN)

Differential Dataflow - Implementation of differential dataflow using timely dataflow on Rust. (Book) (HN)

The Log: What every software engineer should know about real-time data's unifying abstraction (2013)

Luna - Data processing and visualization environment built on a principle that people need an immediate connection to what they are building.

Guide To The Data Lake — Modern Batch Data Warehousing (2020)

Plumbing At Scale (2020) - Event Sourcing and Stream Processing Pipelines at Grab.

Differential Dataflow! But at what COST? (2017) (HN)

Timely Dataflow and Total Order (2020)

Nuclio - High-Performance Serverless event and data processing platform.

Apache Spark - Unified analytics engine for large-scale data processing. (PySpark) (PySpark Style Guide) (Article) (Web) (Spark Learning Guide)

Spark: The Definitive Guide Book (2018) (Code)

Batch - Event replay platform. Version control for data passing through your messaging systems. (HN)

A log/event processing pipeline you can't have (2019) (HN)

mm-ADT - Multi-Model Abstract Data Type. Distributed virtual machine capable of integrating a diverse collection of data processing technologies. (Code)

Data Preprocessing in Machine Learning (2020)

lakeFS - Open source layer that delivers resilience and manageability to object-storage based data lakes. (Web)

Baker - High performance, composable and extendable data-processing pipeline for the big data era.

Cylon - Fast, scalable distributed memory data parallel library for processing structured data. (Web)

cuGraph - GPU Graph Analytics.

Opaque - Secure Apache Spark SQL.

Apache Beam - Unified programming model for Batch and Streaming. (Web)

Stitch - Simple, extensible ETL built for data teams.

Databricks - Unified Data Analytics. (GitHub) (CLI) (Reflecting on Four Years at Databricks (2021))

AugMix - Simple Data Processing Method to Improve Robustness and Uncertainty.

Snapflow - Framework for building end-to-end functional data pipelines from modular components.

Workflow Description Language (WDL) - Way to specify data processing workflows with a human-readable and writeable syntax.

Cloudfuse - Open source serverless data solutions. Future of data pipelines. (GitHub)

Create your own data stream for Kafka with Python and Faker (2021)

Hindsight - C based data processing infrastructure based on the lua sandbox project.

Reverse ETL — A Primer (2021)

I wrote one of the fastest DataFrame libraries (2021)

Build your own “data lake” for reporting purposes in a multi-services environment (2021)

Feature Stores: The Data Side of ML Pipelines (2021)

Flowgger - Fast, simple and lightweight data collector written in Rust.

Popsink - Real-time data platform you don't have to build.

Flyte - Structured programming and distributed processing platform that enables highly concurrent, scalable and maintainable workflows for Machine Learning and Data Processing. (Web) (GitHub) (Python SDK) (CLI)

Winterfell - Distributed STARK prover.

Python to Distributed Python to Airflow task in ~5 lines of code

DataFusion - Extensible query execution framework, written in Rust, that uses Apache Arrow as its in-memory format.

Delta Lake - Reliable Data Lakes at Scale. (GitHub)

Delta Sharing - Open Protocol for Secure Data Sharing. (Article) (Tweet)

Dataform - Manage data pipelines in BigQuery.

Legate Pandas - Aspiring Drop-In Replacement for Pandas at Scale.

datablocks - Flow based data processing editor. (HN)

Reproducible data processing pipelines (2021)

datasketch - Probabilistic data structures that can process and search very large amount of data super fast, with little loss of accuracy.

Tuplex - Parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code. (Web)

file.d - Blazing fast tool for building data pipelines: read, process and output events.

Datafuse - Modern Real-Time Data Processing in Rust. (Code) (HN)

MapReduce is making a comeback (2021) (HN)

SciPipe - Robust, flexible and resource-efficient pipelines using Go and the command line. (Docs)

The Future Is Big Graphs: A Community View on Graph Processing Systems (2021) (HN)

What Is the Data Lakehouse Pattern? (HN)

Apache Hadoop - Open-source software for reliable, scalable, distributed computing. (Is Hadoop Dead?) (Code)

go-stash - High performance, free and open source server-side data processing pipeline that ingests data from Kafka, processes it, and then sends it to ElasticSearch.

pypely - Make your data processing easy - build pipelines in a functional manner.

An opinionated map of incremental and streaming systems (2021)

Crossjoin - Joins together your data from anywhere.

Ceramic Network - Decentralized, open source platform for creating, hosting, and sharing streams of data. (TS Code) (GitHub) (Doc)

Graphite-Web - Highly scalable real-time graphing system. (Docs)

vega - Faster implementation of Apache Spark from scratch in Rust.

Memgraph - Build modern, graph-based applications on top of your streaming data in minutes. (Web)

Apache Parquetv - Columnar storage format that supports nested data. (Code)

Data Pipelines Pocket Reference Book (2021) (Code)

miniwdl - Workflow Description Language developer tools & local runner.

Rain - Framework for large distributed pipelines.

Apache SeaTunnel - Distributed, high-performance data integration platform for the synchronization and transformation of massive data (offline & real-time). (Code)

Databend - Open Source Serverless Data Warehouse for Everyone. (Web)

Pydra - Simple dataflow engine with scalable semantics.

Bytewax - Open source Python framework for building highly scalable dataflows.

Atomic Data - Modular specification for sharing, modifying and modeling graph data. (Code) (Rust Code)

Apache Arrow Flight SQL: Accelerating Database Access (2022) (HN)

Grist - Modern relational spreadsheet. Open core alternative to Airtable and Google Sheets. (HN)

Data Engineering Practice Problems

Dagster: Rebundling the Data Platform (2022)

cq - Clojure Command-line Data Processor for JSON, YAML, EDN, XML and more.

utt - Universal text transformer.

Loggie - Lightweight, high-performance, cloud-native agent and aggregator based on Go.

ter - CLI to run text expressions and perform basic text operations such as filtering, ignoring and replacing on the command line.

csv-diff - Python CLI tool and library for diffing CSV and JSON files.

pqrs - Command line tool for inspecting Parquet files.

Kestra - Infinitely scalable open source orchestration & scheduling platform. (Code) (HN)

TiFlash - Analytical engine for TiDB.

Streamify - Data pipeline with Kafka, Spark Streaming, dbt, Docker, Airflow, Terraform, GCP and much more.

DTL - Language and JavaScript lib to transform and manipulate data. (HN)

Hawk - Haskell text processor for the command-line.

Alternatives to pandas library

Zed - Tooling for super-structured data: a new and easier way to manipulate data. (Web)

Fast Analysis with DuckDB + PyArrow (2022) - Trying out some new speedy tools for data analysis.

Why isn’t there a decent file format for tabular data? (2022) (HN)

Data Engineering Wiki (Code)

csv-clean - Command line tool to clean up malformed CSV files.

rq - Tool for doing record analysis and transformation.

Data Integration Guide: Techniques, Technologies, and Tools (2022)

Mito - Mito – Excel-like interface for Pandas dataframes in Jupyter notebook. (HN)

Tornado - Complex Event Processor that receives reports of events from data sources such as monitoring, email, and telegram, matches them against pre-configured rules.

Meet Dash-AB — The Statistics Engine of Experimentation at DoorDash (2022)

dataPipe - Data processing and data analytics library for JavaScript.

gosquito - Pluggable tool for data gathering, data processing and data transmitting to various destinations.

DLT - Enables simple python-native data pipelining for data professionals.

PipeRider - Toolkit for detecting data issues across pipelines that works with CI systems for continuous data quality assessment.

airflint - Enforce Best Practices for all your Airflow DAGs.

Scaling our Spreadsheet Engine from Thousands to Billions of Cells (2022) (HN) (Lobsters)

qv - Simple CLI to quickly view your data. Powered by DataFusion.

Airflow's Problem (2022) (HN)

Notes

Links

Data processing

Notes​

Links​

Notes

Links