Skip to content

Instantly share code, notes, and snippets.

@daefresh
Created July 11, 2022 11:40
Show Gist options
  • Star 4 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save daefresh/1d07579ae75736b4005e9c80ac839363 to your computer and use it in GitHub Desktop.
Save daefresh/1d07579ae75736b4005e9c80ac839363 to your computer and use it in GitHub Desktop.
Top 100+ data engineering repos in GitHub for 2022
Repo Name Stars GitHub URL Project URL Project Description
airbyte 7176 https://github.com/airbytehq/airbyte https://airbyte.com Airbyte is an open-source EL(T) platform that helps you replicate your data in your warehouses lakes and databases.
amundsen 3389 https://github.com/amundsen-io/amundsen https://www.amundsen.io/amundsen/ Amundsen is a metadata driven application for improving the productivity of data analysts data scientists and engineers when interacting with data.
arangodb 12377 https://github.com/arangodb/arangodb https://www.arangodb.com 🥑 ArangoDB is a native multi-model database with flexible data models for documents graphs and key-values. Build high performance applications using a convenient SQL-like query language or JavaScript extensions.
arctic 2729 https://github.com/man-group/arctic https://arctic.readthedocs.io/en/latest/ High performance datastore for time series and tick data
arrow-datafusion 2173 https://github.com/apache/arrow-datafusion https://arrow.apache.org/datafusion Apache Arrow DataFusion SQL Query Engine
aws-data-wrangler 2914 https://github.com/awslabs/aws-data-wrangler https://aws-data-wrangler.readthedocs.io Pandas on AWS - Easy integration with Athena Glue Redshift Timestream Neptune OpenSearch QuickSight Chime CloudWatchLogs DynamoDB EMR SecretManager PostgreSQL MySQL SQLServer and S3 (Parquet CSV JSON and EXCEL).
benthos 4529 https://github.com/benthosdev/benthos https://www.benthos.dev Fancy stream processing made operationally mundane
cayley 14227 https://github.com/cayleygraph/cayley https://cayley.io An open-source graph database
ClickHouse 24377 https://github.com/ClickHouse/ClickHouse https://clickhouse.com ClickHouse® is a free analytics DBMS for big data
cockroach 25018 https://github.com/cockroachdb/cockroach https://www.cockroachlabs.com CockroachDB - the open source cloud-native distributed SQL database.
cog 2569 https://github.com/replicate/cog Containers for machine learning
composer 2216 https://github.com/mosaicml/composer http://docs.mosaicml.com train neural networks up to 7x faster
crate 3434 https://github.com/crate/crate https://crate.io/products/cratedb/ CrateDB is a distributed SQL database that makes it simple to store and analyze massive amounts of machine data in real-time.
cudf 4839 https://github.com/rapidsai/cudf http://rapids.ai cuDF - GPU DataFrame Library
dagster 4934 https://github.com/dagster-io/dagster https://dagster.io An orchestration platform for the development production and observation of data assets.
dash 16751 https://github.com/plotly/dash https://plotly.com/dash Analytical Web Apps for Python R Julia and Jupyter. No JavaScript Required.
databend 4181 https://github.com/datafuselabs/databend https://databend.rs A modern Elasticity and Performance cloud data warehouse activate your object storage for real-time analytics. Cloud at https://app.databend.com/
DataFrames.jl 1397 https://github.com/JuliaData/DataFrames.jl https://dataframes.juliadata.org/stable/ In-memory tabular data in Julia
datahub 5724 https://github.com/datahub-project/datahub https://datahubproject.io The Metadata Platform for the Modern Data Stack
dbt-core 5109 https://github.com/dbt-labs/dbt-core https://getdbt.com dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.
debezium 7004 https://github.com/debezium/debezium https://debezium.io Change data capture for a variety of databases. Please log issues at https://issues.redhat.com/browse/DBZ.
dgraph 18178 https://github.com/dgraph-io/dgraph https://dgraph.io Native GraphQL Database with graph backend
diesel 8640 https://github.com/diesel-rs/diesel https://diesel.rs A safe extensible ORM and Query Builder for Rust
dolt 12310 https://github.com/dolthub/dolt Dolt – It's Git for Data
dremio-oss 1086 https://github.com/dremio/dremio-oss https://www.dremio.com Dremio - the missing link in modern data
duckdb 5426 https://github.com/duckdb/duckdb http://www.duckdb.org DuckDB is an in-process SQL OLAP Database Management System
dvc 9954 https://github.com/iterative/dvc https://dvc.org 🦉Data Version Control Git for Data & Models ML Experiments Management
edgedb 8003 https://github.com/edgedb/edgedb https://edgedb.com A next-generation graph-relational database.
elementary 611 https://github.com/elementary-data/elementary https://docs.elementary-data.com Open-source data observability for analytics engineers
faker 14400 https://github.com/joke2k/faker http://faker.rtfd.org Faker is a Python package that generates fake data for you.
feast 3335 https://github.com/feast-dev/feast https://feast.dev Feature Store for Machine Learning
feathr 748 https://github.com/linkedin/feathr https://engineering.linkedin.com/blog/2022/open-sourcing-feathr---linkedin-s-feature-store-for-productive-m Feathr – An Enterprise-Grade High Performance Feature Store
featureform 909 https://github.com/featureform/featureform https://www.featureform.com The Virtual Feature Store. Turn your existing data infrastructure into a feature store.
FerretDB 4516 https://github.com/FerretDB/FerretDB https://www.ferretdb.io A truly Open Source MongoDB alternative
flyte 2458 https://github.com/flyteorg/flyte https://flyte.org Kubernetes-native workflow automation platform for complex mission-critical data and ML processes at scale. It has been battle-tested at Lyft Spotify Freenome and others and is truly open-source.
flyway 6602 https://github.com/flyway/flyway https://flywaydb.org Flyway by Redgate • Database Migrations Made Easy.
grafana 49742 https://github.com/grafana/grafana https://grafana.com The open and composable observability and data visualization platform. Visualize metrics logs and traces from multiple sources like Prometheus Loki Elasticsearch InfluxDB Postgres and many more.
great_expectations 6801 https://github.com/great-expectations/great_expectations https://docs.greatexpectations.io/ Always know what to expect from your data.
horovod 12557 https://github.com/horovod/horovod http://horovod.ai Distributed training framework for TensorFlow Keras PyTorch and Apache MXNet.
hudi 3279 https://github.com/apache/hudi https://hudi.apache.org/ Upserts Deletes And Incremental Processing on Big Data.
ibis 1875 https://github.com/ibis-project/ibis http://ibis-project.org Expressive analytics in Python at any scale.
ignite 4196 https://github.com/apache/ignite https://ignite.apache.org/ Apache Ignite
immudb 7658 https://github.com/codenotary/immudb https://www.codenotary.com/technologies/immudb immudb - immutable database based on zero trust SQL and Key-Value tamperproof data change history
ivy 3005 https://github.com/unifyai/ivy https://lets-unify.ai The Unified Machine Learning Framework
janusgraph 4496 https://github.com/JanusGraph/janusgraph https://janusgraph.org JanusGraph: an open-source distributed graph database
jgrapht 2146 https://github.com/jgrapht/jgrapht http://www.jgrapht.org Master repository for the JGraphT project
keras 55556 https://github.com/keras-team/keras http://keras.io/ Deep Learning for humans
ksql 5058 https://github.com/confluentinc/ksql https://ksqldb.io The database purpose-built for stream processing applications.
lakeFS 2647 https://github.com/treeverse/lakeFS https://lakefs.io Git-like capabilities for your object storage
lightdash 1211 https://github.com/lightdash/lightdash https://lightdash.com An open source alternative to Looker built using dbt. Made for analysts ❤️
lightning 19229 https://github.com/Lightning-AI/lightning https://lightning.ai Build high-performance AI models with PyTorch Lightning (organized PyTorch). Deploy models with Lightning Apps (organized Python to build end-to-end ML systems).
liquibase 3289 https://github.com/liquibase/liquibase https://www.liquibase.org Main Liquibase Source
ludwig 8411 https://github.com/ludwig-ai/ludwig http://ludwig.ai Data-centric declarative deep learning framework
marquez 1106 https://github.com/MarquezProject/marquez https://marquezproject.ai Collect aggregate and visualize a data ecosystem's metadata
mars 2449 https://github.com/mars-project/mars https://docs.pymars.org Mars is a tensor-based unified framework for large-scale data computation which scales numpy pandas scikit-learn and Python functions.
materialize 4177 https://github.com/MaterializeInc/materialize https://materialize.com The Fastest Way to Build the Fastest Data Products. Build data-intensive applications and services in SQL — without pipelines or caches — using materialized views that are always up-to-date.
matplotlib 15731 https://github.com/matplotlib/matplotlib https://matplotlib.org/stable matplotlib: plotting with Python
mediapipe 17821 https://github.com/google/mediapipe https://mediapipe.dev Cross-platform customizable ML solutions for live and streaming media.
metabase 29005 https://github.com/metabase/metabase https://metabase.com The simplest fastest way to get business intelligence and analytics to everyone in your company :yum:
metaflow 5745 https://github.com/Netflix/metaflow https://metaflow.org :rocket: Build and manage real-life data science projects with ease!
metarank 1457 https://github.com/metarank/metarank https://metarank.ai A low code Machine Learning service that personalizes articles listings search results recommendations to boost user engagement. A friendly Learn-to-Rank engine
metricflow 614 https://github.com/transform-data/metricflow https://transform.co/metricflow MetricFlow allows you to define build and maintain metrics in code.
milvus 11183 https://github.com/milvus-io/milvus https://milvus.io Vector database for scalable similarity search and AI applications.
mindsdb 8199 https://github.com/mindsdb/mindsdb http://mindsdb.com In-Database Machine Learning
modin 7556 https://github.com/modin-project/modin http://modin.readthedocs.io Modin: Scale your Pandas workflows by changing a single line of code
nebula 7600 https://github.com/vesoft-inc/nebula https://nebula-graph.io A distributed fast open-source graph database featuring horizontal scalability and high availability
neo4j 10180 https://github.com/neo4j/neo4j http://neo4j.com Graphs for Everyone
neon 3452 https://github.com/neondatabase/neon https://neon.tech The serverless open source alternative to AWS Aurora Postgres.
netron 19210 https://github.com/lutzroeder/netron https://netron.app Visualizer for neural network deep learning and machine learning models
networkx 10931 https://github.com/networkx/networkx https://networkx.org Network Analysis in Python
nocodb 28607 https://github.com/nocodb/nocodb https://docs.nocodb.com 🔥 🔥 🔥 Open Source Airtable Alternative - turns any MySQL Postgres SQLite into a Spreadsheet with REST APIs.
oceanbase 4408 https://github.com/oceanbase/oceanbase https://open.oceanbase.com OceanBase is an enterprise distributed relational database with high availability high performance horizontal scalability and compatibility with SQL standards.
OnlineStats.jl 700 https://github.com/joshday/OnlineStats.jl https://joshday.github.io/OnlineStats.jl/latest/ ⚡ Single-pass algorithms for statistics
onnx 12811 https://github.com/onnx/onnx https://onnx.ai/ Open standard for machine learning interoperability
opacus 1172 https://github.com/pytorch/opacus https://opacus.ai Training PyTorch models with differential privacy
OpenMetadata 1090 https://github.com/open-metadata/OpenMetadata https://open-metadata.org Open Standard for Metadata. A Single place to Discover Collaborate and Get your data right.
orientdb 4470 https://github.com/orientechnologies/orientdb http://orientdb.org OrientDB is the most versatile DBMS supporting Graph Document Reactive Full-Text and Geospatial models in one Multi-Model product. OrientDB can run distributed (Multi-Master) supports SQL ACID Transactions Full-Text indexing and Reactive Queries.
pandas-profiling 9173 https://github.com/ydataai/pandas-profiling https://pandas-profiling.ydata.ai Create HTML profiling reports from pandas DataFrame objects
pandera 1517 https://github.com/pandera-dev/pandera https://pandera.readthedocs.io A light-weight flexible and expressive data validation library for dataframes
ploomber 2526 https://github.com/ploomber/ploomber https://ploomber.io The fastest ⚡️ way to build data pipelines. Develop iteratively deploy anywhere. ☁️
pointblank 646 https://github.com/rich-iannone/pointblank https://rich-iannone.github.io/pointblank Data quality assessment and metadata reporting for data frames and database tables
polars 6595 https://github.com/pola-rs/polars https://pola.rs/ Fast multi-threaded DataFrame library in Rust Python Node.js
polyaxon 3112 https://github.com/polyaxon/polyaxon https://polyaxon.com MLOps Tools For Managing & Orchestrating The Machine Learning LifeCycle
prefect 9489 https://github.com/PrefectHQ/prefect https://prefect.io The easiest way to automate your data
prisma 23900 https://github.com/prisma/prisma https://www.prisma.io Next-generation ORM for Node.js & TypeScript PostgreSQL MySQL MariaDB SQL Server SQLite MongoDB and CockroachDB
pycaret 5884 https://github.com/pycaret/pycaret https://www.pycaret.org An open-source low-code machine learning library in Python
pyro 7506 https://github.com/pyro-ppl/pyro http://pyro.ai Deep universal probabilistic programming with Python and PyTorch
qlib 8810 https://github.com/microsoft/qlib https://qlib.readthedocs.io/en/latest/ Qlib is an AI-oriented quantitative investment platform which aims to realize the potential empower the research and create the value of AI technologies in quantitative investment. With Qlib you can easily try your ideas to create better Quant investment strategies. An increasing number of SOTA Quant research works/papers are released in Qlib.
questdb 8820 https://github.com/questdb/questdb https://questdb.io An open source SQL database designed to process time series data faster
ray 21122 https://github.com/ray-project/ray https://ray.io An open source framework that provides a simple universal API for building distributed applications. Ray is packaged with RLlib a scalable reinforcement learning library and Tune a scalable hyperparameter tuning library.
re-data 1153 https://github.com/re-data/re-data https://getre.io re_data - fix data issues before your users & CEO would discover them 😊
RedisGraph 1666 https://github.com/RedisGraph/RedisGraph https://redis.io/docs/stack/graph/ A graph database as a Redis module
risingwave 2808 https://github.com/singularity-data/risingwave https://www.risingwave.dev RisingWave: the next-generation streaming database in the cloud.
rudder-server 3150 https://github.com/rudderlabs/rudder-server https://www.rudderstack.com Privacy and Security focused Segment-alternative in Golang and React
scikit-learn 50605 https://github.com/scikit-learn/scikit-learn https://scikit-learn.org scikit-learn: machine learning in Python
sea-orm 2226 https://github.com/SeaQL/sea-orm https://www.sea-ql.org/SeaORM/ 🐚 An async & dynamic ORM for Rust
snowplow 6124 https://github.com/snowplow/snowplow http://snowplowanalytics.com The enterprise-grade behavioral data engine (web mobile server-side webhooks) running cloud-natively on AWS and GCP
soda-core 905 https://github.com/sodadata/soda-core https://docs.soda.io/soda-core/overview.html Data reliability tools for SQL- and Spark-accessible data
spaCy 23699 https://github.com/explosion/spaCy https://spacy.io 💫 Industrial-strength Natural Language Processing (NLP) in Python
spiceai 739 https://github.com/spiceai/spiceai https://docs.spiceai.org Build apps that learn and adapt. Time series AI for developers.
spicedb 2285 https://github.com/authzed/spicedb https://docs.authzed.com Open source permissions database inspired by Google Zanzibar
streamlit 19740 https://github.com/streamlit/streamlit https://streamlit.io Streamlit — The fastest way to build data apps in Python
stumpy 2310 https://github.com/TDAmeritrade/stumpy https://stumpy.readthedocs.io/en/latest/ STUMPY is a powerful and scalable Python library for modern time series analysis
superset 46844 https://github.com/apache/superset https://superset.apache.org/ Apache Superset is a Data Visualization and Data Exploration Platform
terminusdb 1854 https://github.com/terminusdb/terminusdb https://terminusdb.com TerminusDB is a distributed database with a collaboration model
tidb 31721 https://github.com/pingcap/tidb https://pingcap.com TiDB is an open-source cloud-native distributed MySQL-Compatible database for elastic scale and real-time analytics. Try free: https://tidbcloud.com/free-trial
TileDB 1359 https://github.com/TileDB-Inc/TileDB https://tiledb.com The Universal Storage Engine
timescaledb 13290 https://github.com/timescale/timescaledb https://www.timescale.com/ An open-source time-series SQL database optimized for fast ingest and complex queries. Packaged as a PostgreSQL extension.
transformers 66291 https://github.com/huggingface/transformers https://huggingface.co/transformers 🤗 Transformers: State-of-the-art Machine Learning for Pytorch TensorFlow and JAX.
trino 5658 https://github.com/trinodb/trino https://trino.io Official repository of Trino the distributed SQL query engine for big data formerly known as PrestoSQL (https://trino.io)
typedb 3127 https://github.com/vaticle/typedb https://vaticle.com TypeDB: a strongly-typed database
vaex 7136 https://github.com/vaexio/vaex https://vaex.io Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python ML visualization and exploration of big tabular data at a billion rows per second 🚀
vespa 3974 https://github.com/vespa-engine/vespa https://vespa.ai The open big data serving engine. https://vespa.ai
whale 693 https://github.com/hyperqueryhq/whale https://docs.whale.cx 🐳 The stupidly simple CLI workspace for your data warehouse.
yolov5 28113 https://github.com/ultralytics/yolov5 https://ultralytics.com YOLOv5 🚀 in PyTorch > ONNX > CoreML > TFLite
yugabyte-db 6599 https://github.com/yugabyte/yugabyte-db https://www.yugabyte.com The high-performance distributed SQL database for global internet-scale apps.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment