Created
July 11, 2022 11:40
-
-
Save daefresh/1d07579ae75736b4005e9c80ac839363 to your computer and use it in GitHub Desktop.
Top 100+ data engineering repos in GitHub for 2022
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Repo Name | Stars | GitHub URL | Project URL | Project Description | |
---|---|---|---|---|---|
airbyte | 7176 | https://github.com/airbytehq/airbyte | https://airbyte.com | Airbyte is an open-source EL(T) platform that helps you replicate your data in your warehouses lakes and databases. | |
amundsen | 3389 | https://github.com/amundsen-io/amundsen | https://www.amundsen.io/amundsen/ | Amundsen is a metadata driven application for improving the productivity of data analysts data scientists and engineers when interacting with data. | |
arangodb | 12377 | https://github.com/arangodb/arangodb | https://www.arangodb.com | 🥑 ArangoDB is a native multi-model database with flexible data models for documents graphs and key-values. Build high performance applications using a convenient SQL-like query language or JavaScript extensions. | |
arctic | 2729 | https://github.com/man-group/arctic | https://arctic.readthedocs.io/en/latest/ | High performance datastore for time series and tick data | |
arrow-datafusion | 2173 | https://github.com/apache/arrow-datafusion | https://arrow.apache.org/datafusion | Apache Arrow DataFusion SQL Query Engine | |
aws-data-wrangler | 2914 | https://github.com/awslabs/aws-data-wrangler | https://aws-data-wrangler.readthedocs.io | Pandas on AWS - Easy integration with Athena Glue Redshift Timestream Neptune OpenSearch QuickSight Chime CloudWatchLogs DynamoDB EMR SecretManager PostgreSQL MySQL SQLServer and S3 (Parquet CSV JSON and EXCEL). | |
benthos | 4529 | https://github.com/benthosdev/benthos | https://www.benthos.dev | Fancy stream processing made operationally mundane | |
cayley | 14227 | https://github.com/cayleygraph/cayley | https://cayley.io | An open-source graph database | |
ClickHouse | 24377 | https://github.com/ClickHouse/ClickHouse | https://clickhouse.com | ClickHouse® is a free analytics DBMS for big data | |
cockroach | 25018 | https://github.com/cockroachdb/cockroach | https://www.cockroachlabs.com | CockroachDB - the open source cloud-native distributed SQL database. | |
cog | 2569 | https://github.com/replicate/cog | Containers for machine learning | ||
composer | 2216 | https://github.com/mosaicml/composer | http://docs.mosaicml.com | train neural networks up to 7x faster | |
crate | 3434 | https://github.com/crate/crate | https://crate.io/products/cratedb/ | CrateDB is a distributed SQL database that makes it simple to store and analyze massive amounts of machine data in real-time. | |
cudf | 4839 | https://github.com/rapidsai/cudf | http://rapids.ai | cuDF - GPU DataFrame Library | |
dagster | 4934 | https://github.com/dagster-io/dagster | https://dagster.io | An orchestration platform for the development production and observation of data assets. | |
dash | 16751 | https://github.com/plotly/dash | https://plotly.com/dash | Analytical Web Apps for Python R Julia and Jupyter. No JavaScript Required. | |
databend | 4181 | https://github.com/datafuselabs/databend | https://databend.rs | A modern Elasticity and Performance cloud data warehouse activate your object storage for real-time analytics. Cloud at https://app.databend.com/ | |
DataFrames.jl | 1397 | https://github.com/JuliaData/DataFrames.jl | https://dataframes.juliadata.org/stable/ | In-memory tabular data in Julia | |
datahub | 5724 | https://github.com/datahub-project/datahub | https://datahubproject.io | The Metadata Platform for the Modern Data Stack | |
dbt-core | 5109 | https://github.com/dbt-labs/dbt-core | https://getdbt.com | dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications. | |
debezium | 7004 | https://github.com/debezium/debezium | https://debezium.io | Change data capture for a variety of databases. Please log issues at https://issues.redhat.com/browse/DBZ. | |
dgraph | 18178 | https://github.com/dgraph-io/dgraph | https://dgraph.io | Native GraphQL Database with graph backend | |
diesel | 8640 | https://github.com/diesel-rs/diesel | https://diesel.rs | A safe extensible ORM and Query Builder for Rust | |
dolt | 12310 | https://github.com/dolthub/dolt | Dolt – It's Git for Data | ||
dremio-oss | 1086 | https://github.com/dremio/dremio-oss | https://www.dremio.com | Dremio - the missing link in modern data | |
duckdb | 5426 | https://github.com/duckdb/duckdb | http://www.duckdb.org | DuckDB is an in-process SQL OLAP Database Management System | |
dvc | 9954 | https://github.com/iterative/dvc | https://dvc.org | 🦉Data Version Control Git for Data & Models ML Experiments Management | |
edgedb | 8003 | https://github.com/edgedb/edgedb | https://edgedb.com | A next-generation graph-relational database. | |
elementary | 611 | https://github.com/elementary-data/elementary | https://docs.elementary-data.com | Open-source data observability for analytics engineers | |
faker | 14400 | https://github.com/joke2k/faker | http://faker.rtfd.org | Faker is a Python package that generates fake data for you. | |
feast | 3335 | https://github.com/feast-dev/feast | https://feast.dev | Feature Store for Machine Learning | |
feathr | 748 | https://github.com/linkedin/feathr | https://engineering.linkedin.com/blog/2022/open-sourcing-feathr---linkedin-s-feature-store-for-productive-m | Feathr – An Enterprise-Grade High Performance Feature Store | |
featureform | 909 | https://github.com/featureform/featureform | https://www.featureform.com | The Virtual Feature Store. Turn your existing data infrastructure into a feature store. | |
FerretDB | 4516 | https://github.com/FerretDB/FerretDB | https://www.ferretdb.io | A truly Open Source MongoDB alternative | |
flyte | 2458 | https://github.com/flyteorg/flyte | https://flyte.org | Kubernetes-native workflow automation platform for complex mission-critical data and ML processes at scale. It has been battle-tested at Lyft Spotify Freenome and others and is truly open-source. | |
flyway | 6602 | https://github.com/flyway/flyway | https://flywaydb.org | Flyway by Redgate • Database Migrations Made Easy. | |
grafana | 49742 | https://github.com/grafana/grafana | https://grafana.com | The open and composable observability and data visualization platform. Visualize metrics logs and traces from multiple sources like Prometheus Loki Elasticsearch InfluxDB Postgres and many more. | |
great_expectations | 6801 | https://github.com/great-expectations/great_expectations | https://docs.greatexpectations.io/ | Always know what to expect from your data. | |
horovod | 12557 | https://github.com/horovod/horovod | http://horovod.ai | Distributed training framework for TensorFlow Keras PyTorch and Apache MXNet. | |
hudi | 3279 | https://github.com/apache/hudi | https://hudi.apache.org/ | Upserts Deletes And Incremental Processing on Big Data. | |
ibis | 1875 | https://github.com/ibis-project/ibis | http://ibis-project.org | Expressive analytics in Python at any scale. | |
ignite | 4196 | https://github.com/apache/ignite | https://ignite.apache.org/ | Apache Ignite | |
immudb | 7658 | https://github.com/codenotary/immudb | https://www.codenotary.com/technologies/immudb | immudb - immutable database based on zero trust SQL and Key-Value tamperproof data change history | |
ivy | 3005 | https://github.com/unifyai/ivy | https://lets-unify.ai | The Unified Machine Learning Framework | |
janusgraph | 4496 | https://github.com/JanusGraph/janusgraph | https://janusgraph.org | JanusGraph: an open-source distributed graph database | |
jgrapht | 2146 | https://github.com/jgrapht/jgrapht | http://www.jgrapht.org | Master repository for the JGraphT project | |
keras | 55556 | https://github.com/keras-team/keras | http://keras.io/ | Deep Learning for humans | |
ksql | 5058 | https://github.com/confluentinc/ksql | https://ksqldb.io | The database purpose-built for stream processing applications. | |
lakeFS | 2647 | https://github.com/treeverse/lakeFS | https://lakefs.io | Git-like capabilities for your object storage | |
lightdash | 1211 | https://github.com/lightdash/lightdash | https://lightdash.com | An open source alternative to Looker built using dbt. Made for analysts ❤️ | |
lightning | 19229 | https://github.com/Lightning-AI/lightning | https://lightning.ai | Build high-performance AI models with PyTorch Lightning (organized PyTorch). Deploy models with Lightning Apps (organized Python to build end-to-end ML systems). | |
liquibase | 3289 | https://github.com/liquibase/liquibase | https://www.liquibase.org | Main Liquibase Source | |
ludwig | 8411 | https://github.com/ludwig-ai/ludwig | http://ludwig.ai | Data-centric declarative deep learning framework | |
marquez | 1106 | https://github.com/MarquezProject/marquez | https://marquezproject.ai | Collect aggregate and visualize a data ecosystem's metadata | |
mars | 2449 | https://github.com/mars-project/mars | https://docs.pymars.org | Mars is a tensor-based unified framework for large-scale data computation which scales numpy pandas scikit-learn and Python functions. | |
materialize | 4177 | https://github.com/MaterializeInc/materialize | https://materialize.com | The Fastest Way to Build the Fastest Data Products. Build data-intensive applications and services in SQL — without pipelines or caches — using materialized views that are always up-to-date. | |
matplotlib | 15731 | https://github.com/matplotlib/matplotlib | https://matplotlib.org/stable | matplotlib: plotting with Python | |
mediapipe | 17821 | https://github.com/google/mediapipe | https://mediapipe.dev | Cross-platform customizable ML solutions for live and streaming media. | |
metabase | 29005 | https://github.com/metabase/metabase | https://metabase.com | The simplest fastest way to get business intelligence and analytics to everyone in your company :yum: | |
metaflow | 5745 | https://github.com/Netflix/metaflow | https://metaflow.org | :rocket: Build and manage real-life data science projects with ease! | |
metarank | 1457 | https://github.com/metarank/metarank | https://metarank.ai | A low code Machine Learning service that personalizes articles listings search results recommendations to boost user engagement. A friendly Learn-to-Rank engine | |
metricflow | 614 | https://github.com/transform-data/metricflow | https://transform.co/metricflow | MetricFlow allows you to define build and maintain metrics in code. | |
milvus | 11183 | https://github.com/milvus-io/milvus | https://milvus.io | Vector database for scalable similarity search and AI applications. | |
mindsdb | 8199 | https://github.com/mindsdb/mindsdb | http://mindsdb.com | In-Database Machine Learning | |
modin | 7556 | https://github.com/modin-project/modin | http://modin.readthedocs.io | Modin: Scale your Pandas workflows by changing a single line of code | |
nebula | 7600 | https://github.com/vesoft-inc/nebula | https://nebula-graph.io | A distributed fast open-source graph database featuring horizontal scalability and high availability | |
neo4j | 10180 | https://github.com/neo4j/neo4j | http://neo4j.com | Graphs for Everyone | |
neon | 3452 | https://github.com/neondatabase/neon | https://neon.tech | The serverless open source alternative to AWS Aurora Postgres. | |
netron | 19210 | https://github.com/lutzroeder/netron | https://netron.app | Visualizer for neural network deep learning and machine learning models | |
networkx | 10931 | https://github.com/networkx/networkx | https://networkx.org | Network Analysis in Python | |
nocodb | 28607 | https://github.com/nocodb/nocodb | https://docs.nocodb.com | 🔥 🔥 🔥 Open Source Airtable Alternative - turns any MySQL Postgres SQLite into a Spreadsheet with REST APIs. | |
oceanbase | 4408 | https://github.com/oceanbase/oceanbase | https://open.oceanbase.com | OceanBase is an enterprise distributed relational database with high availability high performance horizontal scalability and compatibility with SQL standards. | |
OnlineStats.jl | 700 | https://github.com/joshday/OnlineStats.jl | https://joshday.github.io/OnlineStats.jl/latest/ | ⚡ Single-pass algorithms for statistics | |
onnx | 12811 | https://github.com/onnx/onnx | https://onnx.ai/ | Open standard for machine learning interoperability | |
opacus | 1172 | https://github.com/pytorch/opacus | https://opacus.ai | Training PyTorch models with differential privacy | |
OpenMetadata | 1090 | https://github.com/open-metadata/OpenMetadata | https://open-metadata.org | Open Standard for Metadata. A Single place to Discover Collaborate and Get your data right. | |
orientdb | 4470 | https://github.com/orientechnologies/orientdb | http://orientdb.org | OrientDB is the most versatile DBMS supporting Graph Document Reactive Full-Text and Geospatial models in one Multi-Model product. OrientDB can run distributed (Multi-Master) supports SQL ACID Transactions Full-Text indexing and Reactive Queries. | |
pandas-profiling | 9173 | https://github.com/ydataai/pandas-profiling | https://pandas-profiling.ydata.ai | Create HTML profiling reports from pandas DataFrame objects | |
pandera | 1517 | https://github.com/pandera-dev/pandera | https://pandera.readthedocs.io | A light-weight flexible and expressive data validation library for dataframes | |
ploomber | 2526 | https://github.com/ploomber/ploomber | https://ploomber.io | The fastest ⚡️ way to build data pipelines. Develop iteratively deploy anywhere. ☁️ | |
pointblank | 646 | https://github.com/rich-iannone/pointblank | https://rich-iannone.github.io/pointblank | Data quality assessment and metadata reporting for data frames and database tables | |
polars | 6595 | https://github.com/pola-rs/polars | https://pola.rs/ | Fast multi-threaded DataFrame library in Rust Python Node.js | |
polyaxon | 3112 | https://github.com/polyaxon/polyaxon | https://polyaxon.com | MLOps Tools For Managing & Orchestrating The Machine Learning LifeCycle | |
prefect | 9489 | https://github.com/PrefectHQ/prefect | https://prefect.io | The easiest way to automate your data | |
prisma | 23900 | https://github.com/prisma/prisma | https://www.prisma.io | Next-generation ORM for Node.js & TypeScript PostgreSQL MySQL MariaDB SQL Server SQLite MongoDB and CockroachDB | |
pycaret | 5884 | https://github.com/pycaret/pycaret | https://www.pycaret.org | An open-source low-code machine learning library in Python | |
pyro | 7506 | https://github.com/pyro-ppl/pyro | http://pyro.ai | Deep universal probabilistic programming with Python and PyTorch | |
qlib | 8810 | https://github.com/microsoft/qlib | https://qlib.readthedocs.io/en/latest/ | Qlib is an AI-oriented quantitative investment platform which aims to realize the potential empower the research and create the value of AI technologies in quantitative investment. With Qlib you can easily try your ideas to create better Quant investment strategies. An increasing number of SOTA Quant research works/papers are released in Qlib. | |
questdb | 8820 | https://github.com/questdb/questdb | https://questdb.io | An open source SQL database designed to process time series data faster | |
ray | 21122 | https://github.com/ray-project/ray | https://ray.io | An open source framework that provides a simple universal API for building distributed applications. Ray is packaged with RLlib a scalable reinforcement learning library and Tune a scalable hyperparameter tuning library. | |
re-data | 1153 | https://github.com/re-data/re-data | https://getre.io | re_data - fix data issues before your users & CEO would discover them 😊 | |
RedisGraph | 1666 | https://github.com/RedisGraph/RedisGraph | https://redis.io/docs/stack/graph/ | A graph database as a Redis module | |
risingwave | 2808 | https://github.com/singularity-data/risingwave | https://www.risingwave.dev | RisingWave: the next-generation streaming database in the cloud. | |
rudder-server | 3150 | https://github.com/rudderlabs/rudder-server | https://www.rudderstack.com | Privacy and Security focused Segment-alternative in Golang and React | |
scikit-learn | 50605 | https://github.com/scikit-learn/scikit-learn | https://scikit-learn.org | scikit-learn: machine learning in Python | |
sea-orm | 2226 | https://github.com/SeaQL/sea-orm | https://www.sea-ql.org/SeaORM/ | 🐚 An async & dynamic ORM for Rust | |
snowplow | 6124 | https://github.com/snowplow/snowplow | http://snowplowanalytics.com | The enterprise-grade behavioral data engine (web mobile server-side webhooks) running cloud-natively on AWS and GCP | |
soda-core | 905 | https://github.com/sodadata/soda-core | https://docs.soda.io/soda-core/overview.html | Data reliability tools for SQL- and Spark-accessible data | |
spaCy | 23699 | https://github.com/explosion/spaCy | https://spacy.io | 💫 Industrial-strength Natural Language Processing (NLP) in Python | |
spiceai | 739 | https://github.com/spiceai/spiceai | https://docs.spiceai.org | Build apps that learn and adapt. Time series AI for developers. | |
spicedb | 2285 | https://github.com/authzed/spicedb | https://docs.authzed.com | Open source permissions database inspired by Google Zanzibar | |
streamlit | 19740 | https://github.com/streamlit/streamlit | https://streamlit.io | Streamlit — The fastest way to build data apps in Python | |
stumpy | 2310 | https://github.com/TDAmeritrade/stumpy | https://stumpy.readthedocs.io/en/latest/ | STUMPY is a powerful and scalable Python library for modern time series analysis | |
superset | 46844 | https://github.com/apache/superset | https://superset.apache.org/ | Apache Superset is a Data Visualization and Data Exploration Platform | |
terminusdb | 1854 | https://github.com/terminusdb/terminusdb | https://terminusdb.com | TerminusDB is a distributed database with a collaboration model | |
tidb | 31721 | https://github.com/pingcap/tidb | https://pingcap.com | TiDB is an open-source cloud-native distributed MySQL-Compatible database for elastic scale and real-time analytics. Try free: https://tidbcloud.com/free-trial | |
TileDB | 1359 | https://github.com/TileDB-Inc/TileDB | https://tiledb.com | The Universal Storage Engine | |
timescaledb | 13290 | https://github.com/timescale/timescaledb | https://www.timescale.com/ | An open-source time-series SQL database optimized for fast ingest and complex queries. Packaged as a PostgreSQL extension. | |
transformers | 66291 | https://github.com/huggingface/transformers | https://huggingface.co/transformers | 🤗 Transformers: State-of-the-art Machine Learning for Pytorch TensorFlow and JAX. | |
trino | 5658 | https://github.com/trinodb/trino | https://trino.io | Official repository of Trino the distributed SQL query engine for big data formerly known as PrestoSQL (https://trino.io) | |
typedb | 3127 | https://github.com/vaticle/typedb | https://vaticle.com | TypeDB: a strongly-typed database | |
vaex | 7136 | https://github.com/vaexio/vaex | https://vaex.io | Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python ML visualization and exploration of big tabular data at a billion rows per second 🚀 | |
vespa | 3974 | https://github.com/vespa-engine/vespa | https://vespa.ai | The open big data serving engine. https://vespa.ai | |
whale | 693 | https://github.com/hyperqueryhq/whale | https://docs.whale.cx | 🐳 The stupidly simple CLI workspace for your data warehouse. | |
yolov5 | 28113 | https://github.com/ultralytics/yolov5 | https://ultralytics.com | YOLOv5 🚀 in PyTorch > ONNX > CoreML > TFLite | |
yugabyte-db | 6599 | https://github.com/yugabyte/yugabyte-db | https://www.yugabyte.com | The high-performance distributed SQL database for global internet-scale apps. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment