daefresh/top_100_open_data_tooling.csv

## top_100_open_data_tooling.csv

          
            Repo Name
            Stars
            GitHub URL
            Project URL
            Project Description

            
              airbyte
              7176
              https://github.com/airbytehq/airbyte
              https://airbyte.com
              Airbyte is an open-source EL(T) platform that helps you replicate your data in your warehouses lakes and databases.

            
              amundsen
              3389
              https://github.com/amundsen-io/amundsen
              https://www.amundsen.io/amundsen/
              Amundsen is a metadata driven application for improving the productivity of data analysts data scientists and engineers when interacting with data.

            
              arangodb
              12377
              https://github.com/arangodb/arangodb
              https://www.arangodb.com
              🥑 ArangoDB is a native multi-model database with flexible data models for documents graphs and key-values. Build high performance applications using a convenient SQL-like query language or JavaScript extensions.

            
              arctic
              2729
              https://github.com/man-group/arctic
              https://arctic.readthedocs.io/en/latest/
              High performance datastore for time series and tick data

            
              arrow-datafusion
              2173
              https://github.com/apache/arrow-datafusion
              https://arrow.apache.org/datafusion
              Apache Arrow DataFusion SQL Query Engine

            
              aws-data-wrangler
              2914
              https://github.com/awslabs/aws-data-wrangler
              https://aws-data-wrangler.readthedocs.io
              Pandas on AWS - Easy integration with Athena Glue Redshift Timestream Neptune OpenSearch QuickSight Chime CloudWatchLogs DynamoDB EMR SecretManager PostgreSQL MySQL SQLServer and S3 (Parquet CSV JSON and EXCEL).

            
              benthos
              4529
              https://github.com/benthosdev/benthos
              https://www.benthos.dev
              Fancy stream processing made operationally mundane

            
              cayley
              14227
              https://github.com/cayleygraph/cayley
              https://cayley.io
              An open-source graph database

            
              ClickHouse
              24377
              https://github.com/ClickHouse/ClickHouse
              https://clickhouse.com
              ClickHouse® is a free analytics DBMS for big data

            
              cockroach
              25018
              https://github.com/cockroachdb/cockroach
              https://www.cockroachlabs.com
              CockroachDB - the open source cloud-native distributed SQL database.

            
              cog
              2569
              https://github.com/replicate/cog
              
              Containers for machine learning

            
              composer
              2216
              https://github.com/mosaicml/composer
              http://docs.mosaicml.com
              train neural networks up to 7x faster

            
              crate
              3434
              https://github.com/crate/crate
              https://crate.io/products/cratedb/
              CrateDB is a distributed SQL database that makes it simple to store and analyze massive amounts of machine data in real-time.

            
              cudf
              4839
              https://github.com/rapidsai/cudf
              http://rapids.ai
              cuDF - GPU DataFrame Library

            
              dagster
              4934
              https://github.com/dagster-io/dagster
              https://dagster.io
              An orchestration platform for the development production and observation of data assets.

            
              dash
              16751
              https://github.com/plotly/dash
              https://plotly.com/dash
              Analytical Web Apps for Python R Julia and Jupyter. No JavaScript Required.

            
              databend
              4181
              https://github.com/datafuselabs/databend
              https://databend.rs
              A modern Elasticity and Performance cloud data warehouse activate your object storage for real-time analytics.  Cloud at https://app.databend.com/

            
              DataFrames.jl
              1397
              https://github.com/JuliaData/DataFrames.jl
              https://dataframes.juliadata.org/stable/
              In-memory tabular data in Julia

            
              datahub
              5724
              https://github.com/datahub-project/datahub
              https://datahubproject.io
              The Metadata Platform for the Modern Data Stack

            
              dbt-core
              5109
              https://github.com/dbt-labs/dbt-core
              https://getdbt.com
              dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.

            
              debezium
              7004
              https://github.com/debezium/debezium
              https://debezium.io
              Change data capture for a variety of databases. Please log issues at https://issues.redhat.com/browse/DBZ.

            
              dgraph
              18178
              https://github.com/dgraph-io/dgraph
              https://dgraph.io
              Native GraphQL Database with graph backend

            
              diesel
              8640
              https://github.com/diesel-rs/diesel
              https://diesel.rs
              A safe extensible ORM and Query Builder for Rust

            
              dolt
              12310
              https://github.com/dolthub/dolt
              
              Dolt – It's Git for Data

            
              dremio-oss
              1086
              https://github.com/dremio/dremio-oss
              https://www.dremio.com
              Dremio - the missing link in modern data

            
              duckdb
              5426
              https://github.com/duckdb/duckdb
              http://www.duckdb.org
              DuckDB is an in-process SQL OLAP Database Management System

            
              dvc
              9954
              https://github.com/iterative/dvc
              https://dvc.org
              🦉Data Version Control  Git for Data & Models  ML Experiments Management

            
              edgedb
              8003
              https://github.com/edgedb/edgedb
              https://edgedb.com
              A next-generation graph-relational database.

            
              elementary
              611
              https://github.com/elementary-data/elementary
              https://docs.elementary-data.com
              Open-source data observability for analytics engineers

            
              faker
              14400
              https://github.com/joke2k/faker
              http://faker.rtfd.org
              Faker is a Python package that generates fake data for you.

            
              feast
              3335
              https://github.com/feast-dev/feast
              https://feast.dev
              Feature Store for Machine Learning

            
              feathr
              748
              https://github.com/linkedin/feathr
              https://engineering.linkedin.com/blog/2022/open-sourcing-feathr---linkedin-s-feature-store-for-productive-m
              Feathr – An Enterprise-Grade High Performance Feature Store

            
              featureform
              909
              https://github.com/featureform/featureform
              https://www.featureform.com
              The Virtual Feature Store. Turn your existing data infrastructure into a feature store.

            
              FerretDB
              4516
              https://github.com/FerretDB/FerretDB
              https://www.ferretdb.io
              A truly Open Source MongoDB alternative

            
              flyte
              2458
              https://github.com/flyteorg/flyte
              https://flyte.org
              Kubernetes-native workflow automation platform for complex mission-critical data and ML processes at scale. It has been battle-tested at Lyft Spotify Freenome and others and is truly open-source.

            
              flyway
              6602
              https://github.com/flyway/flyway
              https://flywaydb.org
              Flyway by Redgate • Database Migrations Made Easy.

            
              grafana
              49742
              https://github.com/grafana/grafana
              https://grafana.com
              The open and composable observability and data visualization platform. Visualize metrics logs and traces from multiple sources like Prometheus Loki Elasticsearch InfluxDB Postgres and many more.

            
              great_expectations
              6801
              https://github.com/great-expectations/great_expectations
              https://docs.greatexpectations.io/
              Always know what to expect from your data.

            
              horovod
              12557
              https://github.com/horovod/horovod
              http://horovod.ai
              Distributed training framework for TensorFlow Keras PyTorch and Apache MXNet.

            
              hudi
              3279
              https://github.com/apache/hudi
              https://hudi.apache.org/
              Upserts Deletes And Incremental Processing on Big Data.

            
              ibis
              1875
              https://github.com/ibis-project/ibis
              http://ibis-project.org
              Expressive analytics in Python at any scale.

            
              ignite
              4196
              https://github.com/apache/ignite
              https://ignite.apache.org/
              Apache Ignite

            
              immudb
              7658
              https://github.com/codenotary/immudb
              https://www.codenotary.com/technologies/immudb
              immudb - immutable database based on zero trust SQL and Key-Value tamperproof data change history

            
              ivy
              3005
              https://github.com/unifyai/ivy
              https://lets-unify.ai
              The Unified Machine Learning Framework

            
              janusgraph
              4496
              https://github.com/JanusGraph/janusgraph
              https://janusgraph.org
              JanusGraph: an open-source distributed graph database

            
              jgrapht
              2146
              https://github.com/jgrapht/jgrapht
              http://www.jgrapht.org
              Master repository for the JGraphT project

            
              keras
              55556
              https://github.com/keras-team/keras
              http://keras.io/
              Deep Learning for humans

            
              ksql
              5058
              https://github.com/confluentinc/ksql
              https://ksqldb.io
              The database purpose-built for stream processing applications.

            
              lakeFS
              2647
              https://github.com/treeverse/lakeFS
              https://lakefs.io
              Git-like capabilities for your object storage

            
              lightdash
              1211
              https://github.com/lightdash/lightdash
              https://lightdash.com
              An open source alternative to Looker built using dbt. Made for analysts ❤️

            
              lightning
              19229
              https://github.com/Lightning-AI/lightning
              https://lightning.ai
              Build high-performance AI models with PyTorch Lightning (organized PyTorch). Deploy models with Lightning Apps (organized Python to build end-to-end ML systems).

            
              liquibase
              3289
              https://github.com/liquibase/liquibase
              https://www.liquibase.org
              Main Liquibase Source

            
              ludwig
              8411
              https://github.com/ludwig-ai/ludwig
              http://ludwig.ai
              Data-centric declarative deep learning framework

            
              marquez
              1106
              https://github.com/MarquezProject/marquez
              https://marquezproject.ai
              Collect aggregate and visualize a data ecosystem's metadata

            
              mars
              2449
              https://github.com/mars-project/mars
              https://docs.pymars.org
              Mars is a tensor-based unified framework for large-scale data computation which scales numpy pandas scikit-learn and Python functions.

            
              materialize
              4177
              https://github.com/MaterializeInc/materialize
              https://materialize.com
              The Fastest Way to Build the Fastest Data Products. Build data-intensive applications and services in SQL — without pipelines or caches — using materialized views that are always up-to-date.

            
              matplotlib
              15731
              https://github.com/matplotlib/matplotlib
              https://matplotlib.org/stable
              matplotlib: plotting with Python

            
              mediapipe
              17821
              https://github.com/google/mediapipe
              https://mediapipe.dev
              Cross-platform customizable ML solutions for live and streaming media.

            
              metabase
              29005
              https://github.com/metabase/metabase
              https://metabase.com
              The simplest fastest way to get business intelligence and analytics  to everyone in your company :yum:

            
              metaflow
              5745
              https://github.com/Netflix/metaflow
              https://metaflow.org
              :rocket: Build and manage real-life data science projects with ease!

            
              metarank
              1457
              https://github.com/metarank/metarank
              https://metarank.ai
              A low code Machine Learning service that personalizes articles listings search results recommendations to boost user engagement. A friendly Learn-to-Rank engine

            
              metricflow
              614
              https://github.com/transform-data/metricflow
              https://transform.co/metricflow
              MetricFlow allows you to define build and maintain metrics in code.

            
              milvus
              11183
              https://github.com/milvus-io/milvus
              https://milvus.io
              Vector database for scalable similarity search and AI applications.

            
              mindsdb
              8199
              https://github.com/mindsdb/mindsdb
              http://mindsdb.com
              In-Database Machine Learning

            
              modin
              7556
              https://github.com/modin-project/modin
              http://modin.readthedocs.io
              Modin: Scale your Pandas workflows by changing a single line of code

            
              nebula
              7600
              https://github.com/vesoft-inc/nebula
              https://nebula-graph.io
              A distributed fast open-source graph database featuring horizontal scalability and high availability

            
              neo4j
              10180
              https://github.com/neo4j/neo4j
              http://neo4j.com
              Graphs for Everyone

            
              neon
              3452
              https://github.com/neondatabase/neon
              https://neon.tech
              The serverless open source alternative to AWS Aurora Postgres.

            
              netron
              19210
              https://github.com/lutzroeder/netron
              https://netron.app
              Visualizer for neural network deep learning and machine learning models

            
              networkx
              10931
              https://github.com/networkx/networkx
              https://networkx.org
              Network Analysis in Python

            
              nocodb
              28607
              https://github.com/nocodb/nocodb
              https://docs.nocodb.com
              🔥 🔥 🔥 Open Source Airtable Alternative - turns any MySQL Postgres SQLite into a Spreadsheet with REST APIs.

            
              oceanbase
              4408
              https://github.com/oceanbase/oceanbase
              https://open.oceanbase.com
              OceanBase is an enterprise distributed relational database with high availability high performance horizontal scalability and compatibility with SQL standards.

            
              OnlineStats.jl
              700
              https://github.com/joshday/OnlineStats.jl
              https://joshday.github.io/OnlineStats.jl/latest/
              ⚡ Single-pass algorithms for statistics

            
              onnx
              12811
              https://github.com/onnx/onnx
              https://onnx.ai/
              Open standard for machine learning interoperability

            
              opacus
              1172
              https://github.com/pytorch/opacus
              https://opacus.ai
              Training PyTorch models with differential privacy

            
              OpenMetadata
              1090
              https://github.com/open-metadata/OpenMetadata
              https://open-metadata.org
              Open Standard for Metadata. A Single place to Discover Collaborate and Get your data right.

            
              orientdb
              4470
              https://github.com/orientechnologies/orientdb
              http://orientdb.org
              OrientDB is the most versatile DBMS supporting Graph Document Reactive Full-Text and Geospatial models in one Multi-Model product. OrientDB can run distributed (Multi-Master) supports SQL ACID Transactions Full-Text indexing and Reactive Queries.

            
              pandas-profiling
              9173
              https://github.com/ydataai/pandas-profiling
              https://pandas-profiling.ydata.ai
              Create HTML profiling reports from pandas DataFrame objects

            
              pandera
              1517
              https://github.com/pandera-dev/pandera
              https://pandera.readthedocs.io
              A light-weight flexible and expressive data validation library for dataframes

            
              ploomber
              2526
              https://github.com/ploomber/ploomber
              https://ploomber.io
              The fastest ⚡️ way to build data pipelines. Develop iteratively deploy anywhere. ☁️

            
              pointblank
              646
              https://github.com/rich-iannone/pointblank
              https://rich-iannone.github.io/pointblank
              Data quality assessment and metadata reporting for data frames and database tables

            
              polars
              6595
              https://github.com/pola-rs/polars
              https://pola.rs/
              Fast multi-threaded DataFrame library in Rust  Python  Node.js

            
              polyaxon
              3112
              https://github.com/polyaxon/polyaxon
              https://polyaxon.com
              MLOps Tools For Managing & Orchestrating The Machine Learning LifeCycle

            
              prefect
              9489
              https://github.com/PrefectHQ/prefect
              https://prefect.io
              The easiest way to automate your data

            
              prisma
              23900
              https://github.com/prisma/prisma
              https://www.prisma.io
              Next-generation ORM for Node.js & TypeScript  PostgreSQL MySQL MariaDB SQL Server SQLite MongoDB and CockroachDB

            
              pycaret
              5884
              https://github.com/pycaret/pycaret
              https://www.pycaret.org
              An open-source low-code machine learning library in Python

            
              pyro
              7506
              https://github.com/pyro-ppl/pyro
              http://pyro.ai
              Deep universal probabilistic programming with Python and PyTorch

            
              qlib
              8810
              https://github.com/microsoft/qlib
              https://qlib.readthedocs.io/en/latest/
              Qlib is an AI-oriented quantitative investment platform which aims to realize the potential empower the research and create the value of AI technologies in quantitative investment. With Qlib you can easily try your ideas to create better Quant investment strategies. An increasing number of  SOTA Quant research works/papers are released in Qlib.

            
              questdb
              8820
              https://github.com/questdb/questdb
              https://questdb.io
              An open source SQL database designed to process time series data faster

            
              ray
              21122
              https://github.com/ray-project/ray
              https://ray.io
              An open source framework that provides a simple universal API for building distributed applications. Ray is packaged with RLlib a scalable reinforcement learning library and Tune a scalable hyperparameter tuning library.

            
              re-data
              1153
              https://github.com/re-data/re-data
              https://getre.io
              re_data - fix data issues before your users & CEO would discover them 😊

            
              RedisGraph
              1666
              https://github.com/RedisGraph/RedisGraph
              https://redis.io/docs/stack/graph/
              A graph database as a Redis module

            
              risingwave
              2808
              https://github.com/singularity-data/risingwave
              https://www.risingwave.dev
              RisingWave: the next-generation streaming database in the cloud.

            
              rudder-server
              3150
              https://github.com/rudderlabs/rudder-server
              https://www.rudderstack.com
              Privacy and Security focused Segment-alternative in Golang and React

            
              scikit-learn
              50605
              https://github.com/scikit-learn/scikit-learn
              https://scikit-learn.org
              scikit-learn: machine learning in Python

            
              sea-orm
              2226
              https://github.com/SeaQL/sea-orm
              https://www.sea-ql.org/SeaORM/
              🐚 An async & dynamic ORM for Rust

            
              snowplow
              6124
              https://github.com/snowplow/snowplow
              http://snowplowanalytics.com
              The enterprise-grade behavioral data engine (web mobile server-side webhooks) running cloud-natively on AWS and GCP

            
              soda-core
              905
              https://github.com/sodadata/soda-core
              https://docs.soda.io/soda-core/overview.html
              Data reliability tools for SQL- and Spark-accessible data

            
              spaCy
              23699
              https://github.com/explosion/spaCy
              https://spacy.io
              💫 Industrial-strength Natural Language Processing (NLP) in Python

            
              spiceai
              739
              https://github.com/spiceai/spiceai
              https://docs.spiceai.org
              Build apps that learn and adapt. Time series AI for developers.

            
              spicedb
              2285
              https://github.com/authzed/spicedb
              https://docs.authzed.com
              Open source permissions database inspired by Google Zanzibar

            
              streamlit
              19740
              https://github.com/streamlit/streamlit
              https://streamlit.io
              Streamlit — The fastest way to build data apps in Python

            
              stumpy
              2310
              https://github.com/TDAmeritrade/stumpy
              https://stumpy.readthedocs.io/en/latest/
              STUMPY is a powerful and scalable Python library for modern time series analysis

            
              superset
              46844
              https://github.com/apache/superset
              https://superset.apache.org/
              Apache Superset is a Data Visualization and Data Exploration Platform

            
              terminusdb
              1854
              https://github.com/terminusdb/terminusdb
              https://terminusdb.com
              TerminusDB is a distributed database with a collaboration model

            
              tidb
              31721
              https://github.com/pingcap/tidb
              https://pingcap.com
              TiDB is an open-source cloud-native distributed MySQL-Compatible database for elastic scale and real-time analytics. Try free: https://tidbcloud.com/free-trial

            
              TileDB
              1359
              https://github.com/TileDB-Inc/TileDB
              https://tiledb.com
              The Universal Storage Engine

            
              timescaledb
              13290
              https://github.com/timescale/timescaledb
              https://www.timescale.com/
              An open-source time-series SQL database optimized for fast ingest and complex queries.  Packaged as a PostgreSQL extension.

            
              transformers
              66291
              https://github.com/huggingface/transformers
              https://huggingface.co/transformers
              🤗 Transformers: State-of-the-art Machine Learning for Pytorch TensorFlow and JAX.

            
              trino
              5658
              https://github.com/trinodb/trino
              https://trino.io
              Official repository of Trino the distributed SQL query engine for big data formerly known as PrestoSQL (https://trino.io)

            
              typedb
              3127
              https://github.com/vaticle/typedb
              https://vaticle.com
              TypeDB: a strongly-typed database

            
              vaex
              7136
              https://github.com/vaexio/vaex
              https://vaex.io
              Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python ML visualization and exploration of big tabular data at a billion rows per second 🚀

            
              vespa
              3974
              https://github.com/vespa-engine/vespa
              https://vespa.ai
              The open big data serving engine. https://vespa.ai

            
              whale
              693
              https://github.com/hyperqueryhq/whale
              https://docs.whale.cx
              🐳 The stupidly simple CLI workspace for your data warehouse.

            
              yolov5
              28113
              https://github.com/ultralytics/yolov5
              https://ultralytics.com
              YOLOv5 🚀 in PyTorch > ONNX > CoreML > TFLite

            
              yugabyte-db
              6599
              https://github.com/yugabyte/yugabyte-db
              https://www.yugabyte.com
              The high-performance distributed SQL database for global internet-scale apps.
Repo Name	Stars	GitHub URL	Project URL	Project Description
airbyte	7176	https://github.com/airbytehq/airbyte	https://airbyte.com	Airbyte is an open-source EL(T) platform that helps you replicate your data in your warehouses lakes and databases.
amundsen	3389	https://github.com/amundsen-io/amundsen	https://www.amundsen.io/amundsen/	Amundsen is a metadata driven application for improving the productivity of data analysts data scientists and engineers when interacting with data.
arangodb	12377	https://github.com/arangodb/arangodb	https://www.arangodb.com	🥑 ArangoDB is a native multi-model database with flexible data models for documents graphs and key-values. Build high performance applications using a convenient SQL-like query language or JavaScript extensions.
arctic	2729	https://github.com/man-group/arctic	https://arctic.readthedocs.io/en/latest/	High performance datastore for time series and tick data
arrow-datafusion	2173	https://github.com/apache/arrow-datafusion	https://arrow.apache.org/datafusion	Apache Arrow DataFusion SQL Query Engine
aws-data-wrangler	2914	https://github.com/awslabs/aws-data-wrangler	https://aws-data-wrangler.readthedocs.io	Pandas on AWS - Easy integration with Athena Glue Redshift Timestream Neptune OpenSearch QuickSight Chime CloudWatchLogs DynamoDB EMR SecretManager PostgreSQL MySQL SQLServer and S3 (Parquet CSV JSON and EXCEL).
benthos	4529	https://github.com/benthosdev/benthos	https://www.benthos.dev	Fancy stream processing made operationally mundane
cayley	14227	https://github.com/cayleygraph/cayley	https://cayley.io	An open-source graph database
ClickHouse	24377	https://github.com/ClickHouse/ClickHouse	https://clickhouse.com	ClickHouse® is a free analytics DBMS for big data
cockroach	25018	https://github.com/cockroachdb/cockroach	https://www.cockroachlabs.com	CockroachDB - the open source cloud-native distributed SQL database.
cog	2569	https://github.com/replicate/cog		Containers for machine learning
composer	2216	https://github.com/mosaicml/composer	http://docs.mosaicml.com	train neural networks up to 7x faster
crate	3434	https://github.com/crate/crate	https://crate.io/products/cratedb/	CrateDB is a distributed SQL database that makes it simple to store and analyze massive amounts of machine data in real-time.
cudf	4839	https://github.com/rapidsai/cudf	http://rapids.ai	cuDF - GPU DataFrame Library
dagster	4934	https://github.com/dagster-io/dagster	https://dagster.io	An orchestration platform for the development production and observation of data assets.
dash	16751	https://github.com/plotly/dash	https://plotly.com/dash	Analytical Web Apps for Python R Julia and Jupyter. No JavaScript Required.
databend	4181	https://github.com/datafuselabs/databend	https://databend.rs	A modern Elasticity and Performance cloud data warehouse activate your object storage for real-time analytics. Cloud at https://app.databend.com/
DataFrames.jl	1397	https://github.com/JuliaData/DataFrames.jl	https://dataframes.juliadata.org/stable/	In-memory tabular data in Julia
datahub	5724	https://github.com/datahub-project/datahub	https://datahubproject.io	The Metadata Platform for the Modern Data Stack
dbt-core	5109	https://github.com/dbt-labs/dbt-core	https://getdbt.com	dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.
debezium	7004	https://github.com/debezium/debezium	https://debezium.io	Change data capture for a variety of databases. Please log issues at https://issues.redhat.com/browse/DBZ.
dgraph	18178	https://github.com/dgraph-io/dgraph	https://dgraph.io	Native GraphQL Database with graph backend
diesel	8640	https://github.com/diesel-rs/diesel	https://diesel.rs	A safe extensible ORM and Query Builder for Rust
dolt	12310	https://github.com/dolthub/dolt		Dolt – It's Git for Data
dremio-oss	1086	https://github.com/dremio/dremio-oss	https://www.dremio.com	Dremio - the missing link in modern data
duckdb	5426	https://github.com/duckdb/duckdb	http://www.duckdb.org	DuckDB is an in-process SQL OLAP Database Management System
dvc	9954	https://github.com/iterative/dvc	https://dvc.org	🦉Data Version Control Git for Data & Models ML Experiments Management
edgedb	8003	https://github.com/edgedb/edgedb	https://edgedb.com	A next-generation graph-relational database.
elementary	611	https://github.com/elementary-data/elementary	https://docs.elementary-data.com	Open-source data observability for analytics engineers
faker	14400	https://github.com/joke2k/faker	http://faker.rtfd.org	Faker is a Python package that generates fake data for you.
feast	3335	https://github.com/feast-dev/feast	https://feast.dev	Feature Store for Machine Learning
feathr	748	https://github.com/linkedin/feathr	https://engineering.linkedin.com/blog/2022/open-sourcing-feathr---linkedin-s-feature-store-for-productive-m	Feathr – An Enterprise-Grade High Performance Feature Store
featureform	909	https://github.com/featureform/featureform	https://www.featureform.com	The Virtual Feature Store. Turn your existing data infrastructure into a feature store.
FerretDB	4516	https://github.com/FerretDB/FerretDB	https://www.ferretdb.io	A truly Open Source MongoDB alternative
flyte	2458	https://github.com/flyteorg/flyte	https://flyte.org	Kubernetes-native workflow automation platform for complex mission-critical data and ML processes at scale. It has been battle-tested at Lyft Spotify Freenome and others and is truly open-source.
flyway	6602	https://github.com/flyway/flyway	https://flywaydb.org	Flyway by Redgate • Database Migrations Made Easy.
grafana	49742	https://github.com/grafana/grafana	https://grafana.com	The open and composable observability and data visualization platform. Visualize metrics logs and traces from multiple sources like Prometheus Loki Elasticsearch InfluxDB Postgres and many more.
great_expectations	6801	https://github.com/great-expectations/great_expectations	https://docs.greatexpectations.io/	Always know what to expect from your data.
horovod	12557	https://github.com/horovod/horovod	http://horovod.ai	Distributed training framework for TensorFlow Keras PyTorch and Apache MXNet.
hudi	3279	https://github.com/apache/hudi	https://hudi.apache.org/	Upserts Deletes And Incremental Processing on Big Data.
ibis	1875	https://github.com/ibis-project/ibis	http://ibis-project.org	Expressive analytics in Python at any scale.
ignite	4196	https://github.com/apache/ignite	https://ignite.apache.org/	Apache Ignite
immudb	7658	https://github.com/codenotary/immudb	https://www.codenotary.com/technologies/immudb	immudb - immutable database based on zero trust SQL and Key-Value tamperproof data change history
ivy	3005	https://github.com/unifyai/ivy	https://lets-unify.ai	The Unified Machine Learning Framework
janusgraph	4496	https://github.com/JanusGraph/janusgraph	https://janusgraph.org	JanusGraph: an open-source distributed graph database
jgrapht	2146	https://github.com/jgrapht/jgrapht	http://www.jgrapht.org	Master repository for the JGraphT project
keras	55556	https://github.com/keras-team/keras	http://keras.io/	Deep Learning for humans
ksql	5058	https://github.com/confluentinc/ksql	https://ksqldb.io	The database purpose-built for stream processing applications.
lakeFS	2647	https://github.com/treeverse/lakeFS	https://lakefs.io	Git-like capabilities for your object storage
lightdash	1211	https://github.com/lightdash/lightdash	https://lightdash.com	An open source alternative to Looker built using dbt. Made for analysts ❤️
lightning	19229	https://github.com/Lightning-AI/lightning	https://lightning.ai	Build high-performance AI models with PyTorch Lightning (organized PyTorch). Deploy models with Lightning Apps (organized Python to build end-to-end ML systems).
liquibase	3289	https://github.com/liquibase/liquibase	https://www.liquibase.org	Main Liquibase Source
ludwig	8411	https://github.com/ludwig-ai/ludwig	http://ludwig.ai	Data-centric declarative deep learning framework
marquez	1106	https://github.com/MarquezProject/marquez	https://marquezproject.ai	Collect aggregate and visualize a data ecosystem's metadata
mars	2449	https://github.com/mars-project/mars	https://docs.pymars.org	Mars is a tensor-based unified framework for large-scale data computation which scales numpy pandas scikit-learn and Python functions.
materialize	4177	https://github.com/MaterializeInc/materialize	https://materialize.com	The Fastest Way to Build the Fastest Data Products. Build data-intensive applications and services in SQL — without pipelines or caches — using materialized views that are always up-to-date.
matplotlib	15731	https://github.com/matplotlib/matplotlib	https://matplotlib.org/stable	matplotlib: plotting with Python
mediapipe	17821	https://github.com/google/mediapipe	https://mediapipe.dev	Cross-platform customizable ML solutions for live and streaming media.
metabase	29005	https://github.com/metabase/metabase	https://metabase.com	The simplest fastest way to get business intelligence and analytics to everyone in your company :yum:
metaflow	5745	https://github.com/Netflix/metaflow	https://metaflow.org	:rocket: Build and manage real-life data science projects with ease!
metarank	1457	https://github.com/metarank/metarank	https://metarank.ai	A low code Machine Learning service that personalizes articles listings search results recommendations to boost user engagement. A friendly Learn-to-Rank engine
metricflow	614	https://github.com/transform-data/metricflow	https://transform.co/metricflow	MetricFlow allows you to define build and maintain metrics in code.
milvus	11183	https://github.com/milvus-io/milvus	https://milvus.io	Vector database for scalable similarity search and AI applications.
mindsdb	8199	https://github.com/mindsdb/mindsdb	http://mindsdb.com	In-Database Machine Learning
modin	7556	https://github.com/modin-project/modin	http://modin.readthedocs.io	Modin: Scale your Pandas workflows by changing a single line of code
nebula	7600	https://github.com/vesoft-inc/nebula	https://nebula-graph.io	A distributed fast open-source graph database featuring horizontal scalability and high availability
neo4j	10180	https://github.com/neo4j/neo4j	http://neo4j.com	Graphs for Everyone
neon	3452	https://github.com/neondatabase/neon	https://neon.tech	The serverless open source alternative to AWS Aurora Postgres.
netron	19210	https://github.com/lutzroeder/netron	https://netron.app	Visualizer for neural network deep learning and machine learning models
networkx	10931	https://github.com/networkx/networkx	https://networkx.org	Network Analysis in Python
nocodb	28607	https://github.com/nocodb/nocodb	https://docs.nocodb.com	🔥 🔥 🔥 Open Source Airtable Alternative - turns any MySQL Postgres SQLite into a Spreadsheet with REST APIs.
oceanbase	4408	https://github.com/oceanbase/oceanbase	https://open.oceanbase.com	OceanBase is an enterprise distributed relational database with high availability high performance horizontal scalability and compatibility with SQL standards.
OnlineStats.jl	700	https://github.com/joshday/OnlineStats.jl	https://joshday.github.io/OnlineStats.jl/latest/	⚡ Single-pass algorithms for statistics
onnx	12811	https://github.com/onnx/onnx	https://onnx.ai/	Open standard for machine learning interoperability
opacus	1172	https://github.com/pytorch/opacus	https://opacus.ai	Training PyTorch models with differential privacy
OpenMetadata	1090	https://github.com/open-metadata/OpenMetadata	https://open-metadata.org	Open Standard for Metadata. A Single place to Discover Collaborate and Get your data right.
orientdb	4470	https://github.com/orientechnologies/orientdb	http://orientdb.org	OrientDB is the most versatile DBMS supporting Graph Document Reactive Full-Text and Geospatial models in one Multi-Model product. OrientDB can run distributed (Multi-Master) supports SQL ACID Transactions Full-Text indexing and Reactive Queries.
pandas-profiling	9173	https://github.com/ydataai/pandas-profiling	https://pandas-profiling.ydata.ai	Create HTML profiling reports from pandas DataFrame objects
pandera	1517	https://github.com/pandera-dev/pandera	https://pandera.readthedocs.io	A light-weight flexible and expressive data validation library for dataframes
ploomber	2526	https://github.com/ploomber/ploomber	https://ploomber.io	The fastest ⚡️ way to build data pipelines. Develop iteratively deploy anywhere. ☁️
pointblank	646	https://github.com/rich-iannone/pointblank	https://rich-iannone.github.io/pointblank	Data quality assessment and metadata reporting for data frames and database tables
polars	6595	https://github.com/pola-rs/polars	https://pola.rs/	Fast multi-threaded DataFrame library in Rust Python Node.js
polyaxon	3112	https://github.com/polyaxon/polyaxon	https://polyaxon.com	MLOps Tools For Managing & Orchestrating The Machine Learning LifeCycle
prefect	9489	https://github.com/PrefectHQ/prefect	https://prefect.io	The easiest way to automate your data
prisma	23900	https://github.com/prisma/prisma	https://www.prisma.io	Next-generation ORM for Node.js & TypeScript PostgreSQL MySQL MariaDB SQL Server SQLite MongoDB and CockroachDB
pycaret	5884	https://github.com/pycaret/pycaret	https://www.pycaret.org	An open-source low-code machine learning library in Python
pyro	7506	https://github.com/pyro-ppl/pyro	http://pyro.ai	Deep universal probabilistic programming with Python and PyTorch
qlib	8810	https://github.com/microsoft/qlib	https://qlib.readthedocs.io/en/latest/	Qlib is an AI-oriented quantitative investment platform which aims to realize the potential empower the research and create the value of AI technologies in quantitative investment. With Qlib you can easily try your ideas to create better Quant investment strategies. An increasing number of SOTA Quant research works/papers are released in Qlib.
questdb	8820	https://github.com/questdb/questdb	https://questdb.io	An open source SQL database designed to process time series data faster
ray	21122	https://github.com/ray-project/ray	https://ray.io	An open source framework that provides a simple universal API for building distributed applications. Ray is packaged with RLlib a scalable reinforcement learning library and Tune a scalable hyperparameter tuning library.
re-data	1153	https://github.com/re-data/re-data	https://getre.io	re_data - fix data issues before your users & CEO would discover them 😊
RedisGraph	1666	https://github.com/RedisGraph/RedisGraph	https://redis.io/docs/stack/graph/	A graph database as a Redis module
risingwave	2808	https://github.com/singularity-data/risingwave	https://www.risingwave.dev	RisingWave: the next-generation streaming database in the cloud.
rudder-server	3150	https://github.com/rudderlabs/rudder-server	https://www.rudderstack.com	Privacy and Security focused Segment-alternative in Golang and React
scikit-learn	50605	https://github.com/scikit-learn/scikit-learn	https://scikit-learn.org	scikit-learn: machine learning in Python
sea-orm	2226	https://github.com/SeaQL/sea-orm	https://www.sea-ql.org/SeaORM/	🐚 An async & dynamic ORM for Rust
snowplow	6124	https://github.com/snowplow/snowplow	http://snowplowanalytics.com	The enterprise-grade behavioral data engine (web mobile server-side webhooks) running cloud-natively on AWS and GCP
soda-core	905	https://github.com/sodadata/soda-core	https://docs.soda.io/soda-core/overview.html	Data reliability tools for SQL- and Spark-accessible data
spaCy	23699	https://github.com/explosion/spaCy	https://spacy.io	💫 Industrial-strength Natural Language Processing (NLP) in Python
spiceai	739	https://github.com/spiceai/spiceai	https://docs.spiceai.org	Build apps that learn and adapt. Time series AI for developers.
spicedb	2285	https://github.com/authzed/spicedb	https://docs.authzed.com	Open source permissions database inspired by Google Zanzibar
streamlit	19740	https://github.com/streamlit/streamlit	https://streamlit.io	Streamlit — The fastest way to build data apps in Python
stumpy	2310	https://github.com/TDAmeritrade/stumpy	https://stumpy.readthedocs.io/en/latest/	STUMPY is a powerful and scalable Python library for modern time series analysis
superset	46844	https://github.com/apache/superset	https://superset.apache.org/	Apache Superset is a Data Visualization and Data Exploration Platform
terminusdb	1854	https://github.com/terminusdb/terminusdb	https://terminusdb.com	TerminusDB is a distributed database with a collaboration model
tidb	31721	https://github.com/pingcap/tidb	https://pingcap.com	TiDB is an open-source cloud-native distributed MySQL-Compatible database for elastic scale and real-time analytics. Try free: https://tidbcloud.com/free-trial
TileDB	1359	https://github.com/TileDB-Inc/TileDB	https://tiledb.com	The Universal Storage Engine
timescaledb	13290	https://github.com/timescale/timescaledb	https://www.timescale.com/	An open-source time-series SQL database optimized for fast ingest and complex queries. Packaged as a PostgreSQL extension.
transformers	66291	https://github.com/huggingface/transformers	https://huggingface.co/transformers	🤗 Transformers: State-of-the-art Machine Learning for Pytorch TensorFlow and JAX.
trino	5658	https://github.com/trinodb/trino	https://trino.io	Official repository of Trino the distributed SQL query engine for big data formerly known as PrestoSQL (https://trino.io)
typedb	3127	https://github.com/vaticle/typedb	https://vaticle.com	TypeDB: a strongly-typed database
vaex	7136	https://github.com/vaexio/vaex	https://vaex.io	Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python ML visualization and exploration of big tabular data at a billion rows per second 🚀
vespa	3974	https://github.com/vespa-engine/vespa	https://vespa.ai	The open big data serving engine. https://vespa.ai
whale	693	https://github.com/hyperqueryhq/whale	https://docs.whale.cx	🐳 The stupidly simple CLI workspace for your data warehouse.
yolov5	28113	https://github.com/ultralytics/yolov5	https://ultralytics.com	YOLOv5 🚀 in PyTorch > ONNX > CoreML > TFLite
yugabyte-db	6599	https://github.com/yugabyte/yugabyte-db	https://www.yugabyte.com	The high-performance distributed SQL database for global internet-scale apps.