| Category | Tool | Key Strengths | Use Case |
|---|---|---|---|
| Data Integration | Airbyte | Extensive connector library, user-friendly interface | Seamless data migration across platforms |
| Apache Hop | Visual workflows, flexibility | Complex data orchestration tasks | |
| Meltano | CLI & version control for ELT | Customizable data pipelines | |
| Data Query Engines | Dremio | High performance, ease of use | Complex data environments |
| Presto/Trino | Large community, vendor-neutral | Interactive analytics, diverse sources | |
| Apache Drill | Schema-free querying | Flexible, niche use cases | |
| Apache Pinot | Real-time OLAP, low-latency analytics | Real-time analytics on event data | |
| Data Processing Platforms | Spark | Batch processing, large ecosystem | General-purpose data tasks |
| Flink | Real-time streaming, low-latency | Streaming analytics | |
| Data Storage Formats | Parquet | Columnar, analytical queries | Big data analytics |
| Avro | Row-based, schema evolution | Serialization, messaging | |
| Delta Lake | Transactions, time travel | Enhanced data lake management | |
| Table Storage Formats | Delta Lake | Spark integration, ACID transactions | Spark-heavy environments |
| Iceberg | Vendor-neutral, broad compatibility | Multi-platform setups | |
| Hudi | Real-time ingestion, CDC support | Near-real-time updates | |
| ELT Processing | dbt | Open-source, versatile warehouses | Broad data warehouse support |
| Dataform | Optimized for BigQuery, GCP integration | Google Cloud users | |
| Workflow Orchestration | Airflow | Mature, extensive integrations | Stable, large-scale pipelines |
| Dagster | Modern, asset-based, flexible | Dynamic, complex workflows | |
| Data Observability | Datafold | Data diffing, column-level lineage | Data quality monitoring, anomaly detection |
| Data Visualization | Grafana | Interactive dashboards, extensive plugin ecosystem | Monitoring, real-time data visualization |
| Data Storage Systems | CrateDB | Distributed SQL, real-time analytics | IoT data management, time-series data |
| Alluxio | Virtual distributed file system, memory-speed data access | Data orchestration between storage systems | |
| In-Memory Data Stores | Valkey | High-performance, versatile data structures | Caching, real-time analytics |
Last active
March 23, 2025 03:39
-
-
Save ankurkumarz/2fe8ef2d1dfda22bd193a12e26699482 to your computer and use it in GitHub Desktop.
Modern Data Engineering Stack using Open Source Technologies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment