Skip to content

Instantly share code, notes, and snippets.

@ankurkumarz
Last active March 23, 2025 03:39
Show Gist options
  • Select an option

  • Save ankurkumarz/2fe8ef2d1dfda22bd193a12e26699482 to your computer and use it in GitHub Desktop.

Select an option

Save ankurkumarz/2fe8ef2d1dfda22bd193a12e26699482 to your computer and use it in GitHub Desktop.
Modern Data Engineering Stack using Open Source Technologies
Category Tool Key Strengths Use Case
Data Integration Airbyte Extensive connector library, user-friendly interface Seamless data migration across platforms
Apache Hop Visual workflows, flexibility Complex data orchestration tasks
Meltano CLI & version control for ELT Customizable data pipelines
Data Query Engines Dremio High performance, ease of use Complex data environments
Presto/Trino Large community, vendor-neutral Interactive analytics, diverse sources
Apache Drill Schema-free querying Flexible, niche use cases
Apache Pinot Real-time OLAP, low-latency analytics Real-time analytics on event data
Data Processing Platforms Spark Batch processing, large ecosystem General-purpose data tasks
Flink Real-time streaming, low-latency Streaming analytics
Data Storage Formats Parquet Columnar, analytical queries Big data analytics
Avro Row-based, schema evolution Serialization, messaging
Delta Lake Transactions, time travel Enhanced data lake management
Table Storage Formats Delta Lake Spark integration, ACID transactions Spark-heavy environments
Iceberg Vendor-neutral, broad compatibility Multi-platform setups
Hudi Real-time ingestion, CDC support Near-real-time updates
ELT Processing dbt Open-source, versatile warehouses Broad data warehouse support
Dataform Optimized for BigQuery, GCP integration Google Cloud users
Workflow Orchestration Airflow Mature, extensive integrations Stable, large-scale pipelines
Dagster Modern, asset-based, flexible Dynamic, complex workflows
Data Observability Datafold Data diffing, column-level lineage Data quality monitoring, anomaly detection
Data Visualization Grafana Interactive dashboards, extensive plugin ecosystem Monitoring, real-time data visualization
Data Storage Systems CrateDB Distributed SQL, real-time analytics IoT data management, time-series data
Alluxio Virtual distributed file system, memory-speed data access Data orchestration between storage systems
In-Memory Data Stores Valkey High-performance, versatile data structures Caching, real-time analytics
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment