Skip to content

Instantly share code, notes, and snippets.

@rupeshtiwari
Last active April 25, 2024 14:27
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save rupeshtiwari/d1a51539e4d477dcc925fd568e43863d to your computer and use it in GitHub Desktop.
Save rupeshtiwari/d1a51539e4d477dcc925fd568e43863d to your computer and use it in GitHub Desktop.
Data Engineering Fundamentals

Chapter 1: Data Engineering Described

image

image

image

image

image

image

image

CHAPTER 2 The Data Engineering Lifecycle

image

image

image

image

Batch versus streaming

Virtually all data we deal with is inherently streaming. Data is nearly always produced and updated continually at its source. Batch ingestion is simply a specialized and convenient way of processing this stream in large chunks—for example, handling a full day’s worth of data in a single batch.

Streaming ingestion allows us to provide data to downstream systems—whether other applications, databases, or analytics systems—in a continuous, real-time fash‐ ion. Here, real-time (or near real-time) means that the data is available to a down‐ stream system a short time after it is produced (e.g., less than one second later). The latency required to qualify as real-time varies by domain and requirements.

Batch data is ingested either on a predetermined time interval or as data reaches a preset size threshold. Batch ingestion is a one-way door: once data is broken into batches, the latency for downstream consumers is inherently constrained. Because of limitations of legacy systems, batch was for a long time the default way to ingest data. Batch processing remains an extremely popular way to ingest data for downstream consumption, particularly in analytics and ML.

However, the separation of storage and compute in many systems and the ubiquity of event-streaming and processing platforms make the continuous processing of data streams much more accessible and increasingly popular. The choice largely depends on the use case and expectations for data timeliness.

Key considerations for batch versus stream ingestion

Should you go streaming-first? Despite the attractiveness of a streaming-first approach, there are many trade-offs to understand and think about. The following are some questions to ask yourself when determining whether streaming ingestion is an appropriate choice over batch ingestion: • If I ingest the data in real time, can downstream storage systems handle the rate of data flow? • Do I need millisecond real-time data ingestion? Or would a micro-batch approach work, accumulating and ingesting data, say, every minute? • What are my use cases for streaming ingestion? What specific benefits do I realize by implementing streaming? If I get data in real time, what actions can I take on that data that would be an improvement upon batch? • Will my streaming-first approach cost more in terms of time, money, mainte‐ nance, downtime, and opportunity cost than simply doing batch? • Are my streaming pipeline and system reliable and redundant if infrastructure fails? • What tools are most appropriate for the use case? Should I use a managed service (Amazon Kinesis, Google Cloud Pub/Sub, Google Cloud Dataflow) or stand up my own instances of Kafka, Flink, Spark, Pulsar, etc.? If I do the latter, who will manage it? What are the costs and trade-offs? • If I’m deploying an ML model, what benefits do I have with online predictions and possibly continuous training? What Is the Data Engineering Lifecycle? • Am I getting data from a live production instance? If so, what’s the impact of my ingestion process on this source system?

As you can see, streaming-first might seem like a good idea, but it’s not always straightforward; extra costs and complexities inherently occur. Many great ingestion frameworks do handle both batch and micro-batch ingestion styles. We think batch is an excellent approach for many common use cases, such as model training and weekly reporting. Adopt true real-time streaming only after identifying a business use case that justifies the trade-offs against using batch.

image

image

image

Data Management

Data management has quite a few facets, including the following:

• Data governance, including discoverability and accountability

• Data modeling and design

• Data lineage

• Storage and operations

• Data integration and interoperability

• Data lifecycle management

• Data systems for advanced analytics and ML

• Ethics and privacy

Data governance

Think of the typical example of data governance being done poorly. A business analyst gets a request for a report but doesn’t know what data to use to answer the question. The recipient of the report also questions the validity of the data. The integrity of the analyst—and of all data in the company’s systems—is called into question. The company is confused about its performance, making business planning impossible.

Discoverability

data discoverability include metadata management and master data management.

Metadata

metadata into two major categories: autogenerated and human generated. We’re seeing a proliferation of data catalogs, data-lineage tracking systems, and metadata management tools.

• Business metadata

• Technical metadata

• Operational metadata

• Reference metadata

Business metadata relates to the way data is used in the business, including business and data definitions, data rules and logic, how and where data is used, and the data owner(s).

Technical metadata describes the data created and used by systems across the data engineering lifecycle. It includes the data model and schema, data lineage, field mappings, and pipeline workflows. A data engineer uses technical metadata to create, connect, and monitor various systems across the data engineering lifecycle.

• Pipeline metadata (often produced in orchestration systems)

• Data lineage

• Schema

Orchestration is a central hub that coordinates workflow across various systems. Pipeline metadata captured in orchestration systems provides details of the workflow schedule, system and data dependencies, configurations, connection details, and much more.

Data-lineage metadata tracks the origin and changes to data, and its dependencies, over time.

Schema metadata describes the structure of data stored in a system such as a database, a data warehouse, a data lake, or a filesystem; it is one of the key differentiators across different storage systems. Object stores, for example, don’t manage schema metadata; instead, this must be managed in a metastore. On the other hand, cloud data warehouses manage schema metadata internally.

Operational metadata describes the operational results of various systems and includes statistics about processes, job IDs, application runtime logs, data used in a process, and error logs. A data engineer uses operational metadata to determine whether a process succeeded or failed and the data involved in the process.

Reference metadata is data used to classify other data. This is also referred to as lookup data. Standard examples of reference data are internal codes, geographic codes, units of measurement, and internal calendar standards.

Data accountability

Data accountability means assigning an individual to govern a portion of data. The responsible person then coordinates the governance activities of other stakeholders. An individual may be accountable for managing a customer ID across many systems.

Data quality (Can I trust this data?)

A data engineer ensures data quality across the entire data engineering lifecycle.

According to Data Governance: The Definitive Guide, data quality is defined by three main characteristics:

Accuracy

Is the collected data factually correct? Are there duplicate values? Are the numeric values accurate?

Completeness

Are the records complete? Do all required fields contain valid values?

Timeliness

Are records available in a timely fashion?

Data modeling and design

To derive business insights from data, through business analytics and data science, the data must be in a usable form. The process for converting data into a usable form is known as data modeling and design.

Data lineage

As data moves through its lifecycle, how do you know what system affected the data or what the data is composed of as it gets passed around and transformed? Data lineage describes the recording of an audit trail of data through its lifecycle, tracking both the systems that process the data and the upstream data it depends on.

For example, if a user would like their data deleted from your systems, having lineage for that data lets you know where that data is stored and its dependencies.

Data integration and interoperability

Data integration and interoperability is the process of integrating data across tools and processes. For example, a data pipeline might pull data from the Sales‐ force API, store it to Amazon S3, call the Snowflake API to load it into a table, call the API again to run a query, and then export the results to S3 where Spark can consume them. Stumble into the need for orchestration.

Data lifecycle management

The advent of data lakes encouraged organizations to ignore data archival and destruction. Why discard data when you can simply add more storage ad infinitum?

First, data is increasingly stored in the cloud.

Second, privacy and data retention laws such as the GDPR and the CCPA require data engineers to actively manage data destruction to respect users’ “right to be forgotten.”

Ethics and privacy

Data engineers need to ensure that datasets mask personally identifiable information (PII) and other sensitive information. Ensure that your data assets are compliant with a growing number of data regulations, such as GDPR and CCPA.

DataOps

DataOps maps the best practices of Agile methodology, DevOps, and statistical pro‐ cess control (SPC) to data. Whereas DevOps aims to improve the release and quality of software products, DataOps does the same thing for data products.

First and foremost, DataOps is a set of cultural habits.

We suggest first starting with observability and monitoring to get a window into the performance of a system, then adding in automation and incident response.

DataOps has three core technical elements: automation, monitoring and observability, and incident response

image

Automation

Automation enables reliability and consistency in the DataOps process and allows data engineers to quickly deploy new product features and improvements to existing workflows. In regular CI/CD with the added dimension of check‐ing for data quality, data/model drift, metadata integrity, and more.

If the cron jobs are hosted on a cloud instance, the instance may have an operational problem, causing the jobs to stop running unexpectedly. As the spacing between jobs becomes tighter, a job will eventually run long, causing a subsequent job to fail or produce stale data. Engineers may not be aware of job failures until they hear from analysts that their reports are out-of-date. Typically adopt an orchestration framework, perhaps Airflow (GCP Cloud Composer built on Apache Airflow) or Dagster.

Migrate their cron jobs to Airflow jobs. Now, dependencies are checked before jobs run. More transformation jobs can be packed into a given time because each job can start as soon as upstream data is ready rather than at a fixed, predetermined time.

The data engineering team still has room for operational improvements. A data scientist eventually deploys a broken DAG, bringing down the Airflow web server, need to stop allowing manual DAG deployments. They adopt automated DAG deployment. DAGs are tested before deployment, and monitoring processes ensure that the new DAGs start running properly.

Block the deployment of new Python dependencies until installation is validated.

Observability and monitoring

Observability, monitoring, logging, alerting, and tracing are all critical to getting ahead of any problems along the data engineering lifecycle. We recommend you incorporate SPC to understand whether events being monitored are out of line and which incidents are worth responding to.

Incident response

Incident response is about using the automation and observability capabilities mentioned previously to rapidly identify root causes of an incident and resolve it as reliably and quickly as possible.

As Werner Vogels, CTO of Amazon Web Services, is famous for saying, “Everything breaks all the time.”

Orchestration

An orchestration engine builds in metadata on job dependencies, generally in the form of a directed acyclic graph (DAG). The DAG can be run once or scheduled to run at a fixed interval of daily, weekly, every hour, every five minutes, etc.

run new jobs anytime they are deployed.

Orchestration systems also build job history capabilities, visualization, and alerting. For example, a monthly reporting job might check that an ETL job has been completed for the full month before starting.

Apache Oozie was extremely popular in the 2010s, but it was designed to work within a Hadoop cluster and was difficult to use in a more heterogeneous environment. Facebook developed Dataswarm for internal use in the late 20013; this inspired popular tools such as Airflow, introduced by Airbnb in 2014. GCP cloud composer built on Apache Airflow. Other Orchestration tools like Luigi and Conductor, Prefect and Dagster, which aim to improve the portability and testability of DAGs, Argo is an orchestration engine built around Kubernetes primitives. Metaflow is an open source project out of Netflix that aims to improve data science orchestration. Orchestration is strictly a batch concept. Pulsar aim to dramatically reduce the engineering and operational burden for streaming platforms.

Streaming

OpenFaaS, AWS Lambda, Google Cloud Func‐ tions for handling individual events.

Dedicated stream processors (Spark, Beam, Flink, or Pulsar) for analyzing streams to support reporting and real-time actions.

Pipelines as code

Pipelines as code is the core concept of present-day orchestration systems, which touch every stage of the data engineering lifecycle. Data engineers use code (typically Python) to declare data tasks and dependencies among them. The orchestration engine interprets these instructions to run steps using available resources.

CHAPTER 3 Designing Good Data Architecture

image

For instance, how will you move 10 TB of data every hour from a source database to your data lake? In short, operational architecture describes what needs to be done, and technical architecture details how it will happen.

image

Principles of Good Data Architecture

Never shoot for the best architecture, but rather the least worst architecture. — Mark Richards and Neal Ford

  1. Choose common components wisely. 2. Plan for failure.
  2. Architect for scalability.
  3. Architecture is leadership.
  4. Always be architecting.
  5. Build loosely coupled systems. 7. Make reversible decisions.
  6. Prioritize security.
  7. Embrace FinOps.

Domain and Services

A domain is the real-world subject area for which you’re architecting. A service is a set of functionality whose goal is to accomplish a task. image

image image image image

A domain is the real-world subject area for which you’re architecting. A service is a set of functionality whose goal is to accomplish a task.

Considerations for data architecture

For example, data pipelines might consume data from many sources ingested into a central data warehouse. The central data warehouse is inherently monolithic. A move toward a microservices equivalent with a data warehouse is to decouple the workflow with domain-specific data pipelines connecting to corresponding domain-specific data warehouses. For example, the sales data pipeline connects to the sales-specific data warehouse, and the inventory and product domains follow a similar pattern.

a single team is responsible for gathering data from all domains and reconciling it for consumption across the orga‐ nization. (This is a common approach in traditional data warehousing.) Another approach is the data mesh.

Examples and Types of Data Architecture

1. Data Warehouse

A data warehouse is a central data hub used for reporting and analysis. Data in a data warehouse is typically highly formatted and structured for analytics use cases. It’s among the oldest and most well-established data architectures. In 1989, Bill Inmon originated the notion of the data warehouse. The organizational data warehouse architecture organizes data associated with certain business team structures and processes. The technical data warehouse architecture reflects the technical nature of the data warehouse, such as MPP.

The organizational data warehouse architecture has two main characteristics:

  1. Separates online analytical processing (OLAP) from production databases (online trans‐ action processing) - This separation is critical as businesses grow. Moving data into a separate physi‐ cal system directs load away from production systems and improves analytics performance.
  2. Centralizes and organizes data - Traditionally, a data warehouse pulls data from application systems by using ETL. The extract phase pulls data from source systems. The transformation phase cleans and standardizes data, organizing and imposing business logic in a highly modeled form. (Chapter 8 covers transformations and data models.) The load phase pushes data into the data warehouse target database system. Data is loaded into multiple data marts that serve the analytical needs for specific lines or business and departments.
image

ELT is also popular in a streaming arrangement, as events are streamed from a CDC process, stored in a staging area, and then subsequently transformed within the data warehouse. A second version of ELT was popularized during big data growth in the Hadoop ecosystem. This is transform-on-read ELT.

image

The cloud data warehouse Amazon Redshift, Google BigQuery and Snowflake (3rd party)

2. Data marts

A data mart is a more refined subset of a warehouse designed to serve analytics and reporting, focused on a single suborganization, department, or line of business; every department has its own data mart, specific to its needs. This is in contrast to the full data warehouse that serves the broader organization or business.

Data marts exist for two reasons. First, a data mart makes data more easily accessible to analysts and report developers. Second, data marts provide an additional stage of transformation beyond that provided by the initial ETL or ELT pipelines.

image

2. Data marts

A data mart is a more refined subset of a warehouse designed to serve analytics and reporting, focused on a single suborganization, department, or line of business; every department has its own data mart, specific to its needs. This is in contrast to the full data warehouse that serves the broader organization or business. Data marts exist for two reasons. First, a data mart makes data more easily accessible to analysts and report developers. Second, data marts provide an additional stage of transformation beyond that provided by the initial ETL or ELT pipelines.

3. Data Lake

Instead of imposing tight structural limitations on data, why not simply dump all of your data—structured and unstructured—into a central location? Data lake 1.0 started with HDFS (in 2006). You can pick your favorite data-processing technology for the task at hand — MapReduce, Spark, Ray, Presto, Hive, etc.

headaches: GDPR that required targeted deletion of user records, joins were a huge headache, spend big on talent to write MapReduce jobs.

Later frameworks such as Pig and Hive somewhat improved the situation.

Data Lakehouse

Databricks, Snowflake, GCP BigLake

Databricks introduced the notion of a data lakehouse. The lakehouse incorporates the controls, data management, and data structures found in a data warehouse while still housing data in object storage and supporting a variety of query and transformation engines. In particular, the data lakehouse supports atomicity, consistency, isolation, and dura‐ bility (ACID) transactions, a big departure from the original data lake, where you simply pour in data and never update or delete it. The term data lakehouse suggests a convergence between data lakes and data warehouses. Cloud data warehouses separate compute from storage, sup‐ port petabyte-scale queries, store a variety of unstructured data and semistructured objects, and integrate with advanced processing technologies such as Spark or Beam.

image

Lambda Architecture

error-prone systems with code and data that are extremely difficult to reconcile. image

image

The distinction between row-based and columnar storage databases addresses specific needs for managing and querying large datasets. Let’s explore this with a simple example:

Row-based Database

What it is: In a row-based database, data is stored in rows, meaning each row holds one record with all its fields.

Example Use Case: Consider a transactional system like an order processing system where every transaction involves creating or updating records. Here, efficiency in writing operations is crucial.

Strengths:

  • Optimized for writing operations as the entire record can be written in a single operation.
  • Well-suited for OLTP systems where transactions are frequently updated and accessed as a whole.

Limitations:

  • Not ideal for analytical queries that only require a subset of columns as the entire row must be read.

Columnar Database

What it is: In a columnar database, data is stored in columns, so each column’s data is stored together.

Example Use Case: Consider an analytical report that needs to process millions of records but only a few columns. For example, calculating the average age of users from a large dataset.

Strengths:

  • Highly efficient for read-intensive operations particularly analytics, as it can quickly scan and aggregate data from only the necessary columns.
  • Offers better compression which reduces the storage requirements and I/O operations.

Limitations:

  • Not optimized for frequent writing or updating of records as it could involve re-writing large parts of columns.

Why They Exist

Row-based databases exist primarily to handle high transaction volumes efficiently, where the entire record is typically required in each operation. They excel in environments where data integrity and quick, complex queries over individual records are critical.

Columnar databases are designed to solve the efficiency problems in analytics and reporting use cases encountered with row-based databases. They allow for better performance and lower costs when the query only involves a subset of columns, which is common in data analysis.

Mapping to AWS and GCP Services

In AWS, Amazon RDS and Amazon DynamoDB offer traditional row-based storage, while Amazon Redshift is a good example of columnar storage designed for large-scale data warehousing and analytics.

In GCP, Cloud SQL and Cloud Spanner provide row-based storage solutions, with Cloud Spanner offering horizontal scalability and strong consistency across global databases. BigQuery, on the other hand, uses a columnar storage mechanism that makes it ideal for large-scale data analysis.

This understanding of the differences between these types of databases can be crucial for developing efficient data handling strategies in your role as a Customer Engineer at GCP.

Presto, Hive, and Ray are three distinct systems, each designed to solve specific challenges in data processing and computing environments. Here’s a breakdown of each, including their development history, primary purposes, and the platforms they work on:

1. Presto

  • What it is: Presto is a high-performance, distributed SQL query engine designed for interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes.
  • Development: Presto was originally developed by Facebook to allow querying enormous amounts of data stored in several systems including HDFS and Apache Hive. The project was open-sourced in 2013 and is now governed by the Presto Foundation under the Linux Foundation.
  • Platform: Presto works across multiple data sources like HDFS, S3, Cassandra, relational databases, and proprietary data stores. It is designed to work in a cluster environment where it can scale out across multiple machines and is commonly used in conjunction with Hadoop.

2. Hive

  • What it is: Apache Hive is a data warehousing and SQL-like query language (HiveQL) that facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. It is built on top of Apache Hadoop for providing data summarization, query, and analysis.
  • Development: Hive was developed by Facebook and later donated to the Apache Software Foundation for further development. It was primarily created to make querying and analyzing easy on Hadoop, particularly for those familiar with SQL.
  • Platform: Hive runs on top of Apache Hadoop, utilizing HDFS for storage and can integrate with other data storage systems like Apache HBase. Hive queries are converted into MapReduce, Tez, or Spark jobs.

3. Ray

  • What it is: Ray is an open-source framework that provides a simple, universal API for building distributed applications. It is designed for scaling machine learning and Python workflows.
  • Development: Ray was developed by the RISELab at UC Berkeley in order to address the needs of high-performance distributed computing and ease of programming. The project aims to improve efficiency and accessibility for developers needing to scale complex systems.
  • Platform: Ray can run on any POSIX-compliant system (like Linux or macOS) and can scale across a cluster of machines. It is often used with deep learning frameworks like TensorFlow, PyTorch, and Keras, as well as data processing frameworks like Apache Spark.

Purpose and Platforms

Each of these systems serves different purposes:

  • Presto is optimized for interactive queries where fast analytic capabilities are required over big data.
  • Hive is more suited for batch processing applications and jobs where data warehousing and SQL-like capabilities are needed over huge datasets.
  • Ray is focused on providing a straightforward platform for building and running distributed applications, particularly in machine learning and other intensive computational tasks.

They operate on various platforms mainly within big data ecosystems, with Ray extending into machine learning frameworks, showing their versatility and broad applications in modern data and computing operations.

Apache Kafka and Apache Storm are both powerful tools used for handling real-time data streams, but they serve different purposes and excel in different aspects of stream processing. Here’s a detailed comparison, including their creation, primary use cases, and inherent differences.

Apache Kafka

  • What it is: Apache Kafka is a distributed event streaming platform capable of handling trillions of events a day. It is designed for high-throughput, fault-tolerant, publish-subscribe messaging system and used to build real-time streaming data pipelines and applications.
  • Development: Kafka was originally developed at LinkedIn in early 2011 and later became part of the Apache Software Foundation as an open-source project. It was created to handle LinkedIn's increasing data ingestion and processing needs.
  • Primary Use Cases:
    • Data Integration: Kafka is often used as a central hub for data streams, integrating data from multiple sources to multiple destinations.
    • Real-Time Analytics: Kafka is used for real-time analytics where there is a need to process data as it arrives.
  • Platform: Kafka runs as a cluster on one or more servers that can span multiple datacenters. The Kafka cluster stores streams of records in categories called topics.

Apache Storm

  • What it is: Apache Storm is a distributed real-time computation system for processing large streams of data. Storm is designed for real-time analytics, online machine learning, continuous computation, distributed RPC, ETL, and more.
  • Development: Storm was originally created by Nathan Marz and team at BackType and was open-sourced after being acquired by Twitter in 2011. It became an Apache project in 2014.
  • Primary Use Cases:
    • Real-Time Analytics: Storm processes data in real-time, making it suitable for applications that require immediate insights.
    • Continuous Computation: It can run algorithms that need to update results as data streams through, like updating machine learning models.
  • Platform: Storm runs on YARN and can be integrated with the Hadoop ecosystem, including HDFS and HBase for storage, allowing it to manage large-scale data processing.

Key Differences

  • Processing Model:
    • Kafka is fundamentally a messaging system, designed to be a durable and scalable platform for handling streaming data. It uses a publish-subscribe model where data is written to a log and processed in order or partition.
    • Storm provides computation capabilities, often described as the "Hadoop of real-time", and processes data as it flows through the system using "spouts" and "bolts" to transform or aggregate the stream.
  • Fault Tolerance:
    • Kafka uses replication to ensure that data is not lost if a server fails. It ensures data durability through its log storage mechanism.
    • Storm ensures message processing at least once or exactly once with its built-in fault tolerance and message replay capabilities.
  • Scalability:
    • Kafka can handle very high throughputs with minimal latency, making it highly scalable.
    • Storm can also scale by adding more nodes to the cluster, and it parallelizes data processing across multiple nodes.

Conclusion

In summary, Kafka is optimized for data ingestion and dissemination within a durable, scalable architecture, making it ideal for building data pipelines and streaming applications. Storm, on the other hand, excels in complex transformations and real-time data processing tasks. Their uses are often complementary, where Kafka is used for data transport and buffering, and Storm is used for processing the data Kafka streams. Together, they form a robust framework for handling, processing, and analyzing real-time data streams.

Apache Storm, Apache Samza, and Apache Spark Streaming are all distributed stream processing frameworks, but each has its own unique attributes, strengths, and use cases. Let's dive into their differences, origins, and when one might be chosen over the others.

Apache Storm

  • What it is: Apache Storm is a real-time computation system that allows you to process large streams of data fast. It is robust and fault-tolerant with the capability to process messages as they arrive.
  • Development: Created by Nathan Marz at BackType and later acquired by Twitter, Storm was open-sourced in 2011 and became part of the Apache Software Foundation in 2014.
  • Use Cases: Storm is well-suited for scenarios requiring real-time analytics, continuous computation, and quick processing of information as it arrives. It's ideal when you need to guarantee that each message will be processed, even in the event of machine failures.
  • Strengths: High processing speed, guarantees data processing either once or exactly once, excels in real-time computation.
  • Platform: Integrates easily with the broader Hadoop ecosystem.

Apache Samza

  • What it is: Apache Samza is a stream processing framework built on top of the Apache Kafka for large-scale message processing. It is also designed to be fault-tolerant, durable, and scalable.
  • Development: Developed at LinkedIn and donated to the Apache Software Foundation in 2013.
  • Use Cases: Samza is particularly useful for stateful processing of streams, where each message might be related to previous messages. It's good for applications that require a combination of streaming and batch processing.
  • Strengths: Offers robust stateful processing capabilities, local state storage with fault-tolerance mechanisms, and tight integration with Apache Kafka.
  • Platform: Often used with Apache Kafka for messaging, and can integrate with Hadoop YARN for resource management.

Apache Spark Streaming

  • What it is: Apache Spark Streaming is an extension of the core Apache Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams.
  • Development: Developed at UC Berkeley's AMPLab, and later donated to the Apache Software Foundation.
  • Use Cases: Spark Streaming is used for complex transformations and aggregations of live data. It is particularly strong when you need to perform complex analytics that may involve joining streams or batch processing alongside stream processing.
  • Strengths: Integrates flawlessly with other components of the Apache Spark ecosystem for complex analytics, supports windowing capabilities, and can process data in micro-batches.
  • Platform: Runs on Apache Spark, making it easy to integrate with batch and interactive queries, machine learning algorithms, and graph processing.

Key Differences

  • Processing Model:

    • Storm processes each message individually in real-time without the concept of micro-batching.
    • Samza processes messages in a more Kafka-centric way, leveraging Kafka's partitions for fault tolerance and scalability.
    • Spark Streaming processes data in micro-batches, which may introduce a slight delay but allows for more comprehensive processing capabilities.
  • Fault Tolerance:

    • Storm replays messages from the source on failure.
    • Samza stores intermediate message states in Kafka to recover from failures.
    • Spark Streaming achieves fault tolerance through lineage information to replay lost data.
  • Ease of Use:

    • Storm can be more complex to set up and manage compared to Spark Streaming.
    • Samza provides a simpler API but is tightly coupled with Kafka.
    • Spark Streaming benefits from the ease of use inherent in the broader Spark ecosystem.

When to Use What

  • Use Storm if you require a system that processes data in real-time with strong guarantees on data processing.
  • Use Samza if your architecture is heavily based on Kafka and you need excellent support for stateful processing within a stream.
  • Use Spark Streaming if you need advanced analytics capabilities beyond simple aggregation, and the ability to integrate seamlessly with batch and interactive queries, or if you are already using Spark for other data processing tasks.

Each of these frameworks has been developed to meet specific needs within the landscape of big data processing, offering unique features and optimizations based on different operational requirements and scenarios.

Including Apache Flink into the comparison with Apache Storm, Apache Samza, and Apache Spark Streaming provides a fuller picture of the distributed stream processing frameworks available today. Each framework has distinct features and optimal use cases.

Apache Flink

  • What it is: Apache Flink is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams. Flink has been designed to run in all common cluster environments, perform computations at in-memory speed, and at any scale.
  • Development: Originating from the Stratosphere research project at the Technical University of Berlin, Flink was brought into the Apache Software Foundation in 2014.
  • Use Cases: Flink excels in scenarios requiring accurate handling of time and state in complex sessions, windows, and patterns in streams. It is particularly effective for applications such as real-time analytics, event-driven applications, and complex event processing.
  • Strengths: Provides true streaming (not micro-batch) with a robust mechanism for state management and event time processing. It supports exactly-once semantics for state consistency.
  • Platform: Flink runs independently of Hadoop but can integrate with Hadoop’s storage systems and resource management.

Revised Comparison with Apache Storm, Apache Samza, and Apache Spark Streaming

Processing Model

  • Storm processes messages individually in real-time, ideal for low-latency processing.
  • Samza uses a Kafka-centric processing model, processing messages as they come in a fault-tolerant manner.
  • Spark Streaming processes data in micro-batches, which may introduce slight latency but supports complex transformations.
  • Flink processes true streaming data, handling event time processing superbly, and can also process in batch mode, offering a hybrid model.

Fault Tolerance

  • Storm provides fault tolerance by replaying messages from the source or maintaining the message state externally.
  • Samza ensures fault tolerance by storing states in Kafka, which also serves as the log for message replay.
  • Spark Streaming uses the lineage of RDDs (Resilient Distributed Datasets) to recover lost data, thus ensuring fault tolerance.
  • Flink maintains a consistently checkpointed state across distributed processes and can recover from checkpoints in case of failures.

Use Cases and Strengths

  • Storm is best for scenarios where the processing of each message needs to be guaranteed and done with minimal latency.
  • Samza is optimized for systems deeply integrated with Kafka, particularly effective in environments where maintaining state consistency is important.
  • Spark Streaming is ideal for batch-intensive workflows where complex analytics are performed on streams in conjunction with batch and interactive processes.
  • Flink excels in accurate time-window processing and state management across large scale distributed environments, suitable for highly accurate real-time applications and analytics.

Integration and Scalability

  • Storm, Samza, and Flink can all scale horizontally, but their integration with other systems and ease of use can vary.
    • Storm can integrate well with any queueing system but is often seen with Kafka.
    • Samza is tightly integrated with Kafka and benefits from Kafka’s scalability.
    • Spark Streaming integrates well within the Spark ecosystem, leveraging Spark’s core capabilities.
    • Flink offers strong APIs for both batch and stream processing, making it versatile in a variety of operational environments.

Choosing the Right Framework

  • Choose Storm if you need reliable, real-time processing with a mature system.
  • Opt for Samza if your infrastructure is Kafka-centric and requires robust streaming capabilities with state management.
  • Select Spark Streaming if you need rich analytics capabilities that integrate seamlessly with batch and interactive processing.
  • Go with Flink for use cases demanding accurate handling of time, stateful computations across streams and batches, and high-throughput real-time analytics.

Each framework is designed with particular strengths and architectural decisions tailored to different processing needs and operational environments, making the choice highly dependent on specific project requirements.

Apache Storm and Apache Flink both cater to the needs of real-time data processing but are built with different architectural principles and features. Understanding their strengths and limitations will help you decide which is best suited for specific scenarios. Let’s dive deeper into what each lacks, their supported languages, platforms, and when to use one over the other.

Apache Storm

Strengths:

  • Real-time processing: Storm excels in processing unbounded data streams in real-time, ensuring that each message is processed as fast as it arrives.
  • Scalability: It scales out seamlessly, handling very high throughputs with low latency.
  • Guaranteed data processing: Storm provides strong guarantees on data processing, with each message processed at least once, or exactly once with proper configuration.

Limitations:

  • Complex state management: Managing state within Storm can be complex and typically requires external systems like Apache ZooKeeper for state consistency across the cluster.
  • No native support for batch processing: Storm is purely stream-focused, which means it does not have built-in support for batch processing or a unified API for handling both stream and batch processing.

Supported Languages: Primarily supports Java, with APIs available for other languages like Python and Clojure via external libraries.

Use Cases:

  • Real-time analytics where low response times are critical.
  • Scenarios requiring the processing of each message individually with very low latency.
  • Systems where processing reliability and fault tolerance are crucial.

Apache Flink

Strengths:

  • True streaming model: Flink processes data as true streams (not micro-batches), which allows for very accurate and consistent results in stream processing.
  • Unified API: Offers a single API for handling both batch and real-time streaming data, making it easier to write, maintain, and scale applications that need to handle both types of data processing.
  • Advanced windowing and state management: Flink provides sophisticated mechanisms for time handling and state management, making complex event processing simpler and more robust.

Limitations:

  • Higher complexity: Flink’s rich feature set and capabilities come at the cost of a steeper learning curve compared to simpler streaming solutions like Storm.
  • Resource consumption: Given its robust capabilities, Flink can be more demanding in terms of system resources when not properly tuned.

Supported Languages: Java and Scala are first-class citizens, with APIs also available for Python.

Use Cases:

  • Applications requiring accurate handling of event times, especially useful in scenarios where out-of-order events are common.
  • Use cases that require seamless integration of streaming and batch processing.
  • Complex event processing where stateful computations are needed.

Decision Factors

  • Complexity vs. Simplicity: Choose Storm if you need a straightforward, reliable solution for real-time processing without the need for handling complex state or time. Opt for Flink if you require strong state management, sophisticated time handling, and a unified approach to batch and streaming data.
  • Project Requirements: If your project involves complex analytics that integrate real-time and batch processing, Flink is likely the better choice. If you are focusing exclusively on real-time processing and need mature, stable technology with a proven track record, Storm could be more appropriate.
  • Developer Experience: Consider the experience of your team. Java and Scala programmers might find Flink more engaging due to its extensive features and Scala-friendly APIs. If your team has expertise in Java and desires simplicity, then Storm’s Java-centric approach could be more suitable.

In conclusion, the choice between Storm and Flink often comes down to the specific requirements of the data processing tasks, the need for handling complex states or time, and the integration of streaming with batch processing. Flink, while more resource-intensive and complex, offers a broader set of capabilities, making it suitable for more advanced use cases that benefit from its rich feature set. Storm, on the other hand, offers simplicity and effectiveness in scenarios that require basic real-time processing without the overhead of managing complex states or time.

Apache Flink and Apache Spark Streaming are both powerful platforms for processing data streams, but they cater to slightly different needs and use cases due to their architectural differences and operational capabilities. Here's a concise yet comprehensive comparison to help you understand when to use one over the other, highlighting their strengths, limitations, supported languages, and platforms.

Apache Flink

Core Features:

  • True Streaming: Processes data continuously, without the need for micro-batching, leading to lower latency and more precise control over time and state.
  • Unified API: Supports batch and real-time data processing within the same framework, simplifying development and operational efforts.

Strengths:

  • Excellent for applications requiring accurate, consistent handling of state and time, especially in complex event processing.
  • Provides robust support for event-time processing and windowing.

Limitations:

  • Can be more complex to deploy and manage compared to Spark due to its pure streaming nature and the requirements for state and time management.
  • Might be resource-intensive for smaller-scale applications.

Supported Languages: Java, Scala, Python.

Platform: Operates independently of Hadoop but can integrate with Hadoop components for storage (like HDFS).

Apache Spark Streaming

Core Features:

  • Micro-Batch Processing: Processes data in small batches, allowing it to leverage the fast batch processing capabilities of Spark’s core engine.
  • Extensive Ecosystem: Benefits from integration with other Spark components like Spark SQL, MLLib for machine learning, and GraphX for graph processing.

Strengths:

  • Better suited for batch-intensive workflows where complex transformations are common.
  • Easier to integrate with big data technologies and existing Spark batch jobs.

Limitations:

  • Higher latency compared to Flink due to micro-batching.
  • Less effective at handling complex event-time processing and state management in streaming compared to Flink.

Supported Languages: Primarily Java, Scala, and Python.

Platform: Built on the Spark ecosystem, commonly deployed on Hadoop but can also run standalone or on other cloud platforms.

When to Use Which?

  • Use Flink:

    • When precise control over state and real-time processing is crucial.
    • For applications that require robust handling of event time to manage out-of-order events or late data.
    • When you need a unified approach to handle both real-time stream processing and batch processing without sacrificing the performance of either.
  • Use Spark Streaming:

    • When your project is already using Spark for batch processing and you want to leverage the same technology for stream processing.
    • If you need strong integration with complex analytical components like SQL, machine learning, or graph processing within the same application.
    • When micro-batch processing is sufficient for your latency requirements and you benefit from a simpler programming model that aligns closely with batch processing.

In summary, Flink offers lower latency processing and more sophisticated mechanisms for managing state and handling time in streaming data, making it ideal for more complex, stateful, real-time applications. Spark Streaming, on the other hand, is optimal for environments where batch processing is prevalent, and there is a need for deep integration with other data processing tasks, offering a more holistic and unified approach to processing large-scale data.

Your understanding is mostly correct, but there are some nuances and historical reasons for the development of Apache Beam that can help clarify why Google developed it and how it complements, rather than replaces, technologies like Apache Flink.

Spark and Flink Processing Capabilities

  • Apache Spark: Initially, Spark Streaming processed data in micro-batches, which allowed it to leverage the same fast computational engine used for batch processing. However, with the introduction of Structured Streaming in Spark 2.x, it's now capable of treating a stream as a continuous table to which data is being appended, which allows for a more "streaming-like" experience, though the underlying execution model is still fundamentally micro-batch.

  • Apache Flink: Flink was designed from the ground up to handle true streaming, processing each event individually as it arrives, with excellent support for event time and state management. Flink also supports batch processing, making it a true hybrid processor for handling both streams and batches effectively within the same framework.

Reasons for Apache Beam's Development

Despite Flink’s capabilities, Google developed Apache Beam for several reasons:

  1. Abstraction Over Processing Engines: Beam provides a layer of abstraction that allows developers to write generalized data processing pipelines that are agnostic of the underlying execution engine. This means the same code can run on Apache Flink, Apache Spark, Google Cloud Dataflow (which was a primary motivator for Google), and any other Beam-compatible processors. This decouples the pipeline development from the execution technology, giving flexibility to change the execution technology as needed without rewriting the data processing pipelines.

  2. Model Consistency Across Platforms: Beam aims to provide a consistent model for writing both batch and streaming data pipelines. This helps in reducing the cognitive load for developers who need to switch between different paradigms and ensures that there is no need to learn the intricacies and APIs of multiple frameworks.

  3. Portability: Beam pipelines can run on any supported execution environment without modification. This portability is key for cloud environments and scenarios where operational agility is necessary. For example, an organization might develop pipelines on a local Flink setup but choose to run them in production on Google Cloud Dataflow.

  4. Advanced Windowing and Time Handling: Beam provides advanced windowing capabilities that are consistent across different processing backends. It supports complex event processing patterns out of the box, such as handling late data, windowing by event time, and providing custom trigger strategies.

Conclusion

While Flink can indeed handle both true streaming and batch processing within its native API, Beam's existence is justified by its focus on providing a higher-level abstraction for pipeline creation that is consistent regardless of the backend. It’s not about replacing Flink or Spark but rather about providing a way to unify pipeline development across different technologies, which is particularly valuable in a diverse ecosystem or cloud-native environments where operational flexibility and efficiency are critical.

Google's creation of Beam and its emphasis on using it with Cloud Dataflow also aligns with its strategy to enhance cloud offerings, enabling customers to build and migrate data processing jobs easily within the Google Cloud environment. Beam helps in abstracting the complexities of specific processing frameworks, making cloud transitions smoother and more versatile for developers.

Apache Beam is designed to handle complex scenarios in stream processing, including the management of late data. Late data refers to data that arrives after the window it logically belongs to has been considered "complete" or after the window's intended processing time. Handling late data is crucial for accurate and reliable streaming analytics, especially in environments with out-of-order events or varying network latencies.

Handling Late Data in Apache Beam

Apache Beam provides several mechanisms to manage and accommodate late data effectively:

  1. Windowing:

    • Beam supports various window types, such as fixed windows, sliding windows, and session windows. Each window can define its strategy for handling late data.
    • Developers can specify a window's allowed lateness, which is the period after the end of the window during which late data is still considered for processing.
  2. Watermarks:

    • A watermark is a notion of time that provides a measure of when a system believes all data up to a certain point in event time has been received.
    • Beam uses watermarks to manage when a window can be considered complete. If data arrives after the watermark has passed the end of the window but before the window's allowed lateness has expired, the data is still processed as part of that window.
  3. Triggers:

    • Triggers in Beam determine when to materialize the results of a window, allowing for flexibility in how window results are emitted. Triggers can fire based on the progress of time (event time or processing time), data accumulation, or other custom conditions.
    • Event-time triggers allow windows to emit results at specific points in event time, which can be useful for handling late data.
    • Processing-time triggers fire based on when data is processed, not when it was generated.
    • Data-driven triggers fire based on properties of the data itself, such as the number of elements in a window.
    • Composite triggers combine multiple triggers in various ways (e.g., AfterWatermark with an early and late firing trigger).
  4. Accumulating vs. Discarding Mode:

    • When a trigger fires, windowed computations can be in either accumulating or discarding mode.
    • Accumulating mode means that each trigger firing includes all data accumulated so far. This mode is useful when dealing with late data because it allows each trigger firing to update previously emitted results with any new or late data.
    • Discarding mode resets the window's contents after each trigger firing. This mode is typically used when only the most recent data (since the last firing) matters.
  5. Side Outputs and Late Data Collection:

    • In scenarios where late data arrives after the allowed lateness period has expired, it's still possible to handle this data using side outputs.
    • Developers can define side outputs in their pipeline to collect and process late data separately from the main output stream. This ensures that late data does not disrupt the already computed results but can still be analyzed or stored for auditing or other purposes.

These features make Apache Beam highly adaptable for real-time data processing tasks, especially where correctness and completeness of data are critical despite the challenges of late-arriving or out-of-order events. The flexibility to define custom windowing strategies, sophisticated triggering mechanisms, and handling late data beyond the close of windows allows developers to build robust streaming applications tailored to their specific needs.

Handling late-arriving data is a critical aspect of stream processing, and various frameworks provide different levels of support for this issue. Here’s how some of the popular frameworks like Apache Flink, Apache Spark Streaming, Apache Storm, and Apache Samza manage late data, compared to Apache Beam:

Apache Flink

  • Watermarks and Late Data Handling: Flink handles late data using watermarks, which are a sort of timestamp indicating that all data up to a certain point has been received. Flink allows you to define how to handle events that arrive after their respective window has been considered complete but before a user-defined allowed lateness period ends.
  • Triggers and Evictors: Flink provides triggers that determine when a window is considered ready to be processed, and evictors that can remove elements from a window after a trigger fires. This can be useful for handling late-arriving data by adjusting the contents of the window before the final computation.
  • Side Outputs for Late Data: Like Beam, Flink also supports side outputs for processing late data, allowing events that arrive after the allowed lateness period to be redirected to a side output stream for special handling.

Apache Spark Streaming

  • Watermark Support: Spark Streaming introduced watermarking capabilities with its Structured Streaming model. It allows developers to specify how old data can be before it is considered too late and thus not included in window calculations.
  • Handling Late Data: Spark’s approach to late data involves specifying a watermark to delay processing until sufficient data has arrived, reducing the chance of having to deal with late data. If late data is received after this point, it may not be included in the results.
  • No Built-In Side Outputs: Unlike Beam and Flink, Spark does not have a built-in mechanism for routing late data to side outputs. Instead, it primarily relies on watermarks to manage late data.

Apache Storm

  • Basic Late Data Handling: Storm provides basic support for late data through its windowing mechanism, but it is less sophisticated compared to Beam or Flink. It lacks built-in support for event-time processing and watermarking, which are crucial for effectively managing late data in a fine-grained manner.
  • Trident Extensions: The Trident abstraction over Storm introduces some capabilities for stateful processing and could be used to implement custom handling of late data, although it's not as straightforward or robust as in Beam or Flink.

Apache Samza

  • Event Time and Watermarking: Samza has added support for event-time processing and watermarking, which helps in handling late data by allowing the system to wait until it believes all data up to a certain time has arrived before processing.
  • Configurable Application Logic: Samza does not provide as rich a set of tools for dealing with late data out of the box as Beam or Flink. Instead, handling late data typically involves custom application logic tailored to the specifics of the job.

Conclusion

  • Apache Beam offers the most comprehensive and flexible tools for dealing with late data, including advanced windowing, triggering, and side outputs.
  • Apache Flink provides robust support for late data with similar features but within a more traditional stream processing environment.
  • Apache Spark Streaming handles late data using watermarks but lacks the more advanced triggering and side output capabilities of Beam and Flink.
  • Apache Storm and Samza offer basic support and rely more on the developer to implement custom logic for handling late data effectively.

Each framework has its strengths and trade-offs, and the choice of which to use often depends on the specific requirements of your data processing needs, such as the nature of the data streams, the complexity of the processing logic, and the required latency of the results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment