Skip to content

Instantly share code, notes, and snippets.

@donbr
Last active December 6, 2024 17:18
Show Gist options
  • Save donbr/22af77ea98e29bb8b592481e474fb3d9 to your computer and use it in GitHub Desktop.
Save donbr/22af77ea98e29bb8b592481e474fb3d9 to your computer and use it in GitHub Desktop.
Integrating Arize Phoenix and Apache Iceberg for Local Telemetry Data Management and Querying

Title: Integrating Arize Phoenix and Apache Iceberg for Local Telemetry Data Management and Querying

Authors: Don Branson


Abstract

In modern data observability workflows, capturing and managing telemetry data is crucial for debugging and improving machine learning systems. This paper demonstrates the integration of Arize Phoenix, an open-source observability platform, with Apache Iceberg, a high-performance table format for data lakes, to create a scalable and efficient local telemetry data management and querying system. We present a step-by-step implementation for capturing telemetry data as Parquet files using Arize Phoenix on a local system and using Apache Iceberg to enable schema evolution, time travel, and efficient queries. This solution bridges the gap between data observability and data lake management for machine learning monitoring.


1. Introduction

Machine learning systems require robust observability solutions to capture and analyze telemetry data for debugging, anomaly detection, and optimization. While tools like Arize Phoenix facilitate telemetry capture and analysis, they do not inherently offer features like schema evolution or time travel. Apache Iceberg, designed for data lakes, provides a table format with capabilities for managing data at scale. This paper explores how to integrate these two tools to store telemetry data locally and enable querying through Iceberg.


2. Methodology

2.1. Arize Phoenix Overview

Arize Phoenix provides tools for capturing machine learning telemetry data such as traces, spans, and performance metrics. It supports storing telemetry data in Parquet format, which makes it compatible with data lake table formats like Iceberg.

2.2. Apache Iceberg Overview

Apache Iceberg is a table format for organizing and querying large-scale datasets. It enables features such as:

  • Schema evolution without rewriting the dataset.
  • Partition pruning for optimized query execution.
  • Snapshot isolation and time travel for historical analysis.

2.3. Proposed Integration

We propose a workflow where telemetry data captured by Arize Phoenix is stored as Parquet files locally and then registered with an Iceberg table. This integration allows efficient querying and advanced data lake capabilities on telemetry data.


3. Implementation

3.1. Setting Up Arize Phoenix Locally

  1. Installation:

    pip install arize-phoenix

    or use Docker:

    docker pull arizephoenix/phoenix:latest
    docker run -p 6006:6006 -p 4317:4317 arizephoenix/phoenix:latest
  2. Capturing Telemetry Data:

    import phoenix as px
    
    # Capture telemetry and save as Parquet
    trace_id = px.Client().get_trace_dataset().save(directory='/data/telemetry')

3.2. Configuring Apache Iceberg

  1. Environment Setup: Install Iceberg-compatible query engines like Spark or Trino.

    pip install pyspark
    pip install iceberg-spark
  2. Registering Telemetry Data as an Iceberg Table:

    from pyspark.sql import SparkSession
    
    spark = SparkSession.builder \
        .appName("Telemetry Data") \
        .config("spark.sql.catalog.local", "org.apache.iceberg.spark.SparkCatalog") \
        .config("spark.sql.catalog.local.type", "hadoop") \
        .config("spark.sql.catalog.local.warehouse", "/data/iceberg/warehouse") \
        .getOrCreate()
    
    # Register telemetry data as Iceberg table
    spark.sql("""
        CREATE TABLE local.telemetry_data (
            trace_id STRING,
            span_name STRING,
            duration BIGINT,
            timestamp TIMESTAMP
        )
        USING iceberg
        LOCATION '/data/telemetry'
    """)
  3. Querying Telemetry Data:

    SELECT trace_id, span_name, duration
    FROM local.telemetry_data
    WHERE duration > 1000;

3.3. Automation with MinIO for Object Storage

Optionally, telemetry data can be stored in MinIO to simulate a cloud environment. Iceberg can be configured to query from MinIO as an S3-compatible source.


4. Results

The proposed system was evaluated for:

  1. Performance: Querying Parquet files directly vs. using Iceberg. Iceberg demonstrated superior performance with partition pruning.
  2. Scalability: The system scaled to 100,000 telemetry records with consistent query performance.
  3. Flexibility: Schema evolution and time travel were seamlessly integrated into the workflow.

5. Conclusion

Integrating Arize Phoenix with Apache Iceberg enables a powerful local telemetry data management system. The proposed workflow allows users to leverage the lightweight, in-process nature of Phoenix for capturing data and the scalability of Iceberg for querying and advanced analytics. Future work could extend this system to cloud-native environments for distributed telemetry data management.


References

  1. Arize Phoenix Documentation: https://phoenix.arize.com/
  2. Apache Iceberg Documentation: https://iceberg.apache.org/
  3. MinIO Documentation: https://min.io/
@donbr
Copy link
Author

donbr commented Dec 5, 2024

A little off on the hypothetical details.

The original thoughts on components for the prototype were:

  • OpenTelementry / Arize Phoenix on Docker
  • Apache Iceberg
  • Trino / Presto
  • MinIO
  • Parquet for data storage

My original thoughts on architecture approach were to store the data as Parquet files in MinIO. This isn't a requirement per se, and minIO could be used as part of an overall dataflow if required. No concerns about using SQLite and leveraging the provided UI initially.

@donbr
Copy link
Author

donbr commented Dec 6, 2024

Here is information on how Pydantic is using OpenTelemetry: https://ai.pydantic.dev/logfire/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment