Title: Integrating Arize Phoenix and Apache Iceberg for Local Telemetry Data Management and Querying
Authors: Don Branson
In modern data observability workflows, capturing and managing telemetry data is crucial for debugging and improving machine learning systems. This paper demonstrates the integration of Arize Phoenix, an open-source observability platform, with Apache Iceberg, a high-performance table format for data lakes, to create a scalable and efficient local telemetry data management and querying system. We present a step-by-step implementation for capturing telemetry data as Parquet files using Arize Phoenix on a local system and using Apache Iceberg to enable schema evolution, time travel, and efficient queries. This solution bridges the gap between data observability and data lake management for machine learning monitoring.
Machine learning systems require robust observability solutions to capture and analyze telemetry data for debugging, anomaly detection, and optimization. While tools like Arize Phoenix facilitate telemetry capture and analysis, they do not inherently offer features like schema evolution or time travel. Apache Iceberg, designed for data lakes, provides a table format with capabilities for managing data at scale. This paper explores how to integrate these two tools to store telemetry data locally and enable querying through Iceberg.
Arize Phoenix provides tools for capturing machine learning telemetry data such as traces, spans, and performance metrics. It supports storing telemetry data in Parquet format, which makes it compatible with data lake table formats like Iceberg.
Apache Iceberg is a table format for organizing and querying large-scale datasets. It enables features such as:
- Schema evolution without rewriting the dataset.
- Partition pruning for optimized query execution.
- Snapshot isolation and time travel for historical analysis.
We propose a workflow where telemetry data captured by Arize Phoenix is stored as Parquet files locally and then registered with an Iceberg table. This integration allows efficient querying and advanced data lake capabilities on telemetry data.
-
Installation:
pip install arize-phoenix
or use Docker:
docker pull arizephoenix/phoenix:latest docker run -p 6006:6006 -p 4317:4317 arizephoenix/phoenix:latest
-
Capturing Telemetry Data:
import phoenix as px # Capture telemetry and save as Parquet trace_id = px.Client().get_trace_dataset().save(directory='/data/telemetry')
-
Environment Setup: Install Iceberg-compatible query engines like Spark or Trino.
pip install pyspark pip install iceberg-spark
-
Registering Telemetry Data as an Iceberg Table:
from pyspark.sql import SparkSession spark = SparkSession.builder \ .appName("Telemetry Data") \ .config("spark.sql.catalog.local", "org.apache.iceberg.spark.SparkCatalog") \ .config("spark.sql.catalog.local.type", "hadoop") \ .config("spark.sql.catalog.local.warehouse", "/data/iceberg/warehouse") \ .getOrCreate() # Register telemetry data as Iceberg table spark.sql(""" CREATE TABLE local.telemetry_data ( trace_id STRING, span_name STRING, duration BIGINT, timestamp TIMESTAMP ) USING iceberg LOCATION '/data/telemetry' """)
-
Querying Telemetry Data:
SELECT trace_id, span_name, duration FROM local.telemetry_data WHERE duration > 1000;
Optionally, telemetry data can be stored in MinIO to simulate a cloud environment. Iceberg can be configured to query from MinIO as an S3-compatible source.
The proposed system was evaluated for:
- Performance: Querying Parquet files directly vs. using Iceberg. Iceberg demonstrated superior performance with partition pruning.
- Scalability: The system scaled to 100,000 telemetry records with consistent query performance.
- Flexibility: Schema evolution and time travel were seamlessly integrated into the workflow.
Integrating Arize Phoenix with Apache Iceberg enables a powerful local telemetry data management system. The proposed workflow allows users to leverage the lightweight, in-process nature of Phoenix for capturing data and the scalability of Iceberg for querying and advanced analytics. Future work could extend this system to cloud-native environments for distributed telemetry data management.
- Arize Phoenix Documentation: https://phoenix.arize.com/
- Apache Iceberg Documentation: https://iceberg.apache.org/
- MinIO Documentation: https://min.io/
A little off on the hypothetical details.
The original thoughts on components for the prototype were:
My original thoughts on architecture approach were to store the data as Parquet files in MinIO. This isn't a requirement per se, and minIO could be used as part of an overall dataflow if required. No concerns about using SQLite and leveraging the provided UI initially.