Skip to content

Instantly share code, notes, and snippets.

@andrewxhill
Last active June 7, 2024 20:02
Show Gist options
  • Save andrewxhill/2315ff1be3b272022d9141e2ff02a180 to your computer and use it in GitHub Desktop.
Save andrewxhill/2315ff1be3b272022d9141e2ff02a180 to your computer and use it in GitHub Desktop.
example random access formats
File Format Indices (within file) Index Types Sharding Analysis Library DB Interfaces (Examples) Performance Granularity Compression Data Types Durability Security Community/Support Maturity Cost/License Basin Use-cases
Nimble Columns and streams Block encoding, cascading (recursive/composite) encoding, pluggable encoding selection policies Supported Flatbuffers, SIMD, GPU Designed for wide workloads, extensibility APIs Thousands to tens of thousands of columns and streams Flatbuffers, block encoding, recursive/composite encoding Many, with extensibility for additional encodings In development, no stability/versioning guarantees yet Focus on a single unified library to prevent fragmentation Work in progress, community support through Meta Active development, no stable release yet Open-source, dependency on Velox and other libraries Suitable for high-performance analytics workloads, especially where extensibility and scalability are critical.
Lance v2 Point lookups, flexible metadata Extensions (encoding via plugins), no row groups Supported pyarrow, LanceDB Optimized for AI/ML workloads, handles wide columns and schemas efficiently Very large columns, flexible alignment, supports non-tabular data Flexible, configurable, supports large cells Many, with easy addition of new encodings In development, but optimized for performance Empowering encoding developers, fluidity between data & metadata Community support through Discord, active development Initial implementation, groundwork laid for advanced features Open-source Ideal for AI/ML applications on Basin, given its optimization for wide columns and flexible schemas.
Apache Avro Yes, schema-based Field-level, composite Yes, by record or field DuckDB, Spark DataFrame, Pandas (Feather) High Row-based Snappy, Zstandard Complex data structures High (consistent hashing) Optional encryption Active Mature Apache Software License Good for use cases requiring schema evolution and efficient serialization, especially in a multi-provider environment.
Apache Parquet Yes, columnar Range, dictionary, bloom filters Yes, by file or partition DuckDB, Spark DataFrame, Pandas (Feather, PyArrow), Polars High Columnar Snappy, Gzip, LZO Most data types High (consistent hashing) Optional encryption Active Mature Apache Software License Excellent for analytical queries and big data processing on Basin, especially with its efficient columnar storage and compression.
ORC (Optimized Row Columnar) Yes, columnar Predicate pushdown, range, dictionary Yes, by file or partition Spark DataFrame, Pandas (Feather, PyArrow) High Columnar Snappy, Zlib Most data types High (consistent hashing) Optional encryption Active Mature Apache Software License Suitable for heavy analytical workloads on Basin due to its optimization for read performance and complex queries.
SQLite Yes, schema-based B-trees, various types managed by SQLite engine No SQLite (direct), Python (DB-API), Spark (JDBC) Moderate Record Various Most data types High Optional encryption Active Mature Public domain Useful for lightweight, portable database applications on Basin, with optional encryption for security.
Apache Arrow No (not directly) In-memory indexes for specific data structures No Arrow Flight, Spark DataFrames, Pandas (PyArrow) High Columnar Snappy, Zstandard Most data types High Optional encryption Active Mature Apache Software License Ideal for in-memory analytics and fast data interchange on Basin, leveraging its high-performance columnar format.
NetCDF No (not directly) External libraries like HDF5 can be used No Xarray (Python) Moderate Variable-length records None Scientific data types High Optional encryption Active Mature Public domain Suitable for scientific data storage and analysis on Basin, especially with large, complex datasets.
Zarr Yes, by datasets or chunks Dimensional, attribute-based Yes, by datasets or chunks Dask, Xarray, NumPy High Variable-length records Various Most data types High Optional encryption Active Mature BSD-3-Clause License Great for distributed computing and cloud-based workflows on Basin, particularly with chunked data.
FlatGeoBuf No Encoded geometries don't require internal indexing No GeoPandas (Python) Moderate Variable-length records Snappy Spatial data types High Optional encryption Active Mature MIT License Perfect for spatial data storage and geospatial applications on Basin, enabling efficient random access and compression.
PMTiles No Tiled raster data structure doesn't inherently need indexing No Leaflet.js, MapLibre.js Moderate Tiled raster data JPEG, PNG Raster data types High Optional encryption Active Mature MIT License Ideal for storing and serving tiled raster data on Basin, particularly for map and image-based applications.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment