File Format | Indices (within file) | Index Types | Sharding | Analysis Library DB Interfaces (Examples) | Performance | Granularity | Compression | Data Types | Durability | Security | Community/Support | Maturity | Cost/License | Basin Use-cases |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Nimble | Columns and streams | Block encoding, cascading (recursive/composite) encoding, pluggable encoding selection policies | Supported | Flatbuffers, SIMD, GPU | Designed for wide workloads, extensibility APIs | Thousands to tens of thousands of columns and streams | Flatbuffers, block encoding, recursive/composite encoding | Many, with extensibility for additional encodings | In development, no stability/versioning guarantees yet | Focus on a single unified library to prevent fragmentation | Work in progress, community support through Meta | Active development, no stable release yet | Open-source, dependency on Velox and other libraries | Suitable for high-performance analytics workloads, especially where extensibility and scalability are critical. |
Lance v2 | Point lookups, flexible metadata | Extensions (encoding via plugins), no row groups | Supported | pyarrow, LanceDB | Optimized for AI/ML workloads, handles wide columns and schemas efficiently | Very large columns, flexible alignment, supports non-tabular data | Flexible, configurable, supports large cells | Many, with easy addition of new encodings | In development, but optimized for performance | Empowering encoding developers, fluidity between data & metadata | Community support through Discord, active development | Initial implementation, groundwork laid for advanced features | Open-source | Ideal for AI/ML applications on Basin, given its optimization for wide columns and flexible schemas. |
Apache Avro | Yes, schema-based | Field-level, composite | Yes, by record or field | DuckDB, Spark DataFrame, Pandas (Feather) | High | Row-based | Snappy, Zstandard | Complex data structures | High (consistent hashing) | Optional encryption | Active | Mature | Apache Software License | Good for use cases requiring schema evolution and efficient serialization, especially in a multi-provider environment. |
Apache Parquet | Yes, columnar | Range, dictionary, bloom filters | Yes, by file or partition | DuckDB, Spark DataFrame, Pandas (Feather, PyArrow), Polars | High | Columnar | Snappy, Gzip, LZO | Most data types | High (consistent hashing) | Optional encryption | Active | Mature | Apache Software License | Excellent for analytical queries and big data processing on Basin, especially with its efficient columnar storage and compression. |
ORC (Optimized Row Columnar) | Yes, columnar | Predicate pushdown, range, dictionary | Yes, by file or partition | Spark DataFrame, Pandas (Feather, PyArrow) | High | Columnar | Snappy, Zlib | Most data types | High (consistent hashing) | Optional encryption | Active | Mature | Apache Software License | Suitable for heavy analytical workloads on Basin due to its optimization for read performance and complex queries. |
SQLite | Yes, schema-based | B-trees, various types managed by SQLite engine | No | SQLite (direct), Python (DB-API), Spark (JDBC) | Moderate | Record | Various | Most data types | High | Optional encryption | Active | Mature | Public domain | Useful for lightweight, portable database applications on Basin, with optional encryption for security. |
Apache Arrow | No (not directly) | In-memory indexes for specific data structures | No | Arrow Flight, Spark DataFrames, Pandas (PyArrow) | High | Columnar | Snappy, Zstandard | Most data types | High | Optional encryption | Active | Mature | Apache Software License | Ideal for in-memory analytics and fast data interchange on Basin, leveraging its high-performance columnar format. |
NetCDF | No (not directly) | External libraries like HDF5 can be used | No | Xarray (Python) | Moderate | Variable-length records | None | Scientific data types | High | Optional encryption | Active | Mature | Public domain | Suitable for scientific data storage and analysis on Basin, especially with large, complex datasets. |
Zarr | Yes, by datasets or chunks | Dimensional, attribute-based | Yes, by datasets or chunks | Dask, Xarray, NumPy | High | Variable-length records | Various | Most data types | High | Optional encryption | Active | Mature | BSD-3-Clause License | Great for distributed computing and cloud-based workflows on Basin, particularly with chunked data. |
FlatGeoBuf | No | Encoded geometries don't require internal indexing | No | GeoPandas (Python) | Moderate | Variable-length records | Snappy | Spatial data types | High | Optional encryption | Active | Mature | MIT License | Perfect for spatial data storage and geospatial applications on Basin, enabling efficient random access and compression. |
PMTiles | No | Tiled raster data structure doesn't inherently need indexing | No | Leaflet.js, MapLibre.js | Moderate | Tiled raster data | JPEG, PNG | Raster data types | High | Optional encryption | Active | Mature | MIT License | Ideal for storing and serving tiled raster data on Basin, particularly for map and image-based applications. |
Last active
August 27, 2024 16:59
-
-
Save andrewxhill/2315ff1be3b272022d9141e2ff02a180 to your computer and use it in GitHub Desktop.
example random access formats
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment