Skip to content

Instantly share code, notes, and snippets.

@baggiponte
Created April 16, 2023 17:16
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save baggiponte/c5bf8e866d31056bdf1ff32800552ce5 to your computer and use it in GitHub Desktop.
Save baggiponte/c5bf8e866d31056bdf1ff32800552ce5 to your computer and use it in GitHub Desktop.
Pandas vs Polars read gzipped JSONL
import gzip
from pathlib import Path
import pandas as pd
import polars as pl
destination = Path("./generated.jsonl.gz")
if not destination.exists():
destination.touch()
content = b'{"Name":"`pandas`","in-memory/distributed":"in-memory","Apache Arrow":"no","Backends":"","Notes":""}\n{"Name":"`polars`","in-memory/distributed":"in-memory","Apache Arrow":"yes","Backends":"","Notes":""}\n{"Name":"`vaex`","in-memory/distributed":"in-memory","Apache Arrow":"no","Backends":"","Notes":""}\n{"Name":"`duckdb`","in-memory/distributed":"in-memory","Apache Arrow":"no","Backends":"","Notes":"SQL"}\n{"Name":"`apache-spark`","in-memory/distributed":"distributed","Apache Arrow":"no","Backends":"`pandas`-like API","Notes":""}\n{"Name":"`cuPy`/`cuDf`/`RAPIDS`","in-memory/distributed":"distributed","Apache Arrow":"","Backends":"GPU support","Notes":"streaming support"}\n{"Name":"`dask`","in-memory/distributed":"distributed","Apache Arrow":"no","Backends":"","Notes":""}\n{"Name":"`mars`","in-memory/distributed":"distributed","Apache Arrow":"no","Backends":"`ray`, `kubernetes`, `hadoop`","Notes":"`pandas`-like API"}\n{"Name":"`xarray`","in-memory/distributed":"wrapper","Apache Arrow":"no","Backends":"`numpy`, `pandas`, `dask`","Notes":""}\n{"Name":"`fugue`","in-memory/distributed":"wrapper","Apache Arrow":"no","Backends":"`spark`, `dask`, `ray`","Notes":"sql/`pandas`/python support"}\n{"Name":"`modin`","in-memory/distributed":"wrapper","Apache Arrow":"no","Backends":"`ray`, `dask`, `unidist` (?!)","Notes":"`pandas`-like API"}\n{"Name":"`ibis`","in-memory/distributed":"wrapper","Apache Arrow":"yes","Backends":"","Notes":"sql focus"}'
with gzip.open(destination, 'wb') as file:
file.write(content)
pd.read_json(destination, lines=True) # does not fail
pl.read_ndjson(destination) # fails with RuntimeError: BindingsError: "External error at line 0: stream did not contain valid UTF-8"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment