Skip to content

Instantly share code, notes, and snippets.

@l1x
Last active March 2, 2024 05:16
Show Gist options
  • Save l1x/76dab6445b6d55396c622f915c755a17 to your computer and use it in GitHub Desktop.
Save l1x/76dab6445b6d55396c622f915c755a17 to your computer and use it in GitHub Desktop.
Merging Parquet files with Python
import os
import pyarrow.parquet as pq
#
# Warning!!!
# Suffers from the same problem as the parquet-tools merge function
#
#parquet-tools merge:
#Merges multiple Parquet files into one. The command doesn't merge row groups,
#just places one after the other. When used to merge many small files, the
#resulting file will still contain small row groups, which usually leads to bad
#query performance.
def combine_parquet_files(input_folder, target_path):
try:
files = []
for file_name in os.listdir(input_folder):
files.append(pq.read_table(os.path.join(input_folder, file_name)))
with pq.ParquetWriter(target_path,
files[0].schema,
version='2.0',
compression='gzip',
use_dictionary=True,
data_page_size=2097152, #2MB
write_statistics=True) as writer:
for f in files:
writer.write_table(f)
except Exception as e:
print(e)
combine_parquet_files('data', 'combined.parquet')
@mmore500
Copy link

PSA: If you're looking for a pre-packaged CLI to grab and go give joinem a try, available via PyPi: python3 -m pip install joinem.

joinem provides a CLI for fast, flexbile concatenation of tabular data using polars.
I/O is lazily streamed in order to give good performance when working with numerous, large files.

Example Usage

Pass input files via stdin and output file as an argument.

ls -1 path/to/*.parquet | python3 -m joinem out.parquet

You can add the --progress flag to get a progress bar.

No-install Containerized Interface

If you are working in a HPC environment, joinem can also be conveniently used via singularity/apptainer.

ls -1 *.pqt | singularity run docker://ghcr.io/mmore500/joinem out.pqt

Further Information

joinem is also compatible with CSV, JSON, and feather file types.
See the project's README for more usage examples and a full command-line interface API listing.

disclosure: I am the library author of joinem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment