Last active
March 2, 2024 05:16
-
-
Save l1x/76dab6445b6d55396c622f915c755a17 to your computer and use it in GitHub Desktop.
Merging Parquet files with Python
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import os | |
import pyarrow.parquet as pq | |
# | |
# Warning!!! | |
# Suffers from the same problem as the parquet-tools merge function | |
# | |
#parquet-tools merge: | |
#Merges multiple Parquet files into one. The command doesn't merge row groups, | |
#just places one after the other. When used to merge many small files, the | |
#resulting file will still contain small row groups, which usually leads to bad | |
#query performance. | |
def combine_parquet_files(input_folder, target_path): | |
try: | |
files = [] | |
for file_name in os.listdir(input_folder): | |
files.append(pq.read_table(os.path.join(input_folder, file_name))) | |
with pq.ParquetWriter(target_path, | |
files[0].schema, | |
version='2.0', | |
compression='gzip', | |
use_dictionary=True, | |
data_page_size=2097152, #2MB | |
write_statistics=True) as writer: | |
for f in files: | |
writer.write_table(f) | |
except Exception as e: | |
print(e) | |
combine_parquet_files('data', 'combined.parquet') |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
PSA: If you're looking for a pre-packaged CLI to grab and go give joinem a try, available via PyPi:
python3 -m pip install joinem
.joinem provides a CLI for fast, flexbile concatenation of tabular data using polars.
I/O is lazily streamed in order to give good performance when working with numerous, large files.
Example Usage
Pass input files via stdin and output file as an argument.
You can add the
--progress
flag to get a progress bar.No-install Containerized Interface
If you are working in a HPC environment, joinem can also be conveniently used via singularity/apptainer.
Further Information
joinem is also compatible with CSV, JSON, and feather file types.
See the project's README for more usage examples and a full command-line interface API listing.
disclosure: I am the library author of joinem.