Skip to content

Instantly share code, notes, and snippets.

@xujiboy
Last active December 4, 2018 22:29
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save xujiboy/c3fcc47f720ed9adf2260c5d0ba8aed2 to your computer and use it in GitHub Desktop.
Save xujiboy/c3fcc47f720ed9adf2260c5d0ba8aed2 to your computer and use it in GitHub Desktop.
A method to query the unique values of a partition (name) from a `ParquetDataset`, using `pieces`.
import re
import pyarrow.parquet as pq
def query_unique_value(dataset: pq.ParquetDataset,
partition: str
) -> set:
''' query the unique values of a given partition name from a give `ParquetDataset`, returns a set.
Parameters
----------
dataset : pyarrow.parquet.ParquetDataset
The ParquetDataset for the query.
partition : str
The name of the partition to query on.
Returns
-------
set
The unique values of the partition
'''
unique_values = set()
pattern = re.compile(f'.*/{partition}=([^/]*)/')
for p in dataset.pieces:
value = re.match(pattern, p.path).group(1)
unique_values.add(value)
return unique_values
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment