Last active
December 4, 2018 22:29
-
-
Save xujiboy/c3fcc47f720ed9adf2260c5d0ba8aed2 to your computer and use it in GitHub Desktop.
A method to query the unique values of a partition (name) from a `ParquetDataset`, using `pieces`.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import re | |
import pyarrow.parquet as pq | |
def query_unique_value(dataset: pq.ParquetDataset, | |
partition: str | |
) -> set: | |
''' query the unique values of a given partition name from a give `ParquetDataset`, returns a set. | |
Parameters | |
---------- | |
dataset : pyarrow.parquet.ParquetDataset | |
The ParquetDataset for the query. | |
partition : str | |
The name of the partition to query on. | |
Returns | |
------- | |
set | |
The unique values of the partition | |
''' | |
unique_values = set() | |
pattern = re.compile(f'.*/{partition}=([^/]*)/') | |
for p in dataset.pieces: | |
value = re.match(pattern, p.path).group(1) | |
unique_values.add(value) | |
return unique_values |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment