Skip to content

Instantly share code, notes, and snippets.

@aborruso
Created March 7, 2024 22:26
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save aborruso/ade0154e3eb61558e1bd2cf8122c2a57 to your computer and use it in GitHub Desktop.
Save aborruso/ade0154e3eb61558e1bd2cf8122c2a57 to your computer and use it in GitHub Desktop.
# ISPIRATO DAL SOMMO SIMON https://til.simonwillison.net/duckdb/remote-parquet
# estrai elenco degli URL dei file
curl -X GET "https://huggingface.co/api/datasets/aborruso/open_cup_complessivo/parquet/default/train"
# interroga i file parquet, come se fosse un'unica risorsa, con l'hard coding degli URL
duckdb -c "select CODICE_NATURA_INTERVENTO,NATURA_INTERVENTO,count(*) conteggio FROM read_parquet(
['https://huggingface.co/api/datasets/aborruso/open_cup_complessivo/parquet/default/train/0.parquet','https://huggingface.co/api/datasets/aborruso/open_cup_complessivo/parquet/default/train/1.parquet','https://huggingface.co/api/datasets/aborruso/open_cup_complessivo/parquet/default/train/2.parquet','https://huggingface.co/api/datasets/aborruso/open_cup_complessivo/parquet/default/train/3.parquet','https://huggingface.co/api/datasets/aborruso/open_cup_complessivo/parquet/default/train/4.parquet','https://huggingface.co/api/datasets/aborruso/open_cup_complessivo/parquet/default/train/5.parquet','https://huggingface.co/api/datasets/aborruso/open_cup_complessivo/parquet/default/train/6.parquet','https://huggingface.co/api/datasets/aborruso/open_cup_complessivo/parquet/default/train/7.parquet','https://huggingface.co/api/datasets/aborruso/open_cup_complessivo/parquet/default/train/8.parquet','https://huggingface.co/api/datasets/aborruso/open_cup_complessivo/parquet/default/train/9.parquet','https://huggingface.co/api/datasets/aborruso/open_cup_complessivo/parquet/default/train/10.parquet','https://huggingface.co/api/datasets/aborruso/open_cup_complessivo/parquet/default/train/11.parquet','https://huggingface.co/api/datasets/aborruso/open_cup_complessivo/parquet/default/train/12.parquet','https://huggingface.co/api/datasets/aborruso/open_cup_complessivo/parquet/default/train/13.parquet','https://huggingface.co/api/datasets/aborruso/open_cup_complessivo/parquet/default/train/14.parquet','https://huggingface.co/api/datasets/aborruso/open_cup_complessivo/parquet/default/train/15.parquet','https://huggingface.co/api/datasets/aborruso/open_cup_complessivo/parquet/default/train/16.parquet','https://huggingface.co/api/datasets/aborruso/open_cup_complessivo/parquet/default/train/17.parquet','https://huggingface.co/api/datasets/aborruso/open_cup_complessivo/parquet/default/train/18.parquet','https://huggingface.co/api/datasets/aborruso/open_cup_complessivo/parquet/default/train/19.parquet']) group by ALL order by conteggio DESC"
# interroga i file parquet, come se fosse un'unica risorsa, senza scrivere elenco URL
# ma facendolo generare, con funzioni lambda
duckdb -c "
SELECT
CODICE_NATURA_INTERVENTO,
NATURA_INTERVENTO,
COUNT(*) AS conteggio
FROM read_parquet(
list_transform(
generate_series(0, 19), -- Genera una serie di numeri da 0 a 19.
n -> 'https://huggingface.co/api/datasets/aborruso/open_cup_complessivo/parquet/default/train/' ||
n || '.parquet' -- Crea un URL per ciascun file Parquet
)
)
GROUP BY
CODICE_NATURA_INTERVENTO,
NATURA_INTERVENTO
ORDER BY
conteggio DESC; -- Ordina i risultati in base al conteggio decrescente.
"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment