Skip to content

Instantly share code, notes, and snippets.

@dchaplinsky
Created April 6, 2023 14:30
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save dchaplinsky/c1098b868d389cd24080d6c8a2d702b3 to your computer and use it in GitHub Desktop.
Save dchaplinsky/c1098b868d389cd24080d6c8a2d702b3 to your computer and use it in GitHub Desktop.
Small bash script which downloads 1.6TB of extracted structured data of the common crawl and finds pages where HowTo/FAQ structured data is available.
#!/bin/bash
# You will need `apt get parallel pv` to make it run
# download file containing urls
curl http://webdatacommons.org/structureddata/2022-12/files/file.list > urls.txt
# create output file
touch output.txt
# use parallel command to download/grep in parallel
cat urls.txt | pv -cN Input | parallel -j 4 "curl -s {} | zcat | grep -e '<http://schema.org/FAQPage>' -e '<http://schema.org/HowTo>'" | pv -cN Output > output.txt
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment