Skip to content

Instantly share code, notes, and snippets.

@napsternxg
Last active June 10, 2023 13:29
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save napsternxg/c94cb3bf0fdba4738cbc33912ce5480e to your computer and use it in GitHub Desktop.
Save napsternxg/c94cb3bf0fdba4738cbc33912ce5480e to your computer and use it in GitHub Desktop.
Food.com sitemap
mkdir food.com
cd food.com
wget https://www.food.com/sitemap.xml
for url in $(cat sitemap.xml | grep "<loc>https://www.food.com/sitemap-" | sed -n 's:.*<loc>\(.*\)</loc>.*:\1:p');
do echo "Download: $url";
done
for url in $(cat sitemap.xml | grep "<loc>https://www.food.com/sitemap-" | sed -n 's:.*<loc>\(.*\)</loc>.*:\1:p');
do wget "$url";
done
# for i in $(seq 1 24); do wget "https://www.food.com/sitemap-${i}.xml.gz"; done
cat sitemap-*.xml.gz| zcat | grep "<loc>https://www.food.com/recipe/" | wc -l # 298131
cat sitemap-*.xml.gz| zcat | grep "<loc>https://www.food.com/recipe/" | sed -n 's:.*<loc>\(.*\)</loc>.*:\1:p' > urls_with_recipes.txt
# Recipe names
cat sitemap-*.xml.gz | zcat | sed -n 's:.*<image\:title>\(.*\)</image\:title>:\1:p' > recipe_names.txt
wc -l recipe_names.txt # 168870
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment