Skip to content

Instantly share code, notes, and snippets.

@uriesk
uriesk / analyse_file_size_usage_wikipedia_top1m_zim.md
Last active February 3, 2023 15:18
wikipedia top1m webm files

Media URLs scrapped out of wikipedia_en_top1m logfile with:

cat 28c070f7906bf9674d93ad36_mwoffliner.log | grep ".webm" | grep -v api.php | grep Downloading | sed -e 's/.* \[//' -e 's/\].*//' | grep -v .jpg | sort | uniq > videofiles.txt

for video and

cat 28c070f7906bf9674d93ad36_mwoffliner.log | grep ".ogg" | grep -v api.php | grep Downloading | sed -e 's/.* \[//' -e 's/\].*//' | grep -v .jpg | grep -v '.png' | grep -v maps | grep -v load.php | sort | uniq > audiofiles.txt

for audio