Skip to content

Instantly share code, notes, and snippets.

@ZubairLK
Last active December 16, 2015 07:13
Show Gist options
  • Save ZubairLK/8942d7a2d01d668631bf to your computer and use it in GitHub Desktop.
Save ZubairLK/8942d7a2d01d668631bf to your computer and use it in GitHub Desktop.
Extacting subject titles from sabaq.pk site map
# sabaq.pk has a site-map page which has links to all videos.
# Extracting titles of all videos using the following bash for example punjab biology 9th class
# need a few variables
subject=punjab-biology-9th
board="punjab board"
# grep subject, then video. then replace bad characters in the line with nothing using sed to generate right urls of each video
cat sabaq.xml | grep $subject | grep video | sed s/vsg/sid/ | sed 's/amp;//' | sed 's/<loc>//' | sed 's/<\/loc>//' > $subject.txt
# Download all url html files
wget -i $subject.txt
# grep the html files for the title and remove the redundant characters
cat video-page.php\?sid\=* | grep -i "$board" | sed 's/<br><h1>//' | sed 's/<\/h1><br>/\t/' | sed 's/<div class=//' | sed "s/'b-crum'>//" | sed 's/ > / /' | sed 's/ > / /' | sed 's/ > / /' | sed 's/ > / /' | sed 's/<\/div><br>//' | sed 's/ <div class="lnk-video">//' > full_list.txt
# split into title and subject/chapter prefix files.
cut -f1 full_list.txt > title.txt
cut -f2 full_list.txt > prefix.txt
" combine in prefix_title form and remove space characters with _ and add .mp4 at end
paste prefix.txt title.txt | sed 's/\t//' | tr -s ' ' '_' | sed s/$/.mp4/ > $subject\_titles.txt
cat $subject\_titles.txt
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment