public
Created — forked from macias/CHANGELOG

Coursera Getter [download video lecture]

  • Download Gist
c_get.php
PHP
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179
<?php
 
/* ===================== COURSERA GETTER ============================
tags: [coursera video download] [coursera lecture download]
CHANGELOG:
---------
2012-07-05 * initial release of Coursera preview getter
 
2012-06-14 * added little control if there is sufficient number of arguments
 
2012-06-11 * UTF-8 in filenames are supported
(another module for PHP is required -- mbstring)
* replaces slash and backslash with underscore
 
2012-06-07 * changed this info, added another way of getting cookies
 
2012-06-06, 2 * extensions casing reverted -- they matter again
* directories are named according to Lectures sections
* handles multiple files for given file type
* files are counted within each directory, not within entire course
2012-06-06 * extensions can be given lower/upper-case, they do not matter
2012-06-05, 2 * creates weekly subdirectories and puts the files in there
2012-06-05 * initial release
 
WHAT IT DOES:
------------
 
* it parses given course Lectures page
* it extracts all the desired content (links for videos, slides, etc)
* it uses consistent naming of the files
* it replaces colon with period (hello Windows users)
* it finally creates a bunch of wget command ready to execute
* it ignores already existing files, so it is safe to rerun wget script just to get missing files
(note this might be not true if you update this script, because of possible change in naming convention)
 
WHAT YOU NEED:
-------------
1. proper shell (Windows users -- of course I recommend switching to Linux entirely, but as a workaround Cygwin should be fine -- I don't know how about the tools I mention below)
2. wget (in openSUSE `sudo zypper in wget`)
3. php5 (in openSUSE `sudo zypper in php5`)
4. php5-openssl (in openSUSE `sudo zypper in php5-openssl`)
5. php5-mbstring (in openSUSE `sudo zypper in php5-mbstring`)
6. and an adventurous soul -- in Firefox, go to Edit/Preferences/Privacy/Remove Individual Cookie (don't freak out!) search for "coursera". Several items should appear -- look for key session for the site you would like to download (for example "nlp"). Copy the value (content) of that key. Close the preferences window (do **NOT** delete anything!) -- I will be grateful for info if there is easier way
 
Ok, so now you know the address of the site, the session, and the files you would like to download.
 
Jan de Vos sent another way for getting cookies (step 5):
 
* find the cookies directory -- in case of Linux it will be something like this `~/.mozilla/firefox/88xw1k8g.default/`
* run sqlite3 -- `sqlite3 cookies.sqlite`
* run SQL query -- `select path,value from moz_cookies where baseDomain = 'coursera.org' and name='session';`
You will get the session codes for all courses you are enrolled on.
USAGE:
-----
 
php c_get.php "link_to_lectures_page" "file types" "session code" > wget_script_name.sh
sh wget_script_name.sh
 
Example (this is one line):
 
php c_get.php "https://class.coursera.org/crypto-preview/lecture/index" "MP4 PDF" "HERE&IS%MY&SESSION^VALUE@WHICH*OF!COURSE*I_WONT*TELL9YOU" > wgetter.sh
 
the one above creates appropriate script for wget for downloading videos (MP4) and slides (PDF). Now execute
 
sh wgetter.sh
 
Please note the file type casing (MP4 vs. mp4) must match the casing of the title (tooltip) of given category of files
-- check the Lectures page to find it out.
SECURITY NOTE:
-------------
 
Do NOT share your session code with anyone, and this means -- do NOT share the wget script with anyone as well.
================================================================== */
function get_page($url,$session)
{
$http = array('method'=>'GET',
'header'=> 'Cookie: session='.$session.';');
$context = stream_context_create(array('http'=> $http));
 
$content = file_get_contents($url,false,$context);
if ($content===FALSE)
return NULL;
 
return $content;
}
 
function get_dom($content)
{
$dom = new DOMDocument();
$errors_mode = libxml_use_internal_errors(TRUE);
$content = mb_convert_encoding($content, 'HTML-ENTITIES', "UTF-8");
$dom->loadHTML($content);
libxml_clear_errors();
libxml_use_internal_errors($errors_mode);
$dom->preserveWhiteSpace = false;
return $dom;
}
 
function fix_filename($s)
{
return strtr(trim($s),':"/\\','.\'__');
}
 
function print_wget($content,$session,$extensions)
{
$dom = get_dom($content);
$xpath = new DOMXPath($dom);
 
$group_count = 0;
$group_list = $xpath->query('//a[contains(@class,"list_header_link")]');
foreach ($group_list as $group)
{
$item_count = 0;
++$group_count;
$dir = $xpath->query('./h3',$group)->item(0)->nodeValue;
$dir = str_pad($group_count,2,'0',STR_PAD_LEFT).'. '.fix_filename($dir);
echo 'mkdir "'.$dir.'"'."\n";
$node_list = $xpath->query('.//li[contains(@class,"item_row")]',$group->nextSibling);
foreach ($node_list as $node)
{
++$item_count;
$title = $xpath->query('.//a[@class="lecture-link"]/text()',$node)->item(0)->nodeValue;
foreach ($extensions as $ext)
{
$links = $xpath->query('.//a[contains(@title,"'.$ext.'")]',$node);
foreach ($links as $link)
{
$suffix = '';
if ($links->length>1)
$suffix = '.'.$link->attributes->getNamedItem('title')->nodeValue;
$link = $link->attributes->getNamedItem('href')->nodeValue;
 
echo 'wget -nc --no-cookies --header "Cookie: session='.$session.'" "'.$link.'" -O "'.$dir.'/'.str_pad($item_count,3,'0',STR_PAD_LEFT).'. '.fix_filename($title).$suffix.'.'.strtolower($ext).'"'."\n";
}
}
}
}
}
 
if ($argc!=4)
{
echo "Error: you should input three arguments, the usage is:\n";
echo "\"LECTURES_URL\" \"FILE_TYPES\" \"SESSION_CODE\"\n";
}
else
{
$url = $argv[1];
$extensions = explode(' ',$argv[2]);
$session = $argv[3];
 
$content = get_page($url,$session);
if ($content!==NULL)
print_wget($content,$session,$extensions);
}
?>
c_preview.php
PHP
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144
<?php
 
/* ===================== COURSERA PREVIEW GETTER =======================
tags: [coursera video download] [coursera lecture download]
 
CHANGELOG:
---------
2012-07-05 * initial release
 
WHAT IT DOES:
------------
 
* it is counterpart for Coursera getter, but this one works only for course previews
-- the ones with embedded video player, and nothing else
 
WHAT YOU NEED:
-------------
1. proper shell (Windows users -- of course I recommend switching to Linux entirely, but as a workaround Cygwin should be fine -- I don't know how about the tools I mention below)
2. wget (in openSUSE `sudo zypper in wget`)
3. php5 (in openSUSE `sudo zypper in php5`)
4. php5-mbstring (in openSUSE `sudo zypper in php5-mbstring`)
 
USAGE:
-----
 
php c_preview.php "link_to_preview_page" "video_file_type" > wget_script_name.sh
sh wget_script_name.sh
 
Example (this is one line):
 
php c_get.php "https://class.coursera.org/crypto-preview/lecture/index" "mp4"
 
the one above creates appropriate script for wget for downloading videos (MP4). Now execute
 
sh wgetter.sh
 
Please note the file type is not guaranteed to exists on the server
(so far "webm" and "mp4" are supported by Coursera).
================================================================== */
function get_page_xpath($url,$session = null)
{
$http = array('method'=>'GET');
if ($session!==NULL)
$http['header'] = 'Cookie: session='.$session.';';
$context = stream_context_create(array('http'=> $http));
 
$content = file_get_contents($url,false,$context);
if ($content===FALSE)
return NULL;
 
$dom = get_dom($content);
$xpath = new DOMXPath($dom);
return $xpath;
}
 
function get_dom($content)
{
$dom = new DOMDocument();
$errors_mode = libxml_use_internal_errors(TRUE);
$content = mb_convert_encoding($content, 'HTML-ENTITIES', "UTF-8");
$dom->loadHTML($content);
libxml_clear_errors();
libxml_use_internal_errors($errors_mode);
$dom->preserveWhiteSpace = false;
return $dom;
}
 
function fix_filename($s)
{
return strtr(trim($s),':"/\\','.\'__');
}
 
function print_wget($xpath,$ext)
{
$group_count = 0;
$group_list = $xpath->query('//a[contains(@class,"list_header_link")]');
foreach ($group_list as $group)
{
$item_count = 0;
++$group_count;
$dir = $xpath->query('./h3',$group)->item(0)->nodeValue;
$dir = str_pad($group_count,2,'0',STR_PAD_LEFT).'. '.fix_filename($dir);
echo 'mkdir "'.$dir.'"'."\n";
$node_list = $xpath->query('.//li[contains(@class,"item_row")]',$group->nextSibling);
foreach ($node_list as $node)
{
++$item_count;
$row = $xpath->query('.//a[@class="lecture-link"]',$node)->item(0);
 
$title = $row->firstChild->nodeValue; // retrieving text() via firstChild --> buggy point
 
$link = trim($row->attributes->getNamedItem('href')->nodeValue);
$preview = get_page_xpath($link);
$video_list = $preview->query('//video[@id="QL_video_element_first"]/source[@type="video/'.$ext.'"]');
if ($video_list->length==0)
{
file_put_contents('php://stderr', "Filetype $ext not found for '$title'\n");
continue;
}
 
$video = $video_list->item(0);
$vid_src = $video->attributes->getNamedItem('src')->nodeValue;
echo 'wget -nc "'.$vid_src.'" -O "'.$dir.'/'.str_pad($item_count,3,'0',STR_PAD_LEFT).'. '.fix_filename($title).'.'.strtolower($ext).'"'."\n";
 
}
}
}
 
if ($argc!=3)
{
file_put_contents('php://stderr', "Error: you should input three arguments, the usage is:\n");
file_put_contents('php://stderr', "\"PREVIEW_URL\" \"FILE_TYPE\"\n");
}
else
{
$url = $argv[1];
$filetype = $argv[2];
 
$xpath = get_page_xpath($url);
if ($xpath!==NULL)
print_wget($xpath,$filetype);
}
?>

Please sign in to comment on this gist.

Something went wrong with that request. Please try again.