Skip to content

Instantly share code, notes, and snippets.

@daxadax
Created October 30, 2015 15:42
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save daxadax/14f576edd67954eee0ee to your computer and use it in GitHub Desktop.
Save daxadax/14f576edd67954eee0ee to your computer and use it in GitHub Desktop.
disclaimer: the URL used is just an example. I've picked the first from /r/opendirectories that worked good enough
Not sure if this post is ok, but I figured I'd share some tips for mass-downloading things from directories.
I've used this for dirs with lots and lots of say, mp3 files. It's easily customized, and better yet, fast!
If the server in question uses apache (which thankfully most do), retrieving a textual representation of the URLs that is stored in a local file is usually the first step one should do. That way the index doesn't have to be re-downloaded all the time, thus hugely improving speed.
I found lynx does the best job at this, because lynx comes with a feature that transforms the HTML tree into a textual representation:
lynx -dump http://ls.df.vc/pictures/ > listing.txt
results in this (for example):
Index of /pictures/
Name Last Modified Size Type
[1]Parent Directory/ - Directory
[2]avatar/ 2009-Oct-10 21:35:43 - Directory
[3]duckie/ 2009-Jul-18 03:02:57 - Directory
[ ... many more lines skipped ...]
lighttpd/1.4.35
References
1. http://ls.df.vc/
2. http://ls.df.vc/pictures/avatar/
3. http://ls.df.vc/pictures/duckie/
4. http://ls.df.vc/pictures/dump/
5. http://ls.df.vc/pictures/gif/
6. http://ls.df.vc/pictures/hayka/
7. http://ls.df.vc/pictures/mock_the_war/
8. http://ls.df.vc/pictures/record_store_gats/
9. http://ls.df.vc/pictures/404_gf_not_found.jpg
10. http://ls.df.vc/pictures/8bit_wedding.jpg
11. http://ls.df.vc/pictures/9_deadly_words_user_by_a_women.jpg
[... snibbedy snib ...]
The stuff we want begins after 'References'. We don't have to parse this manually, thanks to regular expressions. A simple script that would grep the URLs and then feed the whole mess to wget, while also checking that we're not pointlessly downloading files we already retrieved:
url="http://ls.df.vc/"
# we only want files ending in '.jpg' or '.png' for now
grep 'http://.*\.(jpg|png)' listing.txt -oh | while read url; do
filename="$(basename "$url" | urldecode)"
if [[ ! -f "$filename" ]]; then
wget -c "$url"
fi
done
The urldecode script is nothing to write home about, although it's been working flawlessly so far (just put in your ~/bin dir):
#!/usr/bin/env perl
use URI::Encode;
my $uri = URI::Encode->new({ encode_reserved => 0 });
while (<>)
{
print $uri->decode($_)
}
However...
While using bash works fine, it's really slow. Especially for large directories, a good percentage of waiting is literally just bash doing its thing. This is mostly because bash has to fork/exec all the time, and fork is a rather heavy function call.
We can massively improve speed by using a different (better, maybe) language. I'm using Ruby, although you can do the same in Python, Perl, Lisp, even PHP if you feel that way. I don't judge.
So here's the Ruby script that does the same thing:
url = "http://ls.df.vc/"
File.open("listing.txt", "r") do |fh|
listing = fh.read
listing.scan(/\d\d?\d?\d?\d?\d?. (http:\/\/.*.(jpg|png))/).each do |m|
furl = m.shift
base = File.basename(furl)
filename = URI.unescape(base)
if not File.file?(filename) then
system("wget", "-c", furl)
end
end
end
By the way, you don't need to run Linux, Unix, et cetera for these scripts to work: For the bash variant, MinGW will work just fine. For the ruby script I strongly suggest installing cygwin -- it's easy, doesn't invade your %PATH%, and you get a proper POSIX environment on windows. What more could you possibly ask for?
Also, I hope this post isn't messing with the subreddit rules. Just trying to help. If you have any questions, just go ahead! I'll answer them (if I can).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment