Skip to content

Instantly share code, notes, and snippets.

@wenming
Created August 13, 2012 01:24
Show Gist options
  • Save wenming/3336207 to your computer and use it in GitHub Desktop.
Save wenming/3336207 to your computer and use it in GitHub Desktop.
downloading millions of files from blob storage fast
Raw notes for downloading 6.6+ million files from blob storage within hours using a few simple tools on a single machine.
1. Get a list of files from blob storage. A few lines of c# code will do.
//In app config.
<configuration>
<appSettings>
<add key="StorageConnectionString"
value="DefaultEndpointsProtocol=https;AccountName=storagename;AccountKey=yourkey" />
</appSettings>
//In your main function put:
CloudStorageAccount storageAccount = CloudStorageAccount.Parse(
CloudConfigurationManager.GetSetting("StorageConnectionString")); // stored in app.config
CloudBlobClient blobClient = storageAccount.CreateCloudBlobClient();
// Retrieve a reference to a container
CloudBlobContainer container = blobClient.GetContainerReference("yourblob");
foreach (var blobItem in container.ListBlobs(
new BlobRequestOptions() { UseFlatBlobListing = true }))
{
Console.WriteLine(blobItem.Uri);
}
2. Vi the file which is 1.7gb (Unicode), you might want to install cygwin.
3. :Set ff=unix; :set fenc=utf-8
4. Save the file. (no UI editor I had can edit a 1.7gb file easily). Power of native code!
5. Split the file into 100,000 lines each. "split -l 100000 file" (which contains the URLs) will show up as xaa xab xac
6. Ask wget to pull the URLs looks like: wget –x (create directory from the URL) -nv (shut up most of the time) –i ./xaa (one of the splited files).
7. Now run this in parallel on 67 files (given we have 6.6 million files) by running gnu parallel. Cygwin users need to download and build it. ./configure;make;make install (shows up in /usr/local/bin)
8. bash> ls x* |parallel -j 67 wget –x –nv –i ./{} It will spawn 67 processes and keep wget running on 100k files each (no need to relookup dns, etc) optionally change parallel -j N to specify number of concurrent jobs to run at a time.
9. You can also simply use bash script: for file in x*; do wget -x -nv -i $file; done
If you download using standard c# code naively, it will take an estimated 300,000 minutes (5000 hours) even within the storage center itself.
Effectively you are using gnu parallel to start 67 wget processes to download 6.6+ million files and preserve their dir structure. That is FAST, since wget is native. Best part? Cross-platform & Not much programming here.
@ole-tange
Copy link

How about avoiding split:
cat file | parallel wget –x –nv

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment