Created
August 13, 2012 01:24
-
-
Save wenming/3336207 to your computer and use it in GitHub Desktop.
downloading millions of files from blob storage fast
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Raw notes for downloading 6.6+ million files from blob storage within hours using a few simple tools on a single machine. | |
1. Get a list of files from blob storage. A few lines of c# code will do. | |
//In app config. | |
<configuration> | |
<appSettings> | |
<add key="StorageConnectionString" | |
value="DefaultEndpointsProtocol=https;AccountName=storagename;AccountKey=yourkey" /> | |
</appSettings> | |
//In your main function put: | |
CloudStorageAccount storageAccount = CloudStorageAccount.Parse( | |
CloudConfigurationManager.GetSetting("StorageConnectionString")); // stored in app.config | |
CloudBlobClient blobClient = storageAccount.CreateCloudBlobClient(); | |
// Retrieve a reference to a container | |
CloudBlobContainer container = blobClient.GetContainerReference("yourblob"); | |
foreach (var blobItem in container.ListBlobs( | |
new BlobRequestOptions() { UseFlatBlobListing = true })) | |
{ | |
Console.WriteLine(blobItem.Uri); | |
} | |
2. Vi the file which is 1.7gb (Unicode), you might want to install cygwin. | |
3. :Set ff=unix; :set fenc=utf-8 | |
4. Save the file. (no UI editor I had can edit a 1.7gb file easily). Power of native code! | |
5. Split the file into 100,000 lines each. "split -l 100000 file" (which contains the URLs) will show up as xaa xab xac | |
6. Ask wget to pull the URLs looks like: wget –x (create directory from the URL) -nv (shut up most of the time) –i ./xaa (one of the splited files). | |
7. Now run this in parallel on 67 files (given we have 6.6 million files) by running gnu parallel. Cygwin users need to download and build it. ./configure;make;make install (shows up in /usr/local/bin) | |
8. bash> ls x* |parallel -j 67 wget –x –nv –i ./{} It will spawn 67 processes and keep wget running on 100k files each (no need to relookup dns, etc) optionally change parallel -j N to specify number of concurrent jobs to run at a time. | |
9. You can also simply use bash script: for file in x*; do wget -x -nv -i $file; done | |
If you download using standard c# code naively, it will take an estimated 300,000 minutes (5000 hours) even within the storage center itself. | |
Effectively you are using gnu parallel to start 67 wget processes to download 6.6+ million files and preserve their dir structure. That is FAST, since wget is native. Best part? Cross-platform & Not much programming here. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
How about avoiding split:
cat file | parallel wget –x –nv