wenming/millionblobdownload.txt

## millionblobdownload.txt
Raw notes for downloading 6.6+ million files from blob storage within hours using a few simple tools on a single machine.

1. Get a list of files from blob storage.  A few lines of c# code will do.
//In app config.
<configuration>
  <appSettings>
  <add key="StorageConnectionString"
             value="DefaultEndpointsProtocol=https;AccountName=storagename;AccountKey=yourkey" />
  </appSettings>

//In your main function put:

CloudStorageAccount storageAccount = CloudStorageAccount.Parse(
            CloudConfigurationManager.GetSetting("StorageConnectionString")); // stored in app.config
            CloudBlobClient blobClient = storageAccount.CreateCloudBlobClient();

            // Retrieve a reference to a container
            CloudBlobContainer container = blobClient.GetContainerReference("yourblob");

            foreach (var blobItem in container.ListBlobs(
                new BlobRequestOptions() { UseFlatBlobListing = true }))
            {
                Console.WriteLine(blobItem.Uri);
            }


2.	Vi the file which is 1.7gb (Unicode), you might want to install cygwin.
3.	:Set ff=unix; :set fenc=utf-8
4.	Save the file.   (no UI editor I had can edit a 1.7gb file easily).  Power of native code!
5.	Split the file into 100,000 lines each. "split -l 100000 file" (which contains the URLs)  will show up as xaa xab xac
6.	Ask wget to pull the URLs looks like: wget –x (create directory from the URL)  -nv (shut up most of the time) –i ./xaa (one of the splited files).
7.	Now run this in parallel on 67 files (given we have 6.6 million files) by running gnu parallel.  Cygwin users need to download and build it. ./configure;make;make install  (shows up in /usr/local/bin)
8.	 bash>  ls x* |parallel -j 67  wget –x –nv –i ./{}    It will spawn 67 processes and keep wget running on 100k files each (no need to relookup dns, etc)  optionally change parallel -j N  to specify number of concurrent jobs to run at a time.
9.       You can also simply use bash script:    for file in x*; do wget -x -nv -i $file; done

If you download using standard c# code naively, it will take an estimated 300,000 minutes (5000 hours) even within the storage center itself.

Effectively you are using gnu parallel to start 67 wget processes to download 6.6+ million files and preserve their dir structure.   That is FAST, since wget is native.  Best part? Cross-platform & Not much programming here.
	Raw notes for downloading 6.6+ million files from blob storage within hours using a few simple tools on a single machine.

	1. Get a list of files from blob storage. A few lines of c# code will do.
	//In app config.
	<configuration>
	<appSettings>
	<add key="StorageConnectionString"
	value="DefaultEndpointsProtocol=https;AccountName=storagename;AccountKey=yourkey" />
	</appSettings>

	//In your main function put:

	CloudStorageAccount storageAccount = CloudStorageAccount.Parse(
	CloudConfigurationManager.GetSetting("StorageConnectionString")); // stored in app.config
	CloudBlobClient blobClient = storageAccount.CreateCloudBlobClient();

	// Retrieve a reference to a container
	CloudBlobContainer container = blobClient.GetContainerReference("yourblob");

	foreach (var blobItem in container.ListBlobs(
	new BlobRequestOptions() { UseFlatBlobListing = true }))
	{
	Console.WriteLine(blobItem.Uri);
	}


	2. Vi the file which is 1.7gb (Unicode), you might want to install cygwin.
	3. :Set ff=unix; :set fenc=utf-8
	4. Save the file. (no UI editor I had can edit a 1.7gb file easily). Power of native code!
	5. Split the file into 100,000 lines each. "split -l 100000 file" (which contains the URLs) will show up as xaa xab xac
	6. Ask wget to pull the URLs looks like: wget –x (create directory from the URL) -nv (shut up most of the time) –i ./xaa (one of the splited files).
	7. Now run this in parallel on 67 files (given we have 6.6 million files) by running gnu parallel. Cygwin users need to download and build it. ./configure;make;make install (shows up in /usr/local/bin)
	8. bash> ls x* \|parallel -j 67 wget –x –nv –i ./{} It will spawn 67 processes and keep wget running on 100k files each (no need to relookup dns, etc) optionally change parallel -j N to specify number of concurrent jobs to run at a time.
	9. You can also simply use bash script: for file in x*; do wget -x -nv -i $file; done

	If you download using standard c# code naively, it will take an estimated 300,000 minutes (5000 hours) even within the storage center itself.

	Effectively you are using gnu parallel to start 67 wget processes to download 6.6+ million files and preserve their dir structure. That is FAST, since wget is native. Best part? Cross-platform & Not much programming here.