Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save Ricky-G/a9d670728a4f554b1234e4cb3ee74189 to your computer and use it in GitHub Desktop.
Save Ricky-G/a9d670728a4f554b1234e4cb3ee74189 to your computer and use it in GitHub Desktop.
using System;
using System.IO;
using System.Threading.Tasks;
using Microsoft.Azure.Storage;
using Microsoft.Azure.Storage.Blob;
using Microsoft.Azure.WebJobs;
using Microsoft.Extensions.Logging;
using System.Diagnostics;
using System.IO.Compression;
namespace FileUnzipFunction
{
public class FileUnzipBlobTrigger
{
private static readonly CloudBlobClient _blobClient;
static FileUnzipBlobTrigger()
{
string storageConnectionString = Environment.GetEnvironmentVariable("AzureWebJobsStorage");
CloudStorageAccount storageAccount = CloudStorageAccount.Parse(storageConnectionString);
_blobClient = storageAccount.CreateCloudBlobClient();
}
[FunctionName("FileUnzipBlobTrigger")]
public static async Task Run([BlobTrigger("zipped/{name}", Connection = "AzureWebJobsStorage")] Stream inputBlob, string name, ILogger log)
{
log.LogInformation($"Processing blob: {name}");
LogCurrentMemoryConsumption(log);
log.LogInformation($"Processing blob: {name}");
var unzipContainer = _blobClient.GetContainerReference("unzipped");
using (ZipArchive archive = new ZipArchive(inputBlob))
{
foreach (ZipArchiveEntry entry in archive.Entries)
{
var blob = unzipContainer.GetBlockBlobReference(entry.FullName);
using (var stream = entry.Open())
{
log.LogInformation($"Unzipping {entry.FullName}");
LogCurrentMemoryConsumption(log);
await blob.UploadFromStreamAsync(stream);
}
}
}
}
public static void LogCurrentMemoryConsumption(ILogger log)
{
var process = Process.GetCurrentProcess();
var physicalMemoryUsage = process.WorkingSet64 / 1024.0 / 1024.0 / 1024.0;
var virtualMemorySize = process.VirtualMemorySize64 / 1024.0 / 1024.0 / 1024.0;
var pagedMemorySize = process.PagedMemorySize64 / 1024.0 / 1024.0 / 1024.0;
log.LogInformation($"Memory Usage: WorkingSet64={physicalMemoryUsage} MB, VirtualMemorySize64={virtualMemorySize} MB, PagedMemorySize64={pagedMemorySize} MB");
}
}
}
using System;
using System.IO;
using System.Threading.Tasks;
using Microsoft.Azure.Storage;
using Microsoft.Azure.Storage.Blob;
using Microsoft.Azure.WebJobs;
using Microsoft.Extensions.Logging;
using ICSharpCode.SharpZipLib.Zip;
using ICSharpCode.SharpZipLib.Core;
using System.Diagnostics;
namespace FileUnzipFunction
{
public class FileUnzipBlobTrigger
{
private static readonly CloudBlobClient _blobClient;
static FileUnzipBlobTrigger()
{
string storageConnectionString = Environment.GetEnvironmentVariable("AzureWebJobsStorage");
CloudStorageAccount storageAccount = CloudStorageAccount.Parse(storageConnectionString);
_blobClient = storageAccount.CreateCloudBlobClient();
}
[FunctionName("FileUnzipBlobTrigger")]
public static async Task Run([BlobTrigger("zipped/{name}", Connection = "AzureWebJobsStorage")] Stream inputBlob, string name, ILogger log)
{
log.LogInformation($"Processing blob: {name}");
LogCurrentMemoryConsumption(log);
var unzipContainer = _blobClient.GetContainerReference("unzipped");
using (var zipInputStream = new ZipInputStream(inputBlob))
{
ZipEntry entry;
while ((entry = zipInputStream.GetNextEntry()) != null)
{
if (!entry.IsFile) continue; // Ignore directories
var blob = unzipContainer.GetBlockBlobReference(entry.Name);
log.LogInformation($"Unzipping {entry.Name}");
LogCurrentMemoryConsumption(log);
// Copy the entry's data from the zip input stream to the blob output stream
using (var blobStream = await blob.OpenWriteAsync())
{
StreamUtils.Copy(zipInputStream, blobStream, new byte[4096]);
}
LogCurrentMemoryConsumption(log);
}
}
}
public static void LogCurrentMemoryConsumption(ILogger log)
{
var process = Process.GetCurrentProcess();
var physicalMemoryUsage = process.WorkingSet64 / 1024.0 / 1024.0 / 1024.0;
var virtualMemorySize = process.VirtualMemorySize64 / 1024.0 / 1024.0 / 1024.0;
var pagedMemorySize = process.PagedMemorySize64 / 1024.0 / 1024.0 / 1024.0;
log.LogInformation($"Memory Usage: WorkingSet64={physicalMemoryUsage} MB, VirtualMemorySize64={virtualMemorySize} MB, PagedMemorySize64={pagedMemorySize} MB");
}
}
}
@Ricky-G
Copy link
Author

Ricky-G commented May 21, 2023

The two code samples are actually quite similar in their basic approach. They both handle the data in a streaming manner, which allows them to deal with large files without consuming a lot of memory.

However, there are some differences in the details of how they handle the streaming, which may have implications for their performance and resource usage:

  • The first code sample uses the ZipArchive class from the .NET Framework, which provides a high-level, user-friendly interface for dealing with zip files. The second code sample uses the ZipInputStream class from the SharpZipLib library, which provides a lower-level, more flexible interface.
  • In the first code sample, the ZipArchive automatically takes care of reading from the blob stream and unzipping the data. It provides an Open method for each entry in the zip file, which returns a stream that you can read the unzipped data from. In the second code sample, you manually read from the ZipInputStream and write to the blob stream using the StreamUtils.Copy method.
  • The second code sample manually handles the buffer size with new byte[4096] for copying data from the zip input stream to the blob output stream. In contrast, the first code sample relies on the default buffer size provided by the UploadFromStreamAsync method.
Memory wise both are similar (i.e.: they don't download the entire zip file into memory), but the first script takes around 20 minutes to process a 1GB zip file (with 10 * 100 MB files), whereas the second script takes about 10 minutes for the same 1GB zip file.  This mainly comes down to setting the custom buffer size and the optimizations in the SharpZipLib library, 

First script has the benefit of not importing any custom library, but cant not run on an Azure consumption plan, at the time of this writing, consumption plan has a max 10 minute runtime.
Second script can potentially run on a consumption plan, but comes at a cost of having to import a 3rd party library.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment