Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save shawnweisfeld/9788d119c3d013ddde6034ca7d0a5caf to your computer and use it in GitHub Desktop.
Save shawnweisfeld/9788d119c3d013ddde6034ca7d0a5caf to your computer and use it in GitHub Desktop.

Pseudocode for Blob Listing/Counting Algorithm

Blob Storage Provides 2 APIs that are needed for this effort.

  • For the Azure Storage account, you can list all the storage containers that it includes more info
  • For any Azure Storage container, you can list all the blobs in the container more info
    • The key for this api is the “prefix” and “delimiter” parameters, these allow you to break up a single list of all the objects in the container into logical folders and process each folder independently.

Other requirements

  • You need a distributed queue to keep track of what accounts/containers/folders need to be looked in, we typically use Storage Queues for this since they provide the features needed at a good cost.
  • You need a way to track what you have counted. You can use many tools for this, however what you choose needs to be able to support the load of the distributed listers and provide a way to do whatever aggregation you need to do on the data. Redis, Table Storage, SQL are all good options.

Process

  • Create an application that pumps messages from the queue. Run many instances of in parallel.
    • If the application sees an “account” message it should call the list containers api and list all the containers in the account, putting a message for each container in the queue.
    • If the application sees a “container” message it should call the list blobs api with no prefix and a “/” delimiter. This will return all the top-level files and folders in the container.
      • Update the tracking system with the info of all the files.
      • For each folder put a “folder” message in the queue
    • If the application sees a “folder” message it should call the list blobs api with the prefix of the current folder and a “/” delimiter. This will return all the files and folders in the folder.
      • Update the tracking system with the info of all the files.
      • For each folder put a “folder” message in the queue
  • Put a message in the queue with the account name that needs to be scanned, this will trigger the process to start.
  • Keep an eye on the transactions per second numbers for both the storage account you are scanning and the account you place the queue in. Add/remove instances if your application to get the scan to complete as fast as you need, while keeping the transactions per second numbers below the limits.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment