Skip to content

Instantly share code, notes, and snippets.

@AldairCoronel
Created August 31, 2020 06:33
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save AldairCoronel/a0e0987fd5f386ed9a402ebd70a30bdd to your computer and use it in GitHub Desktop.
Save AldairCoronel/a0e0987fd5f386ed9a402ebd70a30bdd to your computer and use it in GitHub Desktop.

Google Summer of Code 2020

Project Description

Title: Implement an Azure blobstore filesystem for Python SDK
JIRA Issue: BEAM-6807

Product: Apache Beam
Organization: The Apache Software Foundation

Student: Aldair Coronel Ruiz (aldaircr@ciencias.unam.mx)
Mentors: Pablo Estrada (pabloem@google.com) and Ismaël Mejía (iemejia@gmail.com)

Initial Goal

Before, Apache Beam Python SDK already had support for Google Cloud Storage and Amazon Web Services S3. This project aimed to add support for the Azure Blob Storage filesystem to the Python SDK.

What Has Been Achived?

BEAM-6807 | PR #12492 In this pull request I managed to implement the following:

  • blobstoragefilesystem.py Filesystem interface methods.
  • blobstoragefilesystem_test.py Filesystem tests using mocks.
  • blobstorageio.py Using the Python SDK provided by Azure Blob Storage I implemented a client wrapper. With the client wrapper I was able to obtain operations functionality (which the client did not support) that were necessary to implement the methods of the filesystem interface.
  • blobstorageio_test.py Parse_azfs_path method tests. This method gets the storage account, container and blob from a path: azfs://storage-account/container/blob.

PR #12734 In this pull request I managed to implement the following:

  • blobstoragefilesystem.py The filesystem now receives pipeline options:
    • --azfs_connection_string Connection string used for Azure/Azurite authentication.
    • -- use_local_azurite If set, Azurite will be used.
  • pipeline_options.py AzureFileSystemOptions class to obtain pipeline options.
  • sdks/java/io/azure/build.gradle Added 4 tasks:
    • createAzuriteContainer
    • StartAzuriteContainer
    • StopAzuriteContainer
    • AzureLocalIT Runs and stops Azurite.
  • blobstorageio.py Now supports Azurite.
  • blobstorageio_test.py A wide variety of unit tests that run against Azurite.

What Is Left To Do?

BEAM-10836 Implement a gradle task that runs Azure Blobstorage tests against Azurite.
BEAM-10815 Improve authentication story.

Final Thoughts

I am very happy with the experience and the results. These last 3 months have been full of new knowledge and have given me a better idea of what open source is. I thank my mentor Pablo for helping me with everything I needed.

Regarding the status of the project, I hope it will be useful for people who want to use Apache Beam with Azure Blob Storage. I will stay close by to make any needed improvements.

Thank you all :)

@saif007s
Copy link

I am new here, where should i started

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment