Skip to content

Instantly share code, notes, and snippets.

@hamelin
Last active April 26, 2022 21:13
Show Gist options
  • Save hamelin/06f6d05cca32e5a4015272cfff6362d5 to your computer and use it in GitHub Desktop.
Save hamelin/06f6d05cca32e5a4015272cfff6362d5 to your computer and use it in GitHub Desktop.
Tutorial on using Rclone and fsspec to wrangle data stored on Azure blob containers.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"id": "153743b1-e04a-4cf9-a6e5-1f67ecd68a2b",
"metadata": {},
"source": [
"# Accessing data from Azure containers\n",
"\n",
"> (c) Benoit Hamelin, 2021-2022\n",
"\n",
"It is getting ever more convenient for organizations to share data through cloud storage. We are committed here to Azure cloud technologies. As such, one of the ideal devices for raw data storage are _Azure blob containers_, also called *Azure data containers*, or jocularly, *Azure buckets*. For whoever is familiar with AWS _buckets_, these are similar. They are sets of data _blobs_ (in Azurespeak), which one can understand as files. Data containers are <a name=\"notfs\"></a>not hierarchical file systems: they are effectively flat. However, one can use slash characters (`/`) in the name of blobs: from that, any set of blobs with a common slash-ending prefix can be understood as part of a common \"directory.\" So for most intents and purposes, blobs are indistinguishable from file systems.\n",
"\n",
"Data containers are storage systems that are accessible through the web, using a HTTPS endpoint and a set of REST verbs and APIs. Fortunately, we have a set of nifty tools at hand that wrap these complexities under abstractions that are easy to manipulate. This notebook introduces two such tools for handling data across Azure containers. The first is a command-line tool named **Rclone** that handles synchronization of data directories between regular file systems and various cloud storage systems. The second is the Python package **fsspec**, which abstracts away storage systems when doing data science in Python programs (including Jupyter notebooks)."
]
},
{
"cell_type": "markdown",
"id": "19dabc20-d8fc-4eac-aadc-825165754d8a",
"metadata": {},
"source": [
"---\n",
"\n",
"## Authorization for Azure data containers: storage account, access keys and SAS tokens\n",
"\n",
"Contrary to many other Azure resources, data containers are accessible through web-visible endpoints, regardless of restrictions that Azure subscription administrators put on resource visibility. This makes them highly useful for sharing data beyond organizational boundaries, but puts the onus on data owners to manage who is authorized to access their data. While the topic of authorization can be taken to identity and access management (IAM) in Azure, we will concern ourselves here with ad hoc mechanisms that data owners can wield to grant access authorization to Azure-stored data, down to designated blobs.\n",
"\n",
"As mentioned above, data containers are sets of blobs; one step over them, a set of data containers are associated to a structure called a _storage account_. Visually:\n",
"\n",
"- Storage account\n",
" - Container\n",
" - Container\n",
" - Container...\n",
" - Blob\n",
" - Blob\n",
" - Blob...\n",
"- Storage account...\n",
"\n",
"Storage accounts have a web-visible name, and are associated to a pair of _access keys_. Anybody who knows the storage account name and detains one of the two keys has full access to the containers under the storage account. Thus, sharing these keys are an effective way for a data owner to share this ownership; the flipside is that this grants full power to other custodians of any of the two keys. Any of these two keys can be changed at any time by Azure users who can access the storage account through Azure IAM, so this grants a modicum of control on the storage account on Azure users that are not extended to key custodians.\n",
"\n",
"In addition to authorization by sharing the access keys of a storage account, one can use these keys to emit _shared access signatures_, more commonly known as _SAS tokens_. Such tokens encode a set of access permissions, including specific container or blob identifier, data access privileges and an authorization period, and signs this data cryptographically using one of the account's access keys. These tokens can then be shared as is, or embedded into data access URLs, denoted as _SAS URLs_.\n",
"\n",
"Authorization by SAS tokens/URLs provides compelling advantages over sharing account keys. The first is granularity: instead of providing a blanket authorization over a storage account, a SAS token targets a specific data container or blob, and grants for this object a set of specific permissions (read, write, add or delete blobs, etc.). The second advantage is that the access is limited to a certain time period: once the expiry date of the SAS token is reached, the authorization it granted is rescinded, and access becomes blocked. The third advantage is that the authorization is tied to one of the account's access keys. If that key were to be revoked by the data owner, all SAS tokens provisioned with this key become rescinded.\n",
"\n",
"As a matter of practice, key sharing is rare, and only done between users that share data ownership responsibilities. It is rather more common that highly privileged access to a storage account is authorized through Azure IAM features, which also provide authentication. To provide access to data resources on an ad hoc basis, particularly for a determined period or to users outside of the organizations (and thus not covered by Azure IAM), provisioning and giving SAS tokens is nearly always preferable."
]
},
{
"cell_type": "markdown",
"id": "78a97d87-223f-45a2-8ab7-6fd6234c8dd1",
"metadata": {},
"source": [
"### A note on container and blob *paths*\n",
"\n",
"As mentioned, Azure data containers are [not hierarchical](#not-fs) file systems: they contain a flat set of objects whose names merely _mimic_ hierarchy. Certain tools such as Rclone (discussed [below](#rclone)) take this convention further than others, yet the unsuspecting user will want to keep that in mind if tools are not behaving as expected. Whenever tools conflate container and blob identifiers with paths, such paths adhere to the expected structure\n",
"`<container name>/<blob name>`, with the name of the blob being possibly composed of slashes as well. This kind of confusion does not happen much when authorization is granted through shared access keys, because one's visibility on the storage account contents is then complete. However, when access is authorized to say, a blob named `a/b/c` in a data container named `x`, one must still specify full path `x/a/b/c`. For example, let's pretend we have shell programs `blob-read` and `blob-list`: the former echoes the content of the blob to standard output, and the latter lists the subset of blobs in a data container that have a given name prefix. I get a SAS token that enables reading and listing `x/a/b/c`, so from my shell, I can run\n",
"\n",
"```\n",
"$ blob-read my-token x/a/b/c > local-file\n",
"```\n",
"\n",
"and expect to store the blob contents in `local-file`. However, if someone attempts to look up their file through\n",
"\n",
"```\n",
"$ blob-list my-token x/a/b\n",
"```\n",
"\n",
"the command will fail and report that access is forbidden. Yet,\n",
"\n",
"```\n",
"$ blob-list my-token x/a/b/c\n",
"```\n",
"\n",
"will succeed."
]
},
{
"cell_type": "markdown",
"id": "04788c3f-ff04-46ca-83a0-da2c4e27b717",
"metadata": {},
"source": [
"### Take away\n",
"\n",
"When somebody tells you that data will be put in an Azure blob container, they either give you a storage account name, container name and SAS token, like this:\n",
"\n",
"```\n",
"name-of-storage-account\n",
"name-of-container\n",
"sp=racwl&st=ISO8601-TIMESTAMP&se=ISO8601-TIMESTAMP&spr=https&sv=SOME-DATE&sr=c&sig=URL-ENCODED-CRYPTO-SIGNATURE\n",
"```\n",
"\n",
"or a SAS URL like this:\n",
"\n",
"```\n",
"https://name-of-storage-account.blob.core.windows.net/name-of-container?sas-token\n",
"```"
]
},
{
"cell_type": "markdown",
"id": "5210d2bd-6816-4050-a1db-4f0605a81d8d",
"metadata": {},
"source": [
"---\n",
"\n",
"## <a name=\"rclone\"></a>Rclone: the swiss-army knife of data access\n",
"\n",
"There exists a large array of data storage services, and keeping track of distinct APIs for all of them is tedious and tiresome. Enter [Rclone](https://rclone.org/), a multiplatform command-line tool (also sporting a GUI for those so inclined) for listing, copying, synchronizing and otherwise accessing data across all services known to Human.[\\*](#footnote1) The following offers a quick crash-course on the setup and usage of Rclone to get a local copy of a dataset, or some part of it.\n",
"\n",
"### Installation\n",
"\n",
"Standalone executables of Rclone are [available](https://rclone.org/downloads/) for all major platforms: one does not need administrative privilege to grab a copy and run it from their own account. In addition, those of us imbued with such privileges can also use Apt, Yum, etc. to install it more conveniently. Finally, for Conda users, Rclone can be installed within one's environment with the usual `conda install rclone`."
]
},
{
"cell_type": "markdown",
"id": "fd54269e-36be-4a40-8435-089cfbeb76f0",
"metadata": {},
"source": [
"### Interactive configuration\n",
"\n",
"Out of the box, Rclone can be used to perform local file operations and transfers. We are, however, much more interested in using it to access data stored in remote services. Such remote services, which Rclone shortens to _remotes_, must be configured using Rclone's interactive configuration tool. This tool is invoked with command\n",
"\n",
"```\n",
"$ rclone config\n",
"```\n",
"\n",
"Here is an example of an interactive run of `rclone config` to set up a remote associated to a SAS URL. The command prompt is denoted as `$`: lines that don't start with this prompt are Rclone output. Whenever Rclone itself prompts for interactive input, the prompt finishes with character `>`; the string that follows is the answer provided from the keyboard. We will assume that I have been given a SAS URL enabled to read and list a data container named `cupboard` on a storage account named `kitchen`.\n",
"\n",
"```\n",
"$ rclone config\n",
"2021/08/23 17:27:40 NOTICE: Config file \"/home/<user name>/.config/rclone/rclone.conf\" not found - using defaults\n",
"No remotes found - make a new one\n",
"```\n",
"\n",
"This is shown the first time we run Rclone.\n",
"\n",
"```\n",
"n) New remote\n",
"s) Set configuration password\n",
"q) Quit config\n",
"n/s/q> n\n",
"```\n",
"\n",
"I hereby ask to set up a new remote.\n",
"\n",
"```\n",
"name> cupboard-in-kitchen\n",
"```\n",
"\n",
"All remotes are given a unique name, so we can indicate services at the source or destination of files or operations. I chose here the name `cupboard-in-kitchen`. It could have been anything: it contains the monikers `kitchen` and `cupboard` just because they made sense, not because they are mandatory at this stage.\n",
"\n",
"```\n",
"Type of storage to configure.\n",
"Enter a string value. Press Enter for the default (\"\").\n",
"Choose a number from below, or type in your own value\n",
" 1 / 1Fichier\n",
" \\ \"fichier\"\n",
" 2 / Alias for an existing remote\n",
" \\ \"alias\"\n",
" 3 / Amazon Drive\n",
" \\ \"amazon cloud drive\"\n",
" \n",
" ... so many things! ...\n",
"\n",
"22 / Microsoft Azure Blob Storage\n",
" \\ \"azureblob\"\n",
"\n",
" ...\n",
"36 / http Connection\n",
" \\ \"http\"\n",
"37 / premiumize.me\n",
" \\ \"premiumizeme\"\n",
"38 / seafile\n",
" \\ \"seafile\"\n",
"Storage> 22\n",
"```\n",
"\n",
"Next is a long list of storage service options. We go for Microsoft Azure Blob Storage, even if you are given a SAS token that targets a full data container, or even get shared a key to a storage account. Remark that the number of this option may be different when you run `rclone config`!\n",
"\n",
"```\n",
"** See help for azureblob backend at: https://rclone.org/azureblob/ **\n",
"\n",
"Storage Account Name (leave blank to use SAS URL or Emulator)\n",
"Enter a string value. Press Enter for the default (\"\").\n",
"account> \n",
"```\n",
"\n",
"Normally we would type in `kitchen`, but the name of the storage account is embedded in the SAS URL. It's thus safe to leave it blank and carry on by typing **[Enter]**.\n",
"\n",
"```\n",
"Path to file containing credentials for use with a service principal.\n",
"\n",
"Leave blank normally. Needed only if you want to use a service principal instead of interactive login.\n",
"\n",
" $ az ad sp create-for-rbac --name \"<name>\" \\\n",
" --role \"Storage Blob Data Owner\" \\\n",
" --scopes \"/subscriptions/<subscription>/resourceGroups/<resource-group>/providers/Microsoft.Storage/storageAccounts/<storage-account>/blobServices/default/containers/<container>\" \\\n",
" > azure-principal.json\n",
"\n",
"See [\"Create an Azure service principal\"](https://docs.microsoft.com/en-us/cli/azure/create-an-azure-service-principal-azure-cli) and [\"Assign an Azure role for access to blob data\"](https://docs.microsoft.com/en-us/azure/storage/common/storage-auth-aad-rbac-cli) pages for more details.\n",
"\n",
"Enter a string value. Press Enter for the default (\"\").\n",
"service_principal_file> \n",
"Storage Account Key (leave blank to use SAS URL or Emulator)\n",
"Enter a string value. Press Enter for the default (\"\").\n",
"key> \n",
"```\n",
"\n",
"A bunch of questions about authorization mechanisms we don't care about. We have a SAS URL, let's focus on that; other fields, we leave blank and carry on.\n",
"\n",
"```\n",
"SAS URL for container level access only\n",
"(leave blank if using account/key or Emulator)\n",
"Enter a string value. Press Enter for the default (\"\").\n",
"sas_url> https://kitchen.blob.core.windows.net/cupboard?sp=racwdl&st=START-DATE&se=EXPIRY-DATE&spr=https&sv=DATE&sr=PRIVILEGES&sig=SIGNATURE\n",
"```\n",
"\n",
"This is where I pasted the SAS URL I was given. You can see in upper case the various fields whose values would be different with your own SAS URL.\n",
"\n",
"```\n",
"Use a managed service identity to authenticate (only works in Azure)\n",
"\n",
"When true, use a [managed service identity](https://docs.microsoft.com/en-us/azure/active-directory/managed-identities-azure-resources/)\n",
"to authenticate to Azure Storage instead of a SAS token or account key.\n",
"\n",
"If the VM(SS) on which this program is running has a system-assigned identity, it will\n",
"be used by default. If the resource has no system-assigned but exactly one user-assigned identity,\n",
"the user-assigned identity will be used by default. If the resource has multiple user-assigned\n",
"identities, the identity to use must be explicitly specified using exactly one of the msi_object_id,\n",
"msi_client_id, or msi_mi_res_id parameters.\n",
"Enter a boolean value (true or false). Press Enter for the default (\"false\").\n",
"use_msi> \n",
"Uses local storage emulator if provided as 'true' (leave blank if using real azure storage endpoint)\n",
"Enter a boolean value (true or false). Press Enter for the default (\"false\").\n",
"use_emulator> \n",
"```\n",
"\n",
"More questions about features we can ignore and leave blank.\n",
"\n",
"```\n",
"Edit advanced config? (y/n)\n",
"y) Yes\n",
"n) No (default)\n",
"y/n> n\n",
"```\n",
"\n",
"Nope, no need for anything advanced.\n",
"\n",
"```\n",
"Remote config\n",
"--------------------\n",
"[cupboard-in-kitchen]\n",
"type = azureblob\n",
"sas_url = https://kitchen.blob.core.windows.net/cupboard?sp=racwdl&st=START-DATE&se=EXPIRY-DATE&spr=https&sv=DATE&sr=PRIVILEGES&sig=SIGNATURE\n",
"--------------------\n",
"y) Yes this is OK (default)\n",
"e) Edit this remote\n",
"d) Delete this remote\n",
"y/e/d> y\n",
"```\n",
"\n",
"All good, accept the configuration with `y`.\n",
"\n",
"```\n",
"Current remotes:\n",
"\n",
"Name Type\n",
"==== ====\n",
"cupboard-in-kitchen azureblob\n",
"\n",
"e) Edit existing remote\n",
"n) New remote\n",
"d) Delete remote\n",
"r) Rename remote\n",
"c) Copy remote\n",
"s) Set configuration password\n",
"q) Quit config\n",
"e/n/d/r/c/s/q> q\n",
"```\n",
"\n",
"And we're done with configuration."
]
},
{
"cell_type": "markdown",
"id": "e0c09225-a054-4523-94f7-628bec38e81e",
"metadata": {},
"source": [
"### Listing contents of a data resource\n",
"\n",
"Rclone, like many modern command-line tools, actually packs in multiple subcommands, all implemented under an umbrella of common features. The first subcommand of interest is `ls`, to \"recursively\" list the contents of a data resource. If the resource is a blob, we get an echo of only that blob; for a data container, we get the full contents. For the **cupboard-in-kitchen** example of last section:\n",
"\n",
"```\n",
"$ rclone ls cupboard-in-kitchen:cupboard\n",
"1045 pasta\n",
"56 meatball/beef\n",
"567 meatball/breadcrumbs\n",
"879 tomato\n",
"1034 gravy/sugar\n",
"4194 gravy/stock\n",
"23 gravy/cornstarch\n",
"```\n",
"\n",
"We get the set of blobs, preceded with their respective size in bytes. Remark the argument to `rclone ls`, formed as `remote:container/blob` -- the container name is present at the root of the path even if it's part of the SAS URL. We can make this listing more precise by listing all blobs under the `gravy` \"directory:\"\n",
"\n",
"```\n",
"$ rclone ls cupboard-in-kitchen:cupboard/gravy\n",
"1034 gravy/sugar\n",
"4194 gravy/stock\n",
"23 gravy/cornstarch\n",
"```"
]
},
{
"cell_type": "markdown",
"id": "40afcf1f-4589-4c01-aaab-cf2cec8ef4ae",
"metadata": {},
"source": [
"### Copying remote contents\n",
"\n",
"Actual file transfers at last! Done simply with `rclone copy`.\n",
"\n",
"```\n",
"$ rclone copy -P cupboard-in-kitchen:cupboard/ ./my-cupboard/\n",
"```\n",
"\n",
"The `-P` option shows progress on the copy operation. It is overkill in the case of the can and cupboard, but one often use Rclone to run large transfers over flaky networks. It's not a bad habit to have. Speaking of flaky networks, should the connection between Azure and one's computer be interrupted, Rclone is smart enough to avoid transferring blobs/files already completely present. So one simply repeats the last Rclone command to resume the transfer.\n",
"\n",
"[Documentation](https://rclone.org/commands/rclone_copy/) on the `copy` subcommand indicates that Rclone behaves like `rsync` when slashes are appended on both the source and destination directories. This means that, even if directory `my-cupboard` exists when `rclone copy` is run, the contents of remote container `cupboard` are put under `my-cupboard`, not under `my-cupboard/cupboard`. This is typically the desired behavior.\n",
"\n",
"When one deals with containers, `rclone copy` can select a subset of blobs by indicating their \"directory:\"\n",
"\n",
"```\n",
"$ rclone copy -P cupboard-in-kitchen:cupboard/gravy the-gravy\n",
"```\n",
"\n",
"Further selection can be effected using the `--filter` option. Please refer to the [documentation](https://rclone.org/filtering/) for details."
]
},
{
"cell_type": "markdown",
"id": "260b651a-2873-471a-b98b-aa63033bd5e8",
"metadata": {},
"source": [
"Do remark that many datasets that are shared through Azure data containers can be onerously large: it might not be convenient to download them whole. It may make sense, then to copy only an excerpt, hone methods on that, and then run a well-tuned computation on the data directly over HTTPS streaming.\n",
"\n",
"Also, check with data owners regarding data access and storage policy. Once authorization is rescinded, data owners may be on the hook for their collaborators to destroy local data copies and caches. Once data is copied locally, data owners lose direct control over the datasets, and expect collaborators to comply with policies on their honor. Should data be leaked away by rogue collaborators, CSE may perceive data sharing as an operational and reputational risk. Collaboration based on data sharing would then be terminated, at the detriment of a budding research community."
]
},
{
"cell_type": "markdown",
"id": "7f1d0b93-abb1-4522-87fc-f97118d0188e",
"metadata": {},
"source": [
"---\n",
"\n",
"## Accessing data on Azure with tools from the [PyData](https://pydata.org/) ecosystem\n",
"\n",
"PyData is an educational program managed by [NumFOCUS](https://numfocus.org/), a nonprofit aiming at improving science and knowledge through open source tooling. The _PyData ecosystem_ is a collection of data science and scientific computing packages based on the Python platform. It includes such staples as [Numpy](https://numpy.org/), [SciPy](https://scipy.org/) and [Pandas](https://pandas.pydata.org/). This software collection relies on package `fsspec` for service-agnostic data access, and the `fsspec` \"plug-in\" for Azure storage is module `adlfs`.\n",
"\n",
"As it installed, the `adlfs` package augments the internal capacities of tools such as Pandas, enabling them to read from a larger set of URLs. To fetch data from a blob, one uses a URL of the form (using the `kitchen` account and `cupboard` container):\n",
"\n",
"```\n",
"abfs://CONTAINER/BLOB\n",
"```\n",
"\n",
"Picking up on the example of the `kitchen` storage account and `cupboard` container, one would read the `gravy/sugar` blob with this URL:\n",
"\n",
"```\n",
"abfs://cupboard/gravy/sugar\n",
"```\n",
"\n",
"The storage account and SAS token are passed on using a supplemental parameter named `storage_options`, defined as a dictionary:\n",
"\n",
"```\n",
"storage_options = {\n",
" \"account_name\": \"kitchen\",\n",
" \"sas_token\": \"<INSERT TOKEN HERE>\"\n",
"}\n",
"```\n",
"\n",
"This is where a package called `python-dotenv` comes in handy. Do the following if you have control of the Python environment from which this kernel runs."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9577d165-9498-4916-94c9-e96eec9964ec",
"metadata": {},
"outputs": [],
"source": [
"%pip install fsspec python-dotenv pandas adlfs"
]
},
{
"cell_type": "markdown",
"id": "0c78a2cc-16d7-41be-a532-1e431daa342d",
"metadata": {},
"source": [
"> Note that package `adlfs` is a sort of \"plug-in\" package to fsspec to enable it to speak Azure Blob Storage."
]
},
{
"cell_type": "markdown",
"id": "ef799fa4-5156-42e4-a30d-5d5f78f1bca0",
"metadata": {},
"source": [
"You will put the storage account name and SAS token into a file named `.env` in the base directory where your kernel is running. Edit and run the following cell to do this:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cd8d03dd-aaab-4f30-bdfc-52d8a0f0e7c1",
"metadata": {},
"outputs": [],
"source": [
"%%writefile .env\n",
"AZURE_STORAGE_ACCOUNT_NAME=\"replace with storage account name\"\n",
"AZURE_STORAGE_SAS_TOKEN=\"replace with SAS token\""
]
},
{
"cell_type": "markdown",
"id": "5f69f181-0c3f-4514-81d8-072376c37128",
"metadata": {},
"source": [
"Now, thanks to package `python-dotenv`, you can have easy-peasy loading of your storage credentials in any Jupyter notebook. Remark that the package also provides similar normal Python routines for loading the credentials in scripts."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2a2aa277-0e4b-4903-8834-cf7dc2e8e023",
"metadata": {},
"outputs": [],
"source": [
"%load_ext dotenv\n",
"%dotenv"
]
},
{
"cell_type": "markdown",
"id": "337784c0-dffc-414c-9211-6e68afc90152",
"metadata": {},
"source": [
"And that spares us any specification of storage option forever! It also moves credential storage _out_ of notebooks and into an ad hoc configuration file, which is a critical best practice."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "351be1bd-33cc-4b5e-b55c-3130da75a595",
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"\n",
"CONTAINER = \"replace with the name of your Azure blob container\"\n",
"BLOB = \"replace with the path to a CSV file -- er, blob -- in the container\"\n",
"df = pd.read_csv(f\"abfs://{CONTAINER}/{BLOB}\")\n",
"df"
]
},
{
"cell_type": "markdown",
"id": "68638d60-594b-48ef-ae02-2d7d88749a7f",
"metadata": {},
"source": [
"So things are easy when one wants to simply access blobs they know the names of. How can one instead _list_ the contents of a container? We need some lower-level tools for that.\n",
"\n",
"The PyData community has come together to develop a large set of virtual file systems, brought under a common interface. This lives under the `fsspec` package. This package is what Pandas uses above to translate this strange `abfs://` URL to queries against Azure storage services. We will grab here the `filesystem` routine to get us an object to pry into our container. "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ac24b84d-fde4-4007-b8e7-279a37e13d74",
"metadata": {},
"outputs": [],
"source": [
"from fsspec import filesystem\n",
"fs = filesystem(\"abfs\") # Credentials fished out of environment variables, thanks to dotenv!"
]
},
{
"cell_type": "markdown",
"id": "6164862c-675e-4ab6-9b5d-4daf5db1c889",
"metadata": {},
"source": [
"If you look at `help(fs)`, you will see this `fs` object offers many methods that are reminiscent of familiar shell commands: `cat`, `ls`... The latter is the one to list the contents of our container. Reminder: the container's name is `optc`."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e7680e5e-376a-4e5b-848e-b520df9ee7cf",
"metadata": {},
"outputs": [],
"source": [
"fs.ls(CONTAINER)"
]
},
{
"cell_type": "markdown",
"id": "b8fa55c8-8369-449f-9d62-710365fc497c",
"metadata": {},
"source": [
"And onwards:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "755787bb-1508-408e-9ed4-4c71969111d3",
"metadata": {},
"outputs": [],
"source": [
"fs.ls(f\"{CONTAINER}/{BLOB}\")"
]
},
{
"cell_type": "markdown",
"id": "4b997eed-5d70-4674-be9d-d0f7ca542e39",
"metadata": {},
"source": [
"The reader is invited to look up the other methods of `fs` and to try them out. Notably, methods `fs.get` and `fs.put` are nifty for, respectively, pulling data from and pushing data to a blob. However, mass data transfers and synchronizations are much easier using **Rclone**, and actual data-scientific computations are typically done using Pandas (or [Dask](https://dask.org/), or [Ray](https://www.ray.io/), or [Jax](https://github.com/google/jax), or whatever you like) as above. "
]
},
{
"cell_type": "markdown",
"id": "b2625246-001e-4d31-ae37-11c2861f2f36",
"metadata": {},
"source": [
"---\n",
"\n",
"# Footnotes\n",
"\n",
"1. <a name=\"footnote1\"></a>Rather, known to Rclone. This is a comfortably large set."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python [conda env:.conda-papercrane]",
"language": "python",
"name": "conda-env-.conda-papercrane-py"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.6"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment