Skip to content

Instantly share code, notes, and snippets.

@reevejd
Last active October 29, 2020 18:50
Show Gist options
  • Star 3 You must be signed in to star a gist
  • Fork 2 You must be signed in to fork a gist
  • Save reevejd/749991897f411fcb9355c51054d3434f to your computer and use it in GitHub Desktop.
Save reevejd/749991897f411fcb9355c51054d3434f to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Read and Write CSV Files in Python Directly From the Cloud"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Every data scientist I know spends a lot of time handling data that originates in CSV files. You can quickly end up with a mess of CSV files located in your Documents, Downloads, Desktop, and other random folders on your hard drive.\n",
"\n",
"I greatly simplified my workflow the moment I started organizing all my CSV files in my Cloud account. Now I always know where my files are and I can read them directly from the Cloud using JupyterLab (the new Jupyter UI) or my Python scripts.\n",
"\n",
"This notebook will teach you how to read your CSV files hosted on the Cloud in Python as well as how to write files to that same Cloud account.\n",
"\n",
"I'll use IBM Cloud Object Storage, an affordable, reliable, and secure Cloud storage solution. (Since I work at IBM, I'll also let you in on a secret of [how to get 10 Terabytes for a whole year](https://cocl.us/cos_with_python_blog), entirely for free.)\n",
"\n",
"This notebook will help you get started with IBM Cloud Object Storage and make the most of this offer. It is composed of three parts: \n",
"\n",
"##### 1. How to use IBM Cloud Object Storage to store your files; \n",
" \n",
"##### 2. Reading CSV files in Python from Object Storage; \n",
" \n",
"##### 3. Writing CSV files to Object Storage (also in Python of course). "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## What is Object Storage and Why Should You Use It?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The \"Storage\" part of object storage is pretty straightforward, but what exactly is an object and why would you want to store one? An object is basically any conceivable data. It could be a text file, a song, or a picture. For the purposes of this tutorial, our objects will all be CSV files.\n",
"\n",
"Unlike a typical filesystem (like the one used by the device you’re reading this article on) where files are grouped in hierarchies of directories/folders, object storage has a flat structure. All objects are stored in groups called buckets. This structure allows for better performance, massive scalability, and cost-effectiveness.\n",
"\n",
"By the end of this notebook, you will know how to store your files on IBM Cloud Object Storage and easily access them using Python."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Provisioning an Object Storage Instance on IBM Cloud"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a href=\"https://console.bluemix.net/registration/?target=/cloud-object-storage/&cm_mmc=Email_Events-_-Developer_Community-_-WW_WW-_-CognitiveClass+Blog+Object+Storage&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Events.M12345678&cvo_campaign=000026UJ\">Sign up or log in with your IBM Cloud account here</a> (it's free) to begin provisioning your Object Storage instance. Feel free to use the Lite plan, which is free and allows you to store up to 25 GB per month. You can customize the Service Name if you wish, or just leave it as the default. You can also leave the resource group to the default. Resource groups are useful to organize your resources on IBM Cloud, particularly when you have many of them running. When you're ready, click the **Create** button to finish provisioning your Object Storage instance."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<img src=\"https://ibm.box.com/shared/static/1537a5huyv2flubq7lq2xni4vg16lbsc.png\"></img>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Working with Buckets"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Since you just created the instance, you'll now be presented with options to create a bucket. You can always find your Object Storage instance by selecting it from your [IBM Cloud Dashboard](https://console.bluemix.net/dashboard/apps)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There's a limit of 100 buckets per Object Storage instance, but each bucket can hold billions of objects. In practice, how many buckets you need will be dictated by your availability and resilience needs.\n",
"\n",
"For the purposes of this tutorial, a single bucket will do just fine."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Creating your First Bucket"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Click the Create Bucket button and you'll be shown a window like the one below, where you can customize some details of your Bucket. All these options may seem overwhelming at the moment, but don't worry, we'll explain them in a moment. They are part of what makes this service so customizable, should you have the need later on."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<img src=\"https://ibm.box.com/shared/static/b5z2fes4pgt9gv7qmbgmujf5b0dyjywd.png\" width=\"500px\"></img>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you don't care about the nuances of bucket configuration, you can type in any unique name you like and press the Create button, leaving all other options to their defaults. You can then skip to the **Putting Objects in Buckets** section below. If you would like to learn about what these options mean, read on."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Configuring your Bucket"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Resiliency Options\n",
"\n",
"<table>\n",
"<tbody>\n",
"\n",
"<tr>\n",
" <td><h5>**Resiliency Option**</h5></td>\n",
" <td><h5>**Description**</h5></td>\n",
" <td><h5>**Characteristics**</h5></td>\n",
"\n",
"</tr>\n",
"<tr>\n",
" <td>Cross Region</td>\n",
" <td>Your data is stored across three geographic regions within your selected location</td>\n",
" <td>High availability and *very* high durability</td>\n",
" \n",
"</tr>\n",
"<tr>\n",
" <td>Regional</td>\n",
" <td>Your data is stored across three different data centers within a single geographic region</td>\n",
" <td>High availability and durability, very low latency for regional users</td>\n",
"\n",
"</tr>\n",
"\n",
"<tr>\n",
" <td>Single Data Center</td>\n",
" <td>Your data is stored across multiple devices within a single data center</td>\n",
" <td>Data locality</td>\n",
"\n",
"</tr> \n",
"</tbody>\n",
"</table>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Storage Class Options\n",
"\n",
"<table>\n",
"<tbody>\n",
"<tr>\n",
" <td><h5>**Frequency of Data Access**</h5></td>\n",
" <td><h5>**IBM Cloud Object Storage Class**</h5></td>\n",
"</tr>\n",
"<tr>\n",
" <td>Continual</td>\n",
" <td>Standard</td>\n",
"</tr>\n",
"<tr>\n",
" <td>Weekly or monthly</td>\n",
" <td>Vault</td>\n",
"</tr>\n",
"<tr>\n",
" <td>Less than once a month</td>\n",
" <td>Cold Vault</td>\n",
"</tr>\n",
"<tr>\n",
" <td>Unpredictable</td>\n",
" <td>Flex</td>\n",
"</tr>\n",
"</tbody>\n",
"</table>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Feel free to experiment with different configurations, but I recommend choosing \"Standard\" for your storage class for this tutorial's purposes. Any resilience option will do."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"After you've created your bucket, store the name of the bucket into the Python variable below (replace `cc-tutorial` with the name of your bucket)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"bucket_name = 'cc-tutorial'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Creating Service Credentials"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To access your IBM Cloud Object Storage instance from anywhere other than the web interface, you will need to create credentials. Click the **New credential** button under the **Service credentials** section to get started."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<img src=\"https://ibm.box.com/shared/static/yp1c9sdn9zzc6nxlmcoho9x0g32g6w51.png\"></img>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In the next window, select `Manager` as your role, and add `{\"HMAC\":true}` to the `Add Inline Configuration Parameters (Optional)` field. You can leave all other fields as their defaults and click the **Add** button to continue."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<img src=\"https://ibm.box.com/shared/static/e6sorcitftio6fs3npvrd2kbspet1vhd.png\"></img>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You'll now be able to click on **View credentials** to obtain the JSON object containing the credentials you just created. You'll want to store everything you see in a `credentials` variable like the one below (obviously, replace the placeholder values with your own). Take special note of your `access_key_id` and `secret_access_key` which you will need for the **Cyberduck** section below."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Note: Be careful not to share this notebook after adding your credentials!**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"credentials = {\n",
" \"apikey\": \"your-api-key\",\n",
" \"cos_hmac_keys\": {\n",
" \"access_key_id\": \"your-access-key-here\",\n",
" \"secret_access_key\": \"your-secret-access-key-here\"\n",
" },\n",
" \"endpoints\": \"your-endpoints\",\n",
" \"iam_apikey_description\": \"your-iam_apikey_description\",\n",
" \"iam_apikey_name\": \"your-iam_apikey_name\",\n",
" \"iam_role_crn\": \"your-iam_apikey_name\",\n",
" \"iam_serviceid_crn\": \"your-iam_serviceid_crn\",\n",
" \"resource_instance_id\": \"your-resource_instance_id\"\n",
"}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Putting Objects in Buckets"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There are many ways to add objects to your bucket, but we'll start by taking a look at two simple ways: the IBM Cloud web interface and Cyberduck."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### IBM Cloud Web Interface"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can add a CSV file of your choice to your newly created bucket through the web interface by either clicking the **Add objects** button, or dragging and dropping your CSV file into the IBM Cloud window.\n",
"\n",
"If you don't have an interesting CSV file handy, I recommend downloading FiveThirtyEight's 2018 World Cup predictions."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<img src=\"https://ibm.box.com/shared/static/jupdp7herwx2kd4wnpy8bt8tj0zxzglc.png\"></img>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Cyberduck"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a href=\"https://cyberduck.io/\">Cyberduck</a> is a free cloud storage browser for Mac OS and Windows. It allows you to easily manage all of the files in all of your object storage instances. After <a href=\"https://cyberduck.io/\">downloading</a>, installing, and starting Cyberduck, create a new bookmark by pressing <kbd>⌘</kbd>+<kbd>Shift</kbd>+<kbd>B</kbd> on Mac OS or <kbd>Ctrl</kbd>+<kbd>Shift</kbd>+<kbd>B</kbd> on Windows. A window will pop up with some bookmark configuration options. Select the <em>Amazon S3</em> option from the dropdown and fill in the form as follows:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<ul>\n",
" \t<li><strong>Nickname</strong>: enter anything you like.</li>\n",
" \t<li><strong>Server</strong>: enter your service endpoint. You can choose any public endpoint <a href=\"https://console.bluemix.net/docs/services/cloud-object-storage/basics/endpoints.html#us-cross-region-endpoints\">here.</a> For your convenience, I recommend one of these:\n",
"<ul>\n",
" \t<li><span aria-labelledby=\"value\"><span class=\"objectBox objectBox-string\"><code>s3-api.us-geo.objectstorage.softlayer.net</code> (If you live in the Americas)\n",
"</span></span></li>\n",
" \t<li><span aria-labelledby=\"value\"><span class=\"objectBox objectBox-string\"><code>s3.eu-geo.objectstorage.softlayer.net</code> (if you live in Europe)\n",
"</span></span></li>\n",
" \t<li><span aria-labelledby=\"value\"><span class=\"objectBox objectBox-string\"><code>s3.ap-geo.objectstorage.softlayer.net</code> (if you live in Asia)</span></span></li>\n",
"</ul>\n",
"</li>\n",
" \t<li><strong>Access Key ID</strong>: enter the <code>access_key_id</code> you created above in the <strong>Creating Service Credentials</strong> section.</li>\n",
"</ul>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<img src=\"https://ibm.box.com/shared/static/vdxij53wfjaohjr1spa6dg4fravivoqx.png\"></img>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Close the window and double-click on your newly created bookmark. You will be asked to log in. Enter the <code>secret_access_key_id</code> you created above in the <strong>Creating Service Credentials</strong> section and click Login."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<img src=\"https://ibm.box.com/shared/static/p30zr57pjb1epttr8he50wguft7hz0wu.png\"></img>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You should now see a file browser pane with the bucket you created in the <strong>Working with Buckets</strong> section. If you added a file in the previous step, you should also be able to expand your bucket to view the file. Using the action dropdown or the context menu (right-click on Windows, control-click on Mac OS)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<img src=\"https://ibm.box.com/shared/static/6r0etr4fo17lsrps4dmevxhkhcg0hgj9.png\"></img>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can add files to your buckets by dragging and dropping them onto this window."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Whether you use the IBM Cloud web interface or Cyberduck, assign the name of the CSV file you upload to the variable `filename` below so that you can easily refer to it later."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"filename = 'wc_forecasts.csv'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Reading CSV files from Object Storage with Cyberduck"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Once you have successfully accessed an object storage instance in Cyberduck using the above steps, you can download files by double-clicking them in Cyberduck's file browser. You can also generate links to your files by selecting the *Open/Copy Link URL* option."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<img src=\"https://ibm.box.com/shared/static/wip5uqv3k0g8to0mwdxk1zdehr9dkajz.png\"></img>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"By default your files are not publicly accessible, so selecting a URL that is not pre-signed will not allow the file to be downloaded. Pre-signed URLs *do* allow for files to be downloaded, but the link will eventually expire. If you want a permanently available public link to one of your files, you can select the *Info* option for that file and add `READ` permissions for `Everyone` under the permissions section."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<img src=\"https://ibm.box.com/shared/static/4otn70xhh7ahtu322gyaejsf1q6cxsw1.png\"></img>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"After changing this setting you can share the URL (without pre-signing) and anyone with the link will be able to download it, either by opening the link in their web browser, or by using a tool like <code>wget</code> from within your Jupyter notebook, e.g:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!wget https://s3-api.us-geo.objectstorage.softlayer.net/cc-tutorial/wc_forecasts.csv"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Reading CSV files from Object Storage using Python"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The recommended way to access IBM Cloud Object Storage with Python is to use the `ibm_boto3` library, which we'll import below."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# if you don't already have the ibm_boto3 library installed you can install it by \n",
"# uncommenting the below line before running this cell\n",
"!pip install ibm-cos-sdk\n",
"\n",
"import ibm_boto3"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The primary way to interact with IBM Cloud Object Storage through `ibm_boto3` is by using an `ibm_boto3.resource` object. This resource-based interface abstracts away the low-level REST interface between you and your Object Storage instance.\n",
"\n",
"Run the cell below to create a `resource` Python object using the IBM Cloud Object Storage credentials you filled in above."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from ibm_botocore.client import Config\n",
"auth_endpoint = 'https://iam.bluemix.net/oidc/token'\n",
"service_endpoint = 'https://s3-api.us-geo.objectstorage.softlayer.net'\n",
"\n",
"\n",
"resource = ibm_boto3.resource('s3',\n",
" ibm_api_key_id=credentials['apikey'],\n",
" ibm_service_instance_id=credentials['resource_instance_id'],\n",
" ibm_auth_endpoint=auth_endpoint,\n",
" config=Config(signature_version='oauth'),\n",
" endpoint_url=service_endpoint)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"After creating a `resource` object, we can easily access any of our Cloud objects by specifying a bucket name and a key (in our case the key is a filename) to our `resource.Object` method and calling the `get` method on the result. In order to get the object into a useful format, we'll do some processing to turn it into a pandas dataframe."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import io\n",
"\n",
"obj = resource.Object(bucket_name=bucket_name, key=filename).get()\n",
"obj_bytes = obj['Body'].read() # .read() returns a byte string\n",
"obj_bytes_stream = io.BytesIO(obj_bytes) # that we convert to a stream\n",
"pd.read_csv(obj_bytes_stream) # and eventually a pandas dataframe\n",
"\n",
"# or a bit more succinctly:\n",
"obj = resource.Object(bucket_name=bucket_name, key=filename).get()\n",
"pd.read_csv(io.BytesIO(obj['Body'].read()))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We'll make this into a function so we can easily use it later:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def get_dataframe_from_obj_storage(bucket_name, key):\n",
" obj = resource.Object(bucket_name=bucket_name, key=key).get()\n",
" return pd.read_csv(io.BytesIO(obj['Body'].read()))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Adding files to IBM Cloud Object Storage with Python"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"IBM Cloud Object Storage's web interface makes it easy to add new objects to your buckets, but at some point you will probably want to handle creating objects through Python programmatically. The `put_object` method allows you to do this. In order to use it you will need:\n",
"\n",
"1. the name of the bucket you want to add the object to\n",
"2. a unique name (Key) for the new object\n",
"3. a [bytes-like object](https://docs.python.org/3.3/glossary.html#term-bytes-like-object), which you can get from: \n",
"\n",
" - `urllib`'s `request.urlopen(...).read()` method, e.g. \n",
" `urllib.request.urlopen('https://example.com/my-csv-file.csv').read()`\n",
" - Python's built-in `open` method in binary mode, e.g. \n",
" `open('myfile.csv', 'rb')`\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To demonstrate, let's add another CSV file to our bucket. This time we'll use [FiveThirtyEight's airline safety dataset](https://raw.githubusercontent.com/fivethirtyeight/data/master/airline-safety/airline-safety.csv)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import urllib.request\n",
"\n",
"csv_url = 'https://raw.githubusercontent.com/fivethirtyeight/data/master/airline-safety/airline-safety.csv'\n",
"csv_name = 'airline-safety'\n",
"\n",
"resource.Bucket(name=bucket_name).put_object(Key=csv_name, Body=urllib.request.urlopen(csv_url).read())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can now easily access your newly created object using the function we defined above in the **Reading from Object Storage using Python** section."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"get_dataframe_from_obj_storage(bucket_name, csv_name)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Conclusion"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You now know how to read from and write to IBM Cloud Object Storage using Python! Well done. The ability to pragmatically read and write files to the Cloud will be quite handy when working from scripts and Jupyter notebooks.\n",
"\n",
"If you build applications or do data science, we also have a great offer for you. You can apply to become an IBM Partner at no cost to you and receive 10 Terabytes of space to play and build applications with.\n",
"\n",
"Make sure that you apply with a business email (even your own domain name if you are a freelancer) as free email accounts like Gmail, Hotmail, and Yahoo are automatically rejected.\n",
"\n",
"[Become an IBM Partner and get 10TB of IBM Cloud Object Storage for free!](https://cocl.us/cos_with_python_blog)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.4"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment