Skip to content

Instantly share code, notes, and snippets.

@aparrish
Last active July 18, 2023 20:00
Show Gist options
  • Save aparrish/f8e7eab47542678a39a39dddbca4ec2f to your computer and use it in GitHub Desktop.
Save aparrish/f8e7eab47542678a39a39dddbca4ec2f to your computer and use it in GitHub Desktop.
Quick intro to CSVs. Code examples released under CC0 https://creativecommons.org/choose/zero/, other text released under CC BY 4.0 https://creativecommons.org/licenses/by/4.0/
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Working with CSV files\n",
"\n",
"By [Allison Parrish](http://www.decontextualize.com/)\n",
"\n",
"Humans have represented data in the form of tables (i.e., organized in rows and columns) for [thousands of years](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.565.2159&rep=rep1&type=pdf). In a contemporary computational context, the tool we use for working with tabular data is called a [spreadsheet](https://en.wikipedia.org/wiki/Spreadsheet). Starting with VisiCalc and Lotus-1-2-3 in the late 1970s and early 1980s, spreadsheet software has consistently been among the best-selling software on personal computers. Today, most computer users are familiar with spreadsheet software like Excel or Google Sheets and many use them daily in their day-to-day work, regardless of their familiarity with math, statistics or computer programming.\n",
"\n",
"It makes sense to want to be able to take work that we do in Python and import it into our spreadsheet software, or take data from spreadsheets and work with it in Python. But there's no one obvious way of representing tabular data in computer-readable format, and spreadsheet software from different vendors use different formats internally that aren't necessarily interoperable. (If you save a spreadsheet in Excel, you might not be able to open that spreadsheet in, say, Apple's Numbers software.) What we need is a common, easy-to-use way to format tabular data so we can move it between tools without a lot of trouble.\n",
"\n",
"## Comma-separated values\n",
"\n",
"\"CSV,\" short for \"comma-separated values,\" is just such a format. A file in CSV format represents tabular data as a series of lines in a plain text file, in which values for each column of the table are separated by commas. Take a look at [this table in Google Sheets](https://docs.google.com/spreadsheets/d/16JhXTxkKmKmsCpAAy0e54Ky8zHspRq6bgaUv8bnqfX8/pubhtml?gid=0&single=true) of the top ten lakes by area in the United States ([source data here](https://en.wikipedia.org/wiki/List_of_largest_lakes_of_the_United_States_by_area)). In CSV format, it looks like this:\n",
"\n",
" Name,States/Provinces,Area (sq mi)\n",
" Lake Superior,Michigan-Minnesota-Wisconsin-Ontario,31700\n",
" Lake Michigan,Illinois-Indiana-Michigan-Wisconsin,22300\n",
" Lake Huron,Michigan-Ontario,22300\n",
" Lake Erie,Michigan-New York-Ohio-Ontario-Pennsylvania,9910\n",
" Lake Ontario,New York-Ontario,7340\n",
" Great Salt Lake (salt),Utah,2117\n",
" Lake of the Woods,Manitoba-Minnesota-Ontario,1679\n",
" Iliamna Lake,Alaska,1014\n",
" Lake Oahe (man-made),North Dakota-South Dakota,685\n",
" Lake Okeechobee,Florida,662\n",
" \n",
"This may look incomprehensible at first, but look carefully and you can see the structure. Each row is on a separate line, and each line has the contents of every cell, separated by commas. (The first line has the names of the columns themselves. This row is called the *header*, and may or may not be present in a given CSV file.)\n",
"\n",
"Sometimes the character that separates the cells on each row is something other than a comma. A few alternative you're likely to see are the pipe character (`|`), a colon (`:`) or a tab. But even if a different delimiter other than a comma is used, such files are still called `CSV` files. (Although sometimes files with tab-separated values are called `TSV` files.)\n",
"\n",
"### Exporting CSV files\n",
"\n",
"Most spreadsheet software packages support CSV files in one form or another. Google Sheets, for example, lets you export a spreadsheet as CSV using the \"Download As...\" item in the File menu. In Excel, CSV is one of the supported formats in the \"Save As...\" dialog box. It's important to remember, however, that a CSV file isn't a perfect replica of the spreadsheet that you're exporting! When you save a spreadsheet as CSV, you lose all formatting (like fonts, colors, cell sizes). You also lose any charts or images you may have added to the spreadsheet, along with formulas, etc.\n",
"\n",
"The most important (and vexing) thing to remember about CSV files is that the values in the table *don't have types*. A CSV file doesn't distinguish between numbers, currency amounts, dates, etc. That means that when you're reading a CSV file into Python, or importing a CSV file from Python into Excel, you'll need to find a way to recover the data types for each column. (We'll see an example of this below.)\n",
"\n",
"## From Counter to spreadsheet\n",
"\n",
"Our first task is going to be writing a Python program that writes out a CSV file, which we can then import into a spreadsheet program to draw a pretty graph. Here's a program that uses [TextBlob](https://textblob.readthedocs.io/en/dev/) to count how many times each part of speech occurs in a text. (Make sure that you have a copy of [this file](http://rwet.decontextualize.com/texts/genesis.txt) in the same directory as your Python script.) It takes this data and then produces a CSV file with each part of speech and its count.\n",
"\n",
"(If you're not familiar with the `Counter` object, [return to this tutorial for a refresher](https://gist.github.com/aparrish/4b096b95bfbd636733b7b9f2636b8cf4). If you're not familiar with TextBlob, [I wrote a tutorial about it here](http://rwet.decontextualize.com/book/textblob/).)"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import csv\n",
"from collections import Counter\n",
"from textblob import TextBlob\n",
"\n",
"# read in the contents of genesis.txt\n",
"text = open(\"genesis.txt\").read()\n",
"blob = TextBlob(text)\n",
"\n",
"all_pos = list()\n",
"for word, pos in blob.tags:\n",
" all_pos.append(pos)\n",
" \n",
"pos_count = Counter(all_pos)\n",
"\n",
"# open the file \"genesis_pos.csv\" for writing...\n",
"with open(\"genesis_pos.csv\", \"w\") as csvfile:\n",
" # create a csv \"writer\" object\n",
" writer = csv.writer(csvfile)\n",
" # write the header row\n",
" writer.writerow([\"part of speech\", \"count\"])\n",
" # write out each pair as a line in the CSV\n",
" for item, count in pos_count.most_common():\n",
" writer.writerow([item, count])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `with` clause may look new to you! We won't go over the technical details here ([though you can read a good introduction here](http://effbot.org/zone/python-with-statement.htm)). Essentially, the line starting with the `with` keyword creates a new file object with the filename `genesis_pos.csv` and assigns it to a variable called `csvfile`. \n",
"\n",
"In the indented section beneath, we create a special kind of Python value called a \"CSV writer\" (from the [`csv` module](https://docs.python.org/2/library/csv.html), passing it the file object as a parameter. The writer object has one method of importance to us, `.writerow()`, which takes a list of values. The writer object takes the values passed to `.writerow()`, formats them by converting them to strings and inserting commas between them, and writes them to the file opened in the initial line of the `with` clause. The first call to `.writerow()` writes the *header row*, which our spreadsheet software will interpret as column names. Inside a `for` loop, we visit each key/value pair from the `pos_count` Counter object and write it to the file.\n",
"\n",
"The resulting file (`genesis_pos.csv`) looks like this:\n",
"\n",
"<pre>\n",
"part of speech,count\n",
"NN,158\n",
"DT,124\n",
"CC,97\n",
"IN,97\n",
"VBD,46\n",
"VB,42\n",
"PRP,38\n",
"NNP,38\n",
"JJ,25\n",
"VBN,24\n",
"NNS,24\n",
"RB,18\n",
"VBG,14\n",
"PRP$,13\n",
"TO,11\n",
"WDT,5\n",
"VBZ,5\n",
"EX,5\n",
"VBP,4\n",
"CD,2\n",
"WP$,2\n",
"MD,2\n",
"JJR,2\n",
"WRB,1\n",
"</pre>\n",
"\n",
"You can now import this CSV file into the spreadsheet program of your choice! [I made this spreadsheet and graph](https://docs.google.com/spreadsheets/d/1HtEnMb0JMFWCSo-nD-53Mv94gdXM4lX9LcB4VkiWNIU/edit?usp=sharing) by loading the CSV into Google Sheets."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## From spreadsheet to Python\n",
"\n",
"t's true that your average spreadsheet program is good enough for basic number crunching tasks, bringing your spreadsheet data into Python opens up a world of computational possibilities! We're going to use [this data about dog names in NYC](https://project.wnyc.org/dogs-of-nyc/), which I've compiled in a Google Sheets table [here](https://docs.google.com/spreadsheets/d/1SmxsgSAcNqYahUXa9-XpecTg-fhy_Ko_-RMD3oT2Ukw/edit?usp=sharing).\n",
"\n",
"To export this data as CSV, select `File > Download as... > Comma-separated values`. Copy it into the directory where your Python scripts are running. (I renamed my copy to `dogs-of-nyc.csv`, just to make it easier to type.) The first few lines of the file look like this:\n",
" \n",
" dog_name,gender,breed,birth,dominant_color,secondary_color,third_color,spayed_or_neutered,guard_or_trained,borough,zip_code\n",
" Buddy,M,Afghan Hound,Jan-00,BRINDLE,BLACK,n/a,Yes,No,Manhattan,10003\n",
" Nicole,F,Afghan Hound,Jul-00,BLACK,n/a,n/a,Yes,No,Manhattan,10021\n",
" Abby,F,Afghan Hound,Nov-00,BLACK,TAN,n/a,Yes,No,Manhattan,10034\n",
" Chloe,F,Afghan Hound,1/2/2017,WHITE,BLOND,n/a,Yes,No,Manhattan,10024\n",
" Jazzle,F,Afghan Hound,10/2/2017,BLOND,WHITE,BLACK,Yes,No,Manhattan,10022\n",
" Trouble,M,Afghan Hound,1/3/2017,BLOND,WHITE,BLACK,Yes,No,Bronx,10472\n",
" Grace,F,Afghan Hound,6/3/2017,CREAM,n/a,n/a,Yes,No,Manhattan,10021\n",
" Sisu,M,Afghan Hound,10/4/2017,BLACK,WHITE,GRAY,No,No,Manhattan,10023\n",
" \n",
"Again, if you're not used to reading CSV files, this looks like a pretty big mess at first. But there's a method to the mess! The top row is a header row that describes the data in the table's columns. The rest of the rows are data, with each cell from the row in a separate column.\n",
"\n",
"Here's a program that reads in this data and then prints out five random dog names:"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Happy, Chiclet, Tucker, Lucky, Charlie\n"
]
}
],
"source": [
"import csv\n",
"import random\n",
"\n",
"all_names = list()\n",
"\n",
"for row in csv.DictReader(open(\"dogs-of-nyc.csv\")):\n",
" all_names.append(row['dog_name'])\n",
" \n",
"print \", \".join(random.sample(all_names, 5))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This program is a little tricky! Here's how it works: The call to `csv.DictReader(open(\"dogs-of-nyc.csv\"))` returns a list-like object of *rows* from the CSV with the given filename. Inside the loop, the variable `row` is a dictionary that has a key for every column named in the header row of the CSV file. (So, if there's a column in the header row called `borough`, you can access the value for that column in the current row using the expression `row['borough']`.) This program simply loops through every row in the table and makes a list of every name, then prints out five random names at the end."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As another quick example, here's a program that prints out the most common dog colors:"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"BLACK 23578\n",
"WHITE 18621\n",
"BROWN 9181\n",
"TAN 8942\n",
"BLOND 4241\n",
"GRAY 2777\n",
"BRINDLE 2627\n",
"RUST 2263\n",
"BLUE 1378\n",
"FAWN 1367\n",
"RED 1338\n",
"CREAM 1186\n",
"GOLD 1015\n",
"n/a 771\n",
"ORANGE 705\n",
"CHOCOLATE 538\n",
"APRICOT 468\n",
"SILVER 353\n",
"BLUE MERLE 103\n",
"CHARCOAL 90\n"
]
}
],
"source": [
"import csv\n",
"from collections import Counter\n",
"\n",
"all_colors = list()\n",
"\n",
"for row in csv.DictReader(open(\"dogs-of-nyc.csv\")):\n",
" all_colors.append(row['dominant_color'])\n",
" \n",
"color_count = Counter(all_colors)\n",
"\n",
"for item, count in color_count.most_common():\n",
" print item, count"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And here's a program that generates a few random dogs and prints a little sentence about each:"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Hi! My name is Hootie. My coat is brown and I'm a loyal pup. I live in Queens. Arf!\n",
"Hi! My name is Pepe. My coat is white and I'm a friendly Bichon Frise. I live in Brooklyn. Woof!\n",
"Hi! My name is KIBBE. My coat is brown and I'm a clever Tibetan Terrier. I live in Brooklyn. Bow-wow!\n",
"Hi! My name is GRINYA. My coat is brown and I'm a clumsy Maltese. I live in the Bronx. Woof!\n",
"Hi! My name is Duke. My coat is white and I'm a happy Beagle. I live in Manhattan. Ruff-ruff!\n",
"Hi! My name is Sunny. My coat is brown and I'm a fluffy Maltese. I live in Brooklyn. Yip!\n",
"Hi! My name is rootie. My coat is brown and I'm a friendly American Eskimo Dog. I live in Brooklyn. Ruff-ruff!\n",
"Hi! My name is Mufasa. My coat is silver and I'm a fluffy pup. I live in the Bronx. Bow-wow!\n",
"Hi! My name is SOPHY. My coat is tan and I'm a loyal dog. I live in Queens. Bow-wow!\n",
"Hi! My name is Bumble. My coat is blond and I'm a kinda smelly pup. I live in Staten Island. Ruff-ruff!\n",
"Hi! My name is Herschel. My coat is white and I'm a friendly Cocker Spaniel. I live in Queens. Woof!\n",
"Hi! My name is Luna. My coat is black and I'm a friendly Chihuahua. I live in Brooklyn. Yip!\n",
"Hi! My name is Rocky. My coat is white and I'm a loyal Jack Russell Terrier. I live in Brooklyn. Ruff-ruff!\n",
"Hi! My name is Daisy. My coat is cream and I'm a clever Shih Tzu. I live in Brooklyn. Ruff-ruff!\n",
"Hi! My name is Foxy. My coat is gray and I'm a clever doggy. I live in Manhattan. Arf!\n",
"Hi! My name is Spunky. My coat is gray and I'm a clever Bichon Frise. I live in Brooklyn. Ruff-ruff!\n",
"Hi! My name is Zeke. My coat is red and I'm a playful Coonhound, Treeing Walker. I live in Manhattan. Arf!\n",
"Hi! My name is Jasper. My coat is blond and I'm a kinda smelly Shih Tzu. I live in Manhattan. Ruff-ruff!\n",
"Hi! My name is Brownie. My coat is tan and I'm a friendly dog. I live in Brooklyn. Arf!\n",
"Hi! My name is Coco. My coat is black and I'm a loyal doggy. I live in Manhattan. Yip!\n",
"Hi! My name is Denim. My coat is white and I'm a loyal Labrador Retriever Crossbreed. I live in Manhattan. Yip!\n",
"Hi! My name is Artemis. My coat is white and I'm a playful Bloodhound. I live in Brooklyn. Ruff-ruff!\n",
"Hi! My name is Beethoven. My coat is tan and I'm a kinda smelly Poodle, Standard. I live in the Bronx. Yip!\n",
"Hi! My name is Ziggy. My coat is blond and I'm a clever Poodle, Miniature. I live in Brooklyn. Woof!\n",
"Hi! My name is Pebbles. My coat is white and I'm a happy Yorkshire Terrier. I live in Staten Island. Arf!\n",
"Hi! My name is Globis. My coat is brown and I'm a loyal Beagle Crossbreed. I live in Brooklyn. Bow-wow!\n",
"Hi! My name is Maxi. My coat is brown and I'm a clever pup. I live in Brooklyn. Yip!\n",
"Hi! My name is Leo. My coat is tan and I'm a clever doggo. I live in Manhattan. Bow-wow!\n",
"Hi! My name is Rudy. My coat is gray and I'm a kinda smelly Maltese. I live in Manhattan. Ruff-ruff!\n",
"Hi! My name is Gizmo. My coat is rust and I'm a loyal Brittany Spaniel. I live in Manhattan. Ruff-ruff!\n",
"Hi! My name is Prada. My coat is brown and I'm a kinda smelly doggy. I live in Staten Island. Bow-wow!\n",
"Hi! My name is Princess. My coat is white and I'm a loyal doggy. I live in Brooklyn. Bow-wow!\n",
"Hi! My name is Hershey. My coat is fawn and I'm a clever Jack Russell Terrier. I live in Manhattan. Ruff-ruff!\n",
"Hi! My name is Rizzo. My coat is brown and I'm a loyal American Pit Bull Mix / Pit Bull Mix. I live in Staten Island. Ruff-ruff!\n",
"Hi! My name is Roxy. My coat is white and I'm a clever American Pit Bull Terrier/Pit Bull. I live in Queens. Yip!\n",
"Hi! My name is Einstein. My coat is gray and I'm a clever Collie Crossbreed. I live in Staten Island. Woof!\n",
"Hi! My name is Oliver. My coat is black and I'm a playful doggo. I live in Manhattan. Yip!\n",
"Hi! My name is Gizmo. My coat is fawn and I'm a clever German Shepherd Crossbreed. I live in Queens. Ruff-ruff!\n",
"Hi! My name is Teddy. My coat is brown and I'm a clever doggy. I live in Brooklyn. Woof!\n",
"Hi! My name is Mya. My coat is black and I'm a clumsy Labrador Retriever Crossbreed. I live in Manhattan. Woof!\n",
"Hi! My name is Lucy. My coat is brown and I'm a clever Greyhound. I live in Queens. Yip!\n",
"Hi! My name is DEBOW. My coat is black and I'm a loyal doggy. I live in Manhattan. Arf!\n",
"Hi! My name is Chaco. My coat is red and I'm a loyal doggo. I live in Brooklyn. Bow-wow!\n",
"Hi! My name is Lexi. My coat is brown and I'm a fluffy Brittany Spaniel. I live in Staten Island. Yip!\n",
"Hi! My name is Priscilla. My coat is brown and I'm a clumsy Pug. I live in Brooklyn. Yip!\n",
"Hi! My name is Grace. My coat is black and I'm a playful Yorkshire Terrier. I live in Manhattan. Arf!\n",
"Hi! My name is Barney. My coat is black and I'm a kinda smelly Yorkshire Terrier. I live in Manhattan. Bow-wow!\n",
"Hi! My name is Marcus. My coat is brown and I'm a clumsy doggy. I live in the Bronx. Arf!\n",
"Hi! My name is Nickols. My coat is tan and I'm a happy Bull Dog, English. I live in Manhattan. Arf!\n",
"Hi! My name is Bentley. My coat is white and I'm a clever Boxer. I live in the Bronx. Woof!\n"
]
}
],
"source": [
"import csv\n",
"\n",
"all_names = list()\n",
"all_colors = list()\n",
"all_breeds = list()\n",
"all_boroughs = list()\n",
"\n",
"for row in csv.DictReader(open(\"dogs-of-nyc.csv\")):\n",
" if row['dog_name'] != 'n/a':\n",
" all_names.append(row['dog_name'])\n",
" all_colors.append(row['dominant_color'].lower())\n",
" if row['secondary_color'] != 'n/a':\n",
" all_colors.append(row['secondary_color'].lower())\n",
" if row['third_color'] != 'n/a':\n",
" all_colors.append(row['third_color'].lower())\n",
" all_breeds.append(row['breed'].title())\n",
" all_boroughs.append(row['borough'].title())\n",
" \n",
"for i in range(50):\n",
" name = random.choice(all_names)\n",
" color = random.choice(all_colors)\n",
" breed = random.choice(all_breeds)\n",
" borough = random.choice(all_boroughs)\n",
" if borough == \"Bronx\":\n",
" borough = \"the Bronx\"\n",
" if breed == \"Mixed/Other\":\n",
" breed = random.choice([\"pup\", \"puppy\", \"dog\", \"doggo\", \"doggy\"])\n",
" adj = random.choice([\"happy\", \"playful\", \"loyal\", \"fluffy\", \"friendly\", \"clever\", \"clumsy\", \"kinda smelly\"])\n",
" breedtxt = \"I'm a \" + adj + \" \" + breed\n",
" bark = random.choice([\"Woof\", \"Arf\", \"Bow-wow\", \"Yip\", \"Ruff-ruff\"])\n",
" print \"Hi! My name is \" + name + \". My coat is \" + color + \" and \" + breedtxt + \". I live in \" + borough + \". \" + bark + \"!\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Further reading and resources\n",
"\n",
"* [Chapter 14](https://automatetheboringstuff.com/chapter14/) from [Automate the Boring Stuff with Python](https://automatetheboringstuff.com/) is a great overview of the basics of working with CSV files in Python.\n",
"* Hungry for more data? You're in luck! There's *tons* of data out there in CSV format just waiting for you to play around with it. Try [this list of awesome public data sets](https://github.com/caesar0301/awesome-public-datasets) or [the structured archive](https://docs.google.com/spreadsheets/d/1wZhPLMCHKJvwOkP4juclhjFgqIY8fQFMemwKL2c64vk/edit) from the [Data Is Plural newsletter](https://tinyletter.com/data-is-plural).\n",
"* CSV files are so useful and popular that there's a whole [conference named after them](https://csvconf.com/)!"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 2",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.6"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment