Skip to content

Instantly share code, notes, and snippets.

@asaini
Last active December 21, 2015 06:28
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save asaini/6264069 to your computer and use it in GitHub Desktop.
Save asaini/6264069 to your computer and use it in GitHub Desktop.
{
"metadata": {
"name": "readWrite"
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": "Reading from Files and Writing to Files"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "<p>Reading(Writing) from(to) files is something that every developer does day to day. In this tutorial we will use python to read input data from files, perform some operations on that data and finally write that data to a file. Input files can be in the form of .txt files, .csv files, binary data files, etc. A simple text file can be used to store numbers, words, sentences etc. Here is an example of a text file, Herman Melville's <a href=\"http://www.gutenberg.org/cache/epub/2701/pg2701.txt\">Moby Dick</a></p>"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "<h2>Reading from Files</h2>"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Lets download the text file from <a href=\"http://www.gutenberg.org/ebooks/2701\">Project Gutenberg</a> where it is freely available for download. Right click on the link which says <b>Plain Text UTF-8</b> and save the file in a folder of your choice. The file is saved with the default name as <em>pg2701.txt</em> "
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Now let's start up our Python Interpreter...<br>"
},
{
"cell_type": "raw",
"metadata": {},
"source": "$ python"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "<br>To open our text file, we need the full path of where we have stored our file. In my case, I've stored the text file in the folder <em>/home/user/absaini/scripted/</em>. \nWe will use Python's <a href=\"http://docs.python.org/2/tutorial/inputoutput.html#reading-and-writing-files\">open()</a> function to tell Python where the file is located."
},
{
"cell_type": "code",
"collapsed": false,
"input": "text_file = open('/home/user/absaini/scripted/pg2701.txt','r')",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 1
},
{
"cell_type": "markdown",
"metadata": {},
"source": "In the above line, with the first argument we're telling the Python Interpreter where our file is located, and with the second argument, <b>r</b> we're specifying which <em>mode</em> we want to open this text file in. <em>Read Mode</em> means that we are only going to read the contents of this text file, and not write anything to this text file. (We will later use the <em>Write Mode</em> which allows us to write to a file)<br><br>To check which <em>mode</em> the file has been opened in simply type"
},
{
"cell_type": "code",
"collapsed": false,
"input": "print text_file",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "<open file '/home/user/absaini/scripted/pg2701.txt', mode 'r' at 0x2ed9c90>\n"
}
],
"prompt_number": 2
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Using the <code><a href=\"http://docs.python.org/2/tutorial/inputoutput.html#methods-of-file-objects\">read()</a></code> function we can read the entire contents of our text file and store the result in a variable called, say <code>text</code>. The entire contents of the text file are stored as a sequence of characters, which is also called a <em>string</em>"
},
{
"cell_type": "code",
"collapsed": false,
"input": "book_text = text_file.read()",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 4
},
{
"cell_type": "markdown",
"metadata": {},
"source": "<h3>What's inside the file?</h3>"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "We want to explore what's inside this text file and run a few simple operations on Moby Dick's text. Let's start by looking at the first few characters of the text we just read. We print the first 500 characters in the text file. "
},
{
"cell_type": "code",
"collapsed": false,
"input": "print book_text[:500]",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "\ufeffThe Project Gutenberg EBook of Moby Dick; or The Whale, by Herman Melville\r\n\r\nThis eBook is for the use of anyone anywhere at no cost and with\r\nalmost no restrictions whatsoever. You may copy it, give it away or\r\nre-use it under the terms of the Project Gutenberg License included\r\nwith this eBook or online at www.gutenberg.org\r\n\r\n\r\nTitle: Moby Dick; or The Whale\r\n\r\nAuthor: Herman Melville\r\n\r\nLast Updated: January 3, 2009\r\nPosting Date: December 25, 2008 [EBook #2701]\r\nRelease Date: June, 200\n"
}
],
"prompt_number": 5
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Hmm, those are indeed the first 500 characters in 'pg2701.txt'. In the above command we used Python's <em>slicing</em> syntax to give us the first 500 characters in text. More on Python Let's count how many letters(characters) Herman Melville used in writing Moby Dick. To do this, we use Python's inbuilt <a href=\"http://docs.python.org/2/library/functions.html#len\">len()</a> function, and print the result. "
},
{
"cell_type": "code",
"collapsed": false,
"input": "print len(book_text)",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "1257260\n"
}
],
"prompt_number": 6
},
{
"cell_type": "markdown",
"metadata": {},
"source": "That was quite easy. Next, let's try to find out how many words are there in the entire book. To do this, we want to count a stream of characters as one word. For eg., in the above text, we consider <em>The</em>, <em>Project</em>, <em>Gutenberg</em> as words which are separated by 'space' characters. \n\nHow we can calculate the number of words in a sentence? Let us say that we have a line of characters, say, <em>This eBook is for the use of anyone anywhere at no cost and with\nalmost no restrictions whatsoever</em>. We will store this line as a string in a variable called <code>line</code>"
},
{
"cell_type": "code",
"collapsed": false,
"input": "line = 'This eBook is for the use of anyone anywhere at no cost and with almost no restrictions whatsoever'",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 7
},
{
"cell_type": "markdown",
"metadata": {},
"source": "In Python, a simple way of finding out the 'words' in a string is by using the <code><a href=\"http://docs.python.org/2/library/stdtypes.html#str.split\">split()</a></code> function. If we call <code>split()</code> on our <code>line</code> variable, we have"
},
{
"cell_type": "code",
"collapsed": false,
"input": "words_in_line = line.split()\nprint words_in_line",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "['This', 'eBook', 'is', 'for', 'the', 'use', 'of', 'anyone', 'anywhere', 'at', 'no', 'cost', 'and', 'with', 'almost', 'no', 'restrictions', 'whatsoever']\n"
}
],
"prompt_number": 8
},
{
"cell_type": "markdown",
"metadata": {},
"source": "The <code>split()</code> function splits our line each time it sees a 'space' character. We can now use the same approach to find out the words present in Moby Dick. We do so by calling the <code>split()</code> function on our <code>book_text</code> variable which contains the entire text of the book Moby Dick."
},
{
"cell_type": "code",
"collapsed": false,
"input": "words = book_text.split()",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 9
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Let's print the first 10 words in this list. We specify the slice in a similar fashion as we did it for the printing the first 500 characters above. <br>\n(*Ignore the extra characters in the first word here.*)"
},
{
"cell_type": "code",
"collapsed": false,
"input": "print words[:10]",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "['\\xef\\xbb\\xbfThe', 'Project', 'Gutenberg', 'EBook', 'of', 'Moby', 'Dick;', 'or', 'The', 'Whale,']\n"
}
],
"prompt_number": 10
},
{
"cell_type": "code",
"collapsed": false,
"input": "print len(words)",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "215133\n"
}
],
"prompt_number": 11
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Let us now compute the Average Number of Characters in a Word,<br><br>\n<b>Average Number of Characters in a Word = (Total Number of Characters)/(Total Number of words)</b><br><br>\nNotice here, that we are doing integer division so Python leaves the trailing decimal digits. Even if we forget the decimal digit our answer is pretty close to the <a href=\"http://www.wolframalpha.com/input/?i=average+english+word+length\">Average Word Length for English</a>"
},
{
"cell_type": "code",
"collapsed": false,
"input": "avg_num_characters = len(book_text)/len(words)\nprint avg_num_characters",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "5\n"
}
],
"prompt_number": 14
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Similarly, we can compute the Number of Sentences and the Average Sentence Length. Earlier, we had used the <code>split()</code> function to find out the number of words in a line. Similarly, we can use the <code>split()</code> function to calculate the number of sentences in a paragraph. This time, instead of splitting on 'space', we will split on the full stop character '.'<br>\nIn Python, we can also store large strings by enclosing text between 3 quotes '''"
},
{
"cell_type": "code",
"collapsed": false,
"input": "paragraph = '''The pale Usher--threadbare in coat, heart, body, and brain; I see him now. He was ever dusting his old lexicons and \ngrammars, with a queer handkerchief, mockingly embellished with all the gay flags of all the known nations of the world. He loved to \ndust his old grammars; it somehow mildly reminded him of his mortality.'''",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 15
},
{
"cell_type": "code",
"collapsed": false,
"input": "paragraph_sentences = paragraph.split('.')\nprint paragraph_sentences",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "['The pale Usher--threadbare in coat, heart, body, and brain; I see him now', ' He was ever dusting his old lexicons and \\ngrammars, with a queer handkerchief, mockingly embellished with all the gay flags of all the known nations of the world', ' He loved to \\ndust his old grammars; it somehow mildly reminded him of his mortality', '']\n"
}
],
"prompt_number": 16
},
{
"cell_type": "markdown",
"metadata": {},
"source": "We can now split the text of Moby Dick, which is stored in the variable <code>book_text</code> and split it into sentences."
},
{
"cell_type": "code",
"collapsed": false,
"input": "sentences = book_text.split('.')",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 17
},
{
"cell_type": "markdown",
"metadata": {},
"source": "<b>Average Number of Words in a Sentence = (Total Number of Words)/(Total Number of Sentences)</b>"
},
{
"cell_type": "code",
"collapsed": false,
"input": "avg_num_words = len(words)/len(sentences)\nprint avg_num_words",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "27\n"
}
],
"prompt_number": 18
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Let us now write our results to a file."
},
{
"cell_type": "markdown",
"metadata": {},
"source": "<h2>Writing to Files</h2>"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Similar to <em>reading</em> from a file, <em>writing</em> to a file requires us to specifiy the full path location of the file to which we are writing. We are going to be using the <em>Write Mode</em> which is indicated by the letter <b>w</b>. We will store our results in a file suitably called <em>results.txt</em>"
},
{
"cell_type": "code",
"collapsed": false,
"input": "results_file = open('/home/user/absaini/scripted/results.txt','w')",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 19
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Using the <code><a href=\"http://docs.python.org/2/tutorial/inputoutput.html#methods-of-file-objects\">write()</a></code> function, we can write string text to our file. <br>\nTo start a new line in our output, we will have to explicitly specify the location of a new line using the newline character <em>\\n</em>"
},
{
"cell_type": "code",
"collapsed": false,
"input": "results_file.write('Results of Analysis of Moby Dick\\n\\n')\nresults_file.write('Average Number of Characters in a Word\\n')",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 20
},
{
"cell_type": "markdown",
"metadata": {},
"source": "<code>avg_num_characters</code> and <code>avg_num_words</code> are not strings so we need to convert them to strings to be able to write them to our file. We can convert them to strings by using Python's inbuilt <code><a href=\"http://docs.python.org/2/library/functions.html#str\">str()</a></code> function. "
},
{
"cell_type": "code",
"collapsed": false,
"input": "results_file.write(str(avg_num_characters))\nresults_file.write('\\nAverage Number of Words in a Line\\n')\nresults_file.write(str(avg_num_words))",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 21
},
{
"cell_type": "markdown",
"metadata": {},
"source": "<strong>Closing Open Files</strong>"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Once we are done reading(writing) from(to) our files we tell Python to <em>close</em> the files. This is accomplished by using the <code><a href=\"http://docs.python.org/2/tutorial/inputoutput.html#methods-of-file-objects\">close()</a></code> method."
},
{
"cell_type": "code",
"collapsed": false,
"input": "text_file.close()\nresults_file.close()",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 22
},
{
"cell_type": "markdown",
"metadata": {},
"source": "<h2>Putting it Together</h2>"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "We can put all the above commands into a single file with a <em>.py</em> extension. Let's call our file <b>readWrite.py</b>"
},
{
"cell_type": "code",
"collapsed": false,
"input": "# Open the file\ntext_file = open('/home/user/absaini/scripted/pg2701.txt','r')\n\n# Read the contents of the file\nbook_text = text_file.read()\nprint len(book_text)\n\n# Words\nwords = book_text.split()\nprint len(words)\n\n# Sentences\nsentences = book_text.split('.')\navg_num_words = len(words)/len(sentences)\n\n# Open the file\nresults_file = open('/home/user/absaini/scripted/results.txt','w')\n\n# Write Results to File\nresults_file.write('Results of Analysis of Moby Dick\\n\\n')\nresults_file.write('Average Number of Characters in a Word\\n')\nresults_file.write(str(avg_num_characters))\nresults_file.write('\\nAverage Number of Words in a Line\\n')\nresults_file.write(str(avg_num_words))\n\n# Close Files\ntext_file.close()\nresults_file.close()",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "1257260\n215133\n"
}
],
"prompt_number": 24
},
{
"cell_type": "markdown",
"metadata": {},
"source": "To run the above Python program, execute the following command in the shell<br>"
},
{
"cell_type": "raw",
"metadata": {},
"source": "$ python readWrite.py"
},
{
"cell_type": "markdown",
"metadata": {},
"source": ""
},
{
"cell_type": "markdown",
"metadata": {},
"source": "<h2>YOUR TURN!</h2>"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Select a book of your choice from <a href=\"http://www.gutenberg.org/ebooks/2701\">Project Gutenberg</a>. Download the book as a text file and read the book as shown above. Compute the following statistics and write it in a file called <b>myResults.txt</b>:<br><br>\n1. Total number of characters \n2. Total number of words \n3. Total number of sentences \n4. Average Number of Characters in a Word \n5. Average Number of Words in a Sentence"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "<h3>Bonus Points if you can...</h3>"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "We all love a challenge, so here's one for the Pythonistas out there. Compute the frequency with which each letter occurs in the entire book and print the counts for each letter in the results file. The format for printing will be <em><b>letter , count</b></em>. Eg.<br><br>\n<em>a , 10101</em><br>\n<em>b , 9888</em><br>\n<em>c , 7888</em><br>\n...<br>\n...<br>\n...<br>"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Submit your script file(.py file) as well as the <b>myResults.txt</b> file for evaluation"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "<br>"
}
],
"metadata": {}
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment