Create a gist now

Instantly share code, notes, and snippets.

What would you like to do?
extractTargetedData.py
#Short Python programme, written by Adam Crymble - a.crymble@herts.ac.uk
#for the Digital Histories Workshop module at the University of Hertfordshire
#I am assuming that you have already read and completed the 'Python Introduction and Installation' tutorial
#from the Programming Historian: http://programminghistorian.org/lessons/introduction-and-installation
#September 2014.
#note that lines beginning with the # symbol are comments for you the student.
#The computer will ignore these.
#read through them all carefully before you try running the programme.
#Afterwards, if you find these comments distracting, feel free to delete them.
#if there are any words you don't know, try putting them in a search box along with "Python"
#Firstly, we are going to 'import' a 'library' called 're' (regular expression)
#This will be used by the programme to find patterns rather than exact matches (eg, 4 digits in a row, instead of "1234")
#we have to tell the programme to import this before we use it. Usually you do this right away.
import re
#---
#Next we want to have the programme open up the .txt file where we've stored our data.
#In a typical programme, you put data in --> the programme does something to it --> the programme spits out a result.
#this is our data input
#this particular programme requires you to have a .txt file containing 1 entry per line
#the easiest way to do that is to open your spreadsheet and copy a whole column into a blank file in Komodo edit
#you will have to call the file "dataToExtractFrom.txt" because that's what this particular programme expects.
#you will also have to save it in the same directory (folder) as you are storing this programme.
#and if you get a warning about 'character encoding', just click on the 'Force' button. We're not going to worry about that.
file = open('dataToExtractFrom.txt', 'r')
#---
#The next 2 lines of code define the pattern we are interested in extracting
#you can change these to extract different patterns
#the first is the pattern beginning. The second marks the end of the pattern.
#In the example, we want to find anything between "matric" and ending with 4 digits.
#take a look at THE DATASET and you'll see that is quite a common pattern
#the value in the 'patternEnd' example is called a 'regular expression', and may look a bit odd to you.
patternBeginning = 'matric'
patternEnd = re.compile("\d{4}")
#---
#Next we are going to 'do something' to our input data.
#In this case, we will 'loop' through each line in our input data and look for the pattern.
#If we find the beginning of the pattern, we will attempt to extract it and print it out
#If we find the beginning of the pattern but not the end, we will print out 'END NOT FOUND' - this will 'flag' failed attempts for us to check manually.
#If we do not find the beginning of the pattern we will print a blank line.
#finally, we will close the text file containing our input data, since we're finished with it.
for line in file:
if patternBeginning in line:
line = line[line.find(patternBeginning):]
try:
line =line[:patternEnd.search(line).end()]
print line
except:
print 'END NOT FOUND'
else:
print ''
file.close()
#The results of your programme should be visible in the 'Command Output' window below.
#You can now paste these back into your spreadsheet as a new column of value-added, extracted data.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment