Last active
August 29, 2015 14:06
-
-
Save acrymble/5953e29b1c30adcc566e to your computer and use it in GitHub Desktop.
extractTargetedData.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#Short Python programme, written by Adam Crymble - a.crymble@herts.ac.uk | |
#for the Digital Histories Workshop module at the University of Hertfordshire | |
#I am assuming that you have already read and completed the 'Python Introduction and Installation' tutorial | |
#from the Programming Historian: http://programminghistorian.org/lessons/introduction-and-installation | |
#September 2014. | |
#note that lines beginning with the # symbol are comments for you the student. | |
#The computer will ignore these. | |
#read through them all carefully before you try running the programme. | |
#Afterwards, if you find these comments distracting, feel free to delete them. | |
#if there are any words you don't know, try putting them in a search box along with "Python" | |
#Firstly, we are going to 'import' a 'library' called 're' (regular expression) | |
#This will be used by the programme to find patterns rather than exact matches (eg, 4 digits in a row, instead of "1234") | |
#we have to tell the programme to import this before we use it. Usually you do this right away. | |
import re | |
#--- | |
#Next we want to have the programme open up the .txt file where we've stored our data. | |
#In a typical programme, you put data in --> the programme does something to it --> the programme spits out a result. | |
#this is our data input | |
#this particular programme requires you to have a .txt file containing 1 entry per line | |
#the easiest way to do that is to open your spreadsheet and copy a whole column into a blank file in Komodo edit | |
#you will have to call the file "dataToExtractFrom.txt" because that's what this particular programme expects. | |
#you will also have to save it in the same directory (folder) as you are storing this programme. | |
#and if you get a warning about 'character encoding', just click on the 'Force' button. We're not going to worry about that. | |
file = open('dataToExtractFrom.txt', 'r') | |
#--- | |
#The next 2 lines of code define the pattern we are interested in extracting | |
#you can change these to extract different patterns | |
#the first is the pattern beginning. The second marks the end of the pattern. | |
#In the example, we want to find anything between "matric" and ending with 4 digits. | |
#take a look at THE DATASET and you'll see that is quite a common pattern | |
#the value in the 'patternEnd' example is called a 'regular expression', and may look a bit odd to you. | |
patternBeginning = 'matric' | |
patternEnd = re.compile("\d{4}") | |
#--- | |
#Next we are going to 'do something' to our input data. | |
#In this case, we will 'loop' through each line in our input data and look for the pattern. | |
#If we find the beginning of the pattern, we will attempt to extract it and print it out | |
#If we find the beginning of the pattern but not the end, we will print out 'END NOT FOUND' - this will 'flag' failed attempts for us to check manually. | |
#If we do not find the beginning of the pattern we will print a blank line. | |
#finally, we will close the text file containing our input data, since we're finished with it. | |
for line in file: | |
if patternBeginning in line: | |
line = line[line.find(patternBeginning):] | |
try: | |
line =line[:patternEnd.search(line).end()] | |
print line | |
except: | |
print 'END NOT FOUND' | |
else: | |
print '' | |
file.close() | |
#The results of your programme should be visible in the 'Command Output' window below. | |
#You can now paste these back into your spreadsheet as a new column of value-added, extracted data. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment