acrymble/extractTargetedData.py

## extractTargetedData.py
#Short Python programme, written by Adam Crymble - a.crymble@herts.ac.uk
#for the Digital Histories Workshop module at the University of Hertfordshire
#I am assuming that you have already read and completed the 'Python Introduction and Installation' tutorial
#from the Programming Historian: http://programminghistorian.org/lessons/introduction-and-installation
#September 2014.

#note that lines beginning with the # symbol are comments for you the student.
#The computer will ignore these.
#read through them all carefully before you try running the programme.
#Afterwards, if you find these comments distracting, feel free to delete them.
#if there are any words you don't know, try putting them in a search box along with "Python"


#Firstly, we are going to 'import' a 'library' called 're' (regular expression)
#This will be used by the programme to find patterns rather than exact matches (eg, 4 digits in a row, instead of "1234")
#we have to tell the programme to import this before we use it. Usually you do this right away.

import re


#---
#Next we want to have the programme open up the .txt file where we've stored our data.
#In a typical programme, you put data in --> the programme does something to it --> the programme spits out a result.
#this is our data input
#this particular programme requires you to have a .txt file containing 1 entry per line
#the easiest way to do that is to open your spreadsheet and copy a whole column into a blank file in Komodo edit
#you will have to call the file "dataToExtractFrom.txt" because that's what this particular programme expects.
#you will also have to save it in the same directory (folder) as you are storing this programme.
#and if you get a warning about 'character encoding', just click on the 'Force' button. We're not going to worry about that.

file = open('dataToExtractFrom.txt', 'r')


#---
#The next 2 lines of code define the pattern we are interested in extracting
#you can change these to extract different patterns
#the first is the pattern beginning. The second marks the end of the pattern.
#In the example, we want to find anything between "matric" and ending with 4 digits.
#take a look at THE DATASET and you'll see that is quite a common pattern
#the value in the 'patternEnd' example is called a 'regular expression', and may look a bit odd to you.

patternBeginning = 'matric'
patternEnd = re.compile("\d{4}")


#---
#Next we are going to 'do something' to our input data.
#In this case, we will 'loop' through each line in our input data and look for the pattern.
#If we find the beginning of the pattern, we will attempt to extract it and print it out
#If we find the beginning of the pattern but not the end, we will print out 'END NOT FOUND' - this will 'flag' failed attempts for us to check manually.
#If we do not find the beginning of the pattern we will print a blank line.
#finally, we will close the text file containing our input data, since we're finished with it.
for line in file:
    if patternBeginning in line:
        line = line[line.find(patternBeginning):]

        try:
            line =line[:patternEnd.search(line).end()]
            print line
        except:
            print 'END NOT FOUND'
    else:
        print ''

file.close()

#The results of your programme should be visible in the 'Command Output' window below.
#You can now paste these back into your spreadsheet as a new column of value-added, extracted data.
	#Short Python programme, written by Adam Crymble - a.crymble@herts.ac.uk
	#for the Digital Histories Workshop module at the University of Hertfordshire
	#I am assuming that you have already read and completed the 'Python Introduction and Installation' tutorial
	#from the Programming Historian: http://programminghistorian.org/lessons/introduction-and-installation
	#September 2014.

	#note that lines beginning with the # symbol are comments for you the student.
	#The computer will ignore these.
	#read through them all carefully before you try running the programme.
	#Afterwards, if you find these comments distracting, feel free to delete them.
	#if there are any words you don't know, try putting them in a search box along with "Python"


	#Firstly, we are going to 'import' a 'library' called 're' (regular expression)
	#This will be used by the programme to find patterns rather than exact matches (eg, 4 digits in a row, instead of "1234")
	#we have to tell the programme to import this before we use it. Usually you do this right away.

	import re


	#---
	#Next we want to have the programme open up the .txt file where we've stored our data.
	#In a typical programme, you put data in --> the programme does something to it --> the programme spits out a result.
	#this is our data input
	#this particular programme requires you to have a .txt file containing 1 entry per line
	#the easiest way to do that is to open your spreadsheet and copy a whole column into a blank file in Komodo edit
	#you will have to call the file "dataToExtractFrom.txt" because that's what this particular programme expects.
	#you will also have to save it in the same directory (folder) as you are storing this programme.
	#and if you get a warning about 'character encoding', just click on the 'Force' button. We're not going to worry about that.

	file = open('dataToExtractFrom.txt', 'r')


	#---
	#The next 2 lines of code define the pattern we are interested in extracting
	#you can change these to extract different patterns
	#the first is the pattern beginning. The second marks the end of the pattern.
	#In the example, we want to find anything between "matric" and ending with 4 digits.
	#take a look at THE DATASET and you'll see that is quite a common pattern
	#the value in the 'patternEnd' example is called a 'regular expression', and may look a bit odd to you.

	patternBeginning = 'matric'
	patternEnd = re.compile("\d{4}")


	#---
	#Next we are going to 'do something' to our input data.
	#In this case, we will 'loop' through each line in our input data and look for the pattern.
	#If we find the beginning of the pattern, we will attempt to extract it and print it out
	#If we find the beginning of the pattern but not the end, we will print out 'END NOT FOUND' - this will 'flag' failed attempts for us to check manually.
	#If we do not find the beginning of the pattern we will print a blank line.
	#finally, we will close the text file containing our input data, since we're finished with it.
	for line in file:
	if patternBeginning in line:
	line = line[line.find(patternBeginning):]

	try:
	line =line[:patternEnd.search(line).end()]
	print line
	except:
	print 'END NOT FOUND'
	else:
	print ''

	file.close()

	#The results of your programme should be visible in the 'Command Output' window below.
	#You can now paste these back into your spreadsheet as a new column of value-added, extracted data.