Skip to content

Instantly share code, notes, and snippets.

@MetaJoker
Created May 1, 2017 00:21
Answer to question #43711418 - MBasith created by metajoker - https://repl.it/H8r1/3
Apples: 5 items in stock
Pears: 10 items in stock
Bananas: 15 items in stock
Watermelon: 20 items in stock
Pears: 25 items in stock
Oranges: 30 items in stock
Apples: 0 items in stock
Pears: 0 items in stock
Bananas: 0 items in stock
Watermelon: 0 items in stock
Pears: 1 items in stock
Oranges: 0 items in stock
def _get_key(string, delim):
#Split key out of string
key=string.split(delim)[0].strip()
return key
def _clean_string(string, charToReplace):
#Remove garbage from string
for character in charToReplace:
string=string.replace(character,'')
#Strip leading and trailing whitespace
string=string.strip()
return string
def get_matching_key_values(file_1, file_2, delim, charToReplace):
#Open the files to be compared
with open(file_1, 'r') as a, open(file_2, 'r') as b:
#Create an object to hold our matches
matches=[]
#Iterate over file 'a' and extract the keys, one-at-a-time
for lineA in a:
keyA=_get_key(lineA, delim)
#Iterate over file 'b' and extract the keys, one-at-a-time
for lineB in b:
keyB=_get_key(lineB, delim)
#Compare the keys. You might need upper, but I usually prefer
#to compare all uppercase to all uppercase
if keyA.upper()==keyB.upper():
cleanedOutput=(_clean_string(lineA, charToReplace), _clean_string(lineB, charToReplace))
matches.append(cleanedOutput)
#Reset file 'b' pointer to start of file and try again
b.seek(0)
#Return our final list of matches
#--NOTE: this method CAN return an empty 'matches' object!
return matches
if __name__=="__main__":
def format_output (output):
return '\n'.join(map(str, output))
#Test of fn against provided file_1 and file_2
print("###############################################\nTest case #1")
print(format_output(get_matching_key_values('./file_1.txt', './file_2.txt', ':', ['\n', '\r'])))
print("###############################################")
print('\n')
#Test of fn against provided file_1 and created file_3
print("###############################################\nTest case #2")
print(format_output(get_matching_key_values('./file_1.txt', './file_3.txt', ':', ['\n', '\r'])))
print("###############################################")
@MetaJoker
Copy link
Author

Original post (http://stackoverflow.com/a/43712831/6476525):

I would actually heavily suggest against storing data in 1GB sized text files and not in some sort of database/standard data storage file format. If your data were more complex, I'd suggest CSV or some sort of delimited format at minimum. If you can split and store the data in much smaller chunks, maybe a markup language like XML, HTML, or JSON (which would make navigation and extraction of data easy) which are far more organized and already optimized to handle what you're trying to do (locating matching keys and returning their values).

That said, you could use the "readline" method found in section 7.2.1 of the Python 3 docs to efficiently do what you're trying to do: https://docs.python.org/3/tutorial/inputoutput.html#reading-and-writing-file.

Or, you could just iterate over the file:

def _get_key(string, delim):

    #Split key out of string
    key=string.split(delim)[0].strip()
    return key

def _clean_string(string, charToReplace):

    #Remove garbage from string
    for character in charToReplace:
        string=string.replace(character,'')

    #Strip leading and trailing whitespace
    string=string.strip()
    return string

def get_matching_key_values(file_1, file_2, delim, charToReplace):

    #Open the files to be compared
    with open(file_1, 'r') as a, open(file_2, 'r') as b:
  
    #Create an object to hold our matches
    matches=[]
  
    #Iterate over file 'a' and extract the keys, one-at-a-time
    for lineA in a:
        keyA=_get_key(lineA, delim)
      
        #Iterate over file 'b' and extract the keys, one-at-a-time
        for lineB in b:
            keyB=_get_key(lineB, delim)

            #Compare the keys. You might need upper, but I usually prefer 
            #to compare all uppercase to all uppercase
            if keyA.upper()==keyB.upper():
                cleanedOutput=(_clean_string(lineA, charToReplace), 
                               _clean_string(lineB, charToReplace))
                
                #Append the match to the 'matches' list
                matches.append(cleanedOutput)
              
        #Reset file 'b' pointer to start of file and try again
        b.seek(0)
      
    #Return our final list of matches 
    #--NOTE: this method CAN return an empty 'matches' object!
    return matches

This is not really the best/most efficient way to go about this:

  1. ALL matches are saved to a list object in memory
  2. There is no handling of duplicates
  3. No speed optimization
  4. Iteration over file 'b' occurs 'n' times, where 'n' is the number of
    lines in file 'a'. Ideally, you would only iterate over each file once.

Even only using base Python, I'm sure there is a better way to go about it.

For the Gist: https://gist.github.com/MetaJoker/a63f8596d1084b0868e1bdb5bdfb5f16

I think the Gist also has a link to the repl.it I used to write and test the code if you want a copy to play with in your browser.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment