Skip to content

Instantly share code, notes, and snippets.

@salma71
Created February 8, 2021 13:57
Show Gist options
  • Save salma71/893e4b38514b8ff8a6d097ac53a1bc05 to your computer and use it in GitHub Desktop.
Save salma71/893e4b38514b8ff8a6d097ac53a1bc05 to your computer and use it in GitHub Desktop.
The idea here is to asynchronously process chunk of data by pushing it into a multiprocessing pool queue. Each process in pool will work on the task, and return the result. Note, it is important to create the Pool inside the __main__ block. That is
import pandas as pd
import multiprocessing as mp
LARGE_FILE = "D:\\my_large_file.txt"
CHUNKSIZE = 100000 # processing 100,000 rows at a time
def process_frame(df):
# process data frame
return len(df)
if __name__ == '__main__':
reader = pd.read_table(LARGE_FILE, chunksize=CHUNKSIZE)
pool = mp.Pool(4) # use 4 processes
funclist = []
for df in reader:
# process each data frame
f = pool.apply_async(process_frame,[df])
funclist.append(f)
result = 0
for f in funclist:
result += f.get(timeout=10) # timeout in 10 seconds
print "There are %d rows of data"%(result)
import pandas as pd
LARGE_FILE = "D:\\my_large_file.txt"
CHUNKSIZE = 100000 # processing 100,000 rows at a time
def process_frame(df):
# process data frame
return len(df)
if __name__ == '__main__':
reader = pd.read_table(LARGE_FILE, chunksize=CHUNKSIZE)
result = 0
for df in reader:
# process each data frame
result += process_frame(df)
print "There are %d rows of data"%(result)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment