Skip to content

Instantly share code, notes, and snippets.

@philshem
Last active August 26, 2020 15:07
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save philshem/1dc727932c990a3230fe38cc83535dad to your computer and use it in GitHub Desktop.
Save philshem/1dc727932c990a3230fe38cc83535dad to your computer and use it in GitHub Desktop.
import pandas as pd
# https://burntsushi.net/stuff/worldcitiespop.csv
df = pd.read_csv('worldcitiespop.csv',low_memory=False)
df = df.query('Country == "us"')
print(len(df))
# prints 141989
@philshem
Copy link
Author

philshem commented Aug 26, 2020

Python 3.8.5
Pandas 1.1.1

takes less than 5 seconds on a 2020 macbook i5

slightly optimized code still takes more than 4 seconds:

import pandas as pd
#url = 'https://burntsushi.net/stuff/worldcitiespop.csv'
url = 'worldcitiespop.csv'
iter_csv = pd.read_csv('worldcitiespop.csv', iterator=True, chunksize=10000)
df = pd.concat([chunk.query('Country == "us"') for chunk in iter_csv])
print(len(df))

Approaching 3 seconds with this optimization

import pandas as pd
#url = 'https://burntsushi.net/stuff/worldcitiespop.csv'
url = 'worldcitiespop.csv'
iter_csv = pd.read_csv('worldcitiespop.csv', iterator=True, chunksize=100000, low_memory=False)
df = pd.concat([chunk.query('Country == "us"') for chunk in iter_csv])
print(len(df))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment