Skip to content

Instantly share code, notes, and snippets.

@misho-kr
Last active March 9, 2021 09:06
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save misho-kr/aebcca0e53893176798dbb456e9a9a20 to your computer and use it in GitHub Desktop.
Save misho-kr/aebcca0e53893176798dbb456e9a9a20 to your computer and use it in GitHub Desktop.
Summary of "Merging DataFrames with pandas" course on Datacamp (https://gist.github.com/misho-kr/873ddcc2fc89f1c96414de9e0a58e0fe)

The data you need is not in a single file. It may be spread across a number of text files, spreadsheets, or databases. Import the data you’re interested in as a collection of DataFrames and combine them to answer your central questions. This course is all about the act of combining — or merging — DataFrames. Organize, reshape, and aggregate multiple datasets to answer your specific questions.

Lead by Team Anaconda, Data Science Training

Preparing data

Different techniques to import multiple files into DataFrames. Share information between DataFrames using their indexes. How indexes work is essential to merging DataFrames.

Perform database-style operations to combine DataFrames. Appending and concatenating DataFrames while working with a variety of real-world datasets.

  • Tools for pandas data import
    • pd.read_csv(), pd.read_excel(), pd.read_json(), pd.read_html()
  • Reindexing DataFrames
    • indices: many index labels within Index data structures
    • indexes: many pandas Index data structures
  • Arithmetic with Series & DataFrames
from glob import glob
filenames = glob('sales*.csv')
dataframes = [pd.read_csv(f) for f in filenames]

w_mean = pd.read_csv('quarterly_mean_temp.csv', index_col='Month')
print(w_mean.index)
print(type(w_mean.index))

ordered = ['Jan', 'Apr', 'Jul', 'Oct']
w_mean2 = w_mean.reindex(ordered)
w_mean2.sort_index()

w_max.reindex(w_mean3.index).dropna()

weather = pd.read_csv('pittsburgh2013.csv', index_col='Date', parse_dates=True)
weather.loc['2013-07-01':'2013-07-07', 'PrecipitationIn'] * 2.54
// error
week1_range / week1_mean
// right way
week1_range.divide(week1_mean, axis='rows')

bronze = pd.read_csv('bronze_top5.csv', index_col=0)
silver = pd.read_csv('silver_top5.csv', index_col=0)

bronze + silver
bronze.add(silver)
bronze.add(silver, fill_value=0).add(gold, fill_value=0)

Concatenating data

  • Appending and concatenating Series
    • append() stacks rows of s2 below s1
    • Method for Series and DataFrames
    • concar() can stack row-wise or column-wise
    • pandas module function
    • May need to reset the index after appending
result1 = pd.concat([s1, s2, s3])
result2 = s1.append(s2).append(s3)

new_east = northeast.append(south).reset_index(drop=True)
new_east = pd.concat([northeast, south], ignore_index=True)

pd.concat([population, unemployment], axis=1)

pd.concat([rain2013, rain2014], axis=0)
rain1314 = pd.concat([rain2013, rain2014], keys=[2013, 2014], axis=0)
print(rain1314.loc[2014])

rain1314 = pd.concat([rain2013, rain2014], axis='columns')
rain1314 = pd.concat([rain2013, rain2014], keys=[2013, 2014], axis='columns')
  • Concatenating rows with axis='rows'
    • Keys and MultiIndexes
  • Concatenating columns with axis='columns'
    • Using a multi-index on columns
A = np.arange(8).reshape(2,4) + 0.1
B = np.arange(6).reshape(2,3) + 0.2
C = np.arange(12).reshape(3,4) + 0.3

np.hstack([B, A])
np.concatenate([B, A], axis=1)

np.vstack([A, C])
np.concatenate([A, C], axis=0)

// errors
np.concatenate([A, B], axis=0) # incompatible columns
np.concatenate([A, C], axis=1) # incompatible rows
  • Outer join
    • Union of index sets (all labels, no repetition)
    • Missing fields filled with NaN
  • Inner join
    • Intersection of index sets (only common labels)
pd.concat([population, unemployment], axis=1, join='inner')
pd.concat([population, unemployment], axis=1, join='outer')

Merging data

Techniques for merging with left joins, right joins, inner joins, and outer joins. Ordered merging is useful to merge DataFrames with columns that have natural orderings, like date-time columns.

  • Merding DataFrames
    • Using suffixes
    • Specifying columns to merge
  • Joining DataFrames
    • Inner join, Left and right join
    • Outer join
  • Which should you use?
    • df1.append(df2): stacking vertically
    • pd.concat([df1, df2]): stacking many horizontally or vertically, simple inner/outer joins on Indexes
    • df1.join(df2): inner/outer/le!/right joins on Indexes
    • pd.merge([df1, df2]): many joins on multiple columns
  • Ordered merges
pd.merge(population, cities)
pd.merge(bronze, gold, on=['NOC', 'Country'], suffixes=['_bronze', '_gold'])
pd.merge(counties, cities, left_on='CITY NAME', right_on='City')

pd.merge(bronze, gold, on=['NOC', 'Country'], suffixes=['_bronze', '_gold'], how='inner')
pd.merge(bronze, gold, on=['NOC', 'Country'], suffixes=['_bronze', '_gold'], how='left')
pd.merge(bronze, gold, on=['NOC', 'Country'], suffixes=['_bronze', '_gold'], how='outer')

population.join(unemployment)
population.join(unemployment, how='right')
population.join(unemployment, how='inner')
population.join(unemployment, how='outer')

pd.merge(hardware, software, how='outer').sort_values('Date')
pd.merge_ordered(hardware, software)
pd.merge_ordered(hardware, software, on=['Date', 'Company'], suffixes=['_hardware', '_software'])
pd.merge_ordered(stocks, gdp, on='Date', fill_method='ffill')

Case Study - Summer Olympics

An in-depth case study using Olympic medal data

  • Reminder: loading & merging files
  • Constructing a pivot table
  • Reshaping and plo!ing
  • Graphical summary
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment