Skip to content

Instantly share code, notes, and snippets.

@vikas-git
Last active September 9, 2019 05:05
Show Gist options
  • Save vikas-git/67816c40e2299af2faa8f94361e3b53b to your computer and use it in GitHub Desktop.
Save vikas-git/67816c40e2299af2faa8f94361e3b53b to your computer and use it in GitHub Desktop.
In this gist trying to explain how to use sql queries on dataframes.
Source Blog: https://medium.com/jbennetcodes/how-to-rewrite-your-sql-queries-in-pandas-and-more-149d341fc53e?source=search_post---------0
-> Sql queries vs pandas queries
* select Query
-> select * from airports;
-> airports.head()
* select query with limit
-> select * from airports limit 3
-> airports.head(3)
* Get number of rows and columns
-> airports.shape // return (55536, 18)
* select certain column with where clause
-> select id from airports where ident='KLAX'
-> airports[airports.ident=='KLAX'].id
* select with multiple conditions
-> select * from airports where iso_region = 'US-CA' and type = 'seaplane_base'
-> airports[(airports.iso_region == 'US-CA') & (airports.type == 'seaplane_base')]
-> select ident, name, municipality from airports where iso_region = 'US-CA' and type = 'large_airport'
-> airports[(airports.iso_region == 'US-CA') & (airports.type == 'large_airport')][['ident', 'name', 'municipality']]
* ORDER BY
-> select * from airport_freq where airport_ident = 'KLAX' order by type
-> airport_freq[airport_freq.airport_ident == 'KLAX'].sort_values('type')
-> select * from airport_freq where airport_ident = 'KLAX' order by type desc
-> airport_freq[airport_freq.airport_ident == 'KLAX'].sort_values('type', ascending=False)
* IN...NOT IN
-> select * from airports where type in ('heliport', 'balloonport')
-> airports[airports.type.isin(['heliport', 'balloonport'])]
-> select * from airports where type not in ('heliport', 'balloonport')
-> airports[~airports.type.isin(['heliport', 'balloonport'])]
* Groupby, count, ORDER BY
-> select iso_country, type, count(*) from airports group by iso_country, type order by iso_country, type
-> airports.groupby(['iso_country', 'type']).size()
-> select iso_country, type, count(*) from airports group by iso_country, type order by iso_country, count(*) desc
-> airports.groupby(['iso_country', 'type']).size().to_frame('size').reset_index().sort_values(['iso_country', 'size'], ascending=[True, False])
-> select iso_country, type, count(*) from airports group by iso_country, type order by iso_country, type
-> airports.groupby(['iso_country', 'type']).size()
-> select iso_country, type, count(*) from airports group by iso_country, type order by iso_country, count(*) desc
-> airports.groupby(['iso_country', 'type']).size().to_frame('size').reset_index().sort_values(['iso_country', 'size'], ascending=[True, False])
* HAVING
-> select type, count(*) from airports where iso_country = 'US' group by type having count(*) > 1000 order by count(*) desc
-> airports[airports.iso_country == 'US'].groupby('type').filter(lambda g: len(g) > 1000).groupby('type').size().sort_values(ascending=False)
* TOP N Records
-> airports.head() // by default it returns 5
-> airports.head(3)
-> select iso_country from by_country order by size desc limit 10
-> by_country.nlargest(10, columns='airport_count')
-> select iso_country from by_country order by size desc limit 10 offset 10
-> by_country.nlargest(20, columns='airport_count').tail(10)
* Aggregate function (MIN, MAX, MEAN, MEDIAN)
-> select max(length_ft), min(length_ft), mean(length_ft), median(length_ft) from runways
-> runways.agg({'length_ft': ['min', 'max', 'mean', 'median']})
OR
-> df.T
* Join
Short note : Use .merge() to join Pandas dataframes. You need to provide which columns to join
on (left_on and right_on), and join type: inner (default), left (corresponds to LEFT OUTER in SQL),
right (RIGHT OUTER), or outer (FULL OUTER).
->select airport_ident, type, description, frequency_mhz from airport_freq join airports on
airport_freq.airport_ref = airports.id where airports.ident = 'KLAX'
-> airport_freq.merge(airports[airports.ident == 'KLAX'][['id']], left_on='airport_ref',
right_on='id', how='inner')[['airport_ident', 'type', 'description', 'frequency_mhz']]
* Union All and Union
-> select name, municipality from airports where ident = 'KLAX'
union all
select name, municipality from airports where ident = 'KLGB'
-> pd.concat([airports[airports.ident == 'KLAX'][['name', 'municipality']],
airports[airports.ident == 'KLGB'][['name', 'municipality']]])
^Note : (To deduplicate things (equivalent of UNION), you’d also have to add .drop_duplicates().)
* Insert query
-> create table heroes (id integer, name text);
-> insert into heroes values (1, 'Harry Potter');
-> insert into heroes values (2, 'Ron Weasley');
-> insert into heroes values (3, 'Hermione Granger');
-> df1 = pd.DataFrame({'id': [1, 2], 'name': ['Harry Potter', 'Ron Weasley']})
-> df2 = pd.DataFrame({'id': [3], 'name': ['Hermione Granger']})
-> pd.concat([df1, df2]).reset_index(drop=True)
* Update
-> update airports set home_link = 'http://www.lawa.org/welcomelax.aspx' where ident == 'KLAX'
-> airports.loc[airports['ident'] == 'KLAX', 'home_link'] = 'http://www.lawa.org/welcomelax.aspx'
* Delete
-> delete from lax_freq where type = 'MISC'
-> lax_freq = lax_freq[lax_freq.type != 'MISC']
-> lax_freq.drop(lax_freq[lax_freq.type == 'MISC'].index)
* or added a new calculated column:
-> df['total_cost'] = df['price'] * df['quantity']
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment