vikas-git/dataframe_queries.txt

## dataframe_queries.txt
Source Blog: https://medium.com/jbennetcodes/how-to-rewrite-your-sql-queries-in-pandas-and-more-149d341fc53e?source=search_post---------0

-> Sql queries vs pandas queries

* select Query
    -> select * from airports;
    -> airports.head()

* select query with limit
    -> select * from airports limit 3
    -> airports.head(3)

* Get number of rows and columns
    ->  airports.shape // return (55536, 18)

* select certain column with where clause
    -> select id from airports where ident='KLAX'
    -> airports[airports.ident=='KLAX'].id

* select with multiple conditions
    -> select * from airports where iso_region = 'US-CA' and type = 'seaplane_base'
    -> airports[(airports.iso_region == 'US-CA') & (airports.type == 'seaplane_base')]

    -> select ident, name, municipality from airports where iso_region = 'US-CA' and type = 'large_airport'
    -> airports[(airports.iso_region == 'US-CA') & (airports.type == 'large_airport')][['ident', 'name', 'municipality']]

* ORDER BY
    -> select * from airport_freq where airport_ident = 'KLAX' order by type
    -> airport_freq[airport_freq.airport_ident == 'KLAX'].sort_values('type')


    -> select * from airport_freq where airport_ident = 'KLAX' order by type desc
    -> airport_freq[airport_freq.airport_ident == 'KLAX'].sort_values('type', ascending=False)

* IN...NOT IN
    -> select * from airports where type in ('heliport', 'balloonport')
    -> airports[airports.type.isin(['heliport', 'balloonport'])]

    -> select * from airports where type not in ('heliport', 'balloonport')
    -> airports[~airports.type.isin(['heliport', 'balloonport'])]

* Groupby, count, ORDER BY
    -> select iso_country, type, count(*) from airports group by iso_country, type order by iso_country, type
    -> airports.groupby(['iso_country', 'type']).size()

    -> select iso_country, type, count(*) from airports group by iso_country, type order by iso_country, count(*) desc
    -> airports.groupby(['iso_country', 'type']).size().to_frame('size').reset_index().sort_values(['iso_country', 'size'], ascending=[True, False])

    -> select iso_country, type, count(*) from airports group by iso_country, type order by iso_country, type
    -> airports.groupby(['iso_country', 'type']).size()

    -> select iso_country, type, count(*) from airports group by iso_country, type order by iso_country, count(*) desc
    -> airports.groupby(['iso_country', 'type']).size().to_frame('size').reset_index().sort_values(['iso_country', 'size'], ascending=[True, False])

* HAVING
    -> select type, count(*) from airports where iso_country = 'US' group by type having count(*) > 1000 order by count(*) desc
    -> airports[airports.iso_country == 'US'].groupby('type').filter(lambda g: len(g) > 1000).groupby('type').size().sort_values(ascending=False)

* TOP N Records
    -> airports.head() // by default it returns 5
    -> airports.head(3)

    -> select iso_country from by_country order by size desc limit 10
    -> by_country.nlargest(10, columns='airport_count')

    -> select iso_country from by_country order by size desc limit 10 offset 10
    -> by_country.nlargest(20, columns='airport_count').tail(10)

* Aggregate function (MIN, MAX, MEAN, MEDIAN)
    -> select max(length_ft), min(length_ft), mean(length_ft), median(length_ft) from runways
    -> runways.agg({'length_ft': ['min', 'max', 'mean', 'median']})
                                OR
    -> df.T

* Join
    Short note : Use .merge() to join Pandas dataframes. You need to provide which columns to join
    on (left_on and right_on), and join type: inner (default), left (corresponds to LEFT OUTER in SQL),
    right (RIGHT OUTER), or outer (FULL OUTER).

    ->select airport_ident, type, description, frequency_mhz from airport_freq join airports on
    airport_freq.airport_ref = airports.id where airports.ident = 'KLAX'
    -> airport_freq.merge(airports[airports.ident == 'KLAX'][['id']], left_on='airport_ref',
    right_on='id', how='inner')[['airport_ident', 'type', 'description', 'frequency_mhz']]

* Union All and Union
    -> select name, municipality from airports where ident = 'KLAX'
    union all
    select name, municipality from airports where ident = 'KLGB'
    -> pd.concat([airports[airports.ident == 'KLAX'][['name', 'municipality']],
                airports[airports.ident == 'KLGB'][['name', 'municipality']]])

    ^Note : (To deduplicate things (equivalent of UNION), you’d also have to add .drop_duplicates().)

* Insert query
    -> create table heroes (id integer, name text);
    -> insert into heroes values (1, 'Harry Potter');
    -> insert into heroes values (2, 'Ron Weasley');
    -> insert into heroes values (3, 'Hermione Granger');

    -> df1 = pd.DataFrame({'id': [1, 2], 'name': ['Harry Potter', 'Ron Weasley']})
    -> df2 = pd.DataFrame({'id': [3], 'name': ['Hermione Granger']})
    -> pd.concat([df1, df2]).reset_index(drop=True)

* Update
    -> update airports set home_link = 'http://www.lawa.org/welcomelax.aspx' where ident == 'KLAX'
    -> airports.loc[airports['ident'] == 'KLAX', 'home_link'] = 'http://www.lawa.org/welcomelax.aspx'

* Delete
    -> delete from lax_freq where type = 'MISC'
    -> lax_freq = lax_freq[lax_freq.type != 'MISC']
    -> lax_freq.drop(lax_freq[lax_freq.type == 'MISC'].index)

* or added a new calculated column:
    -> df['total_cost'] = df['price'] * df['quantity']
	Source Blog: https://medium.com/jbennetcodes/how-to-rewrite-your-sql-queries-in-pandas-and-more-149d341fc53e?source=search_post---------0

	-> Sql queries vs pandas queries

	* select Query
	-> select * from airports;
	-> airports.head()

	* select query with limit
	-> select * from airports limit 3
	-> airports.head(3)

	* Get number of rows and columns
	-> airports.shape // return (55536, 18)

	* select certain column with where clause
	-> select id from airports where ident='KLAX'
	-> airports[airports.ident=='KLAX'].id

	* select with multiple conditions
	-> select * from airports where iso_region = 'US-CA' and type = 'seaplane_base'
	-> airports[(airports.iso_region == 'US-CA') & (airports.type == 'seaplane_base')]

	-> select ident, name, municipality from airports where iso_region = 'US-CA' and type = 'large_airport'
	-> airports[(airports.iso_region == 'US-CA') & (airports.type == 'large_airport')][['ident', 'name', 'municipality']]

	* ORDER BY
	-> select * from airport_freq where airport_ident = 'KLAX' order by type
	-> airport_freq[airport_freq.airport_ident == 'KLAX'].sort_values('type')


	-> select * from airport_freq where airport_ident = 'KLAX' order by type desc
	-> airport_freq[airport_freq.airport_ident == 'KLAX'].sort_values('type', ascending=False)

	* IN...NOT IN
	-> select * from airports where type in ('heliport', 'balloonport')
	-> airports[airports.type.isin(['heliport', 'balloonport'])]

	-> select * from airports where type not in ('heliport', 'balloonport')
	-> airports[~airports.type.isin(['heliport', 'balloonport'])]

	* Groupby, count, ORDER BY
	-> select iso_country, type, count(*) from airports group by iso_country, type order by iso_country, type
	-> airports.groupby(['iso_country', 'type']).size()

	-> select iso_country, type, count() from airports group by iso_country, type order by iso_country, count() desc
	-> airports.groupby(['iso_country', 'type']).size().to_frame('size').reset_index().sort_values(['iso_country', 'size'], ascending=[True, False])

	-> select iso_country, type, count(*) from airports group by iso_country, type order by iso_country, type
	-> airports.groupby(['iso_country', 'type']).size()

	-> select iso_country, type, count() from airports group by iso_country, type order by iso_country, count() desc
	-> airports.groupby(['iso_country', 'type']).size().to_frame('size').reset_index().sort_values(['iso_country', 'size'], ascending=[True, False])

	* HAVING
	-> select type, count() from airports where iso_country = 'US' group by type having count() > 1000 order by count(*) desc
	-> airports[airports.iso_country == 'US'].groupby('type').filter(lambda g: len(g) > 1000).groupby('type').size().sort_values(ascending=False)

	* TOP N Records
	-> airports.head() // by default it returns 5
	-> airports.head(3)

	-> select iso_country from by_country order by size desc limit 10
	-> by_country.nlargest(10, columns='airport_count')

	-> select iso_country from by_country order by size desc limit 10 offset 10
	-> by_country.nlargest(20, columns='airport_count').tail(10)

	* Aggregate function (MIN, MAX, MEAN, MEDIAN)
	-> select max(length_ft), min(length_ft), mean(length_ft), median(length_ft) from runways
	-> runways.agg({'length_ft': ['min', 'max', 'mean', 'median']})
	OR
	-> df.T

	* Join
	Short note : Use .merge() to join Pandas dataframes. You need to provide which columns to join
	on (left_on and right_on), and join type: inner (default), left (corresponds to LEFT OUTER in SQL),
	right (RIGHT OUTER), or outer (FULL OUTER).

	->select airport_ident, type, description, frequency_mhz from airport_freq join airports on
	airport_freq.airport_ref = airports.id where airports.ident = 'KLAX'
	-> airport_freq.merge(airports[airports.ident == 'KLAX'][['id']], left_on='airport_ref',
	right_on='id', how='inner')[['airport_ident', 'type', 'description', 'frequency_mhz']]

	* Union All and Union
	-> select name, municipality from airports where ident = 'KLAX'
	union all
	select name, municipality from airports where ident = 'KLGB'
	-> pd.concat([airports[airports.ident == 'KLAX'][['name', 'municipality']],
	airports[airports.ident == 'KLGB'][['name', 'municipality']]])

	^Note : (To deduplicate things (equivalent of UNION), you’d also have to add .drop_duplicates().)

	* Insert query
	-> create table heroes (id integer, name text);
	-> insert into heroes values (1, 'Harry Potter');
	-> insert into heroes values (2, 'Ron Weasley');
	-> insert into heroes values (3, 'Hermione Granger');

	-> df1 = pd.DataFrame({'id': [1, 2], 'name': ['Harry Potter', 'Ron Weasley']})
	-> df2 = pd.DataFrame({'id': [3], 'name': ['Hermione Granger']})
	-> pd.concat([df1, df2]).reset_index(drop=True)

	* Update
	-> update airports set home_link = 'http://www.lawa.org/welcomelax.aspx' where ident == 'KLAX'
	-> airports.loc[airports['ident'] == 'KLAX', 'home_link'] = 'http://www.lawa.org/welcomelax.aspx'

	* Delete
	-> delete from lax_freq where type = 'MISC'
	-> lax_freq = lax_freq[lax_freq.type != 'MISC']
	-> lax_freq.drop(lax_freq[lax_freq.type == 'MISC'].index)

	* or added a new calculated column:
	-> df['total_cost'] = df['price'] * df['quantity']