simonbyrne/data.md

## data.md

      
    Raw
  

              data.md
            
          
    Data

Useful links:

https://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html
https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf
https://s3.amazonaws.com/assets.datacamp.com/img/blog/data+table+cheat+sheet.pdf
http://pandas.pydata.org/pandas-docs/stable/comparison_with_r.html
https://dato.com/products/create/docs/generated/graphlab.SFrame.html

Appending row

SQL

INSERT INTO table (col1, col2, col3)
VALUES ('aa', 'bb', 'cc');

dplyr

bind_rows(table, data.frame(col1="aa",col2="bb",col3="cc"))

data.table

rbind(table, data.table(col1="aa",col2="bb",col3="cc"))

pandas

table.append({'col1': 'aa', 'col2':'bb', 'col3':'cc'})

Stack vertically

SQL

SELECT * FROM table1
UNION ALL
SELECT * FROM table2;

dplyr

bind_rows(table1, table2)

data.table

rbind(table1,table2)

pandas

pandas.concat(table1,table2) # or
table1.append(table2)

graphlab

table1.append(table2)

Sorting

Sort table by col1 ascending, col2 descending
SQL

SELECT *
FROM table
ORDER BY col1, col2 DESC

dplyr

table %>% arrange(col1,desc(col2))

data.table

table[order(col1, -col2)]  # or
setorder(table, col1, -col2) # operates in-place

pandas

table.sort(['col1','col2'], ascending=[1, 0])

graphlab

table.sort([('col1', True), ('col2', False)])

Filter rows

SQL

SELECT *
FROM table
WHERE col1 = 'aa' AND col2 = 'bb'

dplyr

table %>% filter(col1 == 'aa', col2 == 'bb')

data.table

table[col1=='aa' & col2=='bb']

pandas/graphlab

table[(table['col1'] == 'aa') & (table['col2'] == 'bb')]

Create a new column as a function of others

SQL

SELECT * , col1 - col2 AS col3
FROM table

dplyr

flights %>% mutate(col3 = col1 - col2)

data.table

table[,col3:=col1-col2]

pandas/graphlab

table['col3'] = table['col2'] - table['col1']  

Reshape: columns -> rows

aka stack/melt
SQL

No standard way (apart from abuse of UNION ALL, but this will perform multiple scans of the table).
Postgres can use unnest:
SELECT id,
    unnest(array['col1', 'col2', 'col3']) AS colname,
    unnest(array[col1, col2, col3]) AS value
FROM table;

dplyr

with tidyr
table %>% gather(colname,value,col1:col3)

data.table

melt(table, "id", c("col1","col2","col3"), "colname", "value")

pandas

pandas.melt(table,['id'],['col1','col2','col3'],'colname','value')

table.stack() may also work for simple cases.
graphlab

stack operates on a single dict (or list) column, so this first needs to be created:
table.pack_columns(column_prefix='col',dtype=dict,
    new_column_name='colpack').stack('colpack',new_column_name=['colname', 'value'])

Reshape: rows -> columns

aka unstack/cast
SQL

no standard method, though can be emulated using lots of joins
Postgres offers crosstab, however the number of values and types need to be known in advance.
SELECT *
FROM crosstab3('
    select id, colname, value
    from table
    order by 1,2
    ');

dplyr

with tidyr
table %>% spread(colname, value)

data.table

dcast.data.table(table, id ~ colname, value.var="value")

pandas

Can be somewhat emulated:
table.pivot_table(values='value', index='id', columns='colname', aggfunc='sum')

however this requires an aggregation function.
df.groupby(['id','colname'])['value'].unstack()

might work?
graphlab

table.unstack(['colname','value'],new_column_name='colpack').unpack('colpack',column_name_prefix="")

top n by group

SQL

Not standard, but can be done using Postgres windowed functions
SELECT * , rank() OVER (PARTITION BY grp ORDER BY col DESC) AS pos
FROM table
WHERE pos <= 5;

dplyr

table %>% group_by(grp) %>% top_n(col,5)

(may need ungroup() at the end?)
data.table

table[,.SD[order(-col)[1:5]],by=grp]

pandas

table.sort(['grp','col'],ascending=[1, 0]).groupby('grp').head(5)

graphlab

No builtin. The following is based on this how-to:
table = table.sort(['grp', 'col'], ascending=False)
table = table.add_row_number('id')
table = table.groupby('grp', gl.aggregate.MIN('id'))
table = table.join(grp, 'grp', how='left')
table['rank'] = table['id'] - table['Min of id']
top_k = table[table['rank'] <= 5]

cumulitive sum by group

SQL

Not standard, but can be done using Postgres windowed functions
SELECT group, col, value,
    sum(value) OVER (PARTITION BY grp ORDER BY col) AS cum_value
FROM table
ORDER BY grp, col;

dplyr

table %>% group_by(grp) %>% mutate(cum_value=cumsum(value))

data.table

table[, cum_value := cumsum(value), by=grp]

pandas

table.groupby('grp').cumsum('value')

graphlab

No builtin, may be possible to use numpy integration
http://forum.dato.com/discussion/comment/3566#Comment_3566