Imports & Conventions:
import numpy as np
import pandas as pd
# df is a Pandas DataFrame
# series is a Pandas Series
- Keep only numerical columns using
df2 = df.select_dtypes(exclude=['object'])
. df.explode(['column_to_explode'])
transforms data of the form
Col1 | Col2 | column_to_explode |
---|---|---|
A | BBc | ['l1', 'l2'] |
Zxv | dfafa | ['l3'] |
into:
Col1 | Col2 | column_to_explode |
---|---|---|
A | BBc | l1 |
A | BBc | l2 |
Zxv | dfafa | l3 |
Note: In this example, Column column_to_explode
initially had list
in each cell.
-
Difference b/w 2 dataframes:
df1.compare(df2)
(this will show the rows in df1 where any column value is different from df2) -
pd.get_dummies(df)
converts categorical data (i.e., columns having values from fixed choices - eg. male & female) into multiple dummy/indicator columns, one for each value in column, each column having valuesTrue
|False
. Columns that are already numerical are left unchanged. This is called One-Hot Encoding. For example, this data:
Pclass | Sex |
---|---|
1 | male |
2 | female |
3 | female |
1 | male |
2 | male |
is transformed into:
Pclass | Sex_male | Sex_female |
---|---|---|
1 | True | False |
2 | False | True |
3 | False | True |
1 | True | False |
2 | True | False |
Notice that Pclass
numerical column is unchanged, while 2 new columns are created from categorical column Sex
(one for each unique value in the column).