df.info()
- gives column by col review of dtypes and number of non-null entries
df.sample(n)
- returns a random sample of n entries from the df. Good for looking for quality problems. Default is n=1.
df.head(n)
- returns the first n rows of the dataframe
df.tail(n)
- returns the last n rows of the dataframe
df.describe(percentiles=[.25, .5, .75], include=None, exclude=None, datetime_is_numeric=False)
- percentiles
defaults to those shown here but you can apply any you wish. include
defaults to only numeric; can be include='all'
to show all cols, or can be a list of dtypes you wish to show. exclude
works in a complementary fashion. datetime_is_numeric
if True will treat datetime cols as numeric and include them in the otherwise default call.
df.col.value_counts(normalize=False,sort=True,ascending=False,bins=int,dropna = True)
- returns the values in the column and the number of times they appear. normalize=True
will return the relative frequencies rather than count. Descending sort is default. bins
takes an integer and only works with numeric data. dropna=False
if you want to see the effect of NaNs in your col.
df_new = df.copy()
produces a copy of df under a different name. Essential before cleaning a df so that you don't lose the original.
df1 = pd.concat([df1, df2], ignore_index=True)
stacks one df on another matching matching cols. You can rename or use one of the previous dfs.
df.col1.corr(df.col2)
to get the correlation between two columns
df.isna().sum()
will give a list of cols with a count of the nulls in that col
Ordering Categories in cols: df.a = pd.Categorical(df.a,categories=["Cat1","Cat2","Cat3","Cat4"],ordered=True)