hwetsman/Pandas.md

## Pandas.md

      
    Raw
  

              Pandas.md
            
          
    Pandas

Data Inspection

df.info() - gives column by col review of dtypes and number of non-null entries
df.sample(n) - returns a random sample of n entries from the df. Good for looking for quality problems. Default is n=1.
df.head(n) - returns the first n rows of the dataframe
df.tail(n) - returns the last n rows of the dataframe
df.describe(percentiles=[.25, .5, .75], include=None, exclude=None, datetime_is_numeric=False) - percentiles defaults to those shown here but you can apply any you wish. include defaults to only numeric; can be include='all' to show all cols, or can be a list of dtypes you wish to show. exclude works in a complementary fashion. datetime_is_numeric if True will treat datetime cols as numeric and include them in the otherwise default call.
df.col.value_counts(normalize=False,sort=True,ascending=False,bins=int,dropna = True) - returns the values in the column and the number of times they appear. normalize=True will return the relative frequencies rather than count. Descending sort is default. bins takes an integer and only works with numeric data. dropna=False if you want to see the effect of NaNs in your col.
df_new = df.copy() produces a copy of df under a different name. Essential before cleaning a df so that you don't lose the original.
df1 = pd.concat([df1, df2], ignore_index=True) stacks one df on another matching matching cols. You can rename or use one of the previous dfs.
df.col1.corr(df.col2) to get the correlation between two columns
df.isna().sum() will give a list of cols with a count of the nulls in that col
Ordering Categories in cols:
df.a = pd.Categorical(df.a,categories=["Cat1","Cat2","Cat3","Cat4"],ordered=True)