misho-kr/Data Manipulation with pandas.md

## Data Manipulation with pandas.md

      
    Raw
  

              Data Manipulation with pandas.md
            
          
    Data Manipulation with pandas

pandas is the world's most popular Python library, used for everything from data manipulation to data analysis. Learn how to manipulate DataFrames, as you extract, filter, and transform real-world datasets for analysis. Using real-world data, including Walmart sales figures and global temperature time series, you’ll learn how to import, clean, calculate statistics, and create visualizations—using pandas!
Lead by Maggie Matsui, Data Scientist at DataCamp
Transforming Data

Inspect DataFrames and perform fundamental manipulations, including sorting rows, subsetting, and adding new columns

Exploring DataFrames with .head(), .tail(), .info(), .describe() and .shape
Viewing components with .values, .columns and .index
There should be one -- and preferably only one -- obvious way to do it
Sorting, subsetting columns and rows, adding new columns

> dogs.sort_values("weight_kg")
> dogs.sort_values(["weight_kg", "height_cm"], ascending=[True, False])
> dogs[["breed", "height_cm"]]
> dogs[dogs["height_cm"] > 50]
> dogs["color"].isin(["Black", "Brown"])
Aggregating Data

Calculate summary statistics on DataFrame columns, and master grouped summary statistics and pivot tables

Summarizing with:

median(), mode(), min(), max(), median(), sum(), var(), std(), quantile()
cumsum(), cummin(), cummax(), cumprod(),


Counting

drop_duplicates(), value_counts()


Grouped summary statistics with groupby()
Pivot Tales

They are just DataFrames with sorted indexes
Filling missing values
Summing


> dogs["date_of_birth"].min()
> dogs[["weight_kg", "height_cm"]].agg(np.min)
>
> vet_visits.drop_duplicates(subset="name")
> vet_visits.drop_duplicates(subset=["name", "breed"]).value_counts(sort=True, normalize=True)
> 
> dogs.groupby(["color", "breed"])["weight_kg"].agg([min, max, sum]).mean()
> dogs.pivot_table(values="weight_kg", index="color", aggfunc=[np.mean, np.median])
> 
> dogs.groupby(["color", "breed"], fill_value=0, margins=True)["weight_kg"].mean()
> dogs.pivot_table(values="weight_kg", index="color", columns="breed")
Slicing and indexing

Indexes are supercharged row and column names. Learn how they can be combined with slicing for powerful DataFrame subsetting.

Explicit Indexes: .columns and .index

Setting a column as index
Removingm, dropping and sorting index


Multi-level indexes a.k.a. hierarchical indexes
Indexes make subsetting simpler

Index values are just data
Indexes violate "tidy data" principles
You need to learn two syntaxes


Slicing and subsetting with .loc and .iloc

Sort the index before you slice
Slicing columns and slicing twice
Slicing by dates
Subsetting by row/column number


Working with pivot tables

They are just DataFrames with sorted indexes
Yet they are special cases since every column containers the same data type
The axis argument
Calculating summary stats across columns


> dogs_ind = dogs.set_index("name")
> dogs_ind.reset_index()
> dogs_ind.reset_index(drop=True)
> dogs_ind3 = dogs.set_index(["breed", "color"])
> dogs_ind3.loc[["Labrador", "Chihuahua"]]
> dogs_ind3.loc[[("Labrador", "Brown"), ("Chihuahua", "Tan")]]
> dogs_ind3.sort_index(level=["color", "breed"], ascending=[True, False])
> 
> dogs_srt = dogs.set_index(["breed", "color"]).sort_index()
> dogs_srt.loc["Chow Chow":"Poodle"]
> dogs_srt.loc[("Labrador", "Brown"):("Schnauzer", "Grey")]
> dogs_srt.loc[:, "name":"height_cm"]
> dogs_srt.loc[("Labrador", "Brown"):("Schnauzer", "Grey"), "name":"height_cm"]
> dogs.loc["2014-08-25":"2016-09-16"]
> print(dogs.iloc[2:5, 1:4])
> 
> dogs_height_by_breed_vs_color = 
    dog_pack.pivot_table("height_cm", index="breed", columns="color")
> dogs_height_by_breed_vs_color.loc["Chow Chow":"Poodle"]
> dogs_height_by_breed_vs_color.mean(axis="index")
> dogs_height_by_breed_vs_color.mean(axis="columns")
Creating and Visualizing DataFrames

Visualize the contents of your DataFrames, handle missing data values, and import data from and export data to CSV files

Plots

Histograms, Bar plots, Line plots, Scatter plots
Layering plots, legend, grid, transparency ...


Missing values

Detecting, counting, removing, replacing


Creating DataFrames

From a list of dictionaries
From a dictionary of lists


Reading and writing CSVs

> dog_pack["height_cm"].hist(bins=20, alpha=0.7)
> avg_weight_by_breed = dog_pack.groupby("breed")["weight_kg"].mean()
> avg_weight_by_breed.plot(kind="bar")
> sully.plot(x="date", y="weight_kg", kind="line", rot=45)
> dog_pack.plot(x="height_cm", y="weight_kg", kind="scatter")
> 
> dogs.isna().any()
> dogs.isna().sum()
> dogs.isna().sum().plot(kind="bar")
> dogs.dropna()
> dogs.fillna(0)
>
> new_dogs = pd.read_csv("new_dogs.csv")
> new_dogs.to_csv("new_dogs_with_bmi.csv")