misho-kr/pandas Foundations.md

## pandas Foundations.md

      
    Raw
  

              pandas Foundations.md
            
          
    pandas Foundations

pandas DataFrames are the most widely used in-memory representation of complex data collections within Python. Whether in finance, a scientific field, or data science, familiarity with pandas is essential. This course teaches you to work with real-world datasets containing both string and numeric data, often structured around time series. You will learn powerful analysis, selection, and visualization techniques in this course.
Lead by Team Anaconda, Data Science Consultant at Lander Analytics
Data ingestion & inspection

Use pandas to import and inspect a variety of datasets, ranging from population data obtained from the World Bank to monthly stock data obtained via Yahoo Finance. Build DataFrames from scratch and become familiar with the intrinsic data visualization capabilities of pandas.

Pandas DataFrames

Indexes and Columns
Slicing, head() and tail()
Broadcasting -- assigning scalar value to column slice broadcasts value to each row


Pandas Series
Building DataFrames

CSV file

Omitting header, setting column names, na_value
Parse dates


Python dictionary

Broadcasting with a dictionary


import pandas as pd
users = pd.read_csv('datasets/users.csv', index_col=0)

zipped = list(zip(list_labels, list_cols))
data = dict(zipped)
users = pd.DataFrame(data)

Inspecting with df.info()
Using dates as index
Trimming redundant columns

sunspots.index = sunspots['year_month_day']
sunspots.index.name = 'date'

cols = ['sunspots', 'definite']
sunspots = sunspots[cols]
sunspots.iloc[10:20, :]

Writing files with df.to_csv() and df.to_excel()
Plotting arrays, series and data frames

import pandas as pd
import matplotlib.pyplot as plt
aapl = pd.read_csv('aapl.csv', index_col='date', parse_dates=True)

plt.plot(aapl['close'].values)
plt.plot(aapl['close'])
aapl['close'].plot()
plt.plot(aapl)
aapl.plot()

plt.savefig('aapl.png')
plt.show()

Customizing the plots -- colors, style, labels, legends, ticks, scales

Exploratory data analysis

Explore data visually and quantitatively. Exploratory data analysis (EDA) is a crucial component of any data science project. pandas has powerful methods that help with statistical and visual EDA.

Visual

Plots: scatter, line, box, histogram
Histogram options: bins, range, normalized-to-one, cumulative
CDF: cumulative Distribution Function


iris.plot(y='sepal_length', kind='hist', bins=30, 
          range=(4,8), cumulative=True, normed=True)
plt.xlabel('sepal length (cm)')
plt.title('Cumulative distribution function (CDF)')
plt.show()

Statistical:

describe, count, average, standard deviation
ranges, inter-quartile range
percentiles: 25, 50, 75
unique


indices = iris['species'] == 'setosa'
setosa = iris.loc[indices,:] # extract new DataFrame

Computing errors

describe_all = iris.describe()

error_setosa = 100 * np.abs(describe_setosa - describe_all)
error_setosa = error_setosa/describe_setosa

print(error_setosa)
Time series in pandas

Manipulate and visualize time series data using pandas. Upsampling, downsampling, and interpolation. Use method chaining to efficiently filter data and perform time series analyses. From stock prices to flight timings, time series data can be found in a wide variety of domains.

Using pandas to read datetime objects, specify parse_dates=True
Partial datetime string selection
Convert strings to datatime with pd.to_datetime()
Reindex and fill missing values

sales.loc['2015-02-19 11:00:00', 'Company']
sales.loc['February 5, 2015']
sales.loc[‘2015-Feb-5']   # Whole month
sales.loc[‘2015’]         # Whole year
sales.loc['2015-2-16':'2015-2-20']

evening_2_11 = pd.to_datetime(['2015-2-11 20:00', '2015-2-11 21:00',
                               '2015-2-11 22:00', '2015-2-11 23:00'])
sales.reindex(evening_2_11, method='ffill')

Resampling time series data

Statistical methods over different time intervals
Method chaining: mean(), sum(), count(), etc.
Downsampling to reduce datetime rows to slower frequency
Upsampling to increase datetime rows to faster frequency


daily_mean = sales.resample('D').mean()
sales.loc[:, 'Units'].resample('2W').sum()
two_days.resample('4H').ffill()

String methods
Datetime methods
Set and convert timezone
Interpolate missing data with interpolate()

sales['Company'].str.upper()
sales['Product'].str.contains('ware')

sales['Date'].dt.hour
central = sales['Date'].dt.tz_localize('US/Central')
central.dt.tz_convert('US/Eastern')

TIme series visualization

sp500 = pd.read_csv('sp500.csv', parse_dates=True, index_col= 'Date')
sp500.loc['2012-4-1':'2012-4-7', 'Close'].plot(title='S&P 500')
plt.ylabel('Closing Price (US Dollars)
plt.show()

sp500['Close'].plot(kind='area', title='S&P 500')
plt.ylabel('Closing Price (US Dollars)')
plt.show()

sp500.loc['2012', ['Close','Volume']].plot(subplots=True)
plt.show()
Case Study - Sunlight in Austin

Working with real-world weather and climate data, you will use pandas to manipulate the data into a usable form for analysis and systematically explore it using the techniques you’ve learned.


Climate normals of Austin, TX from 1981-2010
Weather data of Austin, TX from 2011