Skip to content

Instantly share code, notes, and snippets.

View sachinsdate's full-sized avatar
💭
Up to my ears in regression modeling

sachinsdate

💭
Up to my ears in regression modeling
View GitHub Profile
@sachinsdate
sachinsdate / automobiles_for_statistical_sampling.csv
Created December 8, 2024 04:12
Autos dataset used by statistical_sampling.py. Data source: https://archive.ics.uci.edu/dataset/10/automobile
We can make this file beautiful and searchable if this error is corrected: It looks like row 8 should actually have 26 columns, instead of 9 in line 7.
symboling,normalized_losses,make,fuel_type,aspiration,num_of_doors,body_style,drive_wheels,engine_location,wheel_base,length,width,height,curb_weight,engine_type,num_of_cylinders,engine_size,fuel_system,bore,stroke,compression_ratio,horsepower,peak_rpm,city_mpg,highway_mpg,price
3,,alfa-romero,gas,std,two,convertible,rwd,front,88.6,168.8,64.1,48.8,2548,dohc,4,130,mpfi,3.47,2.68,9,111,5000,21,27,13495
3,,alfa-romero,gas,std,two,convertible,rwd,front,88.6,168.8,64.1,48.8,2548,dohc,4,130,mpfi,3.47,2.68,9,111,5000,21,27,16500
1,,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,171.2,65.5,52.4,2823,ohcv,6,152,mpfi,2.68,3.47,9,154,5000,19,26,16500
2,164,audi,gas,std,four,sedan,fwd,front,99.8,176.6,66.2,54.3,2337,ohc,4,109,mpfi,3.19,3.4,10,102,5500,24,30,13950
2,164,audi,gas,std,four,sedan,4wd,front,99.4,176.6,66.4,54.3,2824,ohc,5,136,mpfi,3.19,3.4,8,115,5500,18,22,17450
2,,audi,gas,std,two,sedan,fwd,front,99.8,177.3,66.3,53.1,2507,ohc,5,136,mpfi,3.19,3.4,8.5,110,5500,19,25,15250
1,158,audi,gas,std,four,sedan,fwd,fr
@sachinsdate
sachinsdate / statistical_sampling.py
Last active December 8, 2024 04:19
Experiments with simple random sampling, systematic sampling, and stratified random sampling. Download automobiles data from https://gist.github.com/sachinsdate/254be0cf9a631ffd943c0746e103cd85
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import scipy.stats as stats
from sklearn.model_selection import train_test_split
###############################################################################
# Read the autos dataset into a Dataframe and carve out the subset of interest
@sachinsdate
sachinsdate / m_and_m_sample_weights.csv
Created September 12, 2024 02:51
Sample weight of 15 peanut M&Ms. 60 samples.
Sample_ID Weight_In_GMS
1 36.03
2 33.51
3 34.55
4 34.23
5 35.73
6 31.45
7 35.03
8 36.2
9 36.96
No transaction_date house_age distance_to_the_nearest_mrt_station number_of_convenience_stores latitude longitude house_price_of_unit_area
1 2012.917 32 84.87882 10 24.98298 121.54024 37.9
2 2012.917 19.5 306.5947 9 24.98034 121.53951 42.2
3 2013.583 13.3 561.9845 5 24.98746 121.54391 47.3
4 2013.500 13.3 561.9845 5 24.98746 121.54391 54.8
5 2012.833 5 390.5684 5 24.97937 121.54245 43.1
6 2012.667 7.1 2175.03 3 24.96305 121.51254 32.1
7 2012.667 34.5 623.4731 7 24.97933 121.53642 40.3
8 2013.417 20.3 287.6025 6 24.98042 121.54228 46.7
9 2013.500 31.7 5512.038 1 24.95095 121.48458 18.8
@sachinsdate
sachinsdate / estimator_bias.py
Last active September 13, 2024 10:40
An illustration of estimation bias. CSV files referenced in the code can be available as gists
import math
import pandas as pd
import numpy as np
from patsy import dmatrices
import statsmodels.api as sm
import scipy.stats
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from matplotlib.ticker import StrMethodFormatter
import seaborn as sns
import math
import pandas as pd
from patsy import dmatrices
import numpy as np
import scipy.stats
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
@sachinsdate
sachinsdate / automobiles.data.csv
Created July 31, 2024 10:24
Automobile dataset from UCI ML Repository (http://archive.ics.uci.edu/dataset/10/automobile). 205 vehicles. 25 features. Added column names. Removed all '?'. Made available under CC BY 4.0
We can make this file beautiful and searchable if this error is corrected: It looks like row 8 should actually have 26 columns, instead of 9 in line 7.
symboling,normalized_losses,make,fuel_type,aspiration,num_of_doors,body_style,drive_wheels,engine_location,wheel_base,length,width,height,curb_weight,engine_type,num_of_cylinders,engine_size,fuel_system,bore,stroke,compression_ratio,horsepower,peak_rpm,city_mpg,highway_mpg,price
3,,alfa-romero,gas,std,two,convertible,rwd,front,88.6,168.8,64.1,48.8,2548,dohc,4,130,mpfi,3.47,2.68,9,111,5000,21,27,13495
3,,alfa-romero,gas,std,two,convertible,rwd,front,88.6,168.8,64.1,48.8,2548,dohc,4,130,mpfi,3.47,2.68,9,111,5000,21,27,16500
1,,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,171.2,65.5,52.4,2823,ohcv,6,152,mpfi,2.68,3.47,9,154,5000,19,26,16500
2,164,audi,gas,std,four,sedan,fwd,front,99.8,176.6,66.2,54.3,2337,ohc,4,109,mpfi,3.19,3.4,10,102,5500,24,30,13950
2,164,audi,gas,std,four,sedan,4wd,front,99.4,176.6,66.4,54.3,2824,ohc,5,136,mpfi,3.19,3.4,8,115,5500,18,22,17450
2,,audi,gas,std,two,sedan,fwd,front,99.8,177.3,66.3,53.1,2507,ohc,5,136,mpfi,3.19,3.4,8.5,110,5500,19,25,15250
1,158,audi,gas,std,four,sedan,fwd,fr
@sachinsdate
sachinsdate / boston_monthly_tmax_1998_2019.csv
Created June 26, 2024 10:27
Monthly average maximum temperature in Boston, MA
Date Monthly Average Maximum
1/15/1998 39.71
2/15/1998 40.97
3/15/1998 48.75
4/15/1998 56.74
5/15/1998 68.75
6/15/1998 72
7/15/1998 82.62
8/15/1998 80.2
9/15/1998 74.44
@sachinsdate
sachinsdate / southern_oscillations_standardized_long_may24.csv
Created June 26, 2024 10:25
The El Nino Southern Oscillations (ENSO) Index. Data source: NOAA
Date Y_t
1951-01-01 1.5
1951-02-01 0.9
1951-03-01 -0.1
1951-04-01 -0.3
1951-05-01 -0.7
1951-06-01 0.2
1951-07-01 -1.0
1951-08-01 -0.2
1951-09-01 -1.1
@sachinsdate
sachinsdate / pacf.py
Created June 21, 2024 11:55
Partial Auto-Correlation
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import matplotlib.dates as mdates
from statsmodels.graphics.tsaplots import plot_pacf
from statsmodels.graphics.tsaplots import plot_acf
from statsmodels.tsa.stattools import pacf
from statsmodels.tsa.stattools import acf
import statsmodels.api as sm
from patsy import dmatrices