Skip to content

Instantly share code, notes, and snippets.

View aish-anand's full-sized avatar
🎯
Focusing

Aishwarya aish-anand

🎯
Focusing
View GitHub Profile
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
rf.score(X_test, y_test)
# now look at feature importance
feature_importances = pd.DataFrame(rf.feature_importances_,
index = X_train.columns,
columns=['importance']).sort_values('importance', ascending=False)
# create and train model - ridge, lasso
ridge_model = linear_model.RidgeCV(alphas=[1e-4, 1e-3, 1e-2, 1e-1, 0.5, 0.9, 1, 10])
ridge_model.fit(X_train, y_train)
# RFE - Feature ranking with recursive feature elimination. ; n_features_to_select can be increased or decreased, 80 here is just an example
hyperparam_rfe = {"step":10, "n_features_to_select":80}
hyperparam_rfr = {"n_estimators":20, "max_depth":4}
estimator = RandomForestRegressor(random_state = 42, n_jobs=-1, **hyperparam_rfr)
selector = RFE(estimator, **hyperparam_rfe)
selector.fit(X_train, y_train)
print("Support: ", selector.support_)
print("Ranking: ", selector.ranking_)
selected_cols = [d for d, s in zip(list(X_train.columns), selector.support_) if s]
print("Selected Features: ", selected_cols)
expt_lv = client.set_experiment("low_variance")
# get list of all the original df columns
all_columns = X_train.columns
# instantiate VarianceThreshold object
vt = VarianceThreshold(threshold=0.2)
# fit vt to data
vt.fit(X_train)
# get the list of selected columns
selected_cols = [d for d, s in zip(list(X_train.columns), vt.get_support()) if s]
# creating a new experiment in Verta
expt = client.set_experiment("f_regr")
# for regression use f_regression
selector_f_reg = SelectKBest(f_regression, k=20).fit(X_train, y_train)
selected_cols = [d for d, s in zip(list(X_train.columns), selector_f_reg.get_support()) if s]
print("K best columns: ", selected_cols)
@aish-anand
aish-anand / remove_corr_features.py
Created September 2, 2019 23:34
Removing highly correlated features
# correlation matrix
cor = X_train.corr()
cor.loc[:,:] = np.tril(cor, k=-1) # below main lower triangle of an array
cor_stack = cor.stack()
print("Columns with corr. greater than 0.7 - ")
print(cor_stack[(cor_stack > 0.70) | (cor_stack < -0.70)])
@aish-anand
aish-anand / preprocess.py
Last active September 3, 2019 00:02
Cleaning and Preprocessing
# reading the dataset
loan_df = pd.read_csv("../input/loan.csv")
# looking at missing values
# percentage of null values in each of the columns
missing_val = loan_df.isna().sum()/len(loan_df)*100
print("Columns with more than 60% missing values - ")
print(missing_val[missing_val > 60].sort_values(ascending=False))
# since we cannot fill in values for these columns we're going to drop these ; got rid of ~50 cols with that threshold
loan_df = loan_df.loc[:, loan_df.isnull().mean() < .35]
print(loan_df.shape)
@aish-anand
aish-anand / download_glue_data.py
Last active July 16, 2019 20:07 — forked from W4ngatang/download_glue_data.py
Script for downloading data of the GLUE benchmark (gluebenchmark.com)
''' Script for downloading all GLUE data.
Note: for legal reasons, we are unable to host MRPC.
You can either use the version hosted by the SentEval team, which is already tokenized,
or you can download the original data from (https://download.microsoft.com/download/D/4/6/D46FF87A-F6B9-4252-AA8B-3604ED519838/MSRParaphraseCorpus.msi) and extract the data from it manually.
For Windows users, you can run the .msi file. For Mac and Linux users, consider an external library such as 'cabextract' (see below for an example).
You should then rename and place specific files in a folder (see below for an example).
mkdir MRPC
cabextract MSRParaphraseCorpus.msi -d MRPC
@aish-anand
aish-anand / README.md
Created May 3, 2018 20:33
COMPSCI 590V - Homework 6

Data

A Song of Ice and Fire has been talked to death. Between the TV adaptation (a ratings dreadnought, dragging in its wake a froth of thinkpieces), the source text (sprinkled with enigmas to entice even the most discerning of crackpots), and the book releases (rarer than Olympics), a wealth of criticism has accumulated for any fan of the series to explore. With this visualization, you can look a little more into the TV adaptation!

Visualisation 1 - Scatterplot

The scatterplot talks mostly about the massive viewership and fan following that show has. With millions of dollars spent on each episode, it shows us the votes and the ratings that each episode of show has received so far. The scatter plot allows both hover - which shows the episode name and brushing over the data points in the chart. With the brushing, a table is populated with the relevant information about the episode.

Visualization 2 - Word Cloud

For Season 7 specifically a word cloud is created to see the most frequent words in each o

airport arr_cancelled arr_delay carrier_delay weather_delay nas_delay security_delay late_aircraft_delay
ATL 6396 4184220 1533479 213781 880086 4380 1552494
ORD 4549 3987149 1136612 174348 1129859 3664 1542666
SFO 3519 3825836 670030 133557 1930595 3846 1087808
LAX 2446 3094970 832602 115079 1078174 5812 1063303
EWR 3746 2819605 466541 90950 1612665 1301 648148
DFW 2627 2357531 842968 107792 465076 4420 937275
DEN 1828 2342731 743717 101288 476244 2645 1018837
BOS 3480 2191339 501258 87391 755466 1651 845573
JFK 2925 2129525 452055 64350 938491 3228 671401
@aish-anand
aish-anand / README.md
Last active April 20, 2018 16:27
COMPSCI 590V - Homework 5

Dataset

The visualization used 3 separate files -

  • airports - contains all the airports in the continental United States, with IATA codes, lat and long. (3375 rows) Not all airports are plotted, as the visualisation gets too crowded.
  • flights - contains flights to/from airports, again listed by IATA codes. (5367 rows)
  • airport_delays - 25 busiest airports in the US, delays and cancellation details. The cancellations/delays are split based on cause - weather, security, carrier delay and so on. This has been condensed from the summary stats. The U.S. Department of Transportation's (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT's monthly Air Travel Consumer Report, published about 30 days after the month's end, as well as in summary tables posted on the website. BTS began collecting details on the causes of flight delays