Aishwarya aish-anand

## embedded_method.py
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
rf.score(X_test, y_test)
# now look at feature importance
feature_importances = pd.DataFrame(rf.feature_importances_,
index = X_train.columns,
columns=['importance']).sort_values('importance', ascending=False)
# create and train model - ridge, lasso
ridge_model = linear_model.RidgeCV(alphas=[1e-4, 1e-3, 1e-2, 1e-1, 0.5, 0.9, 1, 10])
ridge_model.fit(X_train, y_train)

## recursive_feature_elimination.py
# RFE - Feature ranking with recursive feature elimination. ; n_features_to_select can be increased or decreased, 80 here is just an example
hyperparam_rfe = {"step":10, "n_features_to_select":80}
hyperparam_rfr = {"n_estimators":20, "max_depth":4}
estimator = RandomForestRegressor(random_state = 42, n_jobs=-1, **hyperparam_rfr)
selector = RFE(estimator, **hyperparam_rfe)
selector.fit(X_train, y_train)
print("Support: ", selector.support_)
print("Ranking: ", selector.ranking_)
selected_cols = [d for d, s in zip(list(X_train.columns), selector.support_) if s]
print("Selected Features: ", selected_cols)

## low_var.py
expt_lv = client.set_experiment("low_variance")
# get list of all the original df columns
all_columns = X_train.columns
# instantiate VarianceThreshold object
vt = VarianceThreshold(threshold=0.2)
# fit vt to data
vt.fit(X_train)
# get the list of selected columns
selected_cols = [d for d, s in zip(list(X_train.columns), vt.get_support()) if s]

## univariate_fs.py
# creating a new experiment in Verta
expt = client.set_experiment("f_regr")
# for regression use f_regression
selector_f_reg = SelectKBest(f_regression, k=20).fit(X_train, y_train)
selected_cols = [d for d, s in zip(list(X_train.columns), selector_f_reg.get_support()) if s]
print("K best columns: ", selected_cols)

## remove_corr_features.py
# correlation matrix
cor = X_train.corr()
cor.loc[:,:] = np.tril(cor, k=-1) # below main lower triangle of an array
cor_stack = cor.stack()
print("Columns with corr. greater than 0.7 - ")
print(cor_stack[(cor_stack > 0.70) | (cor_stack < -0.70)])

## preprocess.py
# reading the dataset
loan_df = pd.read_csv("../input/loan.csv")
# looking at missing values
# percentage of null values in each of the columns
missing_val = loan_df.isna().sum()/len(loan_df)*100
print("Columns with more than 60% missing values - ")
print(missing_val[missing_val > 60].sort_values(ascending=False))
# since we cannot fill in values for these columns we're going to drop these ; got rid of ~50 cols with that threshold
loan_df = loan_df.loc[:, loan_df.isnull().mean() < .35]
print(loan_df.shape)

## download_glue_data.py
''' Script for downloading all GLUE data.

Note: for legal reasons, we are unable to host MRPC.
You can either use the version hosted by the SentEval team, which is already tokenized,
or you can download the original data from (https://download.microsoft.com/download/D/4/6/D46FF87A-F6B9-4252-AA8B-3604ED519838/MSRParaphraseCorpus.msi) and extract the data from it manually.
For Windows users, you can run the .msi file. For Mac and Linux users, consider an external library such as 'cabextract' (see below for an example).
You should then rename and place specific files in a folder (see below for an example).

mkdir MRPC
cabextract MSRParaphraseCorpus.msi -d MRPC

## README.md

      
              8 files
            
          
              0 forks
            
          
                0 comments
              
            
              0 stars
            
          
                aish-anand
                / README.md
            
            
              Created
              May 3, 2018 20:33
            
              
                COMPSCI 590V - Homework 6
              
          
    Data

A Song of Ice and Fire has been talked to death. Between the TV adaptation (a ratings dreadnought, dragging in its wake a froth of thinkpieces), the source text (sprinkled with enigmas to entice even the most discerning of crackpots), and the book releases (rarer than Olympics), a wealth of criticism has accumulated for any fan of the series to explore. With this visualization, you can look a little more into the TV adaptation!
Visualisation 1 - Scatterplot

The scatterplot talks mostly about the massive viewership and fan following that show has. With millions of dollars spent on each episode, it shows us the votes and the ratings that each episode of show has received so far.
The scatter plot allows both hover - which shows the episode name and brushing over the data points in the chart. With the brushing, a table is populated with the relevant information about the episode.
Visualization 2 - Word Cloud

For Season 7 specifically a word cloud is created to see the most frequent words in each o

  
## airport_delays_v2.csv

          
            airport
            arr_cancelled
            arr_delay
            carrier_delay
            weather_delay
            nas_delay
            security_delay
            late_aircraft_delay

            
              ATL
              6396
              4184220
              1533479
              213781
              880086
              4380
              1552494

            
              ORD
              4549
              3987149
              1136612
              174348
              1129859
              3664
              1542666

            
              SFO
              3519
              3825836
              670030
              133557
              1930595
              3846
              1087808

            
              LAX
              2446
              3094970
              832602
              115079
              1078174
              5812
              1063303

            
              EWR
              3746
              2819605
              466541
              90950
              1612665
              1301
              648148

            
              DFW
              2627
              2357531
              842968
              107792
              465076
              4420
              937275

            
              DEN
              1828
              2342731
              743717
              101288
              476244
              2645
              1018837

            
              BOS
              3480
              2191339
              501258
              87391
              755466
              1651
              845573

            
              JFK
              2925
              2129525
              452055
              64350
              938491
              3228
              671401

## README.md

      
              7 files
            
          
              0 forks
            
          
                0 comments
              
            
              0 stars
            
          
                aish-anand
                / README.md
            
            
              Last active
              April 20, 2018 16:27
            
              
                COMPSCI 590V - Homework 5
              
          
    Dataset

The visualization used 3 separate files  -

airports - contains all the airports in the continental United States, with IATA codes, lat and long. (3375 rows) Not all airports are plotted, as the visualisation gets too crowded.
flights - contains flights to/from airports, again listed by IATA codes. (5367 rows)
airport_delays - 25 busiest airports in the US, delays and cancellation details. The cancellations/delays are split based on cause - weather, security, carrier delay and so on. This has been condensed from the summary stats.
The U.S. Department of Transportation's (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT's monthly Air Travel Consumer Report, published about 30 days after the month's end, as well as in summary tables posted on the website. BTS began collecting details on the causes of flight delays
	rf = RandomForestClassifier()
	rf.fit(X_train, y_train)
	rf.score(X_test, y_test)
	# now look at feature importance
	feature_importances = pd.DataFrame(rf.feature_importances_,
	index = X_train.columns,
	columns=['importance']).sort_values('importance', ascending=False)
	# create and train model - ridge, lasso
	ridge_model = linear_model.RidgeCV(alphas=[1e-4, 1e-3, 1e-2, 1e-1, 0.5, 0.9, 1, 10])
	ridge_model.fit(X_train, y_train)
	# RFE - Feature ranking with recursive feature elimination. ; n_features_to_select can be increased or decreased, 80 here is just an example
	hyperparam_rfe = {"step":10, "n_features_to_select":80}
	hyperparam_rfr = {"n_estimators":20, "max_depth":4}
	estimator = RandomForestRegressor(random_state = 42, n_jobs=-1, **hyperparam_rfr)
	selector = RFE(estimator, **hyperparam_rfe)
	selector.fit(X_train, y_train)
	print("Support: ", selector.support_)
	print("Ranking: ", selector.ranking_)
	selected_cols = [d for d, s in zip(list(X_train.columns), selector.support_) if s]
	print("Selected Features: ", selected_cols)
	expt_lv = client.set_experiment("low_variance")
	# get list of all the original df columns
	all_columns = X_train.columns
	# instantiate VarianceThreshold object
	vt = VarianceThreshold(threshold=0.2)
	# fit vt to data
	vt.fit(X_train)
	# get the list of selected columns
	selected_cols = [d for d, s in zip(list(X_train.columns), vt.get_support()) if s]
	# creating a new experiment in Verta
	expt = client.set_experiment("f_regr")
	# for regression use f_regression
	selector_f_reg = SelectKBest(f_regression, k=20).fit(X_train, y_train)
	selected_cols = [d for d, s in zip(list(X_train.columns), selector_f_reg.get_support()) if s]
	print("K best columns: ", selected_cols)
	# correlation matrix
	cor = X_train.corr()
	cor.loc[:,:] = np.tril(cor, k=-1) # below main lower triangle of an array
	cor_stack = cor.stack()
	print("Columns with corr. greater than 0.7 - ")
	print(cor_stack[(cor_stack > 0.70) \| (cor_stack < -0.70)])
	# reading the dataset
	loan_df = pd.read_csv("../input/loan.csv")
	# looking at missing values
	# percentage of null values in each of the columns
	missing_val = loan_df.isna().sum()/len(loan_df)*100
	print("Columns with more than 60% missing values - ")
	print(missing_val[missing_val > 60].sort_values(ascending=False))
	# since we cannot fill in values for these columns we're going to drop these ; got rid of ~50 cols with that threshold
	loan_df = loan_df.loc[:, loan_df.isnull().mean() < .35]
	print(loan_df.shape)
	''' Script for downloading all GLUE data.

	Note: for legal reasons, we are unable to host MRPC.
	You can either use the version hosted by the SentEval team, which is already tokenized,
	or you can download the original data from (https://download.microsoft.com/download/D/4/6/D46FF87A-F6B9-4252-AA8B-3604ED519838/MSRParaphraseCorpus.msi) and extract the data from it manually.
	For Windows users, you can run the .msi file. For Mac and Linux users, consider an external library such as 'cabextract' (see below for an example).
	You should then rename and place specific files in a folder (see below for an example).

	mkdir MRPC
	cabextract MSRParaphraseCorpus.msi -d MRPC
airport	arr_cancelled	arr_delay	carrier_delay	weather_delay	nas_delay	security_delay	late_aircraft_delay
ATL	6396	4184220	1533479	213781	880086	4380	1552494
ORD	4549	3987149	1136612	174348	1129859	3664	1542666
SFO	3519	3825836	670030	133557	1930595	3846	1087808
LAX	2446	3094970	832602	115079	1078174	5812	1063303
EWR	3746	2819605	466541	90950	1612665	1301	648148
DFW	2627	2357531	842968	107792	465076	4420	937275
DEN	1828	2342731	743717	101288	476244	2645	1018837
BOS	3480	2191339	501258	87391	755466	1651	845573
JFK	2925	2129525	452055	64350	938491	3228	671401