Aishwarya aish-anand

## index.html
<!DOCTYPE html>
<html>
<head>
  <title>Air Quality Index</title>
  <meta charset="utf-8">
  <meta name="viewport" content="width=device-width">
  <link rel="stylesheet" href="https://fonts.googleapis.com/css?family=Merriweather" type="text/css">
<style type="text/css">
body {
  margin: 0;

## README.md

      
              5 files
            
          
              0 forks
            
          
                0 comments
              
            
              0 stars
            
          
                aish-anand
                / README.md
            
            
              Last active
              February 19, 2018 01:10
            
              
                COMPSCI 590V - Homework 2
              
          
    Scatterplot

Q1. Scatter plots are also known as a Scatter Graphs, Point Graphs, X-Y Plots, Scatter Chart or Scattergrams.
Scatterplots use a collection of points placed using Cartesian Coordinates to display values from two variables. By displaying a variable in each axis, you can detect if a relationship or correlation between the two variables exists.
Various types of correlation can be interpreted through the patterns displayed on Scatterplots. These are: positive (values increase together), negative (one value decreases as the other increases), null (no correlation), linear, exponential and U-shaped. The strength of the correlation can be determined by how closely packed the points are to each other.
Scatterplots can show you visually the strength of the relationship between the variables, the direction of the relationship between the variables, and whether outliers exist. Shows how data is dispersed and the relationship between the two variables.
Lines or curves are fitted within the graph to aid in

  
## README.md

      
              7 files
            
          
              0 forks
            
          
                0 comments
              
            
              0 stars
            
          
                aish-anand
                / README.md
            
            
              Created
              March 18, 2018 16:25
            
              
                COMPSCI 590V - Homework 3
              
          
    Dataset

An air quality index (AQI) is a number used by government agencies to communicate to the public how polluted the air currently is or how polluted it is forecast to become. As the AQI increases, an increasingly large percentage of the population is likely to experience increasingly severe adverse health effects.
The United States Environmental Protection Agency(EPA) has developed an Air Quality Index that is used to report air quality. The map shows the 50 metropolitan areas in the US with highest carbon monoxide levels. The drop down also allows the user to see other measures like AQI and max CO value in the year 2016. The original data has this CO measurement over the span of one year, with measurements taken each day. However for the purposes of the assignment this has been condensed to top 100 metros and average values are taken. The original dataset has also been attached below.
Link to dataset : https://aqs.epa.gov/aqsweb/airdata/download_files.html
Projection

The Albers equal-area conic pr

  
## Ozone14_to_17.csv

          
            County Code
            County
            Ozone2014
            Ozone2015
            Ozone2016
            Ozone2017

            
              1003
              Baldwin County, AL
              0.07
              0.063
              0.062
              0.064

            
              1027
              Clay County, AL
              0
              0
              0
              0

            
              1033
              Colbert County, AL
              0.059
              0.057
              0.061
              0.056

            
              1049
              DeKalb County, AL
              0.062
              0.065
              0.064
              0.058

            
              1051
              Elmore County, AL
              0.06
              0.061
              0.064
              0.055

            
              1055
              Etowah County, AL
              0.059
              0.06
              0.059
              0.061

            
              1069
              Houston County, AL
              0.059
              0.061
              0.07
              0.055

            
              1073
              Jefferson County, AL
              0.065
              0.073
              0.066
              0.064

            
              1089
              Madison County, AL
              0.064
              0.063
              0.063
              0.063

## README.md

      
              7 files
            
          
              0 forks
            
          
                0 comments
              
            
              0 stars
            
          
                aish-anand
                / README.md
            
            
              Last active
              April 20, 2018 16:27
            
              
                COMPSCI 590V - Homework 5
              
          
    Dataset

The visualization used 3 separate files  -

airports - contains all the airports in the continental United States, with IATA codes, lat and long. (3375 rows) Not all airports are plotted, as the visualisation gets too crowded.
flights - contains flights to/from airports, again listed by IATA codes. (5367 rows)
airport_delays - 25 busiest airports in the US, delays and cancellation details. The cancellations/delays are split based on cause - weather, security, carrier delay and so on. This has been condensed from the summary stats.
The U.S. Department of Transportation's (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT's monthly Air Travel Consumer Report, published about 30 days after the month's end, as well as in summary tables posted on the website. BTS began collecting details on the causes of flight delays


## airport_delays_v2.csv

          
            airport
            arr_cancelled
            arr_delay
            carrier_delay
            weather_delay
            nas_delay
            security_delay
            late_aircraft_delay

            
              ATL
              6396
              4184220
              1533479
              213781
              880086
              4380
              1552494

            
              ORD
              4549
              3987149
              1136612
              174348
              1129859
              3664
              1542666

            
              SFO
              3519
              3825836
              670030
              133557
              1930595
              3846
              1087808

            
              LAX
              2446
              3094970
              832602
              115079
              1078174
              5812
              1063303

            
              EWR
              3746
              2819605
              466541
              90950
              1612665
              1301
              648148

            
              DFW
              2627
              2357531
              842968
              107792
              465076
              4420
              937275

            
              DEN
              1828
              2342731
              743717
              101288
              476244
              2645
              1018837

            
              BOS
              3480
              2191339
              501258
              87391
              755466
              1651
              845573

            
              JFK
              2925
              2129525
              452055
              64350
              938491
              3228
              671401

## README.md

      
              8 files
            
          
              0 forks
            
          
                0 comments
              
            
              0 stars
            
          
                aish-anand
                / README.md
            
            
              Created
              May 3, 2018 20:33
            
              
                COMPSCI 590V - Homework 6
              
          
    Data

A Song of Ice and Fire has been talked to death. Between the TV adaptation (a ratings dreadnought, dragging in its wake a froth of thinkpieces), the source text (sprinkled with enigmas to entice even the most discerning of crackpots), and the book releases (rarer than Olympics), a wealth of criticism has accumulated for any fan of the series to explore. With this visualization, you can look a little more into the TV adaptation!
Visualisation 1 - Scatterplot

The scatterplot talks mostly about the massive viewership and fan following that show has. With millions of dollars spent on each episode, it shows us the votes and the ratings that each episode of show has received so far.
The scatter plot allows both hover - which shows the episode name and brushing over the data points in the chart. With the brushing, a table is populated with the relevant information about the episode.
Visualization 2 - Word Cloud

For Season 7 specifically a word cloud is created to see the most frequent words in each o

  
## download_glue_data.py
''' Script for downloading all GLUE data.

Note: for legal reasons, we are unable to host MRPC.
You can either use the version hosted by the SentEval team, which is already tokenized,
or you can download the original data from (https://download.microsoft.com/download/D/4/6/D46FF87A-F6B9-4252-AA8B-3604ED519838/MSRParaphraseCorpus.msi) and extract the data from it manually.
For Windows users, you can run the .msi file. For Mac and Linux users, consider an external library such as 'cabextract' (see below for an example).
You should then rename and place specific files in a folder (see below for an example).

mkdir MRPC
cabextract MSRParaphraseCorpus.msi -d MRPC

## preprocess.py
# reading the dataset
loan_df = pd.read_csv("../input/loan.csv")
# looking at missing values
# percentage of null values in each of the columns
missing_val = loan_df.isna().sum()/len(loan_df)*100
print("Columns with more than 60% missing values - ")
print(missing_val[missing_val > 60].sort_values(ascending=False))
# since we cannot fill in values for these columns we're going to drop these ; got rid of ~50 cols with that threshold
loan_df = loan_df.loc[:, loan_df.isnull().mean() < .35]
print(loan_df.shape)

## remove_corr_features.py
# correlation matrix
cor = X_train.corr()
cor.loc[:,:] = np.tril(cor, k=-1) # below main lower triangle of an array
cor_stack = cor.stack()
print("Columns with corr. greater than 0.7 - ")
print(cor_stack[(cor_stack > 0.70) | (cor_stack < -0.70)])
	<!DOCTYPE html>
	<html>
	<head>
	<title>Air Quality Index</title>
	<meta charset="utf-8">
	<meta name="viewport" content="width=device-width">
	<link rel="stylesheet" href="https://fonts.googleapis.com/css?family=Merriweather" type="text/css">
	<style type="text/css">
	body {
	margin: 0;
County Code	County	Ozone2014	Ozone2015	Ozone2016	Ozone2017
1003	Baldwin County, AL	0.07	0.063	0.062	0.064
1027	Clay County, AL	0	0	0	0
1033	Colbert County, AL	0.059	0.057	0.061	0.056
1049	DeKalb County, AL	0.062	0.065	0.064	0.058
1051	Elmore County, AL	0.06	0.061	0.064	0.055
1055	Etowah County, AL	0.059	0.06	0.059	0.061
1069	Houston County, AL	0.059	0.061	0.07	0.055
1073	Jefferson County, AL	0.065	0.073	0.066	0.064
1089	Madison County, AL	0.064	0.063	0.063	0.063
airport	arr_cancelled	arr_delay	carrier_delay	weather_delay	nas_delay	security_delay	late_aircraft_delay
ATL	6396	4184220	1533479	213781	880086	4380	1552494
ORD	4549	3987149	1136612	174348	1129859	3664	1542666
SFO	3519	3825836	670030	133557	1930595	3846	1087808
LAX	2446	3094970	832602	115079	1078174	5812	1063303
EWR	3746	2819605	466541	90950	1612665	1301	648148
DFW	2627	2357531	842968	107792	465076	4420	937275
DEN	1828	2342731	743717	101288	476244	2645	1018837
BOS	3480	2191339	501258	87391	755466	1651	845573
JFK	2925	2129525	452055	64350	938491	3228	671401
	''' Script for downloading all GLUE data.

	Note: for legal reasons, we are unable to host MRPC.
	You can either use the version hosted by the SentEval team, which is already tokenized,
	or you can download the original data from (https://download.microsoft.com/download/D/4/6/D46FF87A-F6B9-4252-AA8B-3604ED519838/MSRParaphraseCorpus.msi) and extract the data from it manually.
	For Windows users, you can run the .msi file. For Mac and Linux users, consider an external library such as 'cabextract' (see below for an example).
	You should then rename and place specific files in a folder (see below for an example).

	mkdir MRPC
	cabextract MSRParaphraseCorpus.msi -d MRPC
	# reading the dataset
	loan_df = pd.read_csv("../input/loan.csv")
	# looking at missing values
	# percentage of null values in each of the columns
	missing_val = loan_df.isna().sum()/len(loan_df)*100
	print("Columns with more than 60% missing values - ")
	print(missing_val[missing_val > 60].sort_values(ascending=False))
	# since we cannot fill in values for these columns we're going to drop these ; got rid of ~50 cols with that threshold
	loan_df = loan_df.loc[:, loan_df.isnull().mean() < .35]
	print(loan_df.shape)
	# correlation matrix
	cor = X_train.corr()
	cor.loc[:,:] = np.tril(cor, k=-1) # below main lower triangle of an array
	cor_stack = cor.stack()
	print("Columns with corr. greater than 0.7 - ")
	print(cor_stack[(cor_stack > 0.70) \| (cor_stack < -0.70)])