Skip to content

Instantly share code, notes, and snippets.

View aish-anand's full-sized avatar
🎯
Focusing

Aishwarya aish-anand

🎯
Focusing
View GitHub Profile
@aish-anand
aish-anand / index.html
Last active February 5, 2018 01:40
COMPSCI 590V - Homework 1 - Simple D3 Visualization
<!DOCTYPE html>
<html>
<head>
<title>Air Quality Index</title>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width">
<link rel="stylesheet" href="https://fonts.googleapis.com/css?family=Merriweather" type="text/css">
<style type="text/css">
body {
margin: 0;
@aish-anand
aish-anand / README.md
Last active February 19, 2018 01:10
COMPSCI 590V - Homework 2

Scatterplot

Q1. Scatter plots are also known as a Scatter Graphs, Point Graphs, X-Y Plots, Scatter Chart or Scattergrams. Scatterplots use a collection of points placed using Cartesian Coordinates to display values from two variables. By displaying a variable in each axis, you can detect if a relationship or correlation between the two variables exists. Various types of correlation can be interpreted through the patterns displayed on Scatterplots. These are: positive (values increase together), negative (one value decreases as the other increases), null (no correlation), linear, exponential and U-shaped. The strength of the correlation can be determined by how closely packed the points are to each other. Scatterplots can show you visually the strength of the relationship between the variables, the direction of the relationship between the variables, and whether outliers exist. Shows how data is dispersed and the relationship between the two variables. Lines or curves are fitted within the graph to aid in

@aish-anand
aish-anand / README.md
Created March 18, 2018 16:25
COMPSCI 590V - Homework 3

Dataset

An air quality index (AQI) is a number used by government agencies to communicate to the public how polluted the air currently is or how polluted it is forecast to become. As the AQI increases, an increasingly large percentage of the population is likely to experience increasingly severe adverse health effects. The United States Environmental Protection Agency(EPA) has developed an Air Quality Index that is used to report air quality. The map shows the 50 metropolitan areas in the US with highest carbon monoxide levels. The drop down also allows the user to see other measures like AQI and max CO value in the year 2016. The original data has this CO measurement over the span of one year, with measurements taken each day. However for the purposes of the assignment this has been condensed to top 100 metros and average values are taken. The original dataset has also been attached below. Link to dataset : https://aqs.epa.gov/aqsweb/airdata/download_files.html

Projection

The Albers equal-area conic pr

@aish-anand
aish-anand / Ozone14_to_17.csv
Created April 6, 2018 17:11
COMPSCI 590V - Homework 4
County Code County Ozone2014 Ozone2015 Ozone2016 Ozone2017
1003 Baldwin County, AL 0.07 0.063 0.062 0.064
1027 Clay County, AL 0 0 0 0
1033 Colbert County, AL 0.059 0.057 0.061 0.056
1049 DeKalb County, AL 0.062 0.065 0.064 0.058
1051 Elmore County, AL 0.06 0.061 0.064 0.055
1055 Etowah County, AL 0.059 0.06 0.059 0.061
1069 Houston County, AL 0.059 0.061 0.07 0.055
1073 Jefferson County, AL 0.065 0.073 0.066 0.064
1089 Madison County, AL 0.064 0.063 0.063 0.063
@aish-anand
aish-anand / README.md
Last active April 20, 2018 16:27
COMPSCI 590V - Homework 5

Dataset

The visualization used 3 separate files -

  • airports - contains all the airports in the continental United States, with IATA codes, lat and long. (3375 rows) Not all airports are plotted, as the visualisation gets too crowded.
  • flights - contains flights to/from airports, again listed by IATA codes. (5367 rows)
  • airport_delays - 25 busiest airports in the US, delays and cancellation details. The cancellations/delays are split based on cause - weather, security, carrier delay and so on. This has been condensed from the summary stats. The U.S. Department of Transportation's (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT's monthly Air Travel Consumer Report, published about 30 days after the month's end, as well as in summary tables posted on the website. BTS began collecting details on the causes of flight delays
airport arr_cancelled arr_delay carrier_delay weather_delay nas_delay security_delay late_aircraft_delay
ATL 6396 4184220 1533479 213781 880086 4380 1552494
ORD 4549 3987149 1136612 174348 1129859 3664 1542666
SFO 3519 3825836 670030 133557 1930595 3846 1087808
LAX 2446 3094970 832602 115079 1078174 5812 1063303
EWR 3746 2819605 466541 90950 1612665 1301 648148
DFW 2627 2357531 842968 107792 465076 4420 937275
DEN 1828 2342731 743717 101288 476244 2645 1018837
BOS 3480 2191339 501258 87391 755466 1651 845573
JFK 2925 2129525 452055 64350 938491 3228 671401
@aish-anand
aish-anand / README.md
Created May 3, 2018 20:33
COMPSCI 590V - Homework 6

Data

A Song of Ice and Fire has been talked to death. Between the TV adaptation (a ratings dreadnought, dragging in its wake a froth of thinkpieces), the source text (sprinkled with enigmas to entice even the most discerning of crackpots), and the book releases (rarer than Olympics), a wealth of criticism has accumulated for any fan of the series to explore. With this visualization, you can look a little more into the TV adaptation!

Visualisation 1 - Scatterplot

The scatterplot talks mostly about the massive viewership and fan following that show has. With millions of dollars spent on each episode, it shows us the votes and the ratings that each episode of show has received so far. The scatter plot allows both hover - which shows the episode name and brushing over the data points in the chart. With the brushing, a table is populated with the relevant information about the episode.

Visualization 2 - Word Cloud

For Season 7 specifically a word cloud is created to see the most frequent words in each o

@aish-anand
aish-anand / download_glue_data.py
Last active July 16, 2019 20:07 — forked from W4ngatang/download_glue_data.py
Script for downloading data of the GLUE benchmark (gluebenchmark.com)
''' Script for downloading all GLUE data.
Note: for legal reasons, we are unable to host MRPC.
You can either use the version hosted by the SentEval team, which is already tokenized,
or you can download the original data from (https://download.microsoft.com/download/D/4/6/D46FF87A-F6B9-4252-AA8B-3604ED519838/MSRParaphraseCorpus.msi) and extract the data from it manually.
For Windows users, you can run the .msi file. For Mac and Linux users, consider an external library such as 'cabextract' (see below for an example).
You should then rename and place specific files in a folder (see below for an example).
mkdir MRPC
cabextract MSRParaphraseCorpus.msi -d MRPC
@aish-anand
aish-anand / preprocess.py
Last active September 3, 2019 00:02
Cleaning and Preprocessing
# reading the dataset
loan_df = pd.read_csv("../input/loan.csv")
# looking at missing values
# percentage of null values in each of the columns
missing_val = loan_df.isna().sum()/len(loan_df)*100
print("Columns with more than 60% missing values - ")
print(missing_val[missing_val > 60].sort_values(ascending=False))
# since we cannot fill in values for these columns we're going to drop these ; got rid of ~50 cols with that threshold
loan_df = loan_df.loc[:, loan_df.isnull().mean() < .35]
print(loan_df.shape)
@aish-anand
aish-anand / remove_corr_features.py
Created September 2, 2019 23:34
Removing highly correlated features
# correlation matrix
cor = X_train.corr()
cor.loc[:,:] = np.tril(cor, k=-1) # below main lower triangle of an array
cor_stack = cor.stack()
print("Columns with corr. greater than 0.7 - ")
print(cor_stack[(cor_stack > 0.70) | (cor_stack < -0.70)])