Q1. Scatter plots are also known as a Scatter Graphs, Point Graphs, X-Y Plots, Scatter Chart or Scattergrams. Scatterplots use a collection of points placed using Cartesian Coordinates to display values from two variables. By displaying a variable in each axis, you can detect if a relationship or correlation between the two variables exists. Various types of correlation can be interpreted through the patterns displayed on Scatterplots. These are: positive (values increase together), negative (one value decreases as the other increases), null (no correlation), linear, exponential and U-shaped. The strength of the correlation can be determined by how closely packed the points are to each other. Scatterplots can show you visually the strength of the relationship between the variables, the direction of the relationship between the variables, and whether outliers exist. Shows how data is dispersed and the relationship between the two variables. Lines or curves are fitted within the graph to aid in
<!DOCTYPE html> | |
<html> | |
<head> | |
<title>Air Quality Index</title> | |
<meta charset="utf-8"> | |
<meta name="viewport" content="width=device-width"> | |
<link rel="stylesheet" href="https://fonts.googleapis.com/css?family=Merriweather" type="text/css"> | |
<style type="text/css"> | |
body { | |
margin: 0; |
An air quality index (AQI) is a number used by government agencies to communicate to the public how polluted the air currently is or how polluted it is forecast to become. As the AQI increases, an increasingly large percentage of the population is likely to experience increasingly severe adverse health effects. The United States Environmental Protection Agency(EPA) has developed an Air Quality Index that is used to report air quality. The map shows the 50 metropolitan areas in the US with highest carbon monoxide levels. The drop down also allows the user to see other measures like AQI and max CO value in the year 2016. The original data has this CO measurement over the span of one year, with measurements taken each day. However for the purposes of the assignment this has been condensed to top 100 metros and average values are taken. The original dataset has also been attached below. Link to dataset : https://aqs.epa.gov/aqsweb/airdata/download_files.html
The Albers equal-area conic pr
County Code | County | Ozone2014 | Ozone2015 | Ozone2016 | Ozone2017 | |
---|---|---|---|---|---|---|
1003 | Baldwin County, AL | 0.07 | 0.063 | 0.062 | 0.064 | |
1027 | Clay County, AL | 0 | 0 | 0 | 0 | |
1033 | Colbert County, AL | 0.059 | 0.057 | 0.061 | 0.056 | |
1049 | DeKalb County, AL | 0.062 | 0.065 | 0.064 | 0.058 | |
1051 | Elmore County, AL | 0.06 | 0.061 | 0.064 | 0.055 | |
1055 | Etowah County, AL | 0.059 | 0.06 | 0.059 | 0.061 | |
1069 | Houston County, AL | 0.059 | 0.061 | 0.07 | 0.055 | |
1073 | Jefferson County, AL | 0.065 | 0.073 | 0.066 | 0.064 | |
1089 | Madison County, AL | 0.064 | 0.063 | 0.063 | 0.063 |
The visualization used 3 separate files -
- airports - contains all the airports in the continental United States, with IATA codes, lat and long. (3375 rows) Not all airports are plotted, as the visualisation gets too crowded.
- flights - contains flights to/from airports, again listed by IATA codes. (5367 rows)
- airport_delays - 25 busiest airports in the US, delays and cancellation details. The cancellations/delays are split based on cause - weather, security, carrier delay and so on. This has been condensed from the summary stats. The U.S. Department of Transportation's (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT's monthly Air Travel Consumer Report, published about 30 days after the month's end, as well as in summary tables posted on the website. BTS began collecting details on the causes of flight delays
airport | arr_cancelled | arr_delay | carrier_delay | weather_delay | nas_delay | security_delay | late_aircraft_delay | |
---|---|---|---|---|---|---|---|---|
ATL | 6396 | 4184220 | 1533479 | 213781 | 880086 | 4380 | 1552494 | |
ORD | 4549 | 3987149 | 1136612 | 174348 | 1129859 | 3664 | 1542666 | |
SFO | 3519 | 3825836 | 670030 | 133557 | 1930595 | 3846 | 1087808 | |
LAX | 2446 | 3094970 | 832602 | 115079 | 1078174 | 5812 | 1063303 | |
EWR | 3746 | 2819605 | 466541 | 90950 | 1612665 | 1301 | 648148 | |
DFW | 2627 | 2357531 | 842968 | 107792 | 465076 | 4420 | 937275 | |
DEN | 1828 | 2342731 | 743717 | 101288 | 476244 | 2645 | 1018837 | |
BOS | 3480 | 2191339 | 501258 | 87391 | 755466 | 1651 | 845573 | |
JFK | 2925 | 2129525 | 452055 | 64350 | 938491 | 3228 | 671401 |
A Song of Ice and Fire has been talked to death. Between the TV adaptation (a ratings dreadnought, dragging in its wake a froth of thinkpieces), the source text (sprinkled with enigmas to entice even the most discerning of crackpots), and the book releases (rarer than Olympics), a wealth of criticism has accumulated for any fan of the series to explore. With this visualization, you can look a little more into the TV adaptation!
The scatterplot talks mostly about the massive viewership and fan following that show has. With millions of dollars spent on each episode, it shows us the votes and the ratings that each episode of show has received so far. The scatter plot allows both hover - which shows the episode name and brushing over the data points in the chart. With the brushing, a table is populated with the relevant information about the episode.
For Season 7 specifically a word cloud is created to see the most frequent words in each o
''' Script for downloading all GLUE data. | |
Note: for legal reasons, we are unable to host MRPC. | |
You can either use the version hosted by the SentEval team, which is already tokenized, | |
or you can download the original data from (https://download.microsoft.com/download/D/4/6/D46FF87A-F6B9-4252-AA8B-3604ED519838/MSRParaphraseCorpus.msi) and extract the data from it manually. | |
For Windows users, you can run the .msi file. For Mac and Linux users, consider an external library such as 'cabextract' (see below for an example). | |
You should then rename and place specific files in a folder (see below for an example). | |
mkdir MRPC | |
cabextract MSRParaphraseCorpus.msi -d MRPC |
# reading the dataset | |
loan_df = pd.read_csv("../input/loan.csv") | |
# looking at missing values | |
# percentage of null values in each of the columns | |
missing_val = loan_df.isna().sum()/len(loan_df)*100 | |
print("Columns with more than 60% missing values - ") | |
print(missing_val[missing_val > 60].sort_values(ascending=False)) | |
# since we cannot fill in values for these columns we're going to drop these ; got rid of ~50 cols with that threshold | |
loan_df = loan_df.loc[:, loan_df.isnull().mean() < .35] | |
print(loan_df.shape) |
# correlation matrix | |
cor = X_train.corr() | |
cor.loc[:,:] = np.tril(cor, k=-1) # below main lower triangle of an array | |
cor_stack = cor.stack() | |
print("Columns with corr. greater than 0.7 - ") | |
print(cor_stack[(cor_stack > 0.70) | (cor_stack < -0.70)]) |