This project is slightly different than others you have encountered thus far on Codecademy. Instead of a step-by-step tutorial, this project contains a series of open-ended requirements which describe the project you’ll be building. There are many possible ways to correctly fulfill all of these requirements, and you should expect to use the internet, Codecademy, and other resources when you encounter a problem that you cannot easily solve.
-
Roller coasters are thrilling amusement park rides designed to make you squeal and scream! They take you up high, drop you to the ground quickly, and sometimes even spin you upside down before returning to a stop. Today you will be taking control back from the roller coasters and visualizing data covering international roller coaster rankings and roller coaster statistics.
Roller coasters are often split into two main categories based on their construction material: wood or steel. Rankings for the best wood and steel roller coasters from the 2013 to 2018 Golden Ticket Awards are provided in
'Golden_Ticket_Award_Winners_Wood.csv'
and'Golden_Ticket_Award_Winners_Steel.csv'
, respectively. Load each csv into a DataFrame and inspect it to gain familiarity with the data.
import matplotlib
import matplotlib.pyplot as plt
import pandas as pd
In /home/daniel/anaconda3/lib/python3.7/site-packages/matplotlib/mpl-data/stylelib/_classic_test.mplstyle:
The text.latex.unicode rcparam was deprecated in Matplotlib 3.0 and will be removed in 3.2.
In /home/daniel/anaconda3/lib/python3.7/site-packages/matplotlib/mpl-data/stylelib/_classic_test.mplstyle:
The savefig.frameon rcparam was deprecated in Matplotlib 3.1 and will be removed in 3.3.
In /home/daniel/anaconda3/lib/python3.7/site-packages/matplotlib/mpl-data/stylelib/_classic_test.mplstyle:
The pgf.debug rcparam was deprecated in Matplotlib 3.0 and will be removed in 3.2.
In /home/daniel/anaconda3/lib/python3.7/site-packages/matplotlib/mpl-data/stylelib/_classic_test.mplstyle:
The verbose.level rcparam was deprecated in Matplotlib 3.1 and will be removed in 3.3.
In /home/daniel/anaconda3/lib/python3.7/site-packages/matplotlib/mpl-data/stylelib/_classic_test.mplstyle:
The verbose.fileo rcparam was deprecated in Matplotlib 3.1 and will be removed in 3.3.
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf
import statsmodels.graphics as smgraph
from matplotlib.ticker import MaxNLocator
from statsmodels.graphics.regressionplots import *
matplotlib.rcdefaults()
plt.rcParams["figure.dpi"] = 140
# load rankings data here:
wood = pd.read_csv("Golden_Ticket_Award_Winners_Wood.csv")
steel = pd.read_csv("Golden_Ticket_Award_Winners_Steel.csv")
# write function to plot rankings over time for 1 roller coaster here:
wood.sample(5)
Rank | Name | Park | Location | Supplier | Year Built | Points | Year of Rank | |
---|---|---|---|---|---|---|---|---|
29 | 10 | Lightning Racer | Hersheypark | Hershey, Pa. | GCII | 2000 | 421 | 2015 |
175 | 46 | Megafobia | Oakwood | Pembrookshire, Wales | Custom Coasters | 1996 | 84 | 2018 |
100 | 21 | Rampage | Alabama Splash Adventure | Bessemer, Ala. | Custom Coasters | 1998 | 218 | 2017 |
92 | 13 | Goliath | Six Flags Great America | Gurnee, Ill. | Rocky Mountain | 2014 | 269 | 2017 |
1 | 2 | El Toro | Six Flags Great Adventure | Jackson, N.J. | Intamin | 2006 | 1302 | 2013 |
-
Write a function that will plot the ranking of a given roller coaster over time as a line. Your function should take a roller coaster’s name and a ranking DataFrame as arguments. Make sure to include informative labels that describe your visualization.
Call your function with
"El Toro"
as the roller coaster name and the wood ranking DataFrame. What issue do you notice? Update your function with an additional argument to alleviate the problem, and retest your function.
def rank_year(name, park):
"""
Plot time-series of rankings of park
"""
dfwood = wood[(wood["Name"] == name) & (wood["Park"] == park)]
plt.plot(
dfwood["Year of Rank"], dfwood["Rank"],
)
plt.ylabel("Rank")
plt.xlabel("Year")
plt.legend([name])
plt.title(f"{name}: {park}")
plt.yticks(range(1, dfwood.Rank.max() + 1))
return plt.show()
rank_year("El Toro", "Six Flags Great Adventure")
-
Write a function that will plot the ranking of two given roller coasters over time as lines. Your function should take both roller coasters’ names and a ranking DataFrame as arguments. Make sure to include informative labels that describe your visualization.
Call your function with
"El Toro"
as one roller coaster name, “Boulder Dash
“ as the other roller coaster name, and the wood ranking DataFrame. What issue do you notice? Update your function with two additional arguments to alleviate the problem, and retest your function.
def rank_year2(name1, name2, park1, park2):
"""
Time-series plot of rollercoasters.
"""
dfwood1 = wood[(wood["Name"] == name1) & (wood["Park"] == park1)]
dfwood2 = wood[(wood["Name"] == name2) & (wood["Park"] == park2)]
ay = plt.subplot()
plt.plot(dfwood1["Year of Rank"], dfwood1["Rank"])
plt.plot(dfwood2["Year of Rank"], dfwood2["Rank"])
plt.ylabel("Rank")
plt.xlabel("Year")
plt.legend([name1, name2])
plt.title("Ranking of Rollercoasters")
ay.yaxis.set_major_locator(MaxNLocator(integer=True))
return plt.show()
rank_year2("El Toro", "Boulder Dash", "Six Flags Great Adventure", "Lake Compounce")
-
Write a function that will plot the ranking of the top
n
ranked roller coasters over time as lines. Your function should take a numbern
and a ranking DataFrame as arguments. Make sure to include informative labels that describe your visualization.For example, if
n == 5
, your function should plot a line for each roller coaster that has a rank of5
or lower.Call your function with a value for
n
and either the wood ranking or steel ranking DataFrame.
def top_ranked(n, df):
"""
Returns a plot of top ranked rollercoasters, where `n` is the lowest rank.
"""
n_df = df.query("Rank <= @n")
n_df = n_df.dropna()
fig, ax = plt.subplots(figsize=(10, 10))
for coaster in set(n_df.Name):
coaster_rankings = n_df.query("Name == @coaster")
ax.plot(coaster_rankings["Year of Rank"], coaster_rankings.Rank, label=coaster)
ax.yaxis.set_major_locator(MaxNLocator(integer=True))
plt.title(f"Top {n} Ranked Rollercoasters")
plt.xlabel("Year")
plt.ylabel("Ranking")
plt.legend()
return plt.show()
top_ranked(5, wood)
-
Now that you’ve visualized rankings over time, let’s dive into the actual statistics of roller coasters themselves. Captain Coaster is a popular site for recording roller coaster information. Data on all roller coasters documented on Captain Coaster has been accessed through its API and stored in
roller_coasters.csv
. Load the data from the csv into a DataFrame and inspect it to gain familiarity with the data.Open the hint for more information about each column of the dataset.
captain_coaster = pd.read_csv("roller_coasters.csv")
captain_coaster.sample(5)
name | material_type | seating_type | speed | height | length | num_inversions | manufacturer | park | status | |
---|---|---|---|---|---|---|---|---|---|---|
1999 | Big Apple | na | Sit Down | NaN | NaN | NaN | NaN | D.P.V. Rides | The Milky Way Adventure Park | status.operating |
202 | Infernal Toboggan | Steel | Sit Down | NaN | 11.0 | 335.0 | 0.0 | S.D.C. | Foire | status.operating |
70 | Blue Tornado | Steel | Inverted | 80.0 | 33.0 | 765.0 | 5.0 | Vekoma | Gardaland | status.operating |
953 | Pandemonium | Steel | Spinning | 50.0 | 16.0 | 412.0 | 0.0 | Gerstlauer | Six Flags Fiesta Texas | status.operating |
2320 | Happy Angel | na | Inverted | 87.0 | NaN | NaN | 6.0 | Golden Horse | Heilongjiang Wanda Theme Park | status.operating |
-
Write a function that plots a histogram of any numeric column of the roller coaster DataFrame. Your function should take a DataFrame and a column name for which a histogram should be constructed as arguments. Make sure to include informative labels that describe your visualization.
Call your function with the roller coaster DataFrame and one of the column names.
captain_coaster.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2802 entries, 0 to 2801
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 name 2799 non-null object
1 material_type 2802 non-null object
2 seating_type 2802 non-null object
3 speed 1478 non-null float64
4 height 1667 non-null float64
5 length 1675 non-null float64
6 num_inversions 2405 non-null float64
7 manufacturer 2802 non-null object
8 park 2802 non-null object
9 status 2802 non-null object
dtypes: float64(4), object(6)
memory usage: 219.0+ KB
def histogram(df, column):
"""
Histogram plotting column of a dataframe.
"""
df = df.dropna()
plt.hist(df[column])
plt.legend([column])
plt.xlabel(column)
plt.ylabel("Number of roller coasters")
return plt.show()
histogram(captain_coaster, "speed")
-
Write a function that creates a bar chart showing the number of inversions for each roller coaster at an amusement park. Your function should take the roller coaster DataFrame and an amusement park name as arguments. Make sure to include informative labels that describe your visualization.
Call your function with the roller coaster DataFrame and an amusement park name.
test = "Walibi Belgium"
captain_coaster.query("park == @test & num_inversions > 0").sort_values(
"num_inversions", ascending=False
)
name | material_type | seating_type | speed | height | length | num_inversions | manufacturer | park | status | |
---|---|---|---|---|---|---|---|---|---|---|
44 | Vampire | Steel | Inverted | 80.0 | 33.0 | 689.0 | 5.0 | Vekoma | Walibi Belgium | status.operating |
11 | Cobra | Steel | Sit Down | 76.0 | 36.0 | 285.0 | 3.0 | Vekoma | Walibi Belgium | status.operating |
957 | Tornado | Steel | Sit Down | 64.0 | 23.0 | 725.0 | 2.0 | Vekoma | Walibi Belgium | status.closed.definitely |
39 | Psyké underground | Steel | Sit Down | 85.0 | 42.0 | 260.0 | 1.0 | Schwarzkopf | Walibi Belgium | status.operating |
def barchart_inversions(df, park):
"""
Bar chart plotting number of inversions given DataFrame and park name.
"""
df = df.dropna()
result = df.query("park == @park & num_inversions > 0").sort_values(
"num_inversions", ascending=False
)
plt.figure(figsize=(10, 7.5))
ax = plt.subplot()
plt.bar(result.name, result.num_inversions)
ax.set_xticklabels(result.name)
ax.set_xticks(range(len(result.name)))
plt.xticks(rotation=30)
plt.legend([park])
plt.title(f"Number of Inversions per Rollercoaster: {park}")
plt.tight_layout()
return plt.show()
barchart_inversions(captain_coaster, "Walibi Belgium")
-
Write a function that creates a pie chart that compares the number of operating roller coasters (
'status.operating'
) to the number of closed roller coasters ('status.closed.definitely'
). Your function should take the roller coaster DataFrame as an argument. Make sure to include informative labels that describe your visualization.Call your function with the roller coaster DataFrame.
# Remove prefix
captain_coaster.status = captain_coaster.status.replace("status.", "", regex=True)
# Remove `.` between closed
captain_coaster.status = captain_coaster.status.replace(
"closed.", "closed ", regex=True
)
captain_coaster.status.value_counts(normalize=True)
operating 0.775161
closed definitely 0.156674
announced 0.014989
construction 0.014632
unknown 0.012134
closed temporarily 0.008922
relocated 0.007852
retracked 0.005710
rumored 0.003926
Name: status, dtype: float64
def pie_operation(df):
"""
Plot for pie chart on operation status of roller coasters.
"""
criteria = df.query("status == 'operating' | status == 'closed definitely'")
counts = list(criteria.status.value_counts())
plt.pie(counts, autopct="%0.1f%%", labels=["Operating", "Closed"])
plt.title("Rollercoasters: Operating vs Closed")
plt.axis("equal")
return plt.show()
pie_operation(captain_coaster)
-
.scatter()
is another useful function in matplotlib that you might not have seen before..scatter()
produces a scatter plot, which is similar to.plot()
in that it plots points on a figure..scatter()
, however, does not connect the points with a line. This allows you to analyze the relationship between to variables. Find.scatter()
‘s documentation here.Write a function that creates a scatter plot of two numeric columns of the roller coaster DataFrame. Your function should take the roller coaster DataFrame and two-column names as arguments. Make sure to include informative labels that describe your visualization.
Call your function with the roller coaster DataFrame and two-column names.
captain_coaster.describe()
speed | height | length | num_inversions | |
---|---|---|---|---|
count | 1478.000000 | 1667.000000 | 1675.000000 | 2405.000000 |
mean | 70.102842 | 26.725855 | 606.147463 | 0.809563 |
std | 28.338394 | 35.010166 | 393.840496 | 1.652254 |
min | 0.000000 | 0.000000 | -1.000000 | 0.000000 |
25% | 47.000000 | 13.000000 | 335.000000 | 0.000000 |
50% | 72.000000 | 23.000000 | 500.000000 | 0.000000 |
75% | 88.000000 | 35.000000 | 839.000000 | 1.000000 |
max | 240.000000 | 902.000000 | 2920.000000 | 14.000000 |
def coaster_scatter(df, column_x, column_y):
"""
Plots relationship between two variables.
"""
import numpy as np
df = df.dropna()
df = df.query("height < 140")
plt.figure()
ax = plt.subplot()
plt.scatter(df[column_x], df[column_y], alpha=0.1)
plt.xlabel(column_x)
plt.ylabel(column_y)
plt.title(f"Rollercoaster: Relationship {column_x} vs {column_y}")
trend = np.polyfit(df[column_x], df[column_y], 1)
trendline = np.poly1d(trend)
plt.plot(df[column_x], trendline(df[column_x]), "r--")
return trendline
coaster_scatter(captain_coaster, "height", "speed")
poly1d([ 1.43975908, 31.93321946])
# Correlation heatmap
sns.heatmap(
captain_coaster.dropna().corr(), cmap="seismic", annot=True, vmin=-1, vmax=1
)
<matplotlib.axes._subplots.AxesSubplot at 0x7f8c555e6d10>
captain_coaster.describe()
speed | height | length | num_inversions | |
---|---|---|---|---|
count | 1478.000000 | 1667.000000 | 1675.000000 | 2405.000000 |
mean | 70.102842 | 26.725855 | 606.147463 | 0.809563 |
std | 28.338394 | 35.010166 | 393.840496 | 1.652254 |
min | 0.000000 | 0.000000 | -1.000000 | 0.000000 |
25% | 47.000000 | 13.000000 | 335.000000 | 0.000000 |
50% | 72.000000 | 23.000000 | 500.000000 | 0.000000 |
75% | 88.000000 | 35.000000 | 839.000000 | 1.000000 |
max | 240.000000 | 902.000000 | 2920.000000 | 14.000000 |
df = captain_coaster.dropna()
regression = smf.ols("speed ~ length", data=df).fit()
print(regression.summary())
OLS Regression Results
==============================================================================
Dep. Variable: speed R-squared: 0.446
Model: OLS Adj. R-squared: 0.445
Method: Least Squares F-statistic: 1027.
Date: Tue, 21 Jul 2020 Prob (F-statistic): 7.72e-166
Time: 06:31:08 Log-Likelihood: -5704.5
No. Observations: 1279 AIC: 1.141e+04
Df Residuals: 1277 BIC: 1.142e+04
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 41.2253 1.114 37.022 0.000 39.041 43.410
length 0.0473 0.001 32.046 0.000 0.044 0.050
==============================================================================
Omnibus: 278.245 Durbin-Watson: 1.832
Prob(Omnibus): 0.000 Jarque-Bera (JB): 939.152
Skew: 1.046 Prob(JB): 1.16e-204
Kurtosis: 6.639 Cond. No. 1.43e+03
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.43e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
regression = smf.ols("speed ~ length", data=df).fit()
print(regression.summary())
OLS Regression Results
==============================================================================
Dep. Variable: speed R-squared: 0.446
Model: OLS Adj. R-squared: 0.445
Method: Least Squares F-statistic: 1027.
Date: Tue, 21 Jul 2020 Prob (F-statistic): 7.72e-166
Time: 06:31:08 Log-Likelihood: -5704.5
No. Observations: 1279 AIC: 1.141e+04
Df Residuals: 1277 BIC: 1.142e+04
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 41.2253 1.114 37.022 0.000 39.041 43.410
length 0.0473 0.001 32.046 0.000 0.044 0.050
==============================================================================
Omnibus: 278.245 Durbin-Watson: 1.832
Prob(Omnibus): 0.000 Jarque-Bera (JB): 939.152
Skew: 1.046 Prob(JB): 1.16e-204
Kurtosis: 6.639 Cond. No. 1.43e+03
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.43e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
Here we're using our regression results to do a test for outliers. In this case, I guess the default is a Bonferroni outlier test. We're only printing off test results where the third column is less than 0.05.
test = regression.outlier_test()
print("Bad Data Points")
test[test["bonf(p)"] < 0.05]
Bad Data Points
student_resid | unadj_p | bonf(p) | |
---|---|---|---|
138 | 5.795402 | 8.578726e-09 | 0.000011 |
143 | 5.329034 | 1.166309e-07 | 0.000149 |
160 | 4.865897 | 1.281502e-06 | 0.001639 |
246 | 4.875219 | 1.223498e-06 | 0.001565 |
1397 | 5.041705 | 5.279044e-07 | 0.000675 |
1751 | 4.921659 | 9.702364e-07 | 0.001241 |
figure = smgraph.regressionplots.plot_fit(regression, 1)
line = smgraph.regressionplots.abline_plot(model_results=regression, ax=figure.axes[0])
fig, ax = plt.subplots(figsize=(12, 8))
fig = sm.graphics.influence_plot(regression, alpha=0.05, ax=ax, criterion="cooks")
Y = df.speed
X = df.length
X = sm.add_constant(X)
model = sm.OLS(Y, X)
results = model.fit()
print(results.summary2())
Results: Ordinary least squares
===================================================================
Model: OLS Adj. R-squared: 0.445
Dependent Variable: speed AIC: 11413.0164
Date: 2020-07-21 06:31 BIC: 11423.3240
No. Observations: 1279 Log-Likelihood: -5704.5
Df Model: 1 F-statistic: 1027.
Df Residuals: 1277 Prob (F-statistic): 7.72e-166
R-squared: 0.446 Scale: 438.76
---------------------------------------------------------------------
Coef. Std.Err. t P>|t| [0.025 0.975]
---------------------------------------------------------------------
const 41.2253 1.1135 37.0220 0.0000 39.0408 43.4099
length 0.0473 0.0015 32.0460 0.0000 0.0444 0.0502
-------------------------------------------------------------------
Omnibus: 278.245 Durbin-Watson: 1.832
Prob(Omnibus): 0.000 Jarque-Bera (JB): 939.152
Skew: 1.046 Prob(JB): 0.000
Kurtosis: 6.639 Condition No.: 1433
===================================================================
* The condition number is large (1e+03). This might indicate
strong multicollinearity or other numerical problems.
coaster_scatter(captain_coaster, "length", "speed")
poly1d([ 0.04729325, 41.27655592])
df.length.std()
396.61095477849716
final_list = [x for x in df.length if (x > df.length.mean() - 2 * df.length.std())]
final_list = [x for x in final_list if (x < df.length.mean() + 2 * df.length.std())]
boolean = df.length.isin(final_list)
filtered_df = df[boolean]
print("After removing outliers...")
coaster_scatter(filtered_df, "length", "speed")
After removing outliers...
poly1d([ 0.05015783, 39.79244512])
regression = smf.ols("speed ~ length", data=filtered_df).fit()
print(regression.summary())
OLS Regression Results
==============================================================================
Dep. Variable: speed R-squared: 0.394
Model: OLS Adj. R-squared: 0.394
Method: Least Squares F-statistic: 794.6
Date: Tue, 21 Jul 2020 Prob (F-statistic): 4.28e-135
Time: 06:31:14 Log-Likelihood: -5440.5
No. Observations: 1224 AIC: 1.089e+04
Df Residuals: 1222 BIC: 1.090e+04
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 39.7370 1.211 32.813 0.000 37.361 42.113
length 0.0502 0.002 28.190 0.000 0.047 0.054
==============================================================================
Omnibus: 287.517 Durbin-Watson: 1.856
Prob(Omnibus): 0.000 Jarque-Bera (JB): 886.712
Skew: 1.160 Prob(JB): 2.84e-193
Kurtosis: 6.464 Cond. No. 1.40e+03
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.4e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
# B_status -> Boolean
filtered_df["b_status"] = [1 if x == "operating" else 0 for x in filtered_df["status"]]
/home/daniel/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
filtered_df = filtered_df.query('material_type != "na"')
OLS2 = smf.ols(
formula="speed ~ material_type + seating_type + height + length + num_inversions + b_status - 1",
data=filtered_df,
).fit()
print(OLS2.summary())
OLS Regression Results
==============================================================================
Dep. Variable: speed R-squared: 0.499
Model: OLS Adj. R-squared: 0.491
Method: Least Squares F-statistic: 62.20
Date: Tue, 21 Jul 2020 Prob (F-statistic): 1.67e-154
Time: 06:31:15 Log-Likelihood: -4971.9
No. Observations: 1145 AIC: 9982.
Df Residuals: 1126 BIC: 1.008e+04
Df Model: 18
Covariance Type: nonrobust
=================================================================================================
coef std err t P>|t| [0.025 0.975]
-------------------------------------------------------------------------------------------------
material_type[Hybrid] 44.8927 7.503 5.984 0.000 30.172 59.613
material_type[Steel] 44.6257 5.075 8.793 0.000 34.668 54.583
material_type[Wooden] 44.7924 5.616 7.975 0.000 33.773 55.812
seating_type[T.Bobsleigh] -17.6204 7.925 -2.223 0.026 -33.170 -2.071
seating_type[T.Floorless] -5.0932 6.511 -0.782 0.434 -17.868 7.681
seating_type[T.Flying] -16.7451 6.403 -2.615 0.009 -29.309 -4.181
seating_type[T.Inverted] -11.9431 5.360 -2.228 0.026 -22.459 -1.427
seating_type[T.Motorbike] -0.9354 7.677 -0.122 0.903 -15.999 14.128
seating_type[T.Pipeline] -3.6534 11.905 -0.307 0.759 -27.013 19.706
seating_type[T.Sit Down] -6.5955 4.905 -1.345 0.179 -16.219 3.028
seating_type[T.Spinning] -11.2156 5.534 -2.027 0.043 -22.074 -0.357
seating_type[T.Stand Up] -7.2007 6.863 -1.049 0.294 -20.666 6.265
seating_type[T.Suspended] -11.8223 6.300 -1.877 0.061 -24.183 0.539
seating_type[T.Water Coaster] 0.8739 6.983 0.125 0.900 -12.827 14.575
seating_type[T.Wing] 5.0534 7.745 0.652 0.514 -10.142 20.249
height 0.1226 0.014 8.641 0.000 0.095 0.150
length 0.0421 0.002 20.452 0.000 0.038 0.046
num_inversions 3.0655 0.370 8.281 0.000 2.339 3.792
b_status -0.0278 1.425 -0.020 0.984 -2.823 2.767
==============================================================================
Omnibus: 284.789 Durbin-Watson: 1.890
Prob(Omnibus): 0.000 Jarque-Bera (JB): 2419.285
Skew: 0.898 Prob(JB): 0.00
Kurtosis: 9.891 Cond. No. 2.42e+04
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.42e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
filtered_df.sort_values(by="speed", ascending=False)
name | material_type | seating_type | speed | height | length | num_inversions | manufacturer | park | status | b_status | |
---|---|---|---|---|---|---|---|---|---|---|---|
138 | Kingda Ka | Steel | Sit Down | 206.0 | 139.0 | 950.0 | 0.0 | Intamin | Six Flags Great Adventure | operating | 1 |
143 | Top Thrill Dragster | Steel | Sit Down | 192.0 | 128.0 | 853.0 | 0.0 | Intamin | Cedar Point | operating | 1 |
1751 | Red Force | Steel | Sit Down | 185.0 | 112.0 | 880.0 | 0.0 | Intamin | Ferrari Land | operating | 1 |
140 | Do-Dodonpa | Steel | Sit Down | 172.0 | 52.0 | 1189.0 | 0.0 | S&S | Fuji-Q Highland | operating | 1 |
246 | Tower of Terror II | Steel | Sit Down | 160.0 | 115.0 | 372.0 | 0.0 | Intamin | Dreamworld | operating | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2505 | Unnamed Hyper Coaster | Steel | Sit Down | 0.0 | 0.0 | 0.0 | 0.0 | B&M | Hot Go Dreamworld | construction | 0 |
1835 | Dragonfire | Steel | Sit Down | 0.0 | 60.0 | 0.0 | 0.0 | Premier Rides | Adventure Island | operating | 1 |
2516 | Sons of Anarchy & Weyland Yutani | Steel | Sit Down | 0.0 | 0.0 | 0.0 | 0.0 | na | 20th Century Fox World | announced | 0 |
2517 | Wings Over Rio | Steel | Sit Down | 0.0 | 0.0 | 0.0 | 0.0 | na | 20th Century Fox World | announced | 0 |
2515 | Alien vs Predator | Steel | Sit Down | 0.0 | 0.0 | 0.0 | 0.0 | na | 20th Century Fox World | announced | 0 |
1145 rows × 11 columns
-
Part of the fun of data analysis and visualization is digging into the data you have and answering questions that come to your mind.
Some questions you might want to answer with the datasets provided include:
- What roller coaster seating type is most popular? And do different seating types result in higher/faster/longer roller coasters?
- Do roller coaster manufacturers have any specialties (do they focus on speed, height, seating type, or inversions)?
- Do amusement parks have any specialties?
What visualizations can you create that answer these questions, and any others that come to you? Share the questions you ask and the accompanying visualizations you create on the Codecademy forums.