Skip to content

Instantly share code, notes, and snippets.

@danielcs88
Last active July 21, 2020 11:17
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save danielcs88/b0a9ce4da95eedcc365c4a9628a0b637 to your computer and use it in GitHub Desktop.
Save danielcs88/b0a9ce4da95eedcc365c4a9628a0b637 to your computer and use it in GitHub Desktop.

Roller Coaster

Overview

This project is slightly different than others you have encountered thus far on Codecademy. Instead of a step-by-step tutorial, this project contains a series of open-ended requirements which describe the project you’ll be building. There are many possible ways to correctly fulfill all of these requirements, and you should expect to use the internet, Codecademy, and other resources when you encounter a problem that you cannot easily solve.

Download

Instructions

  1. Roller coasters are thrilling amusement park rides designed to make you squeal and scream! They take you up high, drop you to the ground quickly, and sometimes even spin you upside down before returning to a stop. Today you will be taking control back from the roller coasters and visualizing data covering international roller coaster rankings and roller coaster statistics.

    Roller coasters are often split into two main categories based on their construction material: wood or steel. Rankings for the best wood and steel roller coasters from the 2013 to 2018 Golden Ticket Awards are provided in 'Golden_Ticket_Award_Winners_Wood.csv' and 'Golden_Ticket_Award_Winners_Steel.csv', respectively. Load each csv into a DataFrame and inspect it to gain familiarity with the data.

import matplotlib
import matplotlib.pyplot as plt
import pandas as pd
In /home/daniel/anaconda3/lib/python3.7/site-packages/matplotlib/mpl-data/stylelib/_classic_test.mplstyle: 
The text.latex.unicode rcparam was deprecated in Matplotlib 3.0 and will be removed in 3.2.
In /home/daniel/anaconda3/lib/python3.7/site-packages/matplotlib/mpl-data/stylelib/_classic_test.mplstyle: 
The savefig.frameon rcparam was deprecated in Matplotlib 3.1 and will be removed in 3.3.
In /home/daniel/anaconda3/lib/python3.7/site-packages/matplotlib/mpl-data/stylelib/_classic_test.mplstyle: 
The pgf.debug rcparam was deprecated in Matplotlib 3.0 and will be removed in 3.2.
In /home/daniel/anaconda3/lib/python3.7/site-packages/matplotlib/mpl-data/stylelib/_classic_test.mplstyle: 
The verbose.level rcparam was deprecated in Matplotlib 3.1 and will be removed in 3.3.
In /home/daniel/anaconda3/lib/python3.7/site-packages/matplotlib/mpl-data/stylelib/_classic_test.mplstyle: 
The verbose.fileo rcparam was deprecated in Matplotlib 3.1 and will be removed in 3.3.
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf
import statsmodels.graphics as smgraph
from matplotlib.ticker import MaxNLocator
from statsmodels.graphics.regressionplots import *

matplotlib.rcdefaults()

plt.rcParams["figure.dpi"] = 140

# load rankings data here:

wood = pd.read_csv("Golden_Ticket_Award_Winners_Wood.csv")
steel = pd.read_csv("Golden_Ticket_Award_Winners_Steel.csv")

# write function to plot rankings over time for 1 roller coaster here:

wood.sample(5)
Rank Name Park Location Supplier Year Built Points Year of Rank
29 10 Lightning Racer Hersheypark Hershey, Pa. GCII 2000 421 2015
175 46 Megafobia Oakwood Pembrookshire, Wales Custom Coasters 1996 84 2018
100 21 Rampage Alabama Splash Adventure Bessemer, Ala. Custom Coasters 1998 218 2017
92 13 Goliath Six Flags Great America Gurnee, Ill. Rocky Mountain 2014 269 2017
1 2 El Toro Six Flags Great Adventure Jackson, N.J. Intamin 2006 1302 2013
  1. Write a function that will plot the ranking of a given roller coaster over time as a line. Your function should take a roller coaster’s name and a ranking DataFrame as arguments. Make sure to include informative labels that describe your visualization.

    Call your function with "El Toro" as the roller coaster name and the wood ranking DataFrame. What issue do you notice? Update your function with an additional argument to alleviate the problem, and retest your function.

def rank_year(name, park):
    """
    Plot time-series of rankings of park 
    """

    dfwood = wood[(wood["Name"] == name) & (wood["Park"] == park)]
    plt.plot(
        dfwood["Year of Rank"], dfwood["Rank"],
    )
    plt.ylabel("Rank")
    plt.xlabel("Year")
    plt.legend([name])
    plt.title(f"{name}: {park}")
    plt.yticks(range(1, dfwood.Rank.max() + 1))

    return plt.show()


rank_year("El Toro", "Six Flags Great Adventure")

png

  1. Write a function that will plot the ranking of two given roller coasters over time as lines. Your function should take both roller coasters’ names and a ranking DataFrame as arguments. Make sure to include informative labels that describe your visualization.

    Call your function with "El Toro" as one roller coaster name, “Boulder Dash“ as the other roller coaster name, and the wood ranking DataFrame. What issue do you notice? Update your function with two additional arguments to alleviate the problem, and retest your function.

def rank_year2(name1, name2, park1, park2):
    """
    Time-series plot of rollercoasters.
    """

    dfwood1 = wood[(wood["Name"] == name1) & (wood["Park"] == park1)]
    dfwood2 = wood[(wood["Name"] == name2) & (wood["Park"] == park2)]
    ay = plt.subplot()
    plt.plot(dfwood1["Year of Rank"], dfwood1["Rank"])
    plt.plot(dfwood2["Year of Rank"], dfwood2["Rank"])
    plt.ylabel("Rank")
    plt.xlabel("Year")
    plt.legend([name1, name2])
    plt.title("Ranking of Rollercoasters")
    ay.yaxis.set_major_locator(MaxNLocator(integer=True))

    return plt.show()


rank_year2("El Toro", "Boulder Dash", "Six Flags Great Adventure", "Lake Compounce")

png

  1. Write a function that will plot the ranking of the top n ranked roller coasters over time as lines. Your function should take a number n and a ranking DataFrame as arguments. Make sure to include informative labels that describe your visualization.

    For example, if n == 5, your function should plot a line for each roller coaster that has a rank of 5 or lower.

    Call your function with a value for n and either the wood ranking or steel ranking DataFrame.

def top_ranked(n, df):
    """
    Returns a plot of top ranked rollercoasters, where `n` is the lowest rank.
    """

    n_df = df.query("Rank <= @n")

    n_df = n_df.dropna()

    fig, ax = plt.subplots(figsize=(10, 10))

    for coaster in set(n_df.Name):
        coaster_rankings = n_df.query("Name == @coaster")
        ax.plot(coaster_rankings["Year of Rank"], coaster_rankings.Rank, label=coaster)

    ax.yaxis.set_major_locator(MaxNLocator(integer=True))
    plt.title(f"Top {n} Ranked Rollercoasters")
    plt.xlabel("Year")
    plt.ylabel("Ranking")
    plt.legend()

    return plt.show()


top_ranked(5, wood)

png

  1. Now that you’ve visualized rankings over time, let’s dive into the actual statistics of roller coasters themselves. Captain Coaster is a popular site for recording roller coaster information. Data on all roller coasters documented on Captain Coaster has been accessed through its API and stored in roller_coasters.csv. Load the data from the csv into a DataFrame and inspect it to gain familiarity with the data.

    Open the hint for more information about each column of the dataset.

captain_coaster = pd.read_csv("roller_coasters.csv")

captain_coaster.sample(5)
name material_type seating_type speed height length num_inversions manufacturer park status
1999 Big Apple na Sit Down NaN NaN NaN NaN D.P.V. Rides The Milky Way Adventure Park status.operating
202 Infernal Toboggan Steel Sit Down NaN 11.0 335.0 0.0 S.D.C. Foire status.operating
70 Blue Tornado Steel Inverted 80.0 33.0 765.0 5.0 Vekoma Gardaland status.operating
953 Pandemonium Steel Spinning 50.0 16.0 412.0 0.0 Gerstlauer Six Flags Fiesta Texas status.operating
2320 Happy Angel na Inverted 87.0 NaN NaN 6.0 Golden Horse Heilongjiang Wanda Theme Park status.operating
  1. Write a function that plots a histogram of any numeric column of the roller coaster DataFrame. Your function should take a DataFrame and a column name for which a histogram should be constructed as arguments. Make sure to include informative labels that describe your visualization.

    Call your function with the roller coaster DataFrame and one of the column names.

captain_coaster.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2802 entries, 0 to 2801
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   name            2799 non-null   object 
 1   material_type   2802 non-null   object 
 2   seating_type    2802 non-null   object 
 3   speed           1478 non-null   float64
 4   height          1667 non-null   float64
 5   length          1675 non-null   float64
 6   num_inversions  2405 non-null   float64
 7   manufacturer    2802 non-null   object 
 8   park            2802 non-null   object 
 9   status          2802 non-null   object 
dtypes: float64(4), object(6)
memory usage: 219.0+ KB
def histogram(df, column):
    """
    Histogram plotting column of a dataframe.
    """

    df = df.dropna()

    plt.hist(df[column])
    plt.legend([column])
    plt.xlabel(column)
    plt.ylabel("Number of roller coasters")
    return plt.show()


histogram(captain_coaster, "speed")

png

  1. Write a function that creates a bar chart showing the number of inversions for each roller coaster at an amusement park. Your function should take the roller coaster DataFrame and an amusement park name as arguments. Make sure to include informative labels that describe your visualization.

    Call your function with the roller coaster DataFrame and an amusement park name.

test = "Walibi Belgium"

captain_coaster.query("park == @test & num_inversions > 0").sort_values(
    "num_inversions", ascending=False
)
name material_type seating_type speed height length num_inversions manufacturer park status
44 Vampire Steel Inverted 80.0 33.0 689.0 5.0 Vekoma Walibi Belgium status.operating
11 Cobra Steel Sit Down 76.0 36.0 285.0 3.0 Vekoma Walibi Belgium status.operating
957 Tornado Steel Sit Down 64.0 23.0 725.0 2.0 Vekoma Walibi Belgium status.closed.definitely
39 Psyké underground Steel Sit Down 85.0 42.0 260.0 1.0 Schwarzkopf Walibi Belgium status.operating
def barchart_inversions(df, park):
    """
    Bar chart plotting number of inversions given DataFrame and park name.
    """

    df = df.dropna()

    result = df.query("park == @park & num_inversions > 0").sort_values(
        "num_inversions", ascending=False
    )

    plt.figure(figsize=(10, 7.5))
    ax = plt.subplot()
    plt.bar(result.name, result.num_inversions)
    ax.set_xticklabels(result.name)
    ax.set_xticks(range(len(result.name)))
    plt.xticks(rotation=30)
    plt.legend([park])
    plt.title(f"Number of Inversions per Rollercoaster: {park}")
    plt.tight_layout()

    return plt.show()
barchart_inversions(captain_coaster, "Walibi Belgium")

png

  1. Write a function that creates a pie chart that compares the number of operating roller coasters ('status.operating') to the number of closed roller coasters ('status.closed.definitely'). Your function should take the roller coaster DataFrame as an argument. Make sure to include informative labels that describe your visualization.

    Call your function with the roller coaster DataFrame.

# Remove prefix
captain_coaster.status = captain_coaster.status.replace("status.", "", regex=True)
# Remove `.` between closed
captain_coaster.status = captain_coaster.status.replace(
    "closed.", "closed ", regex=True
)
captain_coaster.status.value_counts(normalize=True)
operating             0.775161
closed definitely     0.156674
announced             0.014989
construction          0.014632
unknown               0.012134
closed temporarily    0.008922
relocated             0.007852
retracked             0.005710
rumored               0.003926
Name: status, dtype: float64
def pie_operation(df):
    """
    Plot for pie chart on operation status of roller coasters.
    """

    criteria = df.query("status == 'operating' | status == 'closed definitely'")
    counts = list(criteria.status.value_counts())
    plt.pie(counts, autopct="%0.1f%%", labels=["Operating", "Closed"])
    plt.title("Rollercoasters: Operating vs Closed")
    plt.axis("equal")
    return plt.show()
pie_operation(captain_coaster)

png

  1. .scatter() is another useful function in matplotlib that you might not have seen before. .scatter() produces a scatter plot, which is similar to .plot() in that it plots points on a figure. .scatter(), however, does not connect the points with a line. This allows you to analyze the relationship between to variables. Find .scatter()‘s documentation here.

    Write a function that creates a scatter plot of two numeric columns of the roller coaster DataFrame. Your function should take the roller coaster DataFrame and two-column names as arguments. Make sure to include informative labels that describe your visualization.

    Call your function with the roller coaster DataFrame and two-column names.

captain_coaster.describe()
speed height length num_inversions
count 1478.000000 1667.000000 1675.000000 2405.000000
mean 70.102842 26.725855 606.147463 0.809563
std 28.338394 35.010166 393.840496 1.652254
min 0.000000 0.000000 -1.000000 0.000000
25% 47.000000 13.000000 335.000000 0.000000
50% 72.000000 23.000000 500.000000 0.000000
75% 88.000000 35.000000 839.000000 1.000000
max 240.000000 902.000000 2920.000000 14.000000
def coaster_scatter(df, column_x, column_y):
    """
    Plots relationship between two variables.
    """

    import numpy as np

    df = df.dropna()

    df = df.query("height < 140")

    plt.figure()
    ax = plt.subplot()
    plt.scatter(df[column_x], df[column_y], alpha=0.1)
    plt.xlabel(column_x)
    plt.ylabel(column_y)
    plt.title(f"Rollercoaster: Relationship {column_x} vs {column_y}")

    trend = np.polyfit(df[column_x], df[column_y], 1)
    trendline = np.poly1d(trend)
    plt.plot(df[column_x], trendline(df[column_x]), "r--")

    return trendline
coaster_scatter(captain_coaster, "height", "speed")
poly1d([ 1.43975908, 31.93321946])

png

# Correlation heatmap

sns.heatmap(
    captain_coaster.dropna().corr(), cmap="seismic", annot=True, vmin=-1, vmax=1
)
<matplotlib.axes._subplots.AxesSubplot at 0x7f8c555e6d10>

png

captain_coaster.describe()
speed height length num_inversions
count 1478.000000 1667.000000 1675.000000 2405.000000
mean 70.102842 26.725855 606.147463 0.809563
std 28.338394 35.010166 393.840496 1.652254
min 0.000000 0.000000 -1.000000 0.000000
25% 47.000000 13.000000 335.000000 0.000000
50% 72.000000 23.000000 500.000000 0.000000
75% 88.000000 35.000000 839.000000 1.000000
max 240.000000 902.000000 2920.000000 14.000000
df = captain_coaster.dropna()
regression = smf.ols("speed ~ length", data=df).fit()
print(regression.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  speed   R-squared:                       0.446
Model:                            OLS   Adj. R-squared:                  0.445
Method:                 Least Squares   F-statistic:                     1027.
Date:                Tue, 21 Jul 2020   Prob (F-statistic):          7.72e-166
Time:                        06:31:08   Log-Likelihood:                -5704.5
No. Observations:                1279   AIC:                         1.141e+04
Df Residuals:                    1277   BIC:                         1.142e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     41.2253      1.114     37.022      0.000      39.041      43.410
length         0.0473      0.001     32.046      0.000       0.044       0.050
==============================================================================
Omnibus:                      278.245   Durbin-Watson:                   1.832
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              939.152
Skew:                           1.046   Prob(JB):                    1.16e-204
Kurtosis:                       6.639   Cond. No.                     1.43e+03
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.43e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
regression = smf.ols("speed ~ length", data=df).fit()
print(regression.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  speed   R-squared:                       0.446
Model:                            OLS   Adj. R-squared:                  0.445
Method:                 Least Squares   F-statistic:                     1027.
Date:                Tue, 21 Jul 2020   Prob (F-statistic):          7.72e-166
Time:                        06:31:08   Log-Likelihood:                -5704.5
No. Observations:                1279   AIC:                         1.141e+04
Df Residuals:                    1277   BIC:                         1.142e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     41.2253      1.114     37.022      0.000      39.041      43.410
length         0.0473      0.001     32.046      0.000       0.044       0.050
==============================================================================
Omnibus:                      278.245   Durbin-Watson:                   1.832
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              939.152
Skew:                           1.046   Prob(JB):                    1.16e-204
Kurtosis:                       6.639   Cond. No.                     1.43e+03
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.43e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

Test for Outliers

Here we're using our regression results to do a test for outliers. In this case, I guess the default is a Bonferroni outlier test. We're only printing off test results where the third column is less than 0.05.

test = regression.outlier_test()
print("Bad Data Points")
test[test["bonf(p)"] < 0.05]
Bad Data Points
student_resid unadj_p bonf(p)
138 5.795402 8.578726e-09 0.000011
143 5.329034 1.166309e-07 0.000149
160 4.865897 1.281502e-06 0.001639
246 4.875219 1.223498e-06 0.001565
1397 5.041705 5.279044e-07 0.000675
1751 4.921659 9.702364e-07 0.001241
figure = smgraph.regressionplots.plot_fit(regression, 1)
line = smgraph.regressionplots.abline_plot(model_results=regression, ax=figure.axes[0])


fig, ax = plt.subplots(figsize=(12, 8))
fig = sm.graphics.influence_plot(regression, alpha=0.05, ax=ax, criterion="cooks")

png

png

Y = df.speed
X = df.length
X = sm.add_constant(X)

model = sm.OLS(Y, X)
results = model.fit()
print(results.summary2())
                  Results: Ordinary least squares
===================================================================
Model:              OLS              Adj. R-squared:     0.445     
Dependent Variable: speed            AIC:                11413.0164
Date:               2020-07-21 06:31 BIC:                11423.3240
No. Observations:   1279             Log-Likelihood:     -5704.5   
Df Model:           1                F-statistic:        1027.     
Df Residuals:       1277             Prob (F-statistic): 7.72e-166 
R-squared:          0.446            Scale:              438.76    
---------------------------------------------------------------------
             Coef.    Std.Err.      t      P>|t|     [0.025    0.975]
---------------------------------------------------------------------
const       41.2253     1.1135   37.0220   0.0000   39.0408   43.4099
length       0.0473     0.0015   32.0460   0.0000    0.0444    0.0502
-------------------------------------------------------------------
Omnibus:              278.245       Durbin-Watson:          1.832  
Prob(Omnibus):        0.000         Jarque-Bera (JB):       939.152
Skew:                 1.046         Prob(JB):               0.000  
Kurtosis:             6.639         Condition No.:          1433   
===================================================================
* The condition number is large (1e+03). This might indicate
strong multicollinearity or other numerical problems.
coaster_scatter(captain_coaster, "length", "speed")
poly1d([ 0.04729325, 41.27655592])

png

df.length.std()
396.61095477849716
final_list = [x for x in df.length if (x > df.length.mean() - 2 * df.length.std())]
final_list = [x for x in final_list if (x < df.length.mean() + 2 * df.length.std())]
boolean = df.length.isin(final_list)
filtered_df = df[boolean]
print("After removing outliers...")
coaster_scatter(filtered_df, "length", "speed")
After removing outliers...





poly1d([ 0.05015783, 39.79244512])

png

regression = smf.ols("speed ~ length", data=filtered_df).fit()
print(regression.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  speed   R-squared:                       0.394
Model:                            OLS   Adj. R-squared:                  0.394
Method:                 Least Squares   F-statistic:                     794.6
Date:                Tue, 21 Jul 2020   Prob (F-statistic):          4.28e-135
Time:                        06:31:14   Log-Likelihood:                -5440.5
No. Observations:                1224   AIC:                         1.089e+04
Df Residuals:                    1222   BIC:                         1.090e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     39.7370      1.211     32.813      0.000      37.361      42.113
length         0.0502      0.002     28.190      0.000       0.047       0.054
==============================================================================
Omnibus:                      287.517   Durbin-Watson:                   1.856
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              886.712
Skew:                           1.160   Prob(JB):                    2.84e-193
Kurtosis:                       6.464   Cond. No.                     1.40e+03
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.4e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
# B_status -> Boolean
filtered_df["b_status"] = [1 if x == "operating" else 0 for x in filtered_df["status"]]
/home/daniel/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
filtered_df = filtered_df.query('material_type != "na"')


OLS2 = smf.ols(
    formula="speed ~ material_type + seating_type + height + length + num_inversions + b_status - 1",
    data=filtered_df,
).fit()
print(OLS2.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  speed   R-squared:                       0.499
Model:                            OLS   Adj. R-squared:                  0.491
Method:                 Least Squares   F-statistic:                     62.20
Date:                Tue, 21 Jul 2020   Prob (F-statistic):          1.67e-154
Time:                        06:31:15   Log-Likelihood:                -4971.9
No. Observations:                1145   AIC:                             9982.
Df Residuals:                    1126   BIC:                         1.008e+04
Df Model:                          18                                         
Covariance Type:            nonrobust                                         
=================================================================================================
                                    coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------------------
material_type[Hybrid]            44.8927      7.503      5.984      0.000      30.172      59.613
material_type[Steel]             44.6257      5.075      8.793      0.000      34.668      54.583
material_type[Wooden]            44.7924      5.616      7.975      0.000      33.773      55.812
seating_type[T.Bobsleigh]       -17.6204      7.925     -2.223      0.026     -33.170      -2.071
seating_type[T.Floorless]        -5.0932      6.511     -0.782      0.434     -17.868       7.681
seating_type[T.Flying]          -16.7451      6.403     -2.615      0.009     -29.309      -4.181
seating_type[T.Inverted]        -11.9431      5.360     -2.228      0.026     -22.459      -1.427
seating_type[T.Motorbike]        -0.9354      7.677     -0.122      0.903     -15.999      14.128
seating_type[T.Pipeline]         -3.6534     11.905     -0.307      0.759     -27.013      19.706
seating_type[T.Sit Down]         -6.5955      4.905     -1.345      0.179     -16.219       3.028
seating_type[T.Spinning]        -11.2156      5.534     -2.027      0.043     -22.074      -0.357
seating_type[T.Stand Up]         -7.2007      6.863     -1.049      0.294     -20.666       6.265
seating_type[T.Suspended]       -11.8223      6.300     -1.877      0.061     -24.183       0.539
seating_type[T.Water Coaster]     0.8739      6.983      0.125      0.900     -12.827      14.575
seating_type[T.Wing]              5.0534      7.745      0.652      0.514     -10.142      20.249
height                            0.1226      0.014      8.641      0.000       0.095       0.150
length                            0.0421      0.002     20.452      0.000       0.038       0.046
num_inversions                    3.0655      0.370      8.281      0.000       2.339       3.792
b_status                         -0.0278      1.425     -0.020      0.984      -2.823       2.767
==============================================================================
Omnibus:                      284.789   Durbin-Watson:                   1.890
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             2419.285
Skew:                           0.898   Prob(JB):                         0.00
Kurtosis:                       9.891   Cond. No.                     2.42e+04
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.42e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
filtered_df.sort_values(by="speed", ascending=False)
name material_type seating_type speed height length num_inversions manufacturer park status b_status
138 Kingda Ka Steel Sit Down 206.0 139.0 950.0 0.0 Intamin Six Flags Great Adventure operating 1
143 Top Thrill Dragster Steel Sit Down 192.0 128.0 853.0 0.0 Intamin Cedar Point operating 1
1751 Red Force Steel Sit Down 185.0 112.0 880.0 0.0 Intamin Ferrari Land operating 1
140 Do-Dodonpa Steel Sit Down 172.0 52.0 1189.0 0.0 S&S Fuji-Q Highland operating 1
246 Tower of Terror II Steel Sit Down 160.0 115.0 372.0 0.0 Intamin Dreamworld operating 1
... ... ... ... ... ... ... ... ... ... ... ...
2505 Unnamed Hyper Coaster Steel Sit Down 0.0 0.0 0.0 0.0 B&M Hot Go Dreamworld construction 0
1835 Dragonfire Steel Sit Down 0.0 60.0 0.0 0.0 Premier Rides Adventure Island operating 1
2516 Sons of Anarchy & Weyland Yutani Steel Sit Down 0.0 0.0 0.0 0.0 na 20th Century Fox World announced 0
2517 Wings Over Rio Steel Sit Down 0.0 0.0 0.0 0.0 na 20th Century Fox World announced 0
2515 Alien vs Predator Steel Sit Down 0.0 0.0 0.0 0.0 na 20th Century Fox World announced 0

1145 rows × 11 columns

  1. Part of the fun of data analysis and visualization is digging into the data you have and answering questions that come to your mind.

    Some questions you might want to answer with the datasets provided include:

    • What roller coaster seating type is most popular? And do different seating types result in higher/faster/longer roller coasters?
    • Do roller coaster manufacturers have any specialties (do they focus on speed, height, seating type, or inversions)?
    • Do amusement parks have any specialties?

    What visualizations can you create that answer these questions, and any others that come to you? Share the questions you ask and the accompanying visualizations you create on the Codecademy forums.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment