danielcs88/roller_coaster.md

## roller_coaster.md

      
    Raw
  

              roller_coaster.md
            
          
    Roller Coaster

Overview

This project is slightly different than others you have encountered thus far on
Codecademy. Instead of a step-by-step tutorial, this project contains a series
of open-ended requirements which describe the project you’ll be building. There
are many possible ways to correctly fulfill all of these requirements, and you
should expect to use the internet, Codecademy, and other resources when you
encounter a problem that you cannot easily solve.
Download
Instructions


Roller coasters are thrilling amusement park rides designed to make you
squeal and scream! They take you up high, drop you to the ground quickly, and
sometimes even spin you upside down before returning to a stop. Today you
will be taking control back from the roller coasters and visualizing data
covering international roller coaster rankings and roller coaster statistics.
Roller coasters are often split into two main categories based on their
construction material: wood or steel. Rankings for the best wood and
steel roller coasters from the 2013 to 2018 Golden Ticket
Awards are provided in
'Golden_Ticket_Award_Winners_Wood.csv' and
'Golden_Ticket_Award_Winners_Steel.csv', respectively. Load each csv into a
DataFrame and inspect it to gain familiarity with the data.


import matplotlib
import matplotlib.pyplot as plt
import pandas as pd
In /home/daniel/anaconda3/lib/python3.7/site-packages/matplotlib/mpl-data/stylelib/_classic_test.mplstyle: 
The text.latex.unicode rcparam was deprecated in Matplotlib 3.0 and will be removed in 3.2.
In /home/daniel/anaconda3/lib/python3.7/site-packages/matplotlib/mpl-data/stylelib/_classic_test.mplstyle: 
The savefig.frameon rcparam was deprecated in Matplotlib 3.1 and will be removed in 3.3.
In /home/daniel/anaconda3/lib/python3.7/site-packages/matplotlib/mpl-data/stylelib/_classic_test.mplstyle: 
The pgf.debug rcparam was deprecated in Matplotlib 3.0 and will be removed in 3.2.
In /home/daniel/anaconda3/lib/python3.7/site-packages/matplotlib/mpl-data/stylelib/_classic_test.mplstyle: 
The verbose.level rcparam was deprecated in Matplotlib 3.1 and will be removed in 3.3.
In /home/daniel/anaconda3/lib/python3.7/site-packages/matplotlib/mpl-data/stylelib/_classic_test.mplstyle: 
The verbose.fileo rcparam was deprecated in Matplotlib 3.1 and will be removed in 3.3.

import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf
import statsmodels.graphics as smgraph
from matplotlib.ticker import MaxNLocator
from statsmodels.graphics.regressionplots import *

matplotlib.rcdefaults()

plt.rcParams["figure.dpi"] = 140

# load rankings data here:

wood = pd.read_csv("Golden_Ticket_Award_Winners_Wood.csv")
steel = pd.read_csv("Golden_Ticket_Award_Winners_Steel.csv")

# write function to plot rankings over time for 1 roller coaster here:

wood.sample(5)


      Rank
      Name
      Park
      Location
      Supplier
      Year Built
      Points
      Year of Rank
    
  
      29
      10
      Lightning Racer
      Hersheypark
      Hershey, Pa.
      GCII
      2000
      421
      2015
    
    
      175
      46
      Megafobia
      Oakwood
      Pembrookshire, Wales
      Custom Coasters
      1996
      84
      2018
    
    
      100
      21
      Rampage
      Alabama Splash Adventure
      Bessemer, Ala.
      Custom Coasters
      1998
      218
      2017
    
    
      92
      13
      Goliath
      Six Flags Great America
      Gurnee, Ill.
      Rocky Mountain
      2014
      269
      2017
    
    
      1
      2
      El Toro
      Six Flags Great Adventure
      Jackson, N.J.
      Intamin
      2006
      1302
      2013
    
  
Write a function that will plot the ranking of a given roller coaster over
time as a line. Your function should take a roller coaster’s name and a
ranking DataFrame as arguments. Make sure to include informative labels that
describe your visualization.
Call your function with "El Toro" as the roller coaster name and the wood
ranking DataFrame. What issue do you notice? Update your function with an
additional argument to alleviate the problem, and retest your function.


def rank_year(name, park):
    """
    Plot time-series of rankings of park 
    """

    dfwood = wood[(wood["Name"] == name) & (wood["Park"] == park)]
    plt.plot(
        dfwood["Year of Rank"], dfwood["Rank"],
    )
    plt.ylabel("Rank")
    plt.xlabel("Year")
    plt.legend([name])
    plt.title(f"{name}: {park}")
    plt.yticks(range(1, dfwood.Rank.max() + 1))

    return plt.show()


rank_year("El Toro", "Six Flags Great Adventure")


Write a function that will plot the ranking of two given roller coasters over
time as lines. Your function should take both roller coasters’ names and a
ranking DataFrame as arguments. Make sure to include informative labels that
describe your visualization.
Call your function with "El Toro" as one roller coaster name, “Boulder Dash“ as the other roller coaster name, and the wood ranking DataFrame. What
issue do you notice? Update your function with two additional arguments to
alleviate the problem, and retest your function.


def rank_year2(name1, name2, park1, park2):
    """
    Time-series plot of rollercoasters.
    """

    dfwood1 = wood[(wood["Name"] == name1) & (wood["Park"] == park1)]
    dfwood2 = wood[(wood["Name"] == name2) & (wood["Park"] == park2)]
    ay = plt.subplot()
    plt.plot(dfwood1["Year of Rank"], dfwood1["Rank"])
    plt.plot(dfwood2["Year of Rank"], dfwood2["Rank"])
    plt.ylabel("Rank")
    plt.xlabel("Year")
    plt.legend([name1, name2])
    plt.title("Ranking of Rollercoasters")
    ay.yaxis.set_major_locator(MaxNLocator(integer=True))

    return plt.show()


rank_year2("El Toro", "Boulder Dash", "Six Flags Great Adventure", "Lake Compounce")


Write a function that will plot the ranking of the top n ranked roller
coasters over time as lines. Your function should take a number n and a
ranking DataFrame as arguments. Make sure to include informative labels that
describe your visualization.
For example, if n == 5, your function should plot a line for each roller
coaster that has a rank of 5 or lower.
Call your function with a value for n and either the wood ranking or steel
ranking DataFrame.


def top_ranked(n, df):
    """
    Returns a plot of top ranked rollercoasters, where `n` is the lowest rank.
    """

    n_df = df.query("Rank <= @n")

    n_df = n_df.dropna()

    fig, ax = plt.subplots(figsize=(10, 10))

    for coaster in set(n_df.Name):
        coaster_rankings = n_df.query("Name == @coaster")
        ax.plot(coaster_rankings["Year of Rank"], coaster_rankings.Rank, label=coaster)

    ax.yaxis.set_major_locator(MaxNLocator(integer=True))
    plt.title(f"Top {n} Ranked Rollercoasters")
    plt.xlabel("Year")
    plt.ylabel("Ranking")
    plt.legend()

    return plt.show()


top_ranked(5, wood)


Now that you’ve visualized rankings over time, let’s dive into the actual
statistics of roller coasters themselves. Captain
Coaster is a popular site for recording
roller coaster information. Data on all roller coasters documented on Captain
Coaster has been accessed through its API and stored in
roller_coasters.csv. Load the data from the csv into a DataFrame and
inspect it to gain familiarity with the data.
Open the hint for more information about each column of the dataset.


captain_coaster = pd.read_csv("roller_coasters.csv")

captain_coaster.sample(5)


      name
      material_type
      seating_type
      speed
      height
      length
      num_inversions
      manufacturer
      park
      status
    
  
      1999
      Big Apple
      na
      Sit Down
      NaN
      NaN
      NaN
      NaN
      D.P.V. Rides
      The Milky Way Adventure Park
      status.operating
    
    
      202
      Infernal Toboggan
      Steel
      Sit Down
      NaN
      11.0
      335.0
      0.0
      S.D.C.
      Foire
      status.operating
    
    
      70
      Blue Tornado
      Steel
      Inverted
      80.0
      33.0
      765.0
      5.0
      Vekoma
      Gardaland
      status.operating
    
    
      953
      Pandemonium
      Steel
      Spinning
      50.0
      16.0
      412.0
      0.0
      Gerstlauer
      Six Flags Fiesta Texas
      status.operating
    
    
      2320
      Happy Angel
      na
      Inverted
      87.0
      NaN
      NaN
      6.0
      Golden Horse
      Heilongjiang Wanda Theme Park
      status.operating
    
  
Write a function that plots a histogram of any numeric column of the roller
coaster DataFrame. Your function should take a DataFrame and a column name
for which a histogram should be constructed as arguments. Make sure to
include informative labels that describe your visualization.
Call your function with the roller coaster DataFrame and one of the column
names.


captain_coaster.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2802 entries, 0 to 2801
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   name            2799 non-null   object 
 1   material_type   2802 non-null   object 
 2   seating_type    2802 non-null   object 
 3   speed           1478 non-null   float64
 4   height          1667 non-null   float64
 5   length          1675 non-null   float64
 6   num_inversions  2405 non-null   float64
 7   manufacturer    2802 non-null   object 
 8   park            2802 non-null   object 
 9   status          2802 non-null   object 
dtypes: float64(4), object(6)
memory usage: 219.0+ KB

def histogram(df, column):
    """
    Histogram plotting column of a dataframe.
    """

    df = df.dropna()

    plt.hist(df[column])
    plt.legend([column])
    plt.xlabel(column)
    plt.ylabel("Number of roller coasters")
    return plt.show()


histogram(captain_coaster, "speed")


Write a function that creates a bar chart showing the number of inversions
for each roller coaster at an amusement park. Your function should take the
roller coaster DataFrame and an amusement park name as arguments. Make sure
to include informative labels that describe your visualization.
Call your function with the roller coaster DataFrame and an amusement park name.


test = "Walibi Belgium"

captain_coaster.query("park == @test & num_inversions > 0").sort_values(
    "num_inversions", ascending=False
)


      name
      material_type
      seating_type
      speed
      height
      length
      num_inversions
      manufacturer
      park
      status
    
  
      44
      Vampire
      Steel
      Inverted
      80.0
      33.0
      689.0
      5.0
      Vekoma
      Walibi Belgium
      status.operating
    
    
      11
      Cobra
      Steel
      Sit Down
      76.0
      36.0
      285.0
      3.0
      Vekoma
      Walibi Belgium
      status.operating
    
    
      957
      Tornado
      Steel
      Sit Down
      64.0
      23.0
      725.0
      2.0
      Vekoma
      Walibi Belgium
      status.closed.definitely
    
    
      39
      Psyké underground
      Steel
      Sit Down
      85.0
      42.0
      260.0
      1.0
      Schwarzkopf
      Walibi Belgium
      status.operating
    
  
def barchart_inversions(df, park):
    """
    Bar chart plotting number of inversions given DataFrame and park name.
    """

    df = df.dropna()

    result = df.query("park == @park & num_inversions > 0").sort_values(
        "num_inversions", ascending=False
    )

    plt.figure(figsize=(10, 7.5))
    ax = plt.subplot()
    plt.bar(result.name, result.num_inversions)
    ax.set_xticklabels(result.name)
    ax.set_xticks(range(len(result.name)))
    plt.xticks(rotation=30)
    plt.legend([park])
    plt.title(f"Number of Inversions per Rollercoaster: {park}")
    plt.tight_layout()

    return plt.show()
barchart_inversions(captain_coaster, "Walibi Belgium")


Write a function that creates a pie chart that compares the number of
operating roller coasters ('status.operating') to the number of closed
roller coasters ('status.closed.definitely'). Your function should take the
roller coaster DataFrame as an argument. Make sure to include informative
labels that describe your visualization.
Call your function with the roller coaster DataFrame.


# Remove prefix
captain_coaster.status = captain_coaster.status.replace("status.", "", regex=True)
# Remove `.` between closed
captain_coaster.status = captain_coaster.status.replace(
    "closed.", "closed ", regex=True
)
captain_coaster.status.value_counts(normalize=True)
operating             0.775161
closed definitely     0.156674
announced             0.014989
construction          0.014632
unknown               0.012134
closed temporarily    0.008922
relocated             0.007852
retracked             0.005710
rumored               0.003926
Name: status, dtype: float64

def pie_operation(df):
    """
    Plot for pie chart on operation status of roller coasters.
    """

    criteria = df.query("status == 'operating' | status == 'closed definitely'")
    counts = list(criteria.status.value_counts())
    plt.pie(counts, autopct="%0.1f%%", labels=["Operating", "Closed"])
    plt.title("Rollercoasters: Operating vs Closed")
    plt.axis("equal")
    return plt.show()
pie_operation(captain_coaster)


.scatter() is another useful function in matplotlib that you might not have
seen before. .scatter() produces a scatter plot, which is similar to
.plot() in that it plots points on a figure. .scatter(), however, does
not connect the points with a line. This allows you to analyze the
relationship between to variables. Find .scatter()‘s documentation
here.
Write a function that creates a scatter plot of two numeric columns of the
roller coaster DataFrame. Your function should take the roller coaster
DataFrame and two-column names as arguments. Make sure to include informative
labels that describe your visualization.
Call your function with the roller coaster DataFrame and two-column names.


captain_coaster.describe()


      speed
      height
      length
      num_inversions
    
  
      count
      1478.000000
      1667.000000
      1675.000000
      2405.000000
    
    
      mean
      70.102842
      26.725855
      606.147463
      0.809563
    
    
      std
      28.338394
      35.010166
      393.840496
      1.652254
    
    
      min
      0.000000
      0.000000
      -1.000000
      0.000000
    
    
      25%
      47.000000
      13.000000
      335.000000
      0.000000
    
    
      50%
      72.000000
      23.000000
      500.000000
      0.000000
    
    
      75%
      88.000000
      35.000000
      839.000000
      1.000000
    
    
      max
      240.000000
      902.000000
      2920.000000
      14.000000
    
  
def coaster_scatter(df, column_x, column_y):
    """
    Plots relationship between two variables.
    """

    import numpy as np

    df = df.dropna()

    df = df.query("height < 140")

    plt.figure()
    ax = plt.subplot()
    plt.scatter(df[column_x], df[column_y], alpha=0.1)
    plt.xlabel(column_x)
    plt.ylabel(column_y)
    plt.title(f"Rollercoaster: Relationship {column_x} vs {column_y}")

    trend = np.polyfit(df[column_x], df[column_y], 1)
    trendline = np.poly1d(trend)
    plt.plot(df[column_x], trendline(df[column_x]), "r--")

    return trendline
coaster_scatter(captain_coaster, "height", "speed")
poly1d([ 1.43975908, 31.93321946])


# Correlation heatmap

sns.heatmap(
    captain_coaster.dropna().corr(), cmap="seismic", annot=True, vmin=-1, vmax=1
)
<matplotlib.axes._subplots.AxesSubplot at 0x7f8c555e6d10>


captain_coaster.describe()


      speed
      height
      length
      num_inversions
    
  
      count
      1478.000000
      1667.000000
      1675.000000
      2405.000000
    
    
      mean
      70.102842
      26.725855
      606.147463
      0.809563
    
    
      std
      28.338394
      35.010166
      393.840496
      1.652254
    
    
      min
      0.000000
      0.000000
      -1.000000
      0.000000
    
    
      25%
      47.000000
      13.000000
      335.000000
      0.000000
    
    
      50%
      72.000000
      23.000000
      500.000000
      0.000000
    
    
      75%
      88.000000
      35.000000
      839.000000
      1.000000
    
    
      max
      240.000000
      902.000000
      2920.000000
      14.000000
    
  
df = captain_coaster.dropna()
regression = smf.ols("speed ~ length", data=df).fit()
print(regression.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  speed   R-squared:                       0.446
Model:                            OLS   Adj. R-squared:                  0.445
Method:                 Least Squares   F-statistic:                     1027.
Date:                Tue, 21 Jul 2020   Prob (F-statistic):          7.72e-166
Time:                        06:31:08   Log-Likelihood:                -5704.5
No. Observations:                1279   AIC:                         1.141e+04
Df Residuals:                    1277   BIC:                         1.142e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     41.2253      1.114     37.022      0.000      39.041      43.410
length         0.0473      0.001     32.046      0.000       0.044       0.050
==============================================================================
Omnibus:                      278.245   Durbin-Watson:                   1.832
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              939.152
Skew:                           1.046   Prob(JB):                    1.16e-204
Kurtosis:                       6.639   Cond. No.                     1.43e+03
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.43e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

regression = smf.ols("speed ~ length", data=df).fit()
print(regression.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  speed   R-squared:                       0.446
Model:                            OLS   Adj. R-squared:                  0.445
Method:                 Least Squares   F-statistic:                     1027.
Date:                Tue, 21 Jul 2020   Prob (F-statistic):          7.72e-166
Time:                        06:31:08   Log-Likelihood:                -5704.5
No. Observations:                1279   AIC:                         1.141e+04
Df Residuals:                    1277   BIC:                         1.142e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     41.2253      1.114     37.022      0.000      39.041      43.410
length         0.0473      0.001     32.046      0.000       0.044       0.050
==============================================================================
Omnibus:                      278.245   Durbin-Watson:                   1.832
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              939.152
Skew:                           1.046   Prob(JB):                    1.16e-204
Kurtosis:                       6.639   Cond. No.                     1.43e+03
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.43e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

Test for Outliers

Here we're using our regression results to do a test for outliers. In this case, I guess the default is a Bonferroni outlier test. We're only printing off test results where the third column is less than 0.05.
test = regression.outlier_test()
print("Bad Data Points")
test[test["bonf(p)"] < 0.05]
Bad Data Points


      student_resid
      unadj_p
      bonf(p)
    
  
      138
      5.795402
      8.578726e-09
      0.000011
    
    
      143
      5.329034
      1.166309e-07
      0.000149
    
    
      160
      4.865897
      1.281502e-06
      0.001639
    
    
      246
      4.875219
      1.223498e-06
      0.001565
    
    
      1397
      5.041705
      5.279044e-07
      0.000675
    
    
      1751
      4.921659
      9.702364e-07
      0.001241
    
  
figure = smgraph.regressionplots.plot_fit(regression, 1)
line = smgraph.regressionplots.abline_plot(model_results=regression, ax=figure.axes[0])


fig, ax = plt.subplots(figsize=(12, 8))
fig = sm.graphics.influence_plot(regression, alpha=0.05, ax=ax, criterion="cooks")


Y = df.speed
X = df.length
X = sm.add_constant(X)

model = sm.OLS(Y, X)
results = model.fit()
print(results.summary2())
                  Results: Ordinary least squares
===================================================================
Model:              OLS              Adj. R-squared:     0.445     
Dependent Variable: speed            AIC:                11413.0164
Date:               2020-07-21 06:31 BIC:                11423.3240
No. Observations:   1279             Log-Likelihood:     -5704.5   
Df Model:           1                F-statistic:        1027.     
Df Residuals:       1277             Prob (F-statistic): 7.72e-166 
R-squared:          0.446            Scale:              438.76    
---------------------------------------------------------------------
             Coef.    Std.Err.      t      P>|t|     [0.025    0.975]
---------------------------------------------------------------------
const       41.2253     1.1135   37.0220   0.0000   39.0408   43.4099
length       0.0473     0.0015   32.0460   0.0000    0.0444    0.0502
-------------------------------------------------------------------
Omnibus:              278.245       Durbin-Watson:          1.832  
Prob(Omnibus):        0.000         Jarque-Bera (JB):       939.152
Skew:                 1.046         Prob(JB):               0.000  
Kurtosis:             6.639         Condition No.:          1433   
===================================================================
* The condition number is large (1e+03). This might indicate
strong multicollinearity or other numerical problems.

coaster_scatter(captain_coaster, "length", "speed")
poly1d([ 0.04729325, 41.27655592])


df.length.std()
396.61095477849716

final_list = [x for x in df.length if (x > df.length.mean() - 2 * df.length.std())]
final_list = [x for x in final_list if (x < df.length.mean() + 2 * df.length.std())]
boolean = df.length.isin(final_list)
filtered_df = df[boolean]
print("After removing outliers...")
coaster_scatter(filtered_df, "length", "speed")
After removing outliers...


poly1d([ 0.05015783, 39.79244512])


regression = smf.ols("speed ~ length", data=filtered_df).fit()
print(regression.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  speed   R-squared:                       0.394
Model:                            OLS   Adj. R-squared:                  0.394
Method:                 Least Squares   F-statistic:                     794.6
Date:                Tue, 21 Jul 2020   Prob (F-statistic):          4.28e-135
Time:                        06:31:14   Log-Likelihood:                -5440.5
No. Observations:                1224   AIC:                         1.089e+04
Df Residuals:                    1222   BIC:                         1.090e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     39.7370      1.211     32.813      0.000      37.361      42.113
length         0.0502      0.002     28.190      0.000       0.047       0.054
==============================================================================
Omnibus:                      287.517   Durbin-Watson:                   1.856
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              886.712
Skew:                           1.160   Prob(JB):                    2.84e-193
Kurtosis:                       6.464   Cond. No.                     1.40e+03
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.4e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

# B_status -> Boolean
filtered_df["b_status"] = [1 if x == "operating" else 0 for x in filtered_df["status"]]
/home/daniel/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

filtered_df = filtered_df.query('material_type != "na"')


OLS2 = smf.ols(
    formula="speed ~ material_type + seating_type + height + length + num_inversions + b_status - 1",
    data=filtered_df,
).fit()
print(OLS2.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  speed   R-squared:                       0.499
Model:                            OLS   Adj. R-squared:                  0.491
Method:                 Least Squares   F-statistic:                     62.20
Date:                Tue, 21 Jul 2020   Prob (F-statistic):          1.67e-154
Time:                        06:31:15   Log-Likelihood:                -4971.9
No. Observations:                1145   AIC:                             9982.
Df Residuals:                    1126   BIC:                         1.008e+04
Df Model:                          18                                         
Covariance Type:            nonrobust                                         
=================================================================================================
                                    coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------------------
material_type[Hybrid]            44.8927      7.503      5.984      0.000      30.172      59.613
material_type[Steel]             44.6257      5.075      8.793      0.000      34.668      54.583
material_type[Wooden]            44.7924      5.616      7.975      0.000      33.773      55.812
seating_type[T.Bobsleigh]       -17.6204      7.925     -2.223      0.026     -33.170      -2.071
seating_type[T.Floorless]        -5.0932      6.511     -0.782      0.434     -17.868       7.681
seating_type[T.Flying]          -16.7451      6.403     -2.615      0.009     -29.309      -4.181
seating_type[T.Inverted]        -11.9431      5.360     -2.228      0.026     -22.459      -1.427
seating_type[T.Motorbike]        -0.9354      7.677     -0.122      0.903     -15.999      14.128
seating_type[T.Pipeline]         -3.6534     11.905     -0.307      0.759     -27.013      19.706
seating_type[T.Sit Down]         -6.5955      4.905     -1.345      0.179     -16.219       3.028
seating_type[T.Spinning]        -11.2156      5.534     -2.027      0.043     -22.074      -0.357
seating_type[T.Stand Up]         -7.2007      6.863     -1.049      0.294     -20.666       6.265
seating_type[T.Suspended]       -11.8223      6.300     -1.877      0.061     -24.183       0.539
seating_type[T.Water Coaster]     0.8739      6.983      0.125      0.900     -12.827      14.575
seating_type[T.Wing]              5.0534      7.745      0.652      0.514     -10.142      20.249
height                            0.1226      0.014      8.641      0.000       0.095       0.150
length                            0.0421      0.002     20.452      0.000       0.038       0.046
num_inversions                    3.0655      0.370      8.281      0.000       2.339       3.792
b_status                         -0.0278      1.425     -0.020      0.984      -2.823       2.767
==============================================================================
Omnibus:                      284.789   Durbin-Watson:                   1.890
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             2419.285
Skew:                           0.898   Prob(JB):                         0.00
Kurtosis:                       9.891   Cond. No.                     2.42e+04
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.42e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

filtered_df.sort_values(by="speed", ascending=False)


      name
      material_type
      seating_type
      speed
      height
      length
      num_inversions
      manufacturer
      park
      status
      b_status
    
  
      138
      Kingda Ka
      Steel
      Sit Down
      206.0
      139.0
      950.0
      0.0
      Intamin
      Six Flags Great Adventure
      operating
      1
    
    
      143
      Top Thrill Dragster
      Steel
      Sit Down
      192.0
      128.0
      853.0
      0.0
      Intamin
      Cedar Point
      operating
      1
    
    
      1751
      Red Force
      Steel
      Sit Down
      185.0
      112.0
      880.0
      0.0
      Intamin
      Ferrari Land
      operating
      1
    
    
      140
      Do-Dodonpa
      Steel
      Sit Down
      172.0
      52.0
      1189.0
      0.0
      S&S
      Fuji-Q Highland
      operating
      1
    
    
      246
      Tower of Terror II
      Steel
      Sit Down
      160.0
      115.0
      372.0
      0.0
      Intamin
      Dreamworld
      operating
      1
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      2505
      Unnamed Hyper Coaster
      Steel
      Sit Down
      0.0
      0.0
      0.0
      0.0
      B&M
      Hot Go Dreamworld
      construction
      0
    
    
      1835
      Dragonfire
      Steel
      Sit Down
      0.0
      60.0
      0.0
      0.0
      Premier Rides
      Adventure Island
      operating
      1
    
    
      2516
      Sons of Anarchy & Weyland Yutani
      Steel
      Sit Down
      0.0
      0.0
      0.0
      0.0
      na
      20th Century Fox World
      announced
      0
    
    
      2517
      Wings Over Rio
      Steel
      Sit Down
      0.0
      0.0
      0.0
      0.0
      na
      20th Century Fox World
      announced
      0
    
    
      2515
      Alien vs Predator
      Steel
      Sit Down
      0.0
      0.0
      0.0
      0.0
      na
      20th Century Fox World
      announced
      0
    
  
1145 rows × 11 columns


Part of the fun of data analysis and visualization is digging into the data
you have and answering questions that come to your mind.
Some questions you might want to answer with the datasets provided include:

What roller coaster seating type is most popular? And do different seating
types result in higher/faster/longer roller coasters?
Do roller coaster manufacturers have any specialties (do they focus on
speed, height, seating type, or inversions)?
Do amusement parks have any specialties?

What visualizations can you create that answer these questions, and any
others that come to you? Share the questions you ask and the accompanying
visualizations you create on the Codecademy forums.
	Rank	Name	Park	Location	Supplier	Year Built	Points	Year of Rank
29	10	Lightning Racer	Hersheypark	Hershey, Pa.	GCII	2000	421	2015
175	46	Megafobia	Oakwood	Pembrookshire, Wales	Custom Coasters	1996	84	2018
100	21	Rampage	Alabama Splash Adventure	Bessemer, Ala.	Custom Coasters	1998	218	2017
92	13	Goliath	Six Flags Great America	Gurnee, Ill.	Rocky Mountain	2014	269	2017
1	2	El Toro	Six Flags Great Adventure	Jackson, N.J.	Intamin	2006	1302	2013
	name	material_type	seating_type	speed	height	length	num_inversions	manufacturer	park	status
1999	Big Apple	na	Sit Down	NaN	NaN	NaN	NaN	D.P.V. Rides	The Milky Way Adventure Park	status.operating
202	Infernal Toboggan	Steel	Sit Down	NaN	11.0	335.0	0.0	S.D.C.	Foire	status.operating
70	Blue Tornado	Steel	Inverted	80.0	33.0	765.0	5.0	Vekoma	Gardaland	status.operating
953	Pandemonium	Steel	Spinning	50.0	16.0	412.0	0.0	Gerstlauer	Six Flags Fiesta Texas	status.operating
2320	Happy Angel	na	Inverted	87.0	NaN	NaN	6.0	Golden Horse	Heilongjiang Wanda Theme Park	status.operating
	name	material_type	seating_type	speed	height	length	num_inversions	manufacturer	park	status
44	Vampire	Steel	Inverted	80.0	33.0	689.0	5.0	Vekoma	Walibi Belgium	status.operating
11	Cobra	Steel	Sit Down	76.0	36.0	285.0	3.0	Vekoma	Walibi Belgium	status.operating
957	Tornado	Steel	Sit Down	64.0	23.0	725.0	2.0	Vekoma	Walibi Belgium	status.closed.definitely
39	Psyké underground	Steel	Sit Down	85.0	42.0	260.0	1.0	Schwarzkopf	Walibi Belgium	status.operating
	speed	height	length	num_inversions
count	1478.000000	1667.000000	1675.000000	2405.000000
mean	70.102842	26.725855	606.147463	0.809563
std	28.338394	35.010166	393.840496	1.652254
min	0.000000	0.000000	-1.000000	0.000000
25%	47.000000	13.000000	335.000000	0.000000
50%	72.000000	23.000000	500.000000	0.000000
75%	88.000000	35.000000	839.000000	1.000000
max	240.000000	902.000000	2920.000000	14.000000
	student_resid	unadj_p	bonf(p)
138	5.795402	8.578726e-09	0.000011
143	5.329034	1.166309e-07	0.000149
160	4.865897	1.281502e-06	0.001639
246	4.875219	1.223498e-06	0.001565
1397	5.041705	5.279044e-07	0.000675
1751	4.921659	9.702364e-07	0.001241
	name	material_type	seating_type	speed	height	length	num_inversions	manufacturer	park	status	b_status
138	Kingda Ka	Steel	Sit Down	206.0	139.0	950.0	0.0	Intamin	Six Flags Great Adventure	operating	1
143	Top Thrill Dragster	Steel	Sit Down	192.0	128.0	853.0	0.0	Intamin	Cedar Point	operating	1
1751	Red Force	Steel	Sit Down	185.0	112.0	880.0	0.0	Intamin	Ferrari Land	operating	1
140	Do-Dodonpa	Steel	Sit Down	172.0	52.0	1189.0	0.0	S&S	Fuji-Q Highland	operating	1
246	Tower of Terror II	Steel	Sit Down	160.0	115.0	372.0	0.0	Intamin	Dreamworld	operating	1
...	...	...	...	...	...	...	...	...	...	...	...
2505	Unnamed Hyper Coaster	Steel	Sit Down	0.0	0.0	0.0	0.0	B&M	Hot Go Dreamworld	construction	0
1835	Dragonfire	Steel	Sit Down	0.0	60.0	0.0	0.0	Premier Rides	Adventure Island	operating	1
2516	Sons of Anarchy & Weyland Yutani	Steel	Sit Down	0.0	0.0	0.0	0.0	na	20th Century Fox World	announced	0
2517	Wings Over Rio	Steel	Sit Down	0.0	0.0	0.0	0.0	na	20th Century Fox World	announced	0
2515	Alien vs Predator	Steel	Sit Down	0.0	0.0	0.0	0.0	na	20th Century Fox World	announced	0