Skip to content

Instantly share code, notes, and snippets.

@wyattowalsh
Last active May 18, 2021 18:20
Show Gist options
  • Save wyattowalsh/649ca3bcfbf0036945b17495518046fb to your computer and use it in GitHub Desktop.
Save wyattowalsh/649ca3bcfbf0036945b17495518046fb to your computer and use it in GitHub Desktop.
Project READMEs

basketball-icon This dataset is updated daily and contains data on all games, all teams, and all players within the NBA including:

  • 60,000+ games (every game since the first NBA season in 1946-47) including for the games in which the statistics were recorded:
    • Box scores, Game summaries, Officials, Inactive players, Linescores, Last face-off stats, Season series info, Game video availability
  • 30 teams with information including:
    • General team details (stadium, head coach, general manager, social media links, etc), Franchise history information (name changes, location changes, etc)
  • 4500 players with:
    • Basic draft data, Prior affiliations, Career statistics. Anatomical data (height & weight)
  • and more, with plans for expansion!

Notes

  • Data sourced from stats.nba.com via the nba_api, updated via a fully automated data pipeline hosted between Kaggle & GitHub🧑‍💻
  • Please use this thread to report any bugs 🪳 and this thread to share any suggestions for improvement 👍
  • Feel free to discuss your projects here 🎤
  • See more about the pipeline architecture here 🧑‍🏭
  • See some related resources here 📚

Update Status

Update Kaggle Basketball Dataset - Daily Update Kaggle Basketball Dataset - Monthly

Built with: KaggleGitHub ActionsPythonSQLiteGoogle Colab   basketball-icon

Mixed Integer Linear Programming for Fair Division Problems

The goal of this project is to find optimally fair allocations of divisible and non-divisible goods for a group of people under three different definitions of fairness under envy-freeness with certain assumptions. Mixed integer linear programming (MILP) formulations are created in AMPL and solved using CPLEX resulting in the generation of datasets consisting of the minimal approximate envy value and solver elapsed time for different combinations of number of people and number of goods. Interactive 3D visualizations of this dataset are created in Python and analysis of results is conducted. The project consists of two main outcomes, paper.pdf, which is a full, compiled research paper, and report_nb.ipynb which hosts the results datasets and visualizations. Click below to load the project notebook in your browser using the Binder service, or continue reading for more information on the project.

Interact with the project notebook in your web browser using the Binder service


Contents:

Explanation of Repository Contents

Name Type Description
data Folder This folder contains input and output data from the project as well as .LP and .MPS files for all problem instances. The input and output directories have subdirectories pertaining to the specific problem of interest.
environment.yml File This is a anaconda virtual envirronment replication file that ensures consistent versions of software packages.
report_np.ipynb File This is a Jupyter Python Notebook that contains the results of solving the generated examples. This notebook also contains visualizations, both two-dimensional and three-dimensional, that should help to provide a better understanding of the results.
src Folder Contains all source code for solving the examples. This folder contains the commands used to perform actions like normaization, a file to create all of the dynthetic data, and then a '.mod' and '.run' ampl file for each subtype of problem. Assuming that the necessary data files have been generated, the '.run' files for each sub-question can be run from the console of AMPL, and all of the examples will be solved, and subsequent output files will be created within the data folder.
visualizations Folder A collection of the different visualizations created in the Jupyter Notebook in the form of .png files. These visualizations are sorted by type and can be found in the sub-folders.

Project Summary

Fair division problems are a significant class of problems with considerable multidisciplinary involvement ranging from social science to computer science. Currently there exist many specificies of envy-freeness, applied to a multitude of scenarios, solved through assorted methodologies. To guide the work in this project, three particular definitions of envy-freeness are analyzed for a particular situation. These are envy-freeness, envy-freeness up to one item, and envy-freeness with the inclusion of a divisible subsidy in the form of a cash amount. We apply these definitions to the situation where items are indivisible and valuations are both additive and normalized.

These three definitions were modeled in the AMPL programming language and then solved using the IBM CPLEX solver for two simple examples and a collection of generated data for different combinations of number of people and number of items to be allocated.

The results for the two simple examples serve to validate the accuracy of the formulationsa nd the results for the collection of generated data allow for analysis to be conducted on the complexity of these problem types. Furthermore, strategies are devised and implemented to reduce the runtime of the envy-freeness instance including: upper-bounding the objective function, initializing CPLEX with a feasible starting solution, the combination of both upper-bounding the objective function and initializing CPLEX with a feasible starting solution, and finally tuning various CPLEX parameters.

Instructions for Usage

environment.yml can be found in the repository's root directory for your version of interest and used to install necessary project dependencies. If able to successfully configure your computing environment, then launch Jupyter Notebook from your command prompt and navigate to report_nb.ipynb. If unable to successfully configure your computing environment refer to the sections below to install necessary system tools and package dependencies. The following sections may be cross-platform compatibile in several places, however is geared towards macOS.

Do you have the Conda system installed?

Open a command prompt (i.e. Terminal) and run: conda info.

This should display related information pertaining to your system's installation of Conda. If this is the case, you should be able to skip to the section regarding virtual environment creation (updating to the latest version of Conda could prove helpful though: conda update conda).

If this resulted in an error, then install Conda with the following section.

Install Conda

There are a few options here. To do a general full installation check out the Anaconda Download Page. However, the author strongly recommends the use of Miniconda since it retains necessary functionality while keeping resource use low; Comparison with Anaconda and Miniconda Download Page.

Windows users: please refer to the above links to install some variation of Conda. Once installed, proceed to the instructions for creating and configuring virtual environments [found here](#Configure-Local-Environment

macOS or Linux users: It is recommended to use the Homebrew system to simplify the Miniconda installation process. Usage of Homebrew is explanained next.

Do you have Homebrew Installed?

In your command prompt (i.e. Terminal) use a statement such as: brew help

If this errored, move on to the next section.

If this returned output (e.g. examples of usage) then you have Homebrew installed and can proceed to install conda found here.

Install Homebrew

In your command prompt, call: /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

Install Miniconda with Homebrew

In your command prompt, call: brew install --cask miniconda

When in doubt, calling in the brew doctor might help 💊

A Few Possibly Useful Conda Commands

All environment related commands can be found here: Anaconda Environment Commands

Here are a few of the most used ones though:

List all environments (current environment as marked by the *): conda env list

Create a new environment: conda create --name myenv

Activate an environment: conda activate myenv

Deactivate an environment and go back to system base: conda deactivate

List all installed packages for current environment: conda list

Configure Local Environment

Using the command prompt, navigate to the local project repository directory -- On macOS, I recommend typing cd in Terminal and then dragging the project folder from finder into Terminal.

In your command prompt, call: conda env create -f environment.yml. This will create a new Conda virtual environment with the name: explorations-in-envy-free-allocations.

Activate the new environment by using: explorations-in-envy-free-allocations

Access Project

After having activated your environment, use jupyter notebook to launch a Jupyter session in your browser.

Within the Jupyter Home page, navigate and click on report_nb.ipynb in the list of files. This will launch a local kernel running the project notebook in a new tab.


Conceptualizing Higher Education Institutions:
An Agent-Based Modelling Approach


Link to cloud hosted simulation experiment data analysis and modelling notebook: Binder

Link to project paper: Conceptualizing Higher Education Institutions Paper

Link to datasets: Experiment 1 and All Experiments


Explanation of Repository Contents

.
├── README.md   This file
├── paper.pdf   Project Write-Up
├── environment.yml   Conda environment configuration file (ssed to load project dependencies)
├── nb.ipynb   Jupyter Notebook used for data analysis and modelling (hosted at the above Binder link)
├── .gitignore   Git file used to ignore non-repo local files
└── srcDirectory containing custom scripts
    ├── __init__.py
    ├── agent.py   Agent class definition (agent instantiation and opinion variation)
    ├── data_functions.py   Helpful functions to manipulate data
    ├── data_operations.py   Main data file used to prouduce data (utilizes Apache Spark)
    ├── data_processing.py   Short script to fix time data writing issue in simulation
    ├── environment.py   Environment class definiton (establishes agents, holds data, increments time, conducts group negotiations)
    ├── main.py   Script to run collection of experiments
    ├── model.py   Model class definition (sets enviroment, generates collection of experiment parameters, conducts experiments)
    ├── utilities.py   Helpful functions used throughout simulation
    └── visualization.md   Mermaid markdown snippet dump for flowcharts


Instructions for Usage

environment.yml can be found in the repository's root directory and used to install necessary project dependencies. If able to successfully configure your computing environment, then launch Jupyter Notebook from your command prompt and navigate to nb.ipynb. If unable to successfully configure your computing environment refer to the sections below to install necessary system tools and package dependencies. The following sections may be cross-platform compatibile in several places, however is geared towards macOS1.

Do you have the Conda system installed?

Open a command prompt (i.e. Terminal) and run: conda info.

This should display related information pertaining to your system's installation of Conda. If this is the case, you should be able to skip to the section regarding virtual environment creation (updating to the latest version of Conda could prove helpful though: conda update conda).

If this resulted in an error, then install Conda with the following section.

Install Conda

There are a few options here. To do a general full installation check out the Anaconda Download Page. However, the author strongly recommends the use of Miniconda since it retains necessary functionality while keeping resource use low; Comparison with Anaconda and Miniconda Download Page.

Windows users: please refer to the above links to install some variation of Conda. Once installed, proceed to the instructions for creating and configuring virtual environments [found here](#Configure-Local-Environment

macOS or Linux users: It is recommended to use the Homebrew system to simplify the Miniconda installation process. Usage of Homebrew is explanained next.

Do you have Homebrew Installed?

In your command prompt (i.e. Terminal) use a statement such as: brew help

If this errored, move on to the next section.

If this returned output (e.g. examples of usage) then you have Homebrew installed and can proceed to install conda found here.

Install Homebrew

In your command prompt, call: /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

Install Miniconda with Homebrew

In your command prompt, call: brew install --cask miniconda

When in doubt, calling in the brew doctor might help 💊

A Few Possibly Useful Conda Commands

All environment related commands can be found here: Anaconda Environment Commands

Here are a few of the most used ones though:

List all environments (current environment as marked by the *): conda env list

Create a new environment: conda create --name myenv

Activate an environment: conda activate myenv

Deactivate an environment and go back to system base: conda deactivate

List all installed packages for current environment: conda list

Configure Local Environment

Using the command prompt, navigate to the local project repository directory -- On macOS, I recommend typing cd in Terminal and then dragging the project folder from finder into Terminal.

In your command prompt, call: conda env create -f environment.yml. This will create a new Conda virtual environment with the name: higher-education-simulation.

Activate the new environment by using: conda activate higher-education-simulation

Access Project

After having activated your environment, use jupyter notebook to launch a Jupyter session in your browser.

Within the Jupyter Home page, navigate and click on nb.ipynb in the list of files. This will launch a local kernel running the project notebook in a new tab.
















1: This project was created on macOS version 11.0.1 (Big Sur) using Conda version 4.9.2, and Python 3.8 (please reach out to me if you need further system specifications).

Machine Learning for NBA Game Attendance Prediction

This project seeks to provide a tool to accurately predict the attendance of NBA games in order to better inform the business decisions of different stakeholders across the organization. Predicting game attendance is crucial to making optimized managerial decisions such as planning necessary staffing needs or procuring the proper level of supplies (janitorial, food services, etc). The project is currently being worked on in its second version, version_2. In version 1, an entire machine learning pipeline is established throughout a host of modules ranging from web scraping for data collection to neural-network regression modeling for prediction. These efforts resulted in a high accuracy model with mean absolute error values for attendance around 800 people. However, improvements in data sources and modeling paradigms for improved accuracy are being sought in a few ways in the upcoming version. Click the link below to view the analysis and modeling version 1.0 notebook or continue reading for more about the project.

Interact with the project notebook in your web browser using the Binder service


Contents


Explanation of Repository Contents

  • data contains both raw and processed data. There are game, search popularity, and stadium wiki raw datasets. These three datasets are processed and compiled resulting in the file dataset.csv within the processed directory. However, numerous other datasets can be found here which are the accumulation of different feature selection and data sampling strategies for use in modeling.
  • features contains results derived from statistical testing and principal components analysis across the datasets
  • models contains datasets of the error results across all the models applied as well as tuning parameter values
  • src is where all the project source code can be found. A host of modules and functions for data web scraping, feature selection, visualization, modeling, and Jupyter configuration are here.
  • version_2 is where all files related to the second iteration of this project can be found. Its structure generally mirrors that of repository root directory with sub-directories for data, source code, etc.
  • visualizations holds .png images of the visualizations created on the datasets
  • nb.ipynb is the associated data analysis and modeling notebook (this notebook can also be found and interacted with via the Binder link found above.
  • r_modeling.ipynb is an R notebook used for further data modeling with more exotic models.
  • environment.yml and requirements.txt are environment setup files to properly configure an environment and load necessary dependendicies for the project (a further explanation of how to use these can be found at the bottom)

Version 2.0

Development Roadmap

The goal of this version is to create another implementation of this machine learning pipeline leveraging knowledge gained from the first version to improve overall predictive accuracy and utilize new tools and modeling techniques.

To avoid any potential data cleanliness problems with scraping data from basketball-reference.com as in the first version, stats.nba.com will be queried through an open source API for sport related data enabling more seasons and features to be gathered. Furthermore, a wider range of data sources will be considered taking into consideration factors such as regional socioeconomics, weather, etc. New pre-processing scripts will be used to combine and clean the data from these different sources in order to make a dataset apt for modeling. Core modeling assumptions leveraged in the first version, such as data distribution will be re-evaluated. Furthermore, a new portfolio of modeling techniques based on more current research will be applied. A few models to be included are linear regression with the Huber loss, a long short-term memory neural network, and ensemble methodologies.

In future versions, a full Kubernetes cluster of the pipeline deployed via distributed cloud-computing resources would be a wonderful addition. This would allow for automated model updates, fully parallelized modeling (as every model can be containerized), and prediction delivery.

Progress Updates

  • Game data and especially attendance data was successfully retrieved for all seasons since 1946 using nba_api. This is awesome as version 1.0 only included seasons since 1999. The package was discovered on Github and leveraged to query the numerous stats.nba.com endpoints.
  • Functions to query different types of datasets as well as functions to combine and clean the results have been created for league team data, game overiew data for all seasons, and game box score summary data for all games.

Version 1.0

Project Summary

As briefly discussed in the introduction, the project's aim is to create an NBA game attendance prediction tool in order to improve the business decisions of NBA stadium managers. These managers have to make dynamic decisions reacting to fluctuating demand in a constrained, complex environment. Staff scheduling, food services, and entertainment are just a few of these decision areas. Game attendance predictions can be used as a tool to gain insights on customer demand and help to better inform these manager's decisions. Operating expenses can be minimized if waste is minimized and properly assessing demand helps to ensure fewer overages. Assessing demand can further impact the bottom line of the stadium through helping to ensure there are proper supply levels to meet customer demand.

Game attendance prediction can serve to underlie many of the tools and processes found across the different facets of the organization. As an example, vendors can use attendance predictions along with their own demand metrics and analytics, to better assess how many soft goods to purchase. Simililar foundational relationships can be found for most vendors, as well as janitorial supplies, process timing, and facility operations.

The flowchart below details the five different stages within the pipeline architecture used here.

alt text

  • A: Stadium data (e.g. location) is scraped from wikipedia using the Pandas library. Game and sport data is scraped from basketball-reference.com using the Beautiful Soup framework and requests library. The pytrends Google Trends API is used to gather search popularity data.
  • B:

Results and Discussion

Installation Instructions

environment.yml can be found in the repository's root directory for your version of interest and used to install necessary project dependencies. If able to successfully configure your computing environment, then launch Jupyter Notebook from your command prompt and navigate to nb.ipynb. If unable to successfully configure your computing environment refer to the sections below to install necessary system tools and package dependencies. The following sections may be cross-platform compatibile in several places, however is geared towards macOS1.

Do you have the Conda system installed?

Open a command prompt (i.e. Terminal) and run: conda info.

This should display related information pertaining to your system's installation of Conda. If this is the case, you should be able to skip to the section regarding virtual environment creation (updating to the latest version of Conda could prove helpful though: conda update conda).

If this resulted in an error, then install Conda with the following section.

Install Conda

There are a few options here. To do a general full installation check out the Anaconda Download Page. However, the author strongly recommends the use of Miniconda since it retains necessary functionality while keeping resource use low; Comparison with Anaconda and Miniconda Download Page.

Windows users: please refer to the above links to install some variation of Conda. Once installed, proceed to the instructions for creating and configuring virtual environments [found here](#Configure-Local-Environment

macOS or Linux users: It is recommended to use the Homebrew system to simplify the Miniconda installation process. Usage of Homebrew is explanained next.

Do you have Homebrew Installed?

In your command prompt (i.e. Terminal) use a statement such as: brew help

If this errored, move on to the next section.

If this returned output (e.g. examples of usage) then you have Homebrew installed and can proceed to install conda found here.

Install Homebrew

In your command prompt, call: /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

Install Miniconda with Homebrew

In your command prompt, call: brew install --cask miniconda

When in doubt, calling in the brew doctor might help 💊

A Few Possibly Useful Conda Commands

All environment related commands can be found here: Anaconda Environment Commands

Here are a few of the most used ones though:

List all environments (current environment as marked by the *): conda env list

Create a new environment: conda create --name myenv

Activate an environment: conda activate myenv

Deactivate an environment and go back to system base: conda deactivate

List all installed packages for current environment: conda list

Configure Local Environment

Using the command prompt, navigate to the local project repository directory -- On macOS, I recommend typing cd in Terminal and then dragging the project folder from finder into Terminal.

In your command prompt, call: conda env create -f environment.yml. This will create a new Conda virtual environment with the name: NBA-attendance-prediction.

Activate the new environment by using: conda activate NBA-attendance-prediction

Access Project

After having activated your environment, use jupyter notebook to launch a Jupyter session in your browser.

Within the Jupyter Home page, navigate and click on nb.ipynb in the list of files. This will launch a local kernel running the project notebook in a new tab.
















1: This project was created on macOS version 11.0.1 (Big Sur) using Conda version 4.9.2, and Python 3.8 (please reach out to me if you need further system specifications).

Regularized Linear Regression Deep Dive:
Application to Wine Quality Regression Dataset

This project consists of a deep dive on multiple linear regression (OLS) and its regularized variants (Ridge, the Lasso, and the Elastic Net) as well as Python implementations for exploratory data analysis, K-Fold cross-validation and modeling functions as applied to regression of a wine quality dataset. This examination applies optimization theory to either derive model estimator (for OLS and Ridge) or derive the update rule for Pathwise Coordinate Descent (the discrete optimization algorithm chosen and implemented to solve the Lasso and the Elastic Net). These derivations have accompanying Python implementations, which are leveraged to predict wine quality ratings within a supervised learning context.

A three-part series of blog posts on this topic was published in Towards Data Science
Read them here:

Interact with the project notebook in your web browser using the Binder service

Explanation of Repository Contents · Technical Overview · Installation Instructions

Explanation of Repository Contents

  • data/ - contains the project's wine quality dataset
  • src/ - holds all the project source code
  • nb.ipynb - project notebook
  • environment.yml - Conda virtual environment reproduction file

Technical Overview

The entirety of this project is written in Python (version 3.8) with a majority of functions depending on NumPy and several on pandas. Matplotlib and Seaborn are used for visualization. Furthermore, there are a few other simple dependencies used like the time or math libraries.

Implementations can be found for train-test data splitting, variance inflation factor calculation, K-Fold cross-validation, ordinary least squares (OLS), Ridge, the Lasso, and the Elastic Net as well as several other functions used to produce the notebook.


Installation Instructions

environment.yml can be found in the repository's root directory for your version of interest and used to install necessary project dependencies. If able to successfully configure your computing environment, then launch Jupyter Notebook from your command prompt and navigate to nb.ipynb. If unable to successfully configure your computing environment refer to the sections below to install necessary system tools and package dependencies. The following sections may be cross-platform compatibile in several places, however is geared towards macOS1.

Do you have the Conda system installed?

Open a command prompt (i.e. Terminal) and run: conda info.

This should display related information pertaining to your system's installation of Conda. If this is the case, you should be able to skip to the section regarding virtual environment creation (updating to the latest version of Conda could prove helpful though: conda update conda).

If this resulted in an error, then install Conda with the following section.

Install Conda

There are a few options here. To do a general full installation check out the Anaconda Download Page. However, the author strongly recommends the use of Miniconda since it retains necessary functionality while keeping resource use low; Comparison with Anaconda and Miniconda Download Page.

Windows users: please refer to the above links to install some variation of Conda. Once installed, proceed to the instructions for creating and configuring virtual environments [found here](#Configure-Local-Environment

macOS or Linux users: It is recommended to use the Homebrew system to simplify the Miniconda installation process. Usage of Homebrew is explanained next.

Do you have Homebrew Installed?

In your command prompt (i.e. Terminal) use a statement such as: brew help

If this errored, move on to the next section.

If this returned output (e.g. examples of usage) then you have Homebrew installed and can proceed to install conda found here.

Install Homebrew

In your command prompt, call: /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

Install Miniconda with Homebrew

In your command prompt, call: brew install --cask miniconda

When in doubt, calling in the brew doctor might help 💊

A Few Possibly Useful Conda Commands

All environment related commands can be found here: Anaconda Environment Commands

Here are a few of the most used ones though:

List all environments (current environment as marked by the *): conda env list

Create a new environment: conda create --name myenv

Activate an environment: conda activate myenv

Deactivate an environment and go back to system base: conda deactivate

List all installed packages for current environment: conda list

Configure Local Environment

Using the command prompt, navigate to the local project repository directory -- On macOS, I recommend typing cd in Terminal and then dragging the project folder from finder into Terminal.

In your command prompt, call: conda env create -f environment.yml. This will create a new Conda virtual environment with the name: regularized-regression-from-scratch.

Activate the new environment by using: regularized-regression-from-scratch

Access Project

After having activated your environment, use jupyter notebook to launch a Jupyter session in your browser.

Within the Jupyter Home page, navigate and click on nb.ipynb in the list of files. This will launch a local kernel running the project notebook in a new tab.
















1: This project was created on macOS version 11.0.1 (Big Sur) using Conda version 4.9.2, and Python 3.8 (please reach out to me if you need further system specifications).

See here for the different sources utilized to synthesize this project.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment