simplymathematics/Thesis Proposal

## Thesis Proposal
# Facial Recognition and Combatting Sampling Bias
## Executive Summary
Data:
   Potentially:https://www.nist.gov/srd/nist-special-database-18
   National Institute for Standards dataset for mugshots.
   Potentially: https://lionbridge.ai/datasets/5-million-faces-top-15-free-image-datasets-for-facial-recognition/
   Potentially: http://robotics.csie.ncku.edu.tw/Databases/FaceDetect_PoseEstimate.htm
   Other Faces in the wild


## Literature Review

[Starting From Here](http://simplymathematics.xyz/recommendation-engine/Algorithmic_Bias.html)

Datasets:

Sampling Techniques:

Models:

## Research Methods

### Verification
I will use the above datasets to replicate prior research, see which models carry the most bias, and investigate techniques for mitigating that bias. I will compare RMSE and explained variance (where applicable) and false positive rates (where applicable). In general, I want to answer three questions:

1. With a strictly randomized data set, which facial recognition carry the most racial bias?
2. Does this happen when we stratify the data or apply other bootstrapping techniques?
3. Can we reproduce this effect in data released by standards orgs (such as the NMIST dataset).

This section will be done in R for it's extensive package suite, simple IDE, and fast iterative ability.  In order to iterate quickly, I will use small datasets (or smaller samples of large datasets).

### Validation:

Each of the above questions are falsifiable and testable. The underlying assumption is well-documented, but must still be shown for the datasets in question. To verify this on a large scale, I'd love to compare the standards data to a much larger collection of images from the internet. Datasets exist on the order of millions. Using demographic data and controlled, randomized trials, I can verify the conclusions in the 3 questions (built on smaller datasets I can model locally), but real models are built in the cloud with giant hyperparameter grids, tensorflow libraries, and python scripting. I'd like to spend the latter half of the project validating the first set of findings on much larger sets of data. I would like to leave the tooling and scope vague until I've successfully done the first section. However, my career goals are much more aligned with this part of the project. For such reasons, I intend to use CUDA libraries on a GPU for the matrix calculations (embedded design) or to build a spark cluster in the cloud (processors-are-cheap 'cloud' design) to do the easily parallelized functions like hyperparameter tuning, bootstrapping, sampling, or subset analysis. The results and time commitment from the first section will define the scope and scale of the second.

## Goal

Have a portfolio-ready piece with a web-application (preferably static), R tooling for analysis and document generation, and either a Python or C++ based data back-end. All of this can be done from inside R-studio on an Ubuntu docker container which itself will be distributed with the project for easy replicability.
	# Facial Recognition and Combatting Sampling Bias
	## Executive Summary
	Data:
	Potentially:https://www.nist.gov/srd/nist-special-database-18
	National Institute for Standards dataset for mugshots.
	Potentially: https://lionbridge.ai/datasets/5-million-faces-top-15-free-image-datasets-for-facial-recognition/
	Potentially: http://robotics.csie.ncku.edu.tw/Databases/FaceDetect_PoseEstimate.htm
	Other Faces in the wild


	## Literature Review

	[Starting From Here](http://simplymathematics.xyz/recommendation-engine/Algorithmic_Bias.html)

	Datasets:

	Sampling Techniques:

	Models:

	## Research Methods

	### Verification
	I will use the above datasets to replicate prior research, see which models carry the most bias, and investigate techniques for mitigating that bias. I will compare RMSE and explained variance (where applicable) and false positive rates (where applicable). In general, I want to answer three questions:

	1. With a strictly randomized data set, which facial recognition carry the most racial bias?
	2. Does this happen when we stratify the data or apply other bootstrapping techniques?
	3. Can we reproduce this effect in data released by standards orgs (such as the NMIST dataset).

	This section will be done in R for it's extensive package suite, simple IDE, and fast iterative ability. In order to iterate quickly, I will use small datasets (or smaller samples of large datasets).

	### Validation:

	Each of the above questions are falsifiable and testable. The underlying assumption is well-documented, but must still be shown for the datasets in question. To verify this on a large scale, I'd love to compare the standards data to a much larger collection of images from the internet. Datasets exist on the order of millions. Using demographic data and controlled, randomized trials, I can verify the conclusions in the 3 questions (built on smaller datasets I can model locally), but real models are built in the cloud with giant hyperparameter grids, tensorflow libraries, and python scripting. I'd like to spend the latter half of the project validating the first set of findings on much larger sets of data. I would like to leave the tooling and scope vague until I've successfully done the first section. However, my career goals are much more aligned with this part of the project. For such reasons, I intend to use CUDA libraries on a GPU for the matrix calculations (embedded design) or to build a spark cluster in the cloud (processors-are-cheap 'cloud' design) to do the easily parallelized functions like hyperparameter tuning, bootstrapping, sampling, or subset analysis. The results and time commitment from the first section will define the scope and scale of the second.

	## Goal

	Have a portfolio-ready piece with a web-application (preferably static), R tooling for analysis and document generation, and either a Python or C++ based data back-end. All of this can be done from inside R-studio on an Ubuntu docker container which itself will be distributed with the project for easy replicability.