alisha/output_stats.md

## output_stats.md

      
    Raw
  

              output_stats.md
            
          
    Goals

Specifically, we want to:

Report the percentage of students who receive jobs
Make a histogram for all applicants' travel times
Make a histogram of travel times for applicants who want to stay close to home
Report the percentage of applicants whose job matches their interests
Report the percentage of applicants whose job matches their interests if they wanted that more than proximity

General

Need to import: pandas, numpy, matplotlib.pyplot
Formatting Data

The most straightforward way to read the data is to call it from a CSV, using
pandas.read_csv()
This should contain the following information, at least:

Applicant ID
If the applicant prefers closer jobs
Matching job ID
Time/distance between applicant and job
If job matches applicant's interests (having the exact interests isn't necessary here, we just need to know if they match)

If we want to calculate the percentage of applicants who have jobs, then all applicants should be included here. However,
that can also be calculated in another part of the program (i.e. when the algorithm is matching jobs to applicants), and
this can just be used for reporting on the applicants who do have jobs. It doesn't really matter, as long as there's a
standard way to filter between applicants who have jobs and applicants who don't (most likely putting 0 or null as the job
ID).
Getting Data

Need to have data as a DataFrame (data with labeled columns)
As mentioned earlier, it would be very easy to import the data from a CSV file using
pandas.read_csv()
Documentation for DataFrames is
here
Use boolean indexing to select some
data from the DataSeries. Can be useful to select only applicants who want to be close to home and applicants who would
rather have jobs that match their interests.
Analayzing/visualizing Data

Use df.describe()
to get basic info like mean, median, mode, quartiles, etc.
Generating Histograms

# df is the DataFrame
matplotlib.pyplot.figure()
df.plot(kind='hist')

# to open a window with the graph
matplotlib.pyplot.show()

# to save the graph to foo.png or foo.pdf
matplotlib.pyplot.savefig('foo.png')
matplotlib.pyplot.savefig('foo.pdf')
Here is the documentation
for histograms, including info on how to customize colors, the number of bins, etc.
For this project, it's best to only plot one column of the DataFrame at a time because we don't want the bars to be
stacked.
Running This

If we want to execute this file from the Ruby file, then we need a way to launch a subprocess in the Ruby file. Assuming
that nothing else needs to be done in the Ruby file after launching this Python file, then we can use
exec(python fileName.py) (see documentation here). If we need to return to the Ruby file, then
this link has more information (and a handy flowchart!) on what to choose.