Specifically, we want to:
- Report the percentage of students who receive jobs
- Make a histogram for all applicants' travel times
- Make a histogram of travel times for applicants who want to stay close to home
- Report the percentage of applicants whose job matches their interests
- Report the percentage of applicants whose job matches their interests if they wanted that more than proximity
Need to import: pandas
, numpy
, matplotlib.pyplot
The most straightforward way to read the data is to call it from a CSV, using
pandas.read_csv()
This should contain the following information, at least:
- Applicant ID
- If the applicant prefers closer jobs
- Matching job ID
- Time/distance between applicant and job
- If job matches applicant's interests (having the exact interests isn't necessary here, we just need to know if they match)
If we want to calculate the percentage of applicants who have jobs, then all applicants should be included here. However, that can also be calculated in another part of the program (i.e. when the algorithm is matching jobs to applicants), and this can just be used for reporting on the applicants who do have jobs. It doesn't really matter, as long as there's a standard way to filter between applicants who have jobs and applicants who don't (most likely putting 0 or null as the job ID).
Need to have data as a DataFrame (data with labeled columns)
As mentioned earlier, it would be very easy to import the data from a CSV file using
pandas.read_csv()
Documentation for DataFrames is here
Use boolean indexing to select some data from the DataSeries. Can be useful to select only applicants who want to be close to home and applicants who would rather have jobs that match their interests.
Use df.describe()
to get basic info like mean, median, mode, quartiles, etc.
# df is the DataFrame
matplotlib.pyplot.figure()
df.plot(kind='hist')
# to open a window with the graph
matplotlib.pyplot.show()
# to save the graph to foo.png or foo.pdf
matplotlib.pyplot.savefig('foo.png')
matplotlib.pyplot.savefig('foo.pdf')
Here is the documentation for histograms, including info on how to customize colors, the number of bins, etc.
For this project, it's best to only plot one column of the DataFrame at a time because we don't want the bars to be stacked.
If we want to execute this file from the Ruby file, then we need a way to launch a subprocess in the Ruby file. Assuming
that nothing else needs to be done in the Ruby file after launching this Python file, then we can use
exec(python fileName.py)
(see documentation here). If we need to return to the Ruby file, then
this link has more information (and a handy flowchart!) on what to choose.