Skip to content

Instantly share code, notes, and snippets.

@alisha
Last active February 16, 2016 22:25
Show Gist options
  • Save alisha/8b0bebad64d37edd5a5e to your computer and use it in GitHub Desktop.
Save alisha/8b0bebad64d37edd5a5e to your computer and use it in GitHub Desktop.

Goals

Specifically, we want to:

  • Report the percentage of students who receive jobs
  • Make a histogram for all applicants' travel times
  • Make a histogram of travel times for applicants who want to stay close to home
  • Report the percentage of applicants whose job matches their interests
  • Report the percentage of applicants whose job matches their interests if they wanted that more than proximity

General

Need to import: pandas, numpy, matplotlib.pyplot

Formatting Data

The most straightforward way to read the data is to call it from a CSV, using pandas.read_csv()

This should contain the following information, at least:

  • Applicant ID
  • If the applicant prefers closer jobs
  • Matching job ID
  • Time/distance between applicant and job
  • If job matches applicant's interests (having the exact interests isn't necessary here, we just need to know if they match)

If we want to calculate the percentage of applicants who have jobs, then all applicants should be included here. However, that can also be calculated in another part of the program (i.e. when the algorithm is matching jobs to applicants), and this can just be used for reporting on the applicants who do have jobs. It doesn't really matter, as long as there's a standard way to filter between applicants who have jobs and applicants who don't (most likely putting 0 or null as the job ID).

Getting Data

Need to have data as a DataFrame (data with labeled columns)

As mentioned earlier, it would be very easy to import the data from a CSV file using pandas.read_csv()

Documentation for DataFrames is here

Use boolean indexing to select some data from the DataSeries. Can be useful to select only applicants who want to be close to home and applicants who would rather have jobs that match their interests.

Analayzing/visualizing Data

Use df.describe() to get basic info like mean, median, mode, quartiles, etc.

Generating Histograms

# df is the DataFrame
matplotlib.pyplot.figure()
df.plot(kind='hist')

# to open a window with the graph
matplotlib.pyplot.show()

# to save the graph to foo.png or foo.pdf
matplotlib.pyplot.savefig('foo.png')
matplotlib.pyplot.savefig('foo.pdf')

Here is the documentation for histograms, including info on how to customize colors, the number of bins, etc.

For this project, it's best to only plot one column of the DataFrame at a time because we don't want the bars to be stacked.

Running This

If we want to execute this file from the Ruby file, then we need a way to launch a subprocess in the Ruby file. Assuming that nothing else needs to be done in the Ruby file after launching this Python file, then we can use exec(python fileName.py) (see documentation here). If we need to return to the Ruby file, then this link has more information (and a handy flowchart!) on what to choose.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment