Chizaram-Igolo/ML_Recruitment_Project_Approach_Detailing.md

## ML_Recruitment_Project_Approach_Detailing.md

      
    Raw
  

              ML_Recruitment_Project_Approach_Detailing.md
            
          
    Project Details


Tell me and I forget. Teach me and I may remember. Involve me and I learn - Benjamin Franklin

Can Neural Networks do a better job in the hiring process? While it is obvious that they will save time allowing for recruiters to reallocate their time for other tasks and only examine promising candidates, will they truly get an intuitive understanding of human languages and get rid of prejudices that may arise during selection by hand?
On the Approach of this Project

This project will use a keywordless approach in producing a selection procedure for candidates so that the machine doesn't overfit the model on keywords. "Most recruiters focus on keywords and it's almost impossible to guarantee a fair process of candidate selection (Singh, 2016)" - Page 4 of the proposal paper.
The scope of this project and the data that will be considered later for training its neural network is on entry level/graduate positions with more emphasis on "...broad abilities such as general cognitive ability" as opposed to previous job experience or specific job performance criteria. - Pages 5, 6
Supervised learning is the obvious approach, as we need to tell the machine what good resumes/cvs look like and what bad resumes/cvs look like as well. To ensure fairness, sensitive information or information that may make the machine learn the wrong trend in the data will be stripped out. These include; names, sex, age, religion, marital status, numeric data. This is to ensure that there is no inclination towards, ethnicity, age, religion, marital status and that the machine doesn't learn the wrong trend as in the case with numeric data. Numbers written in string form will be preserved however. No keywords will be looked for as we don't want the model to overfit on specific keywords but rather to get the underlying trend in the textual data, inferring meaning from how the language is used.
How the Data is Acquired and Labeled

With supervision and guidance from a recruiting company, human recruiters who have access to the kind of data we need and who understand the problem will provide us with training data for our model. Based on previous hiring, the resumes are placed into 2 folders; one labeled '1' and the others '0'. The resumes of candidates who were shortedlisted are placed in the '1' folder and the resumes of those who didn't make the shortlist are placed in the '0' folder. Again, care is taken to ensure that there were no conceivable bias, intended or non-intended in this labeling process. Furthermore, during preprocessing, to keep the bearers of the resumes anonymous and to prevent the model from learning biases and inclining towards a group of the population, all sensitive information is removed before storing the preprocess data in a database for access later during training.
Natural Language Processing

For a more comprehensive yet gentle introduction to these concepts, check out this notebook written by Elvis Saravia: 
Tokenization

After preprocessing which has been covered in the last 2 paragraphs, one of the most common first steps is to tokenize the structured text. Structured text is text with predictable format that needs no little to no preprocessing and can easily be understood by the machine as well as stored, retrieved and manipulated in a database. The corpus is the entire collection of the structured text.
Tokenization is the splitting of the text into its component words which are referred to as tokens. Each token bears a token identifier, a number such that any time that word occurs in text, it is referred to by that number by the machine.

    
A simple tokenizer could look like this:
text = "The cat sat on the mat. The cat in the hat."
token_nums = []
token_words = []
for i, w in enumerate(text.split(" ")):
  if w.lower() in token_words:
    token_nums.append(token_words.index(w.lower()) + 1)
  # Ensure no number is skipped
  else:
    # Begin from `1` if no token num exists yet, otherwise
    # use the highest number in the list to determine the next
    # number. Using the length will ensure skipping of numbers.
    
    if len(token_nums) > 0:
      token_nums.append(max(token_nums) + 1)
    else:
      token_nums.append(1)
  token_words.append(w.lower())

list(zip(token_nums, token_words))
Output:
[(1, 'the'),
 (2, 'cat'),
 (3, 'sat'),
 (4, 'on'),
 (1, 'the'),
 (5, 'mat.'),
 (1, 'the'),
 (2, 'cat'),
 (6, 'in'),
 (1, 'the'),
 (7, 'hat.')]
Note how repeating words all share the same token number. So, there are 7 words in the text and not 11 as it should be.
Good as this may seem, our simple token does not split punctuation marks from words. This could be a problem, as we may not want punctuation marks in the final data. Also, hyphenated words will be returned as one word and if there is more than one space in the text, spaces will be returned in the tokens. For these, we can use any of the handful of NLP libraries out there like NLTK, SpaCy, which give us flexibility in how we choose to tokenize so that place names with 2 or 3 words for instance, "Trinidad and Tobago" are regarded as sub-tokens of one token.
Lemmatization

Lemmatization also involves splitting texts into its component parts but unlike tokenization, gets to the base meaning of the language text. The split parts are called lemmas Personal pronounced such as he, she, I are marked as pronouns and regarded similarly. Being verbs like am, are, is are replaced with 'be'. Pluralized words like dogs, churches get reduced to their singular forms. This reduces the text to a more standard form, otherwise known as normalizing the text. This distills the information to such a point that accuracy may even improve slightly in natural language models, depending on the usage of course. Lemmatizers can spot verbs, nouns, pronouns and even punctuations and there is even flexibility in what exceptions one might want to allow.
Stemming

Stemming involves chopping off suffixes of words that can be extended and is closely related to lemmatization. Words like beginning, practicing and argue will be stemmed to begin, practic and argu as each have different possible extended forms, e.g begins, practice and argument.
Sequencing

Not to be confused with other usages of sequence like sequence models, a sequence is a string of tokens (the number not the text) that is passed to the machine to learn patterns from. This is more of a text processing concept in TensorFlow as far as I know. The sequence of the example above, that would be [1, 2, 3, 4, 1, 5, 1, 2, 6, 1, 7].