Skip to content

An automated Hybrid Resume NER based on Rule-Based model, Machine Learning Model, and Transformer model

Notifications You must be signed in to change notification settings

JennyTan5522/NLP-Resume-Parsing

Repository files navigation

About:

Managing and organizing resumes may be a challenging operation as traditional resume management, which is time-consuming and requires manually sorting, organizing, and assessing endless resumes. Named Entity Recognition (NER) is a technique used in Document Management System (DMS) for automatically extracting and categorising significant information from documents. NER entails identifying and categorizing entities inside a document, such as persons, organisations, and locations. By automating this process, NER can increase the efficiency and accuracy of document management while allowing users to quickly and readily obtain critical information.

In this study, we aim to solve the problem of manually screening resumes by proposing a hybrid Automated Resume Named Entity Extraction (NER) to automate resume data.

Objectives:

  • Develop an effective Automated Resume NER system in processing resumes.
  • Evaluate the performance of the Traditional Baseline Model (Rule-based model), Machine Learning-Based Model and Transformer-Based Model in performing Resume Named Entity Extraction (NER).
  • CV Recommendation Model using Latent Dirichlet Allocation (LDA) based topic modelling on resume keywords and similarity scores of topic distributions.
  • The hybridisation approach of the Traditional Baseline Model, Machine-Learning-Based Model and Transformer-Based Model for named entity resume documents.
  • Automated summarization of candidates data and providing the functionalities such as searching and ranking based on the scores.

Dataset:

  1. Around 200 Resumes from Kaggle: Rule-based and Machine Learning training and testing data
  2. Around 220 Annotated Resumes from Kaggle: BERT training and testing data (JSON format)
  3. Around 200 Resumes from Kaggle: Spacy training and testing data (JSON format)
  4. Additional 5 Resume randomly selected to perform testing based on different models

Data Preprocessing:

Based on our research, not all the resumes will be cleaned in the NER processes. Therefore, in this study, we will evaluate the performance of resume named entity extraction (NER) before preprocessing and after preprocessing.

Flow Chart:

image


EDA

  1. WordCloud

    image

  2. N-Grams

    image


Rule-based Model (Established patterns or rules to identify)

  1. Identify candidates name
  2. Identify phone number
  3. Identify email address
  4. Identify qualifications
  5. Identify graduation years
  6. Identify locations
  7. Identify candidate job skills (check if any keywords matched from the skill_set.txt corpus)
  8. Identify university names
  9. Identify company name
  10. Identify candidate's designations or working experience (check if any keywords matched with job titles in the job-title.txt corpus)

Machine learning based named entities recognition labeling

image


Spacy Model: Pre-trained on a large dataset of text documents


Results

  1. Rule-based

    image

  2. Machine Learning + Rule-based

    image

  3. Spacy

    image

    image

    image

Spacy Prediction (Testing)

image

Proposed Hybrid Model (Rule-based + Spacy)

  • The hybrid model is a combination of rule-based, machine learning, and transformer NER. It inherits all the strengths of each NER approach by covering the weaknesses of the others.

image

Sample Results: image

image

image

Preprocessing Results before and after image


CV Recommendation Model

  • TF-IDF vectoriser to find the important terms inside the document and then computes the cosine similarity between the job description and CV.

  • Purpose: Allows the HR department to check for the similarity of job distributions based on Top N candidates.

    image

    image


Topic Modelling

  • Latent Dirichlet Allocation (LDA) technique is used to identify the main topics present in a collection of resumes.

  • Purpose: Quickly identify suitable candidates based on their skills and experiences listed in their resumes.

  • Provided the function for users to search for the keyword appearing in each topic and ultimately return the relevant resumes.

    image

    Search candidates' resume based on keywords

    image


NER Functionalities - Search, Summary and Ranking

  1. Search candidate and retrieve the Resume Summary Report

image

  1. Skills Ranking Score

image

  1. Search candidate

image


Results and Findings:

  1. Hybrid entity extraction showed better performance as compared to the single model. It has the highest detectability in classifying the value to respective entity labels. The hybrid model can extract the resume-named entities from the tested resume, such as name, email, phone number, skills, designation, company and more. The proposed hybrid strategy enhances the results for identifying the resume-named entities with an achievement of precision 87.62% and recall 96.91%.

  2. We found that preprocessed text caused our rule-based model to perform worse than when it was applied to the original text. This is likely because the preprocessing step removed important information.

Future Work and Limitation:

  • The non-standardization of resume structure makes it difficult to automatically extract named entities recognition.

About

An automated Hybrid Resume NER based on Rule-Based model, Machine Learning Model, and Transformer model

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published