About:
Managing and organizing resumes may be a challenging operation as traditional resume management, which is time-consuming and requires manually sorting, organizing, and assessing endless resumes. Named Entity Recognition (NER) is a technique used in Document Management System (DMS) for automatically extracting and categorising significant information from documents. NER entails identifying and categorizing entities inside a document, such as persons, organisations, and locations. By automating this process, NER can increase the efficiency and accuracy of document management while allowing users to quickly and readily obtain critical information.
In this study, we aim to solve the problem of manually screening resumes by proposing a hybrid Automated Resume Named Entity Extraction (NER) to automate resume data.
Objectives:
- Develop an effective Automated Resume NER system in processing resumes.
- Evaluate the performance of the Traditional Baseline Model (Rule-based model), Machine Learning-Based Model and Transformer-Based Model in performing Resume Named Entity Extraction (NER).
- CV Recommendation Model using Latent Dirichlet Allocation (LDA) based topic modelling on resume keywords and similarity scores of topic distributions.
- The hybridisation approach of the Traditional Baseline Model, Machine-Learning-Based Model and Transformer-Based Model for named entity resume documents.
- Automated summarization of candidates data and providing the functionalities such as searching and ranking based on the scores.
Dataset:
- Around 200 Resumes from Kaggle: Rule-based and Machine Learning training and testing data
- Around 220 Annotated Resumes from Kaggle: BERT training and testing data (JSON format)
- Around 200 Resumes from Kaggle: Spacy training and testing data (JSON format)
- Additional 5 Resume randomly selected to perform testing based on different models
Data Preprocessing:
Based on our research, not all the resumes will be cleaned in the NER processes. Therefore, in this study, we will evaluate the performance of resume named entity extraction (NER) before preprocessing and after preprocessing.
Flow Chart:
EDA
Rule-based Model (Established patterns or rules to identify)
- Identify candidates name
- Identify phone number
- Identify email address
- Identify qualifications
- Identify graduation years
- Identify locations
- Identify candidate job skills (check if any keywords matched from the skill_set.txt corpus)
- Identify university names
- Identify company name
- Identify candidate's designations or working experience (check if any keywords matched with job titles in the job-title.txt corpus)
Machine learning based named entities recognition labeling
Spacy Model: Pre-trained on a large dataset of text documents
Results
Spacy Prediction (Testing)
Proposed Hybrid Model (Rule-based + Spacy)
- The hybrid model is a combination of rule-based, machine learning, and transformer NER. It inherits all the strengths of each NER approach by covering the weaknesses of the others.
Preprocessing Results before and after
CV Recommendation Model
-
TF-IDF vectoriser to find the important terms inside the document and then computes the cosine similarity between the job description and CV.
-
Purpose: Allows the HR department to check for the similarity of job distributions based on Top N candidates.
Topic Modelling
-
Latent Dirichlet Allocation (LDA) technique is used to identify the main topics present in a collection of resumes.
-
Purpose: Quickly identify suitable candidates based on their skills and experiences listed in their resumes.
-
Provided the function for users to search for the keyword appearing in each topic and ultimately return the relevant resumes.
Search candidates' resume based on keywords
NER Functionalities - Search, Summary and Ranking
- Search candidate and retrieve the Resume Summary Report
- Skills Ranking Score
- Search candidate
Results and Findings:
-
Hybrid entity extraction showed better performance as compared to the single model. It has the highest detectability in classifying the value to respective entity labels. The hybrid model can extract the resume-named entities from the tested resume, such as name, email, phone number, skills, designation, company and more. The proposed hybrid strategy enhances the results for identifying the resume-named entities with an achievement of precision 87.62% and recall 96.91%.
-
We found that preprocessed text caused our rule-based model to perform worse than when it was applied to the original text. This is likely because the preprocessing step removed important information.
Future Work and Limitation:
- The non-standardization of resume structure makes it difficult to automatically extract named entities recognition.