GitHub - JennyTan5522/NLP-Resume-Parsing: An automated Hybrid Resume NER based on Rule-Based model, Machine Learning Model, and Transformer model

About:

Managing and organizing resumes may be a challenging operation as traditional resume management, which is time-consuming and requires manually sorting, organizing, and assessing endless resumes. Named Entity Recognition (NER) is a technique used in Document Management System (DMS) for automatically extracting and categorising significant information from documents. NER entails identifying and categorizing entities inside a document, such as persons, organisations, and locations. By automating this process, NER can increase the efficiency and accuracy of document management while allowing users to quickly and readily obtain critical information.

In this study, we aim to solve the problem of manually screening resumes by proposing a hybrid Automated Resume Named Entity Extraction (NER) to automate resume data.

Objectives:

Develop an effective Automated Resume NER system in processing resumes.
Evaluate the performance of the Traditional Baseline Model (Rule-based model), Machine Learning-Based Model and Transformer-Based Model in performing Resume Named Entity Extraction (NER).
CV Recommendation Model using Latent Dirichlet Allocation (LDA) based topic modelling on resume keywords and similarity scores of topic distributions.
The hybridisation approach of the Traditional Baseline Model, Machine-Learning-Based Model and Transformer-Based Model for named entity resume documents.
Automated summarization of candidates data and providing the functionalities such as searching and ranking based on the scores.

Dataset:

Around 200 Resumes from Kaggle: Rule-based and Machine Learning training and testing data
Around 220 Annotated Resumes from Kaggle: BERT training and testing data (JSON format)
Around 200 Resumes from Kaggle: Spacy training and testing data (JSON format)
Additional 5 Resume randomly selected to perform testing based on different models

Data Preprocessing:

Based on our research, not all the resumes will be cleaned in the NER processes. Therefore, in this study, we will evaluate the performance of resume named entity extraction (NER) before preprocessing and after preprocessing.

Flow Chart:

EDA

WordCloud
N-Grams

Rule-based Model (Established patterns or rules to identify)

Identify candidates name
Identify phone number
Identify email address
Identify qualifications
Identify graduation years
Identify locations
Identify candidate job skills (check if any keywords matched from the skill_set.txt corpus)
Identify university names
Identify company name
Identify candidate's designations or working experience (check if any keywords matched with job titles in the job-title.txt corpus)

Machine learning based named entities recognition labeling

Spacy Model: Pre-trained on a large dataset of text documents

Results

Rule-based
Machine Learning + Rule-based
Spacy

Spacy Prediction (Testing)

Proposed Hybrid Model (Rule-based + Spacy)

The hybrid model is a combination of rule-based, machine learning, and transformer NER. It inherits all the strengths of each NER approach by covering the weaknesses of the others.

Sample Results:

Preprocessing Results before and after

CV Recommendation Model

TF-IDF vectoriser to find the important terms inside the document and then computes the cosine similarity between the job description and CV.
Purpose: Allows the HR department to check for the similarity of job distributions based on Top N candidates.

Topic Modelling

Latent Dirichlet Allocation (LDA) technique is used to identify the main topics present in a collection of resumes.
Purpose: Quickly identify suitable candidates based on their skills and experiences listed in their resumes.
Provided the function for users to search for the keyword appearing in each topic and ultimately return the relevant resumes.

Search candidates' resume based on keywords

NER Functionalities - Search, Summary and Ranking

Search candidate and retrieve the Resume Summary Report

Skills Ranking Score

Search candidate

Results and Findings:

Hybrid entity extraction showed better performance as compared to the single model. It has the highest detectability in classifying the value to respective entity labels. The hybrid model can extract the resume-named entities from the tested resume, such as name, email, phone number, skills, designation, company and more. The proposed hybrid strategy enhances the results for identifying the resume-named entities with an achievement of precision 87.62% and recall 96.91%.
We found that preprocessed text caused our rule-based model to perform worse than when it was applied to the original text. This is likely because the preprocessing step removed important information.

Future Work and Limitation:

The non-standardization of resume structure makes it difficult to automatically extract named entities recognition.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
0. Information Extraction		0. Information Extraction
1. Rule-Based Model _ 2. Machine-Learning Model		1. Rule-Based Model _ 2. Machine-Learning Model
3. BERT		3. BERT
4. Spacy _ 5. Hybrid (Rule + ML +Spacy)		4. Spacy _ 5. Hybrid (Rule + ML +Spacy)
5. NER summary		5. NER summary
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

JennyTan5522/NLP-Resume-Parsing

Folders and files

Latest commit

History

Repository files navigation

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages