Octograd 2020 - Resume Filtering System

Prateek Gupta — Thu, 21 May 2020 07:31:51 +0000

A smart resume filtering system which shows the best matching resumes according to a given job description.

Link to Code

prateekguptaiiitk / Resume_Filtering

A resume filtering based on natural language processing

Resume Filtering Using Machine Learning

Resume filtering on the basis of Job Descriptions(JDs). It was a summer internship project with Skybits Technologies Pvt. Ltd.

Introduction

The main feature of the current project is that it searches the entire resume database to select and display the resumes which fit the best for the provided job description(JD). This is, in its current form, achieved by assigning a score to each CV by intelligently comparing them against the corresponding Job Description. This reduces the window to a fraction of an original size of applicants. Resumes in the final window can be manually checked for further analysis. The project uses techniques in Machine Learning and Natural Language Processing to automate the process.

Directory Structure

├── Data
│   ├── CVs
│   ├── collectCV.py
│   └── jd.csv
├── Model
│   ├── Model_Training.ipynb
│   ├── Sentence_Extraction.ipynb
│   ├── paragraph_extraction_from_posts.ipynb
│   ├── sample_bitcoin.stackexchange_paras.txt
│   ├── sample_bitcoin.stackexchange_sentences.txt

…

View on GitHub

Project Introduction

This is, in its current form, achieved by assigning a score to each CV by intelligently comparing them against the corresponding Job Description. This reduces the window to a fraction of an original size of applicants. Resumes in the final window can be manually checked for further analysis.

Overview

Mainly three datasets were required.
The Word2Vec Model using the StackOverflow data dump.
Extracted sections from the CVs like Education, Experience etc.
Finally, the CVs were awarded scores against each Job Descriptions available.

Data Collection

Mainly three datasets were required:

StackExchange Network Posts

This dataset was required to trains the word2vec model. Fortunately, StackExchange network dumps it's data in xml format under Creative Commons License. One can find a download link for the dataset(44 GB) on Internet Archive.

Resume Dataset

This dataset was required to test the trained word2vec model. Among these resumes, best matching resumes should be filtered out. Downloaded resumes from indeed.com

Job Description Dataset

This dataset was required to test the trained word2vec model. These job descriptions would be the basis of resume filtering. A Kaggle dataset containing Job Descriptions for several job openings was used.

Resources Used

spaCy Documentation: https://spacy.io/
spaCy GitHub Issue Page: https://github.com/explosion/spaCy/issues
Gensim Word2Vec Documentation: http://radimrehurek.com/gensim/models/word2vec.html
Gensim Word2Vec GitHub repository: link
Google Word2Vec: https://code.google.com/archive/p/word2vec/
GitHub Repository for Doc2Vec Illustration: https://github.com/linanqiu/word2vec-sentiments

Additional Thoughts

It was a great learning experience through this project. My learning doesn't stop here, I will be creating and contributing more in the future. However, there is definitely room for improvements, the result is satisfactory enough for the first iteration of the project.

Thank you octograd2020! Cheers🍻

Forem: Prateek Gupta