Considering the advancement of NLP, we are thinking if machine could play a role in selecting the resume.

What it does

It takes candidates' resumes as input and extract important information from texts and return k candidates that are most talented through machine learning and NLP.

How I built it

The program should work first to select resumes and clarify them into several different types based on given indicators using K-means clustering. As we imagined, we clarify them into three clusters, one for talented people, one for general cases, and one for people that are not satisfied with mandatory requirements The next step is to set up specific grading scheme of each small type based on candidates' skills, experiences, etc. This can be achieved by NLP, by import language database, which contains keywords lists indicating tiny points related to elementary skills or experience, through which we could gather information that could be used as feature in Loss function. Finally, this question is converted to basic machine learning problem that we are trying to find optimal candidates with all their resume information appeared as data with training function. Applying logistic regression to find k(given as the number of accepted candidates)values with error delta. Then return.

Challenges I ran into

~Extract reliable information from different types of resumes.

~Set up balanced grading scheme for different candidates.

~How to transfer character information to dataset

~The way to operate fairly by using optimal algorithms to balance original data and newly-generated data

Accomplishments that I'm proud of

There are two mainly difficulty that we've overcome from this project: one is extracting most useful data from pdf file and create predicting function using these data, another one is given that our newly-generated data have some stable error that cannot be easily handled, we could cluster resumes before extract data. It is a great accomplishment that we achieved extracting texts from pdf documents and generate data from given text file. Newly-generated data need to be optimize though, we get the data covered most common notions and keywords in major software areas. Thus our learning algorithm for data training generates wispy errors that can be somehow ignored. Significantly, clustering is of vital importance in this project since we are considering error of prediction from two sides, one from the text and another from numbers that are listed in the resume. The existing number will not generate difference, making the model that we construct plays crucial role in the final results. For different types of candidates, the weights for its features should not equal. Thus clustering is unavoidable to generate less error.

What I learned

Preprocessing data or even preprocess information before we get data can greatly reduce the redundant work and hopefully get the algorithm enhanced and error reduced.

What's next for Clarify resumes using machine learning

The main aspects that need to improve is how to generate more reliable data from the extracted information in the PDF, that is enable computer to trace, add more indicators and keywords, and update weights or functions to verify the relation of the data and given grading scheme. It shows that the computer have to process language information as normal number information, which need better language model and algorithms. In addition, preprocessing of resumes can be improved by adding specific splitting indicator rather than randomly selecting cluster means. This improvement seems can only reduced limited error. However, the limited error can be enlarger if cluster matches wrong grading scheme. So cluster algorithm is also of vital importance and have to be improved.

Built With

Share this project: