" 10.8 million and counting: Take a look at how many jobs Covid-19 has wiped out. "
The above article is one of many. Covid-19 has impacted our lives greatly but more so has impacted source of income of many, as if things were already not difficult. As an engineering student currently in my 2nd year, I would be sitting for my internships soon, it would be great to have an idea what skills are trending in tech industry, to boost my chances of getting good internships and eventually a good job, in this project, I predict salaries based on the skills, company names, requirements and rating of the company posted in indeed (best job searching site), for which I scraped the site using BeautifulSoup library.
Table of contents
Project is created with:
- Jupyter notebook version: 6.1.3
- Spyder version: 4.2.1
- Python version: 3.8
First, there are a lot of missing values, especially of the target variable
Lets look at salary distribution
Clearly the salary distribution is not uniform with most annual salries below Rs.1000000 and a few high salaries
Lets look at income categories where salaries fall in
Most annual incomes are in the range of 1 to 5 lpa
As observed from the income category distribution and the avg_annual_sal distribution the salary distribution is really skewed, as most of the people are payed near the average which is pretty low and only a few people get really high salaries
Lets Look at the correlation b/w some of these these variables
Some correlations are quite noticable while others are quite weakly related to average salary
Lets look at average salary vs ratings
Higher rated companies generally pay higher with few exceptions (Most of the higher rated companies have not stated the offering salary beforehand, that could be one reason)
Lets look at average salary vs ratings
Comapanies tend to pay higher to more experienced employees
We have talked about how various factors relate to annual salary
Lets now look at most mentioned skills in the requirements section by recruiting companies
From the wordcloud we can see some of the trending skills in software industry
Looks like most jobs are for front end, most popular framework is .net and the most asked for programming language is python, php and java
Plot for frequency of a particular skill ocurring in requirements column
Now lets look at average salary wrt to job role
(i) Most salaries are below Rs.50000
(ii) The highest offered salary is of Rs.5285450 by Jobsrefer
(iii) A Company even pays an annual salary of just Rs. 6500 !!
Lets look at states having the highest job openings at the time of data collected
Most job openings are in Delhi, followed by Karnataka
Lets now look at top 10 companies offering highest salaries wrt seniority
As expected Companies offer high paying salries to senior employees
Looks like most of the missing job_titles for above companies are probably senior
First started with basic regression models like Lasso as the data has outliers and lasso is robust to outliers.
Also some really powerful models like Random forest, ExtraTrees, Gradient boosted trees and Xgboost models were used as the complexity of problem is high but the available data is small. (784 training and 100 test examples)
Also created a blender of best models, to squeeze a bit more performance from the models
For stacking RandomForest, XGBRegressor, ExtraTrees, GradientBoosting, VotingRegressor were used as they performed the best
To run the notebook unzip the all_models zip in the folder all_trained_models.
As the dataset was quite small, RandomForest was used to generate the feature importances of variables to get an idea of how useful our variables are in predicting target values
Following is the plot for top 10 useful features according to RandomForest
The features are quite weakly related to the target values.
Lets now look at the performance of various models (Complexity increases down the list)
|Sno.||Model||Mean Squared Error|
Note: The MSE of all models are on test set.
For Stacking Ensemble :
The 95% confidence interval for our predictions : [130003.6634 , 302895.4814]
The R2 score from Stacking Ensemble model predictions is 0.7342, the model explains about 3/4 th of the observed variation, which is great. The model can give much better predictions if fed with more data.
Also, here our tuned RandomForest performs slightly better than our ensemble:
The 95% confidence interval : [113283.578, 288145.465]
R2 score : 0.75