Required Skills to Become a Data Scientist

Musili Adebayo
The Startup
Published in
6 min readFeb 2, 2021

--

A web scraping research of Monsterindia.com data science job posts using selenium.

I carried out this research to enlighten all aspiring data scientists and it was inspired by @shareefshaik1375 KDnuggets Post

As the world becomes more data-driven, the demand for data-driven solutions continue to increase. Hence, the transition into the data science fields. Every aspiring data scientist often find themselves asking the following questions.

  1. What years of experience are employers looking for?
  2. What are the job roles employers are demanding for?
  3. What are the required skills to become a data scientist?

I found myself asking the same questions when I decided to pursue a career as a data scientist. The best advice an anonymous individual gave me on a faceless forum was that “if I want to get my dream job I should always look at the job specifications and build myself up to the point I can tick them off”.

All these led to this educational webscraping research of over 450 Data Scientist job posts on Monsterindia.com (now foundit.in) a popular job portal in india.

We are going to extract the job role, companies, years of experience, location and skills from this job portal.

So, let’s begin:

First, we will scrape the monsterindia.com (now foundit.in) for all data scientist job post with selenium.

/* loading the important libraries */

from time import sleep
from random import randint
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

PATH = "C:\Program Files (x86)\chromedriver.exe"
driver = webdriver.Chrome(PATH)

/* Loading Progress bar */
from tqdm import tqdm

/* intitaiting data storage */
roles = []
company = []
experience = []
location = []
skills = []

/* we have 25 job listed on a page
iterating through 20 webpages*/
pages = np.arange(1, 501, 25)
for page in tqdm(pages):
url = "https://www.monsterindia.com/srp/results?start="+str(page)+ "&sort=1&limit=25&query=data%20scientist&searchId=0a018204-fd48-42d4-ac6f-938605fe4106"
driver.get(url)
joblist = driver.find_elements_by_css_selector(".card-apply-content")
sleep(randint(2, 10))
/*print(page)*/
/*Looping through each listed job*/
for job in joblist:
job_role = job.find_element_by_css_selector('.medium').text
comp_name = job.find_element_by_css_selector('.company-name').text
years_of_exp = job.find_element_by_css_selector('.exp.col-xxs-12.col-sm-3.text-ellipsis').text
loc = job.find_element_by_css_selector('.col-xxs-12.col-sm-5.text-ellipsis').text


/*Skill elements may or may not be present in some job posts.To avoid NoSuchElement error message,we will use this code to handle it*/
try:
skill = job.find_element_by_css_selector('p.descrip-skills').text
except NoSuchElementException:
pass

roles.append(job_role)
company.append(comp_name)
experience.append(years_of_exp)
location.append(loc)
skills.append(skill)
/* storing in a pandas dataframe */
ds_jobs = pd.DataFrame({
"roles":roles,
"companies":company,
"experience":experience,
"locations" :location,
"skills":skills})


/* Carrying out a little preprocessing */
ds_jobs["skills"] = ds_jobs["skills"].str.replace("Skills :", " ")

Let’s check our saved data

Looking at the tail of our data

Let’s begin some data preprocessing:

Checking for missing data:

We can notice there are no missing data in our data frame.

Checking for duplicates in our data

A total of 16 duplicated value was found. I was a little bit skeptical about removing the duplicated data as it is common for some companies to use the same method of job specification to advertise a data scientist role. But I prefer to work safely

so we are dropping the duplicates.

Let’s drop our duplicated values:

Moving forward, we want to separate the locations and skills with tokenization because we have more than one word attached to them. we also have to reduce all data in our data frame to lowercase to avoid redundancy.

Exploratory Data Analysis

Focusing on our research, let’s begin to answer our research questions.

  1. What years of experience are employers looking for?

Required Years of Experience

  • Unspecified years: we can assume that employers are open to all level of experience gathered by any data scientist(fresh graduates, mid-level and experienced) as long as you can show your competencies and demonstrate your skills at the job. Although, the highest number of experienced gained is 11years.

2. What are the job roles employers are demanding for?

The Top 10 Data Science Role in demand

  • Employers are demanding for a data scientist position.
  • At the same time, the role of a senior data scientist is also getting the buzz. I believe this is a more suited role for an experienced data scientist.

Finally, to the main reason we carried out our research exercise.

3. What are the required skills to become a data scientist (based on the expectations of an employers)?

First, we are going to unstack the skills column and put it in a data frame

3.1 The Must-Have Skills

Must Have Skills

  • It is important for an aspiring data scientist to have deep knowledge and understanding of machine learning and data analysis. However, it is also equally important to understand deep learning, natural language processing(nlp),big data and statistics. All these are the core abilities and skills an aspiring data scientist should possess.

3.2 Programming Languages in Demand

Top programming language for a data science job role

  • Most employer prefer python programming language for a data scientist job role. Python has become so popular because its general-purpose ability, ease of usage and short learning curve. Besides, many aspiring data scientist picked it up easily and mastered within a short time frame.
  • Structured Query Language(SQL) is the mainstream programming language for communicating with databases.
  • R and Java are also sought after because of the statistical computing and data analysis usage.

3.3 Business Intelligence Tools

The Business intelligence tools for a data science job role.

  • Every company wants to derive insights from their data. Tableau and Power Bi are what I call the “the new kid on the block”. They can help companies derive actionable insights from their data, create interactive dashboards and reports that can turn to positive business decisions and results.

3.4 Top Deep Learning Frameworks

The Top Deep Learning Framework

  • Tensorflow has become the most preferred deep learning frame work because of its vast libraries for computation of machine learning model and algorithms.

3.5 Popular Cloud Services

Popular Cloud Service Provider

  • Most organizations preferred Amazon Web Service(AWS) as their cloud computing service provider because its reliability and its popular pay-as-you-go metered charge.

3.6. Top Big Data Software Provider

The Popular Big Data Technology

  • Most companies rely on Spark and Hadoop to extract, analyze and process the massive datasets which the traditional data processing software can’t handle.

You can find the link to the full research on my Git

Thank you for reading.

Remember to subscribe and follow me for more insightful articles and you can reach me via my socials below if you want to discuss further.
LinkedIn: Musili Adebayo
Twitter: Musili_Adebayo

P.S. Enjoy this article? Support my work directly.

--

--

Musili Adebayo
The Startup

I am quite the storyteller, once a good idea pops up in my head. I enjoy talking about modern data stack and how to leverage them.