Introduction
Crawling the Web
About a year ago, I partnered with a web developer to build a webpage for our group at work. I am not a web-developer, and I concentrated on the automated web-testing, which was quite fun! It was based on a Python based Selenium web crawler. I scheduled it to run daily to launch the page and crawl every corner of it to look for failues and generate a report. It also enabled us to check if the page was accessible or not. It would even generate automated emails if the page is not accessible. This was quite an experience. I later modified the tool to collect data from the internet for various projects, one of which is collecting data from a job page.
Let me provide a quick tutorial on running Selenium with Python3. You can find the code in my repository or copy it below.
browser="Chrome"
#browser="FF"
url="https://www.indeed.com/"
if browser=="FF":
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary
driver = webdriver.Firefox( executable_path='C:\Python3_scheduled\geckodriver.exe') # get gecko from https://github.com/mozilla/geckodriver/releases
if browser=="Chrome":
from selenium import webdriver
driver = webdriver.Chrome('C:\Python3_scheduled\chromedriver.exe') # get chrome driver from https://chromedriver.chromium.org/downloads
driver.get(url)
driver.implicitly_wait(10)
What this code does is very simple: it triggers an instance of Firefox or Chrome depending on the selection at the top of the script. It is important to note that you will need to download the driver files, see the links in the code, and save them to a folder associated with the path in the code so that Python can launch the browser. Depending on the site you are working on, you may need to log in with you credentials. For example, in the case of LinkedIn, the credentials can be submitted to the correct fields with the following code.
driver.find_element_by_name('session_key').send_keys(uname);time.sleep(1)
driver.find_element_by_name('session_password').send_keys(pwd);time.sleep(1)
driver.find_elements_by_class_name('login__form_action_container')[0].click();
time.sleep(3)
It is very important to note that it is never a good idea to include the credentials in plain text in the code. I always keep my credential encrypted elsewhere and load them via the code. After the log in, one navigates to a URL that includes the search parameter, which will return the list of jobs. It is then somewhat a tedious coding exercise to locate the fields of interest and compile the data. Unfortunately, it is a running target since as the page design changes, the field identifiers may change causing failures in the code. This means that one needs to keep on top of the code, so that it stays functional. I needed to update the code every other month on the average.
I have been very persistent in maintaining the code over the last year, and scheduled the code to run daily. I now have data on 722 JPL job posts, which I will slice and dice below. Let us peek at the (almost) raw data in Tab. 1 below, which is interactive and searchable. The data is illustrated in Fig. 1 on the right, where each circle represents a job listing (hover over the circles to see more.)
Three ways to slice
Just to get an overall feel for the data, it is illustrative to group it using three different attributes:
- Function,
- Level,
- Type .
This will still be a high level review of the data, and it will only provide a feel for the content. I will later create a filter to narrow down the data to what I am mostly interested in. We will also extract some useful statistics. Let’s dive in and slice and dice the data.
Functions
The job posts include a field related to the functional group. In the original data there are 39 individual functions, which is a bit too granular for my purposes. I re-mapped them to broader titles.
## Warning: The `.dots` argument of `group_by()` is deprecated as of dplyr 1.0.0.
## `summarise()` has grouped output by 'thefunction'. You can override using the `.groups` argument.
The remapping yields 9 functional groups, as in Tab. 3 and Fig. 2.
It is not surprising to see that the top two function categories are Information Technology (205 jobs) and Engineering/Science (201 jobs). The Engineering/Science jobs are more relevant for my background, but before applying any filters, let’s slice the data in a different way.
Levels
The data can be put into 6 levels, as in Fig. 3a. and Fig. 3b.
The jobs listings are dominated by Entry level (231 instances).
Types
Finally we can take a look at the types of the jobs: There are 4 types, as in Fig 4a and 4b. Most of the jobs are Full-Time with 538 instances.
Keywords
The cloud on the right shows the words with sizes scaled with their frequency of appearances in the job postings.
One may be inclined to omit the obvious words, such as “JPL,” however, that would be a mistake. The frequency of the word “JPL” appearing serves as a gauge. For example, we see that the word “data” appears almost as frequently as “JPL” which underscores the importance of data, and presumably data science, to JPL.
Languages
The languages are roughly ordered by the amount of experience I have with them. For the languages/tools listed after ‘Knime,’ my experience is limited. However, this I will decide if I need to invest time to learn more about them if they show up frequently. Figure 6a and b show the ones from my list appearing frequently in the job posts.
It is clear that Python dominates the languages JPL looks for.
To be continued
This is a post in progress. I have a lot to add: What other qualifications do the look for the most? How about ML related tools? Do I meet these requirements?
I am also building a ML based algorithm that will tell me if a job post is a good fit for me by matching my skills with the requirements.
Stay tuned, and stay safe.