This project involves web scraping Glassdoor reviews for data science job postings to analyze sentiment and explore how review ratings correlate with textual sentiment. The goal is to understand how employees describe their experiences and whether sentiment aligns with the number of stars given.
đź“– Read the full article on medium.com.
Since Glassdoor lacks a public API, I used Selenium
and BeautifulSoup
to extract job listings and reviews. However, scraping was limited due to a 403 Forbidden
error after a small number of requests.
Example Scraping Function:
def get_page(url, headers):
"""Fetch webpage content using BeautifulSoup."""
try:
req = Request(url, headers=headers)
page = urlopen(req)
soup = BeautifulSoup(page, "html.parser")
return soup
except HTTPError as e:
print(f"Error opening page {e}")
In order to proceed with sentiment analysis, I had to perform text preprocessing which involved the below. When working on NLP projects, I am usually a big fan of the NLTK library, but this time I wanted to try out TextBlob.
Example of cleaning reviews:
# assign new clean review column
df = df.assign(clean_review = df.reviews.map(lambda x: ' '.join(TextBlob(str(x)).words)))
Once the data is cleaned I can finally move on to the fun part, visualizations! I created a word cloud from the most frequent words in the reviews:
Top words included good, work, people, great, benefits, culture, balance, pay, management, life, reflecting common themes in workplace discussions.
TextBlob measures:
Here is an example:
sample_review = df['clean_review'].iloc[0]
TextBlob(sample_review).sentiment
Observations:
VADER provides:
Here is an example:
sid = SentimentIntensityAnalyzer()
sid.polarity_scores(sample_review)
While VADER’s compound scores showed a clearer relationship with ratings, inconsistencies still emerged.
While sentiment analysis provides valuable insights, review ratings don’t always align with textual sentiment. This raises questions about how employees rate companies and whether numerical ratings alone reflect job satisfaction.