Week 3 content of steam reviewer analyzer
Understand sentiment scoring
Build a sentiment word dictionary
Calculate basic sentiment scores
Code for this week: https://drive.google.com/file/d/1b8azmiBPdYNMpUQVgzgRtENVDfuUG7yO/view?usp=sharing
Welcome to Week 3! Now that we can preprocess text and analyze word frequencies, we're ready to dive into sentiment analysis.
Sentiment analysis answers the question: Is this review positive, negative, or neutral?
This week, we'll build a simple but effective sentiment analyzer from scratch. You'll understand exactly how sentiment scoring works before we use more advanced tools.
Sentiment analysis (also called opinion mining) is the process of determining whether text expresses a positive, negative, or neutral opinion.
Product reviews: "This phone is amazing!" → Positive
Movie reviews: "Terrible acting and boring plot." → Negative
Social media: "The weather is okay today." → Neutral
Create lists of positive and negative words
Count how many positive/negative words appear in the text
Calculate a sentiment score based on the counts
Classify the text as positive, negative, or neutral
The foundation of sentiment analysis is having lists of words that express positive or negative opinions.
# Positive words
positive_words = [
'good', 'great', 'excellent', 'amazing', 'wonderful',*
'fantastic', 'awesome', 'love', 'best', 'perfect',*
'beautiful', 'brilliant', 'outstanding', 'superb', 'enjoyable'*
]
# Negative words
negative_words = [
'bad', 'terrible', 'awful', 'horrible', 'worst',*
'hate', 'disappointing', 'poor', 'waste', 'boring',*
'annoying', 'frustrating', 'ugly', 'useless', 'pathetic'*
]
print("Positive words:", len(positive_words))
print("Negative words:", len(negative_words))
Why these words?
They clearly express positive or negative opinions
They're common in reviews
They're unambiguous in meaning
Now let's count how many positive and negative words appear in a review.
from nltk.tokenize import word_tokenize
positive_words = ['good', 'great', 'excellent', 'amazing', 'love']
negative_words = ['bad', 'terrible', 'awful', 'hate', 'worst']
review = "This game is great! I love the graphics. The story is amazing."
# Tokenize and lowercase
tokens = word_tokenize(review.lower())
# Count positive and negative words
positive_count = 0
negative_count = 0
for token in tokens:
if token in positive_words:
positive_count += 1*
if token in negative_words:
negative_count += 1*
print("Review:", review)
print("Positive words found:", positive_count)
print("Negative words found:", negative_count)
Output:
Review: This game is great! I love the graphics. The story is amazing.
Positive words found: 3
Negative words found: 0
We can calculate a simple sentiment score by subtracting negative counts from positive counts.
positive_count = 3
negative_count = 0
# Calculate sentiment score
sentiment_score = positive_count - negative_count
print("Sentiment score:", sentiment_score)
# Classify the sentiment
if sentiment_score > 0:
classification = "Positive"*
elif sentiment_score < 0:
classification = "Negative"*
else:
classification = "Neutral"*
print("Classification:", classification)
Output:
Sentiment score: 3
Classification: Positive
Positive number = More positive words than negative → Positive sentiment
Negative number = More negative words than positive → Negative sentiment
Zero = Equal positive and negative (or none) → Neutral sentiment
Let's combine everything into one reusable function.
from nltk.tokenize import word_tokenize
def analyze_sentiment(text):
"""Analyze sentiment of text using word counting"""*
Define sentiment word lists*
positive_words = [*
'good', 'great', 'excellent', 'amazing', 'wonderful',* 'fantastic', 'awesome', 'love', 'best', 'perfect',* 'beautiful', 'brilliant', 'outstanding', 'superb', 'enjoyable'*
]*
negative_words = [*
'bad', 'terrible', 'awful', 'horrible', 'worst',* 'hate', 'disappointing', 'poor', 'waste', 'boring',* 'annoying', 'frustrating', 'ugly', 'useless', 'pathetic'*
]*
Tokenize and lowercase*
tokens = word_tokenize(text.lower())*
Count sentiment words*
positive_count = sum(1 for token in tokens if token in positive_words)*
negative_count = sum(1 for token in tokens if token in negative_words)*
Calculate score*
sentiment_score = positive_count - negative_count*
Classify*
if sentiment_score > 0:*
classification = "Positive"*
elif sentiment_score < 0:*
classification = "Negative"*
else:*
classification = "Neutral"*
return {*
'score': sentiment_score,* 'classification': classification,* 'positive_words': positive_count,* 'negative_words': negative_count*
}*
# Test the function
review1 = "This game is amazing! I love it."
review2 = "Terrible game. Waste of money."
review3 = "The game is okay."
print("Review 1:", review1)
print("Analysis:", analyze_sentiment(review1))
print()
print("Review 2:", review2)
print("Analysis:", analyze_sentiment(review2))
print()
print("Review 3:", review3)
print("Analysis:", analyze_sentiment(review3))
Output:
Review 1: This game is amazing! I love it.
Analysis: {'score': 2, 'classification': 'Positive', 'positive_words': 2, 'negative_words': 0}
Review 2: Terrible game. Waste of money.
Analysis: {'score': -2, 'classification': 'Negative', 'positive_words': 0, 'negative_words': 2}
Review 3: The game is okay.
Analysis: {'score': 0, 'classification': 'Neutral', 'positive_words': 0, 'negative_words': 0}
Let's test our sentiment analyzer on real movie reviews from NLTK.
from nltk.corpus import movie_reviews
import random
# Get a random positive review
pos_fileid = random.choice(movie_reviews.fileids('pos'))
pos_text = movie_reviews.raw(pos_fileid)
# Get a random negative review
neg_fileid = random.choice(movie_reviews.fileids('neg'))
neg_text = movie_reviews.raw(neg_fileid)
# Analyze both
print("=== POSITIVE REVIEW ===")
print("First 200 characters:", pos_text[:200])
print("Our analysis:", analyze_sentiment(pos_text))
print("Actual label: Positive")
print()
print("=== NEGATIVE REVIEW ===")
print("First 200 characters:", neg_text[:200])
print("Our analysis:", analyze_sentiment(neg_text))
print("Actual label: Negative")
from nltk.corpus import movie_reviews
# Test on first 100 positive reviews
correct = 0
total = 0
for fileid in movie_reviews.fileids('pos')[:100]:
text = movie_reviews.raw(fileid)*
result = analyze_sentiment(text)*
if result['classification'] == 'Positive':*
correct += 1*
total += 1*
accuracy_positive = (correct / total) * 100
print(f"Accuracy on positive reviews: {accuracy_positive:.1f}%")
# Test on first 100 negative reviews
correct = 0
total = 0
for fileid in movie_reviews.fileids('neg')[:100]:
text = movie_reviews.raw(fileid)*
result = analyze_sentiment(text)*
if result['classification'] == 'Negative':*
correct += 1*
total += 1*
accuracy_negative = (correct / total) * 100
print(f"Accuracy on negative reviews: {accuracy_negative:.1f}%")
overall_accuracy = (accuracy_positive + accuracy_negative) / 2
print(f"Overall accuracy: {overall_accuracy:.1f}%")
Our basic word list might miss some words. Let's expand it with gaming-specific terms.
# Gaming-specific positive words
gaming_positive = [
'addictive', 'immersive', 'engaging', 'polished', 'smooth',*
'fun', 'exciting', 'thrilling', 'impressive', 'stunning'*
]
# Gaming-specific negative words
gaming_negative = [
'buggy', 'glitchy', 'broken', 'laggy', 'repetitive',*
'clunky', 'unfinished', 'unplayable', 'crashes', 'boring'*
]
# Combine with original lists
all_positive = positive_words + gaming_positive
all_negative = negative_words + gaming_negative
print("Total positive words:", len(all_positive))
print("Total negative words:", len(all_negative))
Why add domain-specific words?
Different domains use different vocabulary
Game reviews mention "buggy" and "laggy" - movie reviews don't
More relevant words = better accuracy
Add 5 more positive words and 5 more negative words to the sentiment lists
Test your expanded analyzer on these reviews:
"The gameplay is smooth and the graphics are stunning!"
"Buggy mess. The game crashes constantly."
"It's okay, nothing special."
Calculate the sentiment score for each
# Your expanded word lists
my_positive_words = ['good', 'great', 'excellent'] # Add 5 more
my_negative_words = ['bad', 'terrible', 'awful'] # Add 5 more
# Test reviews
test1 = "The gameplay is smooth and the graphics are stunning!"
test2 = "Buggy mess. The game crashes constantly."
test3 = "It's okay, nothing special."
# Your task: Analyze each review with your expanded word lists
Sentiment analysis classifies text as positive, negative, or neutral
Word counting is a simple but effective approach
Sentiment dictionaries contain lists of positive and negative words
Domain-specific words improve accuracy for specific topics
Even simple methods can achieve reasonable accuracy (50-70%)
Our basic sentiment analyzer has some limitations:
Doesn't understand context: "not good" is counted as positive
Ignores word importance: "amazing" and "good" count the same
Misses sarcasm: "Oh great, another bug" seems positive
Limited vocabulary: Only knows words in our lists
Next week, we'll address these issues by using VADER sentiment analysis - a more sophisticated tool that handles negations, emphasis, and context!
In Week 4, we'll learn:
VADER sentiment analysis (advanced tool)
Understanding compound scores
Handling negations ("not good")
Dealing with emphasis ("AMAZING!!!")
Comparing our basic analyzer with VADER
You've built your first sentiment analyzer from scratch! Now you understand the fundamentals.