Week 2 Content of Steam Review Analyzer
Analyze word frequencies
Calculate text statistics
Explore the movie reviews dataset
Last week, we learned how to clean and preprocess text data. Now we'll learn how to analyze that cleaned text to extract meaningful insights. This week focuses on understanding what the text contains before we move into sentiment analysis.
Think of it like being a detective - we need to examine the evidence (words) before we can draw conclusions (sentiment).
Find the full code for this week here: https://drive.google.com/file/d/1hRTZc2JcNnmqTcNmfKCUixSlAObdHXIE/view?usp=sharing
Word Frequency Analysis - Which words appear most often?
Text Statistics - How long are reviews? How many unique words?
Working with the Movie Reviews Dataset - Analyzing real data
Comparing Positive vs Negative Reviews - Finding patterns
Word frequency tells us which words appear most often in our text. This helps us understand the main topics and themes.
from collections import Counter
from nltk.tokenize import word_tokenize
import string
from nltk.corpus import stopwords
# Sample review text
review = "This game is amazing! The graphics are amazing and the gameplay is amazing too. I love the amazing story."
# Preprocess the text
tokens = word_tokenize(review.lower())
tokens = [token for token in tokens if token not in string.punctuation]
stop_words = set(stopwords.words('english'))
tokens = [token for token in tokens if token not in stop_words]
# Count word frequencies
word_freq = Counter(tokens)
print("Word frequencies:", word_freq)
print("Most common words:", word_freq.most_common(3))
Output:
Word frequencies: Counter({'amazing': 4, 'game': 1, 'graphics': 1, 'gameplay': 1, 'love': 1, 'story': 1})
Most common words: [('amazing', 4), ('game', 1), ('graphics', 1)]
The Counter object from Python's collections module makes frequency analysis easy:
It counts how many times each word appears
most_common(n) returns the top n most frequent words
Results are sorted by frequency automatically
Text statistics help us understand the structure and characteristics of our reviews.
from nltk.tokenize import word_tokenize
review1 = "Great game! Love it."
review2 = "This game has amazing graphics, incredible gameplay, and a fantastic story that keeps you engaged for hours."
# Tokenize both reviews
tokens1 = word_tokenize(review1)
tokens2 = word_tokenize(review2)
print("Review 1 length:", len(tokens1), "tokens")
print("Review 2 length:", len(tokens2), "tokens")
print("Review 1:", review1)
print("Review 2:", review2)
Output:
Review 1 length: 5 tokens
Review 2 length: 17 tokens
review = "The game is good. The graphics are good. The gameplay is good too."
tokens = word_tokenize(review.lower())
# Remove punctuation
tokens = [token for token in tokens if token.isalpha()]
# Count total and unique words
total_words = len(tokens)
unique_words = len(set(tokens))
vocabulary_richness = unique_words / total_words
print("Total words:", total_words)
print("Unique words:", unique_words)
print("Vocabulary richness:", round(vocabulary_richness, 2))
Output:
Total words: 14
Unique words: 7
Vocabulary richness: 0.5
What this means:
Higher vocabulary richness = more diverse word usage
Lower vocabulary richness = repetitive language
Range: 0.0 (all same words) to 1.0 (all unique words)
tokens = ['amazing', 'good', 'incredible', 'ok']
avg_length = sum(len(word) for word in tokens) / len(tokens)
print("Average word length:", round(avg_length, 2), "characters")
Output:
Average word length: 6.75 characters
Now let's apply these techniques to real data from NLTK's movie reviews corpus.
from nltk.corpus import movie_reviews
import random
# Get all positive and negative review IDs
positive_reviews = movie_reviews.fileids('pos')
negative_reviews = movie_reviews.fileids('neg')
print("Total positive reviews:", len(positive_reviews))
print("Total negative reviews:", len(negative_reviews))
# Get a random positive review
random_pos_id = random.choice(positive_reviews)
pos_review_text = movie_reviews.raw(random_pos_id)
print("\nSample positive review (first 200 chars):")
print(pos_review_text[:200])
Output:
Total positive reviews: 1000
Total negative reviews: 1000
from nltk.corpus import movie_reviews
from nltk.tokenize import word_tokenize
from collections import Counter
import string
from nltk.corpus import stopwords
# Get one review
review_id = movie_reviews.fileids('pos')[0]
review_text = movie_reviews.raw(review_id)
# Preprocess
tokens = word_tokenize(review_text.lower())
tokens = [t for t in tokens if t not in string.punctuation]
stop_words = set(stopwords.words('english'))
filtered_tokens = [t for t in tokens if t not in stop_words]
# Statistics
print("Review ID:", review_id)
print("Total tokens:", len(tokens))
print("After removing stop words:", len(filtered_tokens))
print("Unique words:", len(set(filtered_tokens)))
print("\nTop 5 most common words:")
word_freq = Counter(filtered_tokens)
for word, count in word_freq.most_common(5):
print(f" {word}: {count}")*
Let's find patterns that might help with sentiment analysis later.
from nltk.corpus import movie_reviews
from nltk.tokenize import word_tokenize
import string
# Analyze positive reviews
pos_lengths = []
for fileid in movie_reviews.fileids('pos')[:100]: # First 100 reviews
tokens = word_tokenize(movie_reviews.raw(fileid))*
tokens = [t for t in tokens if t not in string.punctuation]*
pos_lengths.append(len(tokens))*
# Analyze negative reviews
neg_lengths = []
for fileid in movie_reviews.fileids('neg')[:100]: # First 100 reviews
tokens = word_tokenize(movie_reviews.raw(fileid))*
tokens = [t for t in tokens if t not in string.punctuation]*
neg_lengths.append(len(tokens))*
# Calculate averages
avg_pos_length = sum(pos_lengths) / len(pos_lengths)
avg_neg_length = sum(neg_lengths) / len(neg_lengths)
print("Average positive review length:", round(avg_pos_length, 2), "words")
print("Average negative review length:", round(avg_neg_length, 2), "words")
from nltk.corpus import movie_reviews
from nltk.tokenize import word_tokenize
from collections import Counter
import string
from nltk.corpus import stopwords
def get_top_words(category, num_reviews=100):
"""Get most common words for a category"""
all_tokens = []
stop_words = set(stopwords.words('english'))
for fileid in movie_reviews.fileids(category)[:num_reviews]:
tokens = word_tokenize(movie_reviews.raw(fileid).lower())
tokens = [t for t in tokens if t not in string.punctuation]
tokens = [t for t in tokens if t not in stop_words]
all_tokens.extend(tokens)
return Counter(all_tokens).most_common(10)
# Get top words for each category
print("Top 10 words in POSITIVE reviews:")
for word, count in get_top_words('pos'):
print(f" {word}: {count}")*
print("\nTop 10 words in NEGATIVE reviews:")
for word, count in get_top_words('neg'):
print(f" {word}: {count}")*
Let's combine everything into one reusable function:
from nltk.corpus import movie_reviews
from nltk.tokenize import word_tokenize
from collections import Counter
import string
from nltk.corpus import stopwords
def analyze_review(review_text):
"""Complete analysis of a single review"""*
Preprocess*
tokens = word_tokenize(review_text.lower())*
tokens_no_punct = [t for t in tokens if t not in string.punctuation]*
stop_words = set(stopwords.words('english'))*
filtered_tokens = [t for t in tokens_no_punct if t not in stop_words]*
Calculate statistics*
total_words = len(tokens_no_punct)*
unique_words = len(set(filtered_tokens))*
vocab_richness = unique_words / total_words if total_words > 0 else 0*
avg_word_length = sum(len(w) for w in filtered_tokens) / len(filtered_tokens) if filtered_tokens else 0*
Get top words*
word_freq = Counter(filtered_tokens)*
top_words = word_freq.most_common(5)*
Return analysis*
return {*
'total_words': total_words,*
'unique_words': unique_words,*
'vocabulary_richness': round(vocab_richness, 2),*
'avg_word_length': round(avg_word_length, 2),*
'top_words': top_words*
}*
# Test the function
sample = "This game is absolutely amazing! The graphics are stunning and the gameplay is fantastic."
analysis = analyze_review(sample)
print("Review Analysis:")
for key, value in analysis.items():
print(f" {key}: {value}")*
Analyze both a positive and negative review from the dataset and compare:
Which one is longer?
Which has more unique words?
What are the top 3 words in each?
from nltk.corpus import movie_reviews
# Get one positive and one negative review
pos_review = movie_reviews.raw(movie_reviews.fileids('pos')[0])
neg_review = movie_reviews.raw(movie_reviews.fileids('neg')[0])
# Your task: Use analyze_review() function on both and compare the results
pos_analysis = analyze_review(pos_review)
neg_analysis = analyze_review(neg_review)
print("Positive Review Analysis:", pos_analysis)
print("\nNegative Review Analysis:", neg_analysis)
Word frequency reveals the most important topics in text
Text statistics help us understand review patterns
Comparing categories shows us what makes positive/negative reviews different
These patterns will be crucial for sentiment analysis in later weeks
In Week 3, we'll finally start building our sentiment analyzer! We'll learn:
Introduction to sentiment scoring
Building a simple positive/negative word dictionary
Calculating sentiment scores for reviews
Understanding how sentiment analysis works
You're now ready to move from analysis to classification!