Week 2: Text Analysis & Statistics

Week 2 Goals

Analyze word frequencies
Calculate text statistics
Explore the movie reviews dataset

Introduction

Last week, we learned how to clean and preprocess text data. Now we'll learn how to analyze that cleaned text to extract meaningful insights. This week focuses on understanding what the text contains before we move into sentiment analysis.

Think of it like being a detective - we need to examine the evidence (words) before we can draw conclusions (sentiment).

Find the full code for this week here: https://drive.google.com/file/d/1hRTZc2JcNnmqTcNmfKCUixSlAObdHXIE/view?usp=sharing

What We'll Learn

Word Frequency Analysis - Which words appear most often?
Text Statistics - How long are reviews? How many unique words?
Working with the Movie Reviews Dataset - Analyzing real data
Comparing Positive vs Negative Reviews - Finding patterns

1. Word Frequency Analysis

Word frequency tells us which words appear most often in our text. This helps us understand the main topics and themes.

Counting Words

from collections import Counter

from nltk.tokenize import word_tokenize

import string

from nltk.corpus import stopwords

# Sample review text

review = "This game is amazing! The graphics are amazing and the gameplay is amazing too. I love the amazing story."

# Preprocess the text

tokens = word_tokenize(review.lower())

tokens = [token for token in tokens if token not in string.punctuation]

stop_words = set(stopwords.words('english'))

tokens = [token for token in tokens if token not in stop_words]

# Count word frequencies

word_freq = Counter(tokens)

print("Word frequencies:", word_freq)

print("Most common words:", word_freq.most_common(3))

Output:

Word frequencies: Counter({'amazing': 4, 'game': 1, 'graphics': 1, 'gameplay': 1, 'love': 1, 'story': 1})

Most common words: [('amazing', 4), ('game', 1), ('graphics', 1)]

Understanding the Counter Object

The Counter object from Python's collections module makes frequency analysis easy:

It counts how many times each word appears
most_common(n) returns the top n most frequent words
Results are sorted by frequency automatically

2. Basic Text Statistics

Text statistics help us understand the structure and characteristics of our reviews.

Review Length Analysis

from nltk.tokenize import word_tokenize

review1 = "Great game! Love it."

review2 = "This game has amazing graphics, incredible gameplay, and a fantastic story that keeps you engaged for hours."

# Tokenize both reviews

tokens1 = word_tokenize(review1)

tokens2 = word_tokenize(review2)

print("Review 1 length:", len(tokens1), "tokens")

print("Review 2 length:", len(tokens2), "tokens")

print("Review 1:", review1)

print("Review 2:", review2)

Output:

Review 1 length: 5 tokens

Review 2 length: 17 tokens

Unique Word Count (Vocabulary Richness)

review = "The game is good. The graphics are good. The gameplay is good too."

tokens = word_tokenize(review.lower())

# Remove punctuation

tokens = [token for token in tokens if token.isalpha()]

# Count total and unique words

total_words = len(tokens)

unique_words = len(set(tokens))

vocabulary_richness = unique_words / total_words

print("Total words:", total_words)

print("Unique words:", unique_words)

print("Vocabulary richness:", round(vocabulary_richness, 2))

Output:

Total words: 14

Unique words: 7

Vocabulary richness: 0.5

What this means:

Higher vocabulary richness = more diverse word usage
Lower vocabulary richness = repetitive language
Range: 0.0 (all same words) to 1.0 (all unique words)

Average Word Length

tokens = ['amazing', 'good', 'incredible', 'ok']

avg_length = sum(len(word) for word in tokens) / len(tokens)

print("Average word length:", round(avg_length, 2), "characters")

Output:

Average word length: 6.75 characters

3. Working with the Movie Reviews Dataset

Now let's apply these techniques to real data from NLTK's movie reviews corpus.

Loading Movie Reviews

from nltk.corpus import movie_reviews

import random

# Get all positive and negative review IDs

positive_reviews = movie_reviews.fileids('pos')

negative_reviews = movie_reviews.fileids('neg')

print("Total positive reviews:", len(positive_reviews))

print("Total negative reviews:", len(negative_reviews))

# Get a random positive review

random_pos_id = random.choice(positive_reviews)

pos_review_text = movie_reviews.raw(random_pos_id)

print("\nSample positive review (first 200 chars):")

print(pos_review_text[:200])

Output:

Total positive reviews: 1000

Total negative reviews: 1000

Analyzing a Single Review

from nltk.corpus import movie_reviews

from nltk.tokenize import word_tokenize

from collections import Counter

import string

from nltk.corpus import stopwords

# Get one review

review_id = movie_reviews.fileids('pos')[0]

review_text = movie_reviews.raw(review_id)

# Preprocess

tokens = word_tokenize(review_text.lower())

tokens = [t for t in tokens if t not in string.punctuation]

stop_words = set(stopwords.words('english'))

filtered_tokens = [t for t in tokens if t not in stop_words]

# Statistics

print("Review ID:", review_id)

print("Total tokens:", len(tokens))

print("After removing stop words:", len(filtered_tokens))

print("Unique words:", len(set(filtered_tokens)))

print("\nTop 5 most common words:")

word_freq = Counter(filtered_tokens)

for word, count in word_freq.most_common(5):

print(f" {word}: {count}")*

4. Comparing Positive vs Negative Reviews

Let's find patterns that might help with sentiment analysis later.

Average Review Length by Sentiment

from nltk.corpus import movie_reviews

from nltk.tokenize import word_tokenize

import string

# Analyze positive reviews

pos_lengths = []

for fileid in movie_reviews.fileids('pos')[:100]: # First 100 reviews

tokens = word_tokenize(movie_reviews.raw(fileid))*
tokens = [t for t in tokens if t not in string.punctuation]*
pos_lengths.append(len(tokens))*

# Analyze negative reviews

neg_lengths = []

for fileid in movie_reviews.fileids('neg')[:100]: # First 100 reviews

tokens = word_tokenize(movie_reviews.raw(fileid))*
tokens = [t for t in tokens if t not in string.punctuation]*
neg_lengths.append(len(tokens))*

# Calculate averages

avg_pos_length = sum(pos_lengths) / len(pos_lengths)

avg_neg_length = sum(neg_lengths) / len(neg_lengths)

print("Average positive review length:", round(avg_pos_length, 2), "words")

print("Average negative review length:", round(avg_neg_length, 2), "words")

Most Common Words in Positive vs Negative Reviews

from nltk.corpus import movie_reviews

from nltk.tokenize import word_tokenize

from collections import Counter

import string

from nltk.corpus import stopwords

def get_top_words(category, num_reviews=100):

"""Get most common words for a category"""

all_tokens = []

stop_words = set(stopwords.words('english'))

for fileid in movie_reviews.fileids(category)[:num_reviews]:

tokens = word_tokenize(movie_reviews.raw(fileid).lower())

tokens = [t for t in tokens if t not in string.punctuation]

tokens = [t for t in tokens if t not in stop_words]

all_tokens.extend(tokens)

return Counter(all_tokens).most_common(10)

# Get top words for each category

print("Top 10 words in POSITIVE reviews:")

for word, count in get_top_words('pos'):

print(f" {word}: {count}")*

print("\nTop 10 words in NEGATIVE reviews:")

for word, count in get_top_words('neg'):

print(f" {word}: {count}")*

5. Building a Complete Analysis Function

Let's combine everything into one reusable function:

from nltk.corpus import movie_reviews

from nltk.tokenize import word_tokenize

from collections import Counter

import string

from nltk.corpus import stopwords

def analyze_review(review_text):

"""Complete analysis of a single review"""*
Preprocess*
tokens = word_tokenize(review_text.lower())*
tokens_no_punct = [t for t in tokens if t not in string.punctuation]*
stop_words = set(stopwords.words('english'))*
filtered_tokens = [t for t in tokens_no_punct if t not in stop_words]*
Calculate statistics*
total_words = len(tokens_no_punct)*
unique_words = len(set(filtered_tokens))*
vocab_richness = unique_words / total_words if total_words > 0 else 0*
avg_word_length = sum(len(w) for w in filtered_tokens) / len(filtered_tokens) if filtered_tokens else 0*
Get top words*
word_freq = Counter(filtered_tokens)*
top_words = word_freq.most_common(5)*
Return analysis*
return {*
'total_words': total_words,*
'unique_words': unique_words,*
'vocabulary_richness': round(vocab_richness, 2),*
'avg_word_length': round(avg_word_length, 2),*
'top_words': top_words*
}*

# Test the function

sample = "This game is absolutely amazing! The graphics are stunning and the gameplay is fantastic."

analysis = analyze_review(sample)

print("Review Analysis:")

for key, value in analysis.items():

print(f" {key}: {value}")*

Practice Exercise

Analyze both a positive and negative review from the dataset and compare:

Which one is longer?
Which has more unique words?
What are the top 3 words in each?

from nltk.corpus import movie_reviews

# Get one positive and one negative review

pos_review = movie_reviews.raw(movie_reviews.fileids('pos')[0])

neg_review = movie_reviews.raw(movie_reviews.fileids('neg')[0])

# Your task: Use analyze_review() function on both and compare the results

pos_analysis = analyze_review(pos_review)

neg_analysis = analyze_review(neg_review)

print("Positive Review Analysis:", pos_analysis)

print("\nNegative Review Analysis:", neg_analysis)

Key Takeaways

Word frequency reveals the most important topics in text
Text statistics help us understand review patterns
Comparing categories shows us what makes positive/negative reviews different
These patterns will be crucial for sentiment analysis in later weeks

Next Week Preview

In Week 3, we'll finally start building our sentiment analyzer! We'll learn:

Introduction to sentiment scoring
Building a simple positive/negative word dictionary
Calculating sentiment scores for reviews
Understanding how sentiment analysis works

You're now ready to move from analysis to classification!

Steam Review Analyzer Week 2

Week 2: Text Analysis & Statistics

Week 2 Goals

Introduction

What We'll Learn

1. Word Frequency Analysis

Counting Words

Understanding the Counter Object

2. Basic Text Statistics

Review Length Analysis

Unique Word Count (Vocabulary Richness)

Average Word Length

3. Working with the Movie Reviews Dataset

Loading Movie Reviews

Analyzing a Single Review

4. Comparing Positive vs Negative Reviews

Average Review Length by Sentiment

Most Common Words in Positive vs Negative Reviews

5. Building a Complete Analysis Function

Practice Exercise

Key Takeaways

Next Week Preview

Comments