Steam Review Analyzer Week 2

    Steam Review Analyzer Week 2

    Week 2 Content of Steam Review Analyzer

    By AI Club on 9/29/2025
    0

    Week 2: Text Analysis & Statistics

    Week 2 Goals

    • Analyze word frequencies

    • Calculate text statistics

    • Explore the movie reviews dataset

    Introduction

    Last week, we learned how to clean and preprocess text data. Now we'll learn how to analyze that cleaned text to extract meaningful insights. This week focuses on understanding what the text contains before we move into sentiment analysis.

    Think of it like being a detective - we need to examine the evidence (words) before we can draw conclusions (sentiment).

    Find the full code for this week here: https://drive.google.com/file/d/1hRTZc2JcNnmqTcNmfKCUixSlAObdHXIE/view?usp=sharing

    What We'll Learn

    1. Word Frequency Analysis - Which words appear most often?

    2. Text Statistics - How long are reviews? How many unique words?

    3. Working with the Movie Reviews Dataset - Analyzing real data

    4. Comparing Positive vs Negative Reviews - Finding patterns

    1. Word Frequency Analysis

    Word frequency tells us which words appear most often in our text. This helps us understand the main topics and themes.

    Counting Words

    from collections import Counter

    from nltk.tokenize import word_tokenize

    import string

    from nltk.corpus import stopwords

    # Sample review text

    review = "This game is amazing! The graphics are amazing and the gameplay is amazing too. I love the amazing story."

    # Preprocess the text

    tokens = word_tokenize(review.lower())

    tokens = [token for token in tokens if token not in string.punctuation]

    stop_words = set(stopwords.words('english'))

    tokens = [token for token in tokens if token not in stop_words]

    # Count word frequencies

    word_freq = Counter(tokens)

    print("Word frequencies:", word_freq)

    print("Most common words:", word_freq.most_common(3))

    Output:

    Word frequencies: Counter({'amazing': 4, 'game': 1, 'graphics': 1, 'gameplay': 1, 'love': 1, 'story': 1})

    Most common words: [('amazing', 4), ('game', 1), ('graphics', 1)]

    Understanding the Counter Object

    The Counter object from Python's collections module makes frequency analysis easy:

    • It counts how many times each word appears

    • most_common(n) returns the top n most frequent words

    • Results are sorted by frequency automatically

    2. Basic Text Statistics

    Text statistics help us understand the structure and characteristics of our reviews.

    Review Length Analysis

    from nltk.tokenize import word_tokenize

    review1 = "Great game! Love it."

    review2 = "This game has amazing graphics, incredible gameplay, and a fantastic story that keeps you engaged for hours."

    # Tokenize both reviews

    tokens1 = word_tokenize(review1)

    tokens2 = word_tokenize(review2)

    print("Review 1 length:", len(tokens1), "tokens")

    print("Review 2 length:", len(tokens2), "tokens")

    print("Review 1:", review1)

    print("Review 2:", review2)

    Output:

    Review 1 length: 5 tokens

    Review 2 length: 17 tokens

    Unique Word Count (Vocabulary Richness)

    review = "The game is good. The graphics are good. The gameplay is good too."

    tokens = word_tokenize(review.lower())

    # Remove punctuation

    tokens = [token for token in tokens if token.isalpha()]

    # Count total and unique words

    total_words = len(tokens)

    unique_words = len(set(tokens))

    vocabulary_richness = unique_words / total_words

    print("Total words:", total_words)

    print("Unique words:", unique_words)

    print("Vocabulary richness:", round(vocabulary_richness, 2))

    Output:

    Total words: 14

    Unique words: 7

    Vocabulary richness: 0.5

    What this means:

    • Higher vocabulary richness = more diverse word usage

    • Lower vocabulary richness = repetitive language

    • Range: 0.0 (all same words) to 1.0 (all unique words)

    Average Word Length

    tokens = ['amazing', 'good', 'incredible', 'ok']

    avg_length = sum(len(word) for word in tokens) / len(tokens)

    print("Average word length:", round(avg_length, 2), "characters")

    Output:

    Average word length: 6.75 characters

    3. Working with the Movie Reviews Dataset

    Now let's apply these techniques to real data from NLTK's movie reviews corpus.

    Loading Movie Reviews

    from nltk.corpus import movie_reviews

    import random

    # Get all positive and negative review IDs

    positive_reviews = movie_reviews.fileids('pos')

    negative_reviews = movie_reviews.fileids('neg')

    print("Total positive reviews:", len(positive_reviews))

    print("Total negative reviews:", len(negative_reviews))

    # Get a random positive review

    random_pos_id = random.choice(positive_reviews)

    pos_review_text = movie_reviews.raw(random_pos_id)

    print("\nSample positive review (first 200 chars):")

    print(pos_review_text[:200])

    Output:

    Total positive reviews: 1000

    Total negative reviews: 1000

    Analyzing a Single Review

    from nltk.corpus import movie_reviews

    from nltk.tokenize import word_tokenize

    from collections import Counter

    import string

    from nltk.corpus import stopwords

    # Get one review

    review_id = movie_reviews.fileids('pos')[0]

    review_text = movie_reviews.raw(review_id)

    # Preprocess

    tokens = word_tokenize(review_text.lower())

    tokens = [t for t in tokens if t not in string.punctuation]

    stop_words = set(stopwords.words('english'))

    filtered_tokens = [t for t in tokens if t not in stop_words]

    # Statistics

    print("Review ID:", review_id)

    print("Total tokens:", len(tokens))

    print("After removing stop words:", len(filtered_tokens))

    print("Unique words:", len(set(filtered_tokens)))

    print("\nTop 5 most common words:")

    word_freq = Counter(filtered_tokens)

    for word, count in word_freq.most_common(5):

    • print(f" {word}: {count}")*

    4. Comparing Positive vs Negative Reviews

    Let's find patterns that might help with sentiment analysis later.

    Average Review Length by Sentiment

    from nltk.corpus import movie_reviews

    from nltk.tokenize import word_tokenize

    import string

    # Analyze positive reviews

    pos_lengths = []

    for fileid in movie_reviews.fileids('pos')[:100]: # First 100 reviews

    • tokens = word_tokenize(movie_reviews.raw(fileid))*

    • tokens = [t for t in tokens if t not in string.punctuation]*

    • pos_lengths.append(len(tokens))*

    # Analyze negative reviews

    neg_lengths = []

    for fileid in movie_reviews.fileids('neg')[:100]: # First 100 reviews

    • tokens = word_tokenize(movie_reviews.raw(fileid))*

    • tokens = [t for t in tokens if t not in string.punctuation]*

    • neg_lengths.append(len(tokens))*

    # Calculate averages

    avg_pos_length = sum(pos_lengths) / len(pos_lengths)

    avg_neg_length = sum(neg_lengths) / len(neg_lengths)

    print("Average positive review length:", round(avg_pos_length, 2), "words")

    print("Average negative review length:", round(avg_neg_length, 2), "words")

    Most Common Words in Positive vs Negative Reviews

    from nltk.corpus import movie_reviews

    from nltk.tokenize import word_tokenize

    from collections import Counter

    import string

    from nltk.corpus import stopwords

    def get_top_words(category, num_reviews=100):

    """Get most common words for a category"""

    all_tokens = []

    stop_words = set(stopwords.words('english'))

    for fileid in movie_reviews.fileids(category)[:num_reviews]:

    tokens = word_tokenize(movie_reviews.raw(fileid).lower())

    tokens = [t for t in tokens if t not in string.punctuation]

    tokens = [t for t in tokens if t not in stop_words]

    all_tokens.extend(tokens)

    return Counter(all_tokens).most_common(10)

    # Get top words for each category

    print("Top 10 words in POSITIVE reviews:")

    for word, count in get_top_words('pos'):

    • print(f" {word}: {count}")*

    print("\nTop 10 words in NEGATIVE reviews:")

    for word, count in get_top_words('neg'):

    • print(f" {word}: {count}")*

    5. Building a Complete Analysis Function

    Let's combine everything into one reusable function:

    from nltk.corpus import movie_reviews

    from nltk.tokenize import word_tokenize

    from collections import Counter

    import string

    from nltk.corpus import stopwords

    def analyze_review(review_text):

    • """Complete analysis of a single review"""*

    • Preprocess*

    • tokens = word_tokenize(review_text.lower())*

    • tokens_no_punct = [t for t in tokens if t not in string.punctuation]*

    • stop_words = set(stopwords.words('english'))*

    • filtered_tokens = [t for t in tokens_no_punct if t not in stop_words]*

    • Calculate statistics*

    • total_words = len(tokens_no_punct)*

    • unique_words = len(set(filtered_tokens))*

    • vocab_richness = unique_words / total_words if total_words > 0 else 0*

    • avg_word_length = sum(len(w) for w in filtered_tokens) / len(filtered_tokens) if filtered_tokens else 0*

    • Get top words*

    • word_freq = Counter(filtered_tokens)*

    • top_words = word_freq.most_common(5)*

    • Return analysis*

    • return {*

    • 'total_words': total_words,*

    • 'unique_words': unique_words,*

    • 'vocabulary_richness': round(vocab_richness, 2),*

    • 'avg_word_length': round(avg_word_length, 2),*

    • 'top_words': top_words*

    • }*

    # Test the function

    sample = "This game is absolutely amazing! The graphics are stunning and the gameplay is fantastic."

    analysis = analyze_review(sample)

    print("Review Analysis:")

    for key, value in analysis.items():

    • print(f" {key}: {value}")*

    Practice Exercise

    Analyze both a positive and negative review from the dataset and compare:

    1. Which one is longer?

    2. Which has more unique words?

    3. What are the top 3 words in each?

    from nltk.corpus import movie_reviews

    # Get one positive and one negative review

    pos_review = movie_reviews.raw(movie_reviews.fileids('pos')[0])

    neg_review = movie_reviews.raw(movie_reviews.fileids('neg')[0])

    # Your task: Use analyze_review() function on both and compare the results

    pos_analysis = analyze_review(pos_review)

    neg_analysis = analyze_review(neg_review)

    print("Positive Review Analysis:", pos_analysis)

    print("\nNegative Review Analysis:", neg_analysis)

    Key Takeaways

    • Word frequency reveals the most important topics in text

    • Text statistics help us understand review patterns

    • Comparing categories shows us what makes positive/negative reviews different

    • These patterns will be crucial for sentiment analysis in later weeks

    Next Week Preview

    In Week 3, we'll finally start building our sentiment analyzer! We'll learn:

    • Introduction to sentiment scoring

    • Building a simple positive/negative word dictionary

    • Calculating sentiment scores for reviews

    • Understanding how sentiment analysis works

    You're now ready to move from analysis to classification!

    Comments