Steam Review Analyzer Week 1

    Steam Review Analyzer Week 1

    Week 1 content of the Steam Review Analyzer project

    By AI Club on 9/22/2025
    0

    Week 1: NLP Fundamentals for Sentiment Analysis

    Week 1 Goals

    • Master text preprocessing

    • Set up NLTK environment

    • Build preprocessing pipeline

    Project Overview

    Over the next several weeks, we'll be building a Chrome Extension that automatically performs sentiment analysis on Steam game reviews. When you visit a Steam game page, our extension will:

    • Extract game reviews from the page

    • Analyze the sentiment of each review (positive, negative, neutral)

    • Provide an overall sentiment score for the game

    • Display insights about what players think

    However, before we dive into the Chrome extension development, we need to understand the Natural Language Processing (NLP) techniques that power sentiment analysis. We'll use Python to learn these fundamentals in a controlled environment, then later apply these concepts to our JavaScript-based Chrome extension.

    Why Start with Python?

    Learning NLP concepts in Python first offers several advantages:

    • Clear focus: No web development complexity to distract from NLP learning

    • Rich ecosystem: Python has excellent NLP libraries like NLTK

    • Interactive learning: Easy to experiment and see immediate results

    • Foundation building: Concepts learned here transfer directly to JavaScript

    Introduction to NLTK

    NLTK (Natural Language Toolkit) is Python's premier library for natural language processing. It provides:

    • Text processing utilities

    • Linguistic data and corpora

    • Classification and tokenization tools

    • Sentiment analysis capabilities

    Installing NLTK and Required Data

    First, let's install NLTK and download the datasets we'll need:

    import nltk

    # Download required NLTK data packages

    nltk.download("movie_reviews")

    nltk.download('vader_lexicon')

    nltk.download('punkt')

    nltk.download('stopwords')

    What each download does:

    • movie_reviews: 2,000 movie reviews labeled as positive/negative

    • vader_lexicon: Sentiment analysis tool

    • punkt: Sentence tokenizer

    • stopwords: Common words that don't carry meaning

    Core NLP Preprocessing Techniques

    Before we can analyze sentiment, we need to preprocess our text data. Raw text is messy and inconsistent - preprocessing cleans it up for analysis.

    1. Tokenization

    Tokenization breaks text into individual words or tokens. It's the first step in most NLP tasks.

    from nltk.tokenize import word_tokenize

    # Example text

    text = "This game is absolutely amazing! I love the graphics and gameplay."

    # Tokenize the text

    tokens = word_tokenize(text)

    print("Original text:", text)

    print("Tokens:", tokens)

    Output:

    Original text: This game is absolutely amazing! I love the graphics and gameplay.

    Tokens: ['This', 'game', 'is', 'absolutely', 'amazing', '!', 'I', 'love', 'the', 'graphics', 'and', 'gameplay', '.']

    Why tokenization matters:

    • Separates words from punctuation

    • Handles contractions and special characters

    • Creates a list we can analyze programmatically

    2. Lowercasing

    Lowercasing normalizes text by converting all characters to lowercase. This ensures "Game", "game", and "GAME" are treated as the same word.

    # Convert tokens to lowercase

    text = "The GRAPHICS are Amazing and the gameplay is EXCELLENT!"

    tokens = word_tokenize(text)

    # Before lowercasing

    print("Original tokens:", tokens)

    # After lowercasing

    lowercase_tokens = [token.lower() for token in tokens]

    print("Lowercase tokens:", lowercase_tokens)

    Output:

    Original tokens: ['The', 'GRAPHICS', 'are', 'Amazing', 'and', 'the', 'gameplay', 'is', 'EXCELLENT', '!']

    Lowercase tokens: ['the', 'graphics', 'are', 'amazing', 'and', 'the', 'gameplay', 'is', 'excellent', '!']

    3. Stop Words Removal

    Stop words are common words like "the", "is", "and" that appear frequently but don't carry sentiment. Removing them helps focus on meaningful words.

    from nltk.corpus import stopwords

    from nltk.tokenize import word_tokenize

    # Get English stop words

    stop_words = set(stopwords.words('english'))

    print("Sample stop words:", list(stop_words)[:10])

    # Example text

    text = "The game is really good and I think the graphics are amazing"

    tokens = word_tokenize(text.lower())

    # Remove stop words

    filtered_tokens = [token for token in tokens if token not in stop_words]

    print("Original tokens:", tokens)

    print("After removing stop words:", filtered_tokens)

    Output:

    Sample stop words: ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

    Original tokens: ['the', 'game', 'is', 'really', 'good', 'and', 'i', 'think', 'the', 'graphics', 'are', 'amazing']

    After removing stop words: ['game', 'really', 'good', 'think', 'graphics', 'amazing']

    4. Punctuation Removal

    Punctuation removal eliminates punctuation marks that don't contribute to sentiment analysis.

    import string

    from nltk.tokenize import word_tokenize

    text = "This game is amazing!!! The graphics are top-notch, and the story is incredible."

    tokens = word_tokenize(text.lower())

    # Remove punctuation

    no_punct_tokens = [token for token in tokens if token not in string.punctuation]

    print("Original tokens:", tokens)

    print("After removing punctuation:", no_punct_tokens)

    print("Punctuation characters:", string.punctuation)

    Output:

    Original tokens: ['this', 'game', 'is', 'amazing', '!', '!', '!', 'the', 'graphics', 'are', 'top-notch', ',', 'and', 'the', 'story', 'is', 'incredible', '.']

    After removing punctuation: ['this', 'game', 'is', 'amazing', 'the', 'graphics', 'are', 'top-notch', 'and', 'the', 'story', 'is', 'incredible']

    Punctuation characters: !"#$%&'()+,-./:;<=>?@[]^_`{|}~*

    Putting It All Together

    Let's combine all preprocessing steps into a single function:

    import nltk

    from nltk.tokenize import word_tokenize

    from nltk.corpus import stopwords

    import string

    def preprocess_text(text):

    """Complete text preprocessing pipeline"""

    # Step 1: Tokenize

    tokens = word_tokenize(text)

    # Step 2: Convert to lowercase

    tokens = [token.lower() for token in tokens]

    # Step 3: Remove punctuation

    tokens = [token for token in tokens if token not in string.punctuation]

    # Step 4: Remove stop words

    stop_words = set(stopwords.words('english'))

    tokens = [token for token in tokens if token not in stop_words]

    return tokens

    # Test the complete pipeline

    review_text = "This game is absolutely incredible! The graphics are stunning and the gameplay is super engaging. I highly recommend it to everyone!"

    original_tokens = word_tokenize(review_text)

    processed_tokens = preprocess_text(review_text)

    print("Original text:", review_text)

    print("Original tokens:", original_tokens)

    print("Processed tokens:", processed_tokens)

    Output:

    Original text: This game is absolutely incredible! The graphics are stunning and the gameplay is super engaging. I highly recommend it to everyone!

    Original tokens: ['This', 'game', 'is', 'absolutely', 'incredible', '!', 'The', 'graphics', 'are', 'stunning', 'and', 'the', 'gameplay', 'is', 'super', 'engaging', '.', 'I', 'highly', 'recommend', 'it', 'to', 'everyone', '!']

    Processed tokens: ['game', 'absolutely', 'incredible', 'graphics', 'stunning', 'gameplay', 'super', 'engaging', 'highly', 'recommend', 'everyone']

    Practice Exercise

    Try preprocessing this game review using the techniques we learned:

    # Practice with this review

    practice_review = "OMG!!! This game is SO BAD. The controls are terrible, the story makes no sense, and I wasted my money. Don't buy this game!"

    # Your task: Apply all preprocessing steps and see what meaningful words remain

    Code to everything we did.

    Here is the link to the code for week 1: https://drive.google.com/file/d/11iZ3uge-vTyyOSasjz0fXk8jX6OEM-vS/view?usp=sharing

    Next Week Preview

    In Week 2, we'll learn how to:

    • Count word frequencies in our processed text

    • Calculate basic text statistics

    • Analyze patterns in the movie reviews dataset

    • Prepare for sentiment analysis techniques

    These preprocessing techniques are the foundation of all NLP work. Master them now, and sentiment analysis will make much more sense later!

    Comments