Week 1 content of the Steam Review Analyzer project
Master text preprocessing
Set up NLTK environment
Build preprocessing pipeline
Over the next several weeks, we'll be building a Chrome Extension that automatically performs sentiment analysis on Steam game reviews. When you visit a Steam game page, our extension will:
Extract game reviews from the page
Analyze the sentiment of each review (positive, negative, neutral)
Provide an overall sentiment score for the game
Display insights about what players think
However, before we dive into the Chrome extension development, we need to understand the Natural Language Processing (NLP) techniques that power sentiment analysis. We'll use Python to learn these fundamentals in a controlled environment, then later apply these concepts to our JavaScript-based Chrome extension.
Learning NLP concepts in Python first offers several advantages:
Clear focus: No web development complexity to distract from NLP learning
Rich ecosystem: Python has excellent NLP libraries like NLTK
Interactive learning: Easy to experiment and see immediate results
Foundation building: Concepts learned here transfer directly to JavaScript
NLTK (Natural Language Toolkit) is Python's premier library for natural language processing. It provides:
Text processing utilities
Linguistic data and corpora
Classification and tokenization tools
Sentiment analysis capabilities
First, let's install NLTK and download the datasets we'll need:
import nltk
# Download required NLTK data packages
nltk.download("movie_reviews")
nltk.download('vader_lexicon')
nltk.download('punkt')
nltk.download('stopwords')
What each download does:
movie_reviews: 2,000 movie reviews labeled as positive/negative
vader_lexicon: Sentiment analysis tool
punkt: Sentence tokenizer
stopwords: Common words that don't carry meaning
Before we can analyze sentiment, we need to preprocess our text data. Raw text is messy and inconsistent - preprocessing cleans it up for analysis.
Tokenization breaks text into individual words or tokens. It's the first step in most NLP tasks.
from nltk.tokenize import word_tokenize
# Example text
text = "This game is absolutely amazing! I love the graphics and gameplay."
# Tokenize the text
tokens = word_tokenize(text)
print("Original text:", text)
print("Tokens:", tokens)
Output:
Original text: This game is absolutely amazing! I love the graphics and gameplay.
Tokens: ['This', 'game', 'is', 'absolutely', 'amazing', '!', 'I', 'love', 'the', 'graphics', 'and', 'gameplay', '.']
Why tokenization matters:
Separates words from punctuation
Handles contractions and special characters
Creates a list we can analyze programmatically
Lowercasing normalizes text by converting all characters to lowercase. This ensures "Game", "game", and "GAME" are treated as the same word.
# Convert tokens to lowercase
text = "The GRAPHICS are Amazing and the gameplay is EXCELLENT!"
tokens = word_tokenize(text)
# Before lowercasing
print("Original tokens:", tokens)
# After lowercasing
lowercase_tokens = [token.lower() for token in tokens]
print("Lowercase tokens:", lowercase_tokens)
Output:
Original tokens: ['The', 'GRAPHICS', 'are', 'Amazing', 'and', 'the', 'gameplay', 'is', 'EXCELLENT', '!']
Lowercase tokens: ['the', 'graphics', 'are', 'amazing', 'and', 'the', 'gameplay', 'is', 'excellent', '!']
Stop words are common words like "the", "is", "and" that appear frequently but don't carry sentiment. Removing them helps focus on meaningful words.
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
# Get English stop words
stop_words = set(stopwords.words('english'))
print("Sample stop words:", list(stop_words)[:10])
# Example text
text = "The game is really good and I think the graphics are amazing"
tokens = word_tokenize(text.lower())
# Remove stop words
filtered_tokens = [token for token in tokens if token not in stop_words]
print("Original tokens:", tokens)
print("After removing stop words:", filtered_tokens)
Output:
Sample stop words: ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]
Original tokens: ['the', 'game', 'is', 'really', 'good', 'and', 'i', 'think', 'the', 'graphics', 'are', 'amazing']
After removing stop words: ['game', 'really', 'good', 'think', 'graphics', 'amazing']
Punctuation removal eliminates punctuation marks that don't contribute to sentiment analysis.
import string
from nltk.tokenize import word_tokenize
text = "This game is amazing!!! The graphics are top-notch, and the story is incredible."
tokens = word_tokenize(text.lower())
# Remove punctuation
no_punct_tokens = [token for token in tokens if token not in string.punctuation]
print("Original tokens:", tokens)
print("After removing punctuation:", no_punct_tokens)
print("Punctuation characters:", string.punctuation)
Output:
Original tokens: ['this', 'game', 'is', 'amazing', '!', '!', '!', 'the', 'graphics', 'are', 'top-notch', ',', 'and', 'the', 'story', 'is', 'incredible', '.']
After removing punctuation: ['this', 'game', 'is', 'amazing', 'the', 'graphics', 'are', 'top-notch', 'and', 'the', 'story', 'is', 'incredible']
Punctuation characters: !"#$%&'()+,-./:;<=>?@[]^_`{|}~*
Let's combine all preprocessing steps into a single function:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import string
def preprocess_text(text):
"""Complete text preprocessing pipeline"""
# Step 1: Tokenize
tokens = word_tokenize(text)
# Step 2: Convert to lowercase
tokens = [token.lower() for token in tokens]
# Step 3: Remove punctuation
tokens = [token for token in tokens if token not in string.punctuation]
# Step 4: Remove stop words
stop_words = set(stopwords.words('english'))
tokens = [token for token in tokens if token not in stop_words]
return tokens
# Test the complete pipeline
review_text = "This game is absolutely incredible! The graphics are stunning and the gameplay is super engaging. I highly recommend it to everyone!"
original_tokens = word_tokenize(review_text)
processed_tokens = preprocess_text(review_text)
print("Original text:", review_text)
print("Original tokens:", original_tokens)
print("Processed tokens:", processed_tokens)
Output:
Original text: This game is absolutely incredible! The graphics are stunning and the gameplay is super engaging. I highly recommend it to everyone!
Original tokens: ['This', 'game', 'is', 'absolutely', 'incredible', '!', 'The', 'graphics', 'are', 'stunning', 'and', 'the', 'gameplay', 'is', 'super', 'engaging', '.', 'I', 'highly', 'recommend', 'it', 'to', 'everyone', '!']
Processed tokens: ['game', 'absolutely', 'incredible', 'graphics', 'stunning', 'gameplay', 'super', 'engaging', 'highly', 'recommend', 'everyone']
Try preprocessing this game review using the techniques we learned:
# Practice with this review
practice_review = "OMG!!! This game is SO BAD. The controls are terrible, the story makes no sense, and I wasted my money. Don't buy this game!"
# Your task: Apply all preprocessing steps and see what meaningful words remain
Here is the link to the code for week 1: https://drive.google.com/file/d/11iZ3uge-vTyyOSasjz0fXk8jX6OEM-vS/view?usp=sharing
In Week 2, we'll learn how to:
Count word frequencies in our processed text
Calculate basic text statistics
Analyze patterns in the movie reviews dataset
Prepare for sentiment analysis techniques
These preprocessing techniques are the foundation of all NLP work. Master them now, and sentiment analysis will make much more sense later!