Week 1: NLP Fundamentals for Sentiment Analysis

Week 1 Goals

Master text preprocessing
Set up NLTK environment
Build preprocessing pipeline

Project Overview

Over the next several weeks, we'll be building a Chrome Extension that automatically performs sentiment analysis on Steam game reviews. When you visit a Steam game page, our extension will:

Extract game reviews from the page
Analyze the sentiment of each review (positive, negative, neutral)
Provide an overall sentiment score for the game
Display insights about what players think

However, before we dive into the Chrome extension development, we need to understand the Natural Language Processing (NLP) techniques that power sentiment analysis. We'll use Python to learn these fundamentals in a controlled environment, then later apply these concepts to our JavaScript-based Chrome extension.

Why Start with Python?

Learning NLP concepts in Python first offers several advantages:

Clear focus: No web development complexity to distract from NLP learning
Rich ecosystem: Python has excellent NLP libraries like NLTK
Interactive learning: Easy to experiment and see immediate results
Foundation building: Concepts learned here transfer directly to JavaScript

Introduction to NLTK

NLTK (Natural Language Toolkit) is Python's premier library for natural language processing. It provides:

Text processing utilities
Linguistic data and corpora
Classification and tokenization tools
Sentiment analysis capabilities

Installing NLTK and Required Data

First, let's install NLTK and download the datasets we'll need:

import nltk

# Download required NLTK data packages

nltk.download("movie_reviews")

nltk.download('vader_lexicon')

nltk.download('punkt')

nltk.download('stopwords')

What each download does:

movie_reviews: 2,000 movie reviews labeled as positive/negative
vader_lexicon: Sentiment analysis tool
punkt: Sentence tokenizer
stopwords: Common words that don't carry meaning

Core NLP Preprocessing Techniques

Before we can analyze sentiment, we need to preprocess our text data. Raw text is messy and inconsistent - preprocessing cleans it up for analysis.

1. Tokenization

Tokenization breaks text into individual words or tokens. It's the first step in most NLP tasks.

from nltk.tokenize import word_tokenize

# Example text

text = "This game is absolutely amazing! I love the graphics and gameplay."

# Tokenize the text

tokens = word_tokenize(text)

print("Original text:", text)

print("Tokens:", tokens)

Output:

Original text: This game is absolutely amazing! I love the graphics and gameplay.

Tokens: ['This', 'game', 'is', 'absolutely', 'amazing', '!', 'I', 'love', 'the', 'graphics', 'and', 'gameplay', '.']

Why tokenization matters:

Separates words from punctuation
Handles contractions and special characters
Creates a list we can analyze programmatically

2. Lowercasing

Lowercasing normalizes text by converting all characters to lowercase. This ensures "Game", "game", and "GAME" are treated as the same word.

# Convert tokens to lowercase

text = "The GRAPHICS are Amazing and the gameplay is EXCELLENT!"

tokens = word_tokenize(text)

# Before lowercasing

print("Original tokens:", tokens)

# After lowercasing

lowercase_tokens = [token.lower() for token in tokens]

print("Lowercase tokens:", lowercase_tokens)

Output:

Original tokens: ['The', 'GRAPHICS', 'are', 'Amazing', 'and', 'the', 'gameplay', 'is', 'EXCELLENT', '!']

Lowercase tokens: ['the', 'graphics', 'are', 'amazing', 'and', 'the', 'gameplay', 'is', 'excellent', '!']

3. Stop Words Removal

Stop words are common words like "the", "is", "and" that appear frequently but don't carry sentiment. Removing them helps focus on meaningful words.

from nltk.corpus import stopwords

from nltk.tokenize import word_tokenize

# Get English stop words

stop_words = set(stopwords.words('english'))

print("Sample stop words:", list(stop_words)[:10])

# Example text

text = "The game is really good and I think the graphics are amazing"

tokens = word_tokenize(text.lower())

# Remove stop words

filtered_tokens = [token for token in tokens if token not in stop_words]

print("Original tokens:", tokens)

print("After removing stop words:", filtered_tokens)

Output:

Sample stop words: ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

Original tokens: ['the', 'game', 'is', 'really', 'good', 'and', 'i', 'think', 'the', 'graphics', 'are', 'amazing']

After removing stop words: ['game', 'really', 'good', 'think', 'graphics', 'amazing']

4. Punctuation Removal

Punctuation removal eliminates punctuation marks that don't contribute to sentiment analysis.

import string

from nltk.tokenize import word_tokenize

text = "This game is amazing!!! The graphics are top-notch, and the story is incredible."

tokens = word_tokenize(text.lower())

# Remove punctuation

no_punct_tokens = [token for token in tokens if token not in string.punctuation]

print("Original tokens:", tokens)

print("After removing punctuation:", no_punct_tokens)

print("Punctuation characters:", string.punctuation)

Output:

Original tokens: ['this', 'game', 'is', 'amazing', '!', '!', '!', 'the', 'graphics', 'are', 'top-notch', ',', 'and', 'the', 'story', 'is', 'incredible', '.']

After removing punctuation: ['this', 'game', 'is', 'amazing', 'the', 'graphics', 'are', 'top-notch', 'and', 'the', 'story', 'is', 'incredible']

Punctuation characters: !"#$%&'()+,-./:;<=>?@[]^_`{|}~*

Putting It All Together

Let's combine all preprocessing steps into a single function:

import nltk

from nltk.tokenize import word_tokenize

from nltk.corpus import stopwords

import string

def preprocess_text(text):

"""Complete text preprocessing pipeline"""

# Step 1: Tokenize

tokens = word_tokenize(text)

# Step 2: Convert to lowercase

tokens = [token.lower() for token in tokens]

# Step 3: Remove punctuation

tokens = [token for token in tokens if token not in string.punctuation]

# Step 4: Remove stop words

stop_words = set(stopwords.words('english'))

tokens = [token for token in tokens if token not in stop_words]

return tokens

# Test the complete pipeline

review_text = "This game is absolutely incredible! The graphics are stunning and the gameplay is super engaging. I highly recommend it to everyone!"

original_tokens = word_tokenize(review_text)

processed_tokens = preprocess_text(review_text)

print("Original text:", review_text)

print("Original tokens:", original_tokens)

print("Processed tokens:", processed_tokens)

Output:

Original text: This game is absolutely incredible! The graphics are stunning and the gameplay is super engaging. I highly recommend it to everyone!

Original tokens: ['This', 'game', 'is', 'absolutely', 'incredible', '!', 'The', 'graphics', 'are', 'stunning', 'and', 'the', 'gameplay', 'is', 'super', 'engaging', '.', 'I', 'highly', 'recommend', 'it', 'to', 'everyone', '!']

Processed tokens: ['game', 'absolutely', 'incredible', 'graphics', 'stunning', 'gameplay', 'super', 'engaging', 'highly', 'recommend', 'everyone']

Practice Exercise

Try preprocessing this game review using the techniques we learned:

# Practice with this review

practice_review = "OMG!!! This game is SO BAD. The controls are terrible, the story makes no sense, and I wasted my money. Don't buy this game!"

# Your task: Apply all preprocessing steps and see what meaningful words remain

Code to everything we did.

Here is the link to the code for week 1: https://drive.google.com/file/d/11iZ3uge-vTyyOSasjz0fXk8jX6OEM-vS/view?usp=sharing

Next Week Preview

In Week 2, we'll learn how to:

Count word frequencies in our processed text
Calculate basic text statistics
Analyze patterns in the movie reviews dataset
Prepare for sentiment analysis techniques

These preprocessing techniques are the foundation of all NLP work. Master them now, and sentiment analysis will make much more sense later!

Steam Review Analyzer Week 1

Week 1: NLP Fundamentals for Sentiment Analysis

Week 1 Goals

Project Overview

Why Start with Python?

Introduction to NLTK

Installing NLTK and Required Data

Core NLP Preprocessing Techniques

1. Tokenization

2. Lowercasing

3. Stop Words Removal

4. Punctuation Removal

Putting It All Together

Practice Exercise

Code to everything we did.

Next Week Preview

Comments