MSU AI Club

Week 2: Text Preprocessing and Feature Extraction Pipelines

Welcome back to the Amazon Review Analyzer project! Last week, you set up your environment, downloaded the dataset, and performed some basic Exploratory Data Analysis (EDA). This week, we'll dive deeper into cleaning the text data and extracting features that will help our model learn to distinguish between real and computer-generated reviews. Don't worry if some of these steps sound scary because text preprocessing and feature engineering are some of the most fun and creative parts of machine learning!

0. More on Git

I want to touch more on what Git is and how to set it up just in case this was a struggle in Week 1. If you are all set with a repo and know how to use git, then you may skip to step 1.

Git is what is called a "version control tool" that basically allows us to save and keep a record of our code as it evolves over time. GitHub is a website that maintains a copy of this record online. Git also makes it possible for multiple people to work on the same code base at the same time without messing each other’s work up.

0.1 Installing Git

Here's how to install Git on your system:

Windows:

Go to https://git-scm.com/download/win and download the installer
Run the installer, accepting the default options
Make sure "Git from the Command Line and also from 3rd-party software" is selected
Complete the installation and open Command Prompt
Type git --version to verify the installation

macOS:

Open Terminal
Type xcode-select --install and follow the prompts (Or if you use Homebrew: brew install git)
Type git --version to verify

After installation, configure your identity. Run the following commands in your command line:

git config --global user.name "Your Name"

git config --global user.email "your.email@example.com"

0.2 Setting up GitHub and Authentication

Create a Personal Access Token (PAT):
1. Click your profile picture → Settings → Developer settings → Personal access tokens → Tokens (classic)
2. Click "Generate new token (classic)"
3. Give it a name
4. Select scopes: at minimum, check "repo" and "workflow"
5. Copy the token immediately and store it somewhere safe like a text file - you won't see it again!
Configure Git to remember your credentials:
1. git config --global credential.helper store

Then the first time you push to GitHub (more on that later), enter your username and use your PAT as the password. Git will remember it for future uses.

0.3 Create a repository for your project

In GitHub, make a new repository. Give it Public visibility. You may have done this already in Week 1.

Go to that repo and click the "Code" tab, then copy the link within the "Quick setup" box

In the IDE of your choice (I would recommend VS Code because that is what was used to make this project), clone the repo by pasting the link you copied in the previous step. If you're using VS Code, you can do this by clicking the "Clone Git Repository" button and pasting the link in the input field. Then choose a good folder for your project to be cloned to on your computer

In the root of your new directory, create a .gitignore file, which tells our version control software (git) what not to keep track of or share publicly. We'll eventually add a ton of files to our directory that we either don't want to share or don't care enough to track. As we build our project, we'll include the addresses of such files here.

0.3 Push your changes

Every week after completing the content, you should push your changes from your local VS Code copy to your Remote repo. This ensures that the code on your remote is always working and cannot get messed up while you are making changes locally. There are ways to do this with a GUI, such as with tools like GitKraken, but I will explain how to do it from the terminal:

Git add .
Git commit -m “Your commit message”
Git push origin main

In addition, here is the link to the git docs that I highly recommend exploring: https://git-scm.com/docs

1. Preprocessing the Text

Raw Amazon reviews often contain noise and impurities such as punctuation, numbers, special characters, and inconsistent casing. Machine learning models usually don't perform well with raw text, so we need to clean and standardize it first. Let's create a script that cleans the reviews before we feed them into a model.

1.1 Create a new file

In your project root, create a new folder called "src" (for source) and within that folder create a file called “preprocess.py”. We want this file to go in the src folder because it is going to be used in several other places throughout our program.

1.2 Add preprocessing function

Copy the following starter code into preprocess.py:

import re

Re is the Python regex library. Regex is used in many programs to parse through text and identify patterns for replacing, editing, etc.

def preprocess_text(text):

text = text.lower() # Convert text to lowercase

# Remove HTML tags

text = re.sub(r"<.*?>", "", text)

# Remove links from text

text = re.sub(r"http\S+|www\S+", "", text)

return text.strip() # Strip remaining whitespace around text

This code will take in text, convert it to lowercase, and remove items like HTML tags and links that could be in a review and mess with our model.

Test this out by calling the function with the string "ML is <fun>" and printing the result. You can run this file from the terminal by typing cd src from the root and then running it with Python, like we have already. The printed text should be "ml is".

2. Feature Extraction

Now that we have clean text, let's create some features to represent our data numerically in order for our XGBoost model to be able to interpret it. Remember the features that you uncovered while doing EDA in Week 1? We'll create similar features to the ones from last week that capture useful patterns like review length, punctuation count, or even part-of-speech counts. These will give our model better clues when deciding if a review looks fake or real.

2.1 Add basic feature columns to your dataframe

In a new file called “feature_extraction.py”, in src, follow the instructions below to program in basic features to your dataframe. Some of these columns may already exist after performing EDA in Week 1:

Create a function named extract_features that takes in a dataframe (df).
Add a second optional argument called include_pos (default = False).
Add a new column/category to your df called char_length.
1. Hint: You can use apply(len) on the "cleaned_text" column to count how many characters are in each review. “cleaned_text” will be added to our df later, but you can be sure it will exist
Add a new column called "word_count". Once again, use the “cleaned_text” column to find the word count
Add a new column called "punctuation_ct"
1. You’ll need to loop through each character in the text and count how many belong to string.punctuation
Add a new column called "is_extreme_star".
1. Use df[“rating”].isin([1.0, 5.0]) to check if a rating is either 1.0 or 5.0 (considered “extreme”).
End the function by returning the modified dataframe with all new features added.

Note: Ignore the fact that we don't actually have a “cleaned_text” column on our dataframe yet; that will be added soon.

2.2 Brainstorm more complicated features

Before continuing on, think about some features or patterns the data may display that will help our model in classifying real and fake reviews. For example, the count of each part of speech could be a valuable feature if computer-generated reviews use more adjectives than human ones on average.

2.3 Implement part-of-speech tagging as a feature

We're going to add feature columns for parts of speech counts to give your model even more context of the text it's receiving.

First, add the libraries necessary to make this happen at the top of your feature extraction file:

import pandas as pd

import string

import spacy # Tokenize text so we can count each POS

from collections import Counter # To keep a dictionary of the count of each POS

nlp = spacy.load("en_core_web_sm", disable=["parser", "ner"])

Note: NLP stands for natural language processing

Next, we're going to add the following functions and calls to them at the end of this file:

POS_WHITELIST = {"VERB", "NOUN", "ADV"}

def pos_counts(text):

doc = nlp(text) # tokenizes the text

return Counter(

token.pos_ for token in doc if token.pos_ in POS_WHITELIST

) # count the number of each pos in the given tokenized text. We're only doing this for the whitelisted POS

def add_pos_features(df):

pos_data = df["cleaned_text"].apply(pos_counts)

pos_df = pd.DataFrame(list(pos_data)).fillna(0) # fill null counts with 0

pos_df.index = df.index # align columns with original df (dataframe)

return pd.concat([df, pos_df], axis=1)

# Add this right before returning the dataframe in extract_features

if include_pos:

df = add_pos_features(df)

You have now added POS counts for verbs, nouns, and adverbs to your dataframe.

3. Build Script

To ensure that your data is being preprocessed and features are being extracted from it, create a new file in the root of your project to perform these operations and receive a ready-to-train dataset.

import pandas as pd

from pathlib import Path # To manipulate file paths easier as objects rather than just strings

import sys

sys.path.append(str(Path(__file__).resolve().parent / "src"))

Follow these steps for the rest:

You’ll need to bring in the text preprocessing function and the feature extraction function you just wrote. This will look similar to importing Python libraries
Create a variable that points to the CSV file containing the reviews (e.g., "fake reviews dataset.csv"). This will need to be a call to Path similar to the one above creating a system path to “src”
Read the CSV into a Pandas dataframe.
Add a new column called "cleaned_text" and apply the preprocess_text function to clean each review.
Decide whether to include part-of-speech tagging (idk why you wouldn’t) and store that in a boolean variable
Pass your dataframe into extract_features(df, include_pos) and overwrite df with the result.
Pick a file path for the new dataset (e.g., "processed-dataset.csv").
1. Hint: Again, use Path(__file__).resolve().parent / "processed-dataset.csv". Store this in a variable
Save your modified dataframe to CSV with df.to_csv(..., index=False).

Note: before running this file, run py -m spacy download en_core_web_sm from the root of your project. This will allow you to be able to load in the spaCy language model for POS counts.

Make sure to run this file before moving on. It may take a few minutes to finish running because you're processing a lot of data. You can open up the "processed-dataset" CSV in VS Code to double-check that the new columns exist.

4. BONUS: Even more features

If you want to go above and beyond, try adding even more features to your model. For example, you could add sentiment analysis scores for each review (I recommend the nltk library for this) and include more parts of speech in the POS tagging. There are endless possibilities, and I encourage you to add more features because it will help with your model's performance.

Wrapping Up

By the end of this week, you should have:

A preprocessing script that standardizes and cleans raw review text
Several new features extracted from the dataset (length, word counts, punctuation count, etc.)
A saved dataset (processed-dataset.csv) that will be used for model training in upcoming weeks

Next Week:

We'll do some initial training to see your models' beginning performance metrics and perform feature importance to see what features are actually being used.

Great progress so far, you and your dataset are getting smarter every step of the way!

Extra:

Here's a video on feature engineering that I found useful while performing this part of the project: https://www.youtube.com/watch?v=ft77eXtn30Q&list=PLXhX6b6y_bWTegYvt-ed5SKTmQUtzwOn4&index=4

Amazon Review Analyzer Week 2