
Week 5 content for the Amazon Review Analyzer project
Welcome to Week 5 of the Amazon Review Analyzer project! You've built, trained, and optimized a powerful XGBoost model. Now it's time to make it accessible by building a user interface (UI). This week, we'll use Streamlit to create an interactive application where anyone can input a review and instantly see if your model thinks it's AI-generated or human-written. We chose Streamlit because it provides a simple, Python-based framework for building interactive interfaces to visualize data and interact with machine learning models. Since our project is already implemented in Python, Streamlit allows for seamless integration without requiring additional web development frameworks.
First, Streamlit should already be installed in your venv, but make sure to install it (after activating your venv) if you did not.
Create a new directory called ‘webapp’ in the root of your project.
Create ‘webapp/utils/constants.py’ and add this category mapping dictionary. This maps user-friendly category names to the ones used in your dataset:
CATEGORY_MAPPING = {
"Unknown": "unknown",
"Books": "Books",
"Clothing, Shoes, and Jewelry": "Clothing_Shoes_and_Jewelry",
"Electronics": "Electronics",
"Home and Kitchen": "Home_and_Kitchen",
"Kindle Store": "Kindle_Store",
"Movies and TV": "Movies_and_TV",
"Pet Supplies": "Pet_Supplies",
"Sports and Outdoors": "Sports_and_Outdoors",
"Tools and Home Improvement": "Tools_and_Home_Improvement",
"Toys and Games": "Toys_and_Games",
}
Important: Verify these category names match the values in the category column in your model! Check your saved feature names to confirm (‘feature_names.json’). In case you are not able to import the category dict into your main file, create an ‘init.py’ file in your utils folder and paste in the following code:
from constants import CATEGORY_MAPPING
all = ["CATEGORY_MAPPING"]
Create ‘webapp/app.py’ and start with these imports:
import streamlit as st
import pandas as pd
import torch
import sys
from pathlib import Path
import joblib
import string
from nltk.sentiment import SentimentIntensityAnalyzer
import nltk
import json
import spacy
from collections import Counter
from utils.constants import CATEGORY_MAPPING # You shouldn’t have to change this unless you placed constants elsewhere
Then, add the src directory to the path so we can import our preprocess text function (we have done this before already so look back).
Streamlit's ‘@st.cache_resource’ decorator ensures models are only loaded once, making your app much faster. Your Streamlit app will reload every time a user interaction occurs, so it is important that your model isn’t reloaded every time something like this happens. Add these functions:
@st.cache_resource
def get_nlp_models():
# Download VADER lexicon if not already present and only if you included sentiment analysis as a feature
try:
nltk.data.find("vader_lexicon")
except LookupError:
nltk.download("vader_lexicon", quiet=True)
# Load spaCy model (disable unused components for speed). This will be used to tokenize our reviews
nlp = spacy.load("en_core_web_sm", disable=["parser", "ner"])
analyzer = SentimentIntensityAnalyzer()
return nlp, analyzer
@st.cache_resource
def get_xgb_model():
# 1. Create a Path object to your saved model (i.e. model_name.pkl)
# 2. Load the model using joblib
return {"best_model": best_model}
Now we need to recreate the same feature extraction process you used during training. This is critical because your features must match exactly!
2.3.1 POS (Part-of-Speech) Features
Based on your EDA, you should have identified which POS tags are important. First, create a constant called “POS_WHITELIST”. This should be a set of the POS tags you used as features.
Now, follow these steps to make functions to extract POS features:
Use the same ‘pos_counts’ function from our feature extraction script but add nlp as a parameter in order to tokenize the review
Do the same with ‘add_pos_features’, but again add nlp as a parameter
Note: You could probably import these functions from our src script, but it is easier to copy them in order to separate model logic from our web app directory. These are the kinds of choices that software devs have to make every day!
2.3.2 Main Feature Extraction Function
This function should extract all the features your model was trained on and should look similar to the function we wrote in a previous week for feature extraction:
def extract_features(text, rating=5.0, include_pos=False):
cleaned_text = preprocess_text(text)
nlp, analyzer = get_nlp_models()
df = pd.DataFrame(
[
{
"rating": rating,
"char_length": len(cleaned_text),
"word_count": len(cleaned_text.split()),
"punctuation_ct": sum(
1 for c in cleaned_text if c in string.punctuation
),
"is_extreme_star": rating in [1.0, 5.0],
"sentiment_score": analyzer.polarity_scores(cleaned_text)["compound"],
}
]
)
df["cleaned_text"] = cleaned_text # Text should already be cleaned, but let’s just make sure
if include_pos:
df = add_pos_features(df, nlp)
return df
Your model expects features in a specific order with specific column names. This function ensures everything is aligned:
def prepare_features_for_prediction(text, category="unknown", rating=5.0):
# TODO: Decide if you used POS features in your final model
include_pos = True
# TODO: call extract_features in order to create a df
# Load the feature names your model expects
# TODO: Update path to your saved feature_names.json from Week 4
with open("../scripts/xgb_model/feature_names.json", "r") as f:
feature_data = json.load(f)
# TODO: Initialize all category value columns to 0
# The last 10 features in feature_data are likely your category columns
# TODO: Set the appropriate category column to 1 based on the input category
# TODO: Ensure all expected features are present in the dataframe. Loop through each feature in feature_data and check if it is in df.columns
# If a feature is missing, add it with a value of 0.0
# Return features in the exact order the model expects
Now create the function that actually makes predictions:
def xgb_predict(text, model, category="unknown", rating=5.0):
# TODO: Prepare features using prepare_features_for_prediction
# Make prediction
prediction = model.predict(features)[0]
probabilities = model.predict_proba(features)[0]
confidence = probabilities[prediction]
# Convert prediction to label (0=AI/CG, 1=Human/OR)
label = "Human" if prediction == 1 else "AI"
return label, confidence, probabilities.tolist()
Now for the fun part, building the actual user interface! Before you get started, I highly recommend watching the following videos and trying to follow along before I give steps to creating your own UI:
https://www.youtube.com/watch?v=-IM3531b1XU&list=PLXhX6b6y_bWTegYvt-ed5SKTmQUtzwOn4&index=21
https://www.youtube.com/watch?v=QetpwPnEpgA&list=PLXhX6b6y_bWTegYvt-ed5SKTmQUtzwOn4&index=2
https://www.youtube.com/watch?v=CSv2TBA9_2E&list=PLXhX6b6y_bWTegYvt-ed5SKTmQUtzwOn4&index=13
The following steps to build your UI will be mostly pseudocode because it would be quite boring if everyone’s UI looked the same. In addition, the videos above and Streamlit documentation should be of great guidance. I recommend sketching out what you would like your UI to look like before you even start programming it. For example:
Now build out the main function:
def main():
# Configure the page
st.set_page_config(
page_title="Amazon Review Analyzer",
page_icon="🤖"
)
# TODO: Title and description
# Load model with a loading spinner
with st.spinner("Loading XGBoost model..."):
model_dict = get_xgb_model()
if model_dict is None:
st.error("Failed to load model. Please check if model files exist.")
return
st.success("XGBoost model loaded successfully!")
# TODO: Create the input section (see 3.2 below)
# TODO: Create the results section (see 3.3 below)
if name == "__main__":
main()
Add this code inside ‘main’ to create the input interface:
# TODO: Create two columns for layout
with col1:
st.write("**Enter Review Text:**")
# TODO: Create a text area for review input (st.text_area())
# Optional inputs for better predictions
col1_input, col2_input = st.columns(2)
with col1_input:
# TODO: Create a select box for product category
# Hint: Use the keys from CATEGORY_MAPPING
with col2_input:
# TODO: Create a number input for rating (1-5)
# Analyze button
analyze_button = st.button("Analyze Review", type="primary")
Add this code to display results in the second column:
with col2:
st.write("**Analysis Results:**")
# Only run analysis if button is clicked and there's input
if analyze_button and input_review.strip():
with st.spinner("Analyzing with XGBoost model..."):
try:
# Map user-friendly category to dataset category
dataset_category = CATEGORY_MAPPING[category]
# TODO: Make prediction with xgb_predict()
# TODO: Display the prediction with appropriate styling
# Hint: You could use st.error() for AI and st.success() for Human
# TODO: Display confidence score
# Feature Analysis Expander (check out 3.4 BONUS below!)
with st.expander("Feature Analysis"):
# TODO: Extract and display the features used for prediction
# This helps users understand what the model is "seeing"
pass
except Exception as e:
st.error(f"Error during analysis: {str(e)}")
st.exception(e)
elif analyze_button and not input_review.strip():
st.warning("Please enter a review to analyze!")
To make your app more educational, show users what features were extracted. This is also helpful for you to make sure that your model is seeing the correct features:
# Inside the expander in the results section:
with st.expander("Feature Analysis"):
include_pos = True # Match your model's feature set
# TODO: get the features using extract_features
st.write("**Extracted Features:**")
# TODO: Display features in two columns
# Column 1: Basic features (char_length, word_count, etc.)
# Column 2: POS features (VERB, NOUN, etc.)
with col1_feat:
# TODO: Display basic features
pass
with col2_feat:
# TODO: Display POS features
pass
Navigate to the webapp directory and run:
streamlit run app.py
Your browser should automatically open to ‘http://localhost:8501’ where you can interact with your app!
Make sure to test:
Review text input works
Category selection works
Rating input works
Analyze button makes predictions
Results display correctly
Feature analysis shows correct values (optional)
Error handling works (try submitting empty text)
Different review lengths work
Both AI and Human predictions work
Try testing with:
A review you know is human-written (maybe from the original dataset)
An obviously AI-generated review (use ChatGPT to generate one)
Edge cases: very short reviews, very long reviews, extreme ratings
Want to make your app even better? Try these:
# Add this after the title
st.write("**Try these examples:**")
example_human = "This product exceeded my expectations! The quality is outstanding."
example_ai = "This product is good. It works well. I recommend it to others."
if st.button("Load Human Example"):
st.session_state.review_text = example_human
if st.button("Load AI Example"):
st.session_state.review_text = example_ai
# TODO: Add a sidebar with model info
# TODO: Load and display model metadata from selection_metadata.json
st.metric("Test AUC Score", f"{metadata['test_auc_best']:.4f}")
st.metric("Features Used", metadata['num_original_features'])
# TODO: loop through metadata[‘best_params’] to print out best params
Allow users to upload a CSV of reviews:
st.write("**Or upload multiple reviews:**")
uploaded_file = st.file_uploader("Choose a CSV file", type="csv")
if uploaded_file is not None:
df = pd.read_csv(uploaded_file)
# TODO: Process each review and display results in a table
Show which features contributed most to the prediction:
# Use SHAP values for feature importance
import shap
explainer = shap.TreeExplainer(model_dict["best_model"])
shap_values = explainer.shap_values(features)
# Display feature importance plot
st.pyplot(shap.force_plot(explainer.expected_value, shap_values, features))
By the end of this week, you should have:
A fully functional Streamlit web application
Real-time review classification
User-friendly interface with inputs and results
Feature visualization capabilities
Error handling and validation
Testing with multiple review types
Great job! You've built a complete machine learning project from data cleaning through model inference. You are very close to a portfolio-worthy project that demonstrates some key data analytics and machine learning principles.
Touch on another approach to solving the issue of fake Amazon reviews… BERT
Deploy your model and Streamlit app so others can use it!