Amazon Review Analyzer Week 5

    Amazon Review Analyzer Week 5

    Week 5 content for the Amazon Review Analyzer project

    By AI Club on 10/27/2025
    0

    Week 5: Streamlit UI

    Welcome to Week 5 of the Amazon Review Analyzer project! You've built, trained, and optimized a powerful XGBoost model. Now it's time to make it accessible by building a user interface (UI). This week, we'll use Streamlit to create an interactive application where anyone can input a review and instantly see if your model thinks it's AI-generated or human-written. We chose Streamlit because it provides a simple, Python-based framework for building interactive interfaces to visualize data and interact with machine learning models. Since our project is already implemented in Python, Streamlit allows for seamless integration without requiring additional web development frameworks.

    1. Setup

    1.1 Install Streamlit

    First, Streamlit should already be installed in your venv, but make sure to install it (after activating your venv) if you did not.

    1.2 Create the Web App Directory Structure

    Create a new directory called ‘webapp’ in the root of your project.

    1.3 Create the Constants File

    Create ‘webapp/utils/constants.py’ and add this category mapping dictionary. This maps user-friendly category names to the ones used in your dataset:

    CATEGORY_MAPPING = {

        "Unknown": "unknown",

        "Books": "Books",

        "Clothing, Shoes, and Jewelry": "Clothing_Shoes_and_Jewelry",

        "Electronics": "Electronics",

        "Home and Kitchen": "Home_and_Kitchen",

        "Kindle Store": "Kindle_Store",

        "Movies and TV": "Movies_and_TV",

        "Pet Supplies": "Pet_Supplies",

        "Sports and Outdoors": "Sports_and_Outdoors",

        "Tools and Home Improvement": "Tools_and_Home_Improvement",

        "Toys and Games": "Toys_and_Games",

    }

    Important: Verify these category names match the values in the category column in your model! Check your saved feature names to confirm (‘feature_names.json’). In case you are not able to import the category dict into your main file, create an ‘init.py’ file in your utils folder and paste in the following code:

    from constants import CATEGORY_MAPPING

    all = ["CATEGORY_MAPPING"]

    2. Building the Streamlit App

    2.1 Create the Main App File

    Create ‘webapp/app.py’ and start with these imports:

    import streamlit as st

    import pandas as pd

    import torch

    import sys

    from pathlib import Path

    import joblib

    import string

    from nltk.sentiment import SentimentIntensityAnalyzer

    import nltk

    import json

    import spacy

    from collections import Counter

    from utils.constants import CATEGORY_MAPPING # You shouldn’t have to change this unless you placed constants elsewhere

    Then, add the src directory to the path so we can import our preprocess text function (we have done this before already so look back).

    2.2 Create Cached Resource Loaders

    Streamlit's ‘@st.cache_resource’ decorator ensures models are only loaded once, making your app much faster. Your Streamlit app will reload every time a user interaction occurs, so it is important that your model isn’t reloaded every time something like this happens. Add these functions:

    @st.cache_resource

    def get_nlp_models():

        # Download VADER lexicon if not already present and only if you included sentiment analysis as a feature

        try:

            nltk.data.find("vader_lexicon")

        except LookupError:

            nltk.download("vader_lexicon", quiet=True)

        

        # Load spaCy model (disable unused components for speed). This will be used to tokenize our reviews

        nlp = spacy.load("en_core_web_sm", disable=["parser", "ner"])

        analyzer = SentimentIntensityAnalyzer()

        

        return nlp, analyzer


    @st.cache_resource

    def get_xgb_model():

    # 1. Create a Path object to your saved model (i.e. model_name.pkl)

    # 2. Load the model using joblib

        return {"best_model": best_model}

    2.3 Feature Extraction Functions

    Now we need to recreate the same feature extraction process you used during training. This is critical because your features must match exactly!

    2.3.1 POS (Part-of-Speech) Features

    Based on your EDA, you should have identified which POS tags are important. First, create a constant called “POS_WHITELIST”. This should be a set of the POS tags you used as features.

    Now, follow these steps to make functions to extract POS features:

    1. Use the same ‘pos_counts’ function from our feature extraction script but add nlp as a parameter in order to tokenize the review

    2. Do the same with ‘add_pos_features’, but again add nlp as a parameter

    Note: You could probably import these functions from our src script, but it is easier to copy them in order to separate model logic from our web app directory. These are the kinds of choices that software devs have to make every day!

    2.3.2 Main Feature Extraction Function

    This function should extract all the features your model was trained on and should look similar to the function we wrote in a previous week for feature extraction:

    def extract_features(text, rating=5.0, include_pos=False):

        cleaned_text = preprocess_text(text)

        nlp, analyzer = get_nlp_models()

        df = pd.DataFrame(

            [

                {

                    "rating": rating,

                    "char_length": len(cleaned_text),

                    "word_count": len(cleaned_text.split()),

                    "punctuation_ct": sum(

                        1 for c in cleaned_text if c in string.punctuation

                    ),

                    "is_extreme_star": rating in [1.0, 5.0],

                    "sentiment_score": analyzer.polarity_scores(cleaned_text)["compound"],

                }

            ]

        )

        df["cleaned_text"] = cleaned_text  # Text should already be cleaned, but let’s just make sure

        if include_pos:

            df = add_pos_features(df, nlp)


        return df

    2.4 Prepare Features for Prediction

    Your model expects features in a specific order with specific column names. This function ensures everything is aligned:

    def prepare_features_for_prediction(text, category="unknown", rating=5.0):

        # TODO: Decide if you used POS features in your final model

        include_pos = True  

        # TODO: call extract_features in order to create a df


        # Load the feature names your model expects

        # TODO: Update path to your saved feature_names.json from Week 4

        with open("../scripts/xgb_model/feature_names.json", "r") as f:

            feature_data = json.load(f)

        

        # TODO: Initialize all category value columns to 0

        # The last 10 features in feature_data are likely your category columns

        # TODO: Set the appropriate category column to 1 based on the input category

        

        # TODO: Ensure all expected features are present in the dataframe. Loop through each feature in feature_data and check if it is in df.columns

        # If a feature is missing, add it with a value of 0.0    

        # Return features in the exact order the model expects

    2.5 Prediction Function

    Now create the function that actually makes predictions:

    def xgb_predict(text, model, category="unknown", rating=5.0):

        # TODO: Prepare features using prepare_features_for_prediction

        

       # Make prediction

        prediction = model.predict(features)[0]

        probabilities = model.predict_proba(features)[0]

        confidence = probabilities[prediction]


        # Convert prediction to label (0=AI/CG, 1=Human/OR)

        label = "Human" if prediction == 1 else "AI"

        

        return label, confidence, probabilities.tolist()

    3. Building the User Interface

    3.0 Create the Main Function

    Now for the fun part, building the actual user interface! Before you get started, I highly recommend watching the following videos and trying to follow along before I give steps to creating your own UI:

    https://www.youtube.com/watch?v=-IM3531b1XU&list=PLXhX6b6y_bWTegYvt-ed5SKTmQUtzwOn4&index=21 

    https://www.youtube.com/watch?v=QetpwPnEpgA&list=PLXhX6b6y_bWTegYvt-ed5SKTmQUtzwOn4&index=2 

    https://www.youtube.com/watch?v=CSv2TBA9_2E&list=PLXhX6b6y_bWTegYvt-ed5SKTmQUtzwOn4&index=13 

    3.1 Create the Main Function

    The following steps to build your UI will be mostly pseudocode because it would be quite boring if everyone’s UI looked the same. In addition, the videos above and Streamlit documentation should be of great guidance. I recommend sketching out what you would like your UI to look like before you even start programming it. For example:

    BlockNote image

    Now build out the main function:

    def main():

        # Configure the page

        st.set_page_config(

            page_title="Amazon Review Analyzer",

            page_icon="🤖"

        )

        

        # TODO: Title and description

        

        # Load model with a loading spinner

        with st.spinner("Loading XGBoost model..."):

            model_dict = get_xgb_model()

        

        if model_dict is None:

            st.error("Failed to load model. Please check if model files exist.")

            return

        

        st.success("XGBoost model loaded successfully!")

        

        # TODO: Create the input section (see 3.2 below)

        

        # TODO: Create the results section (see 3.3 below)



    if name == "__main__":

        main()

    3.2 Input Section

    Add this code inside ‘main’ to create the input interface:

       # TODO: Create two columns for layout

        

        with col1:

            st.write("**Enter Review Text:**")

            # TODO: Create a text area for review input (st.text_area())

            

            # Optional inputs for better predictions

            col1_input, col2_input = st.columns(2)

            

            with col1_input:

                # TODO: Create a select box for product category

                # Hint: Use the keys from CATEGORY_MAPPING

                       

            with col2_input:

                # TODO: Create a number input for rating (1-5)

            

            # Analyze button

            analyze_button = st.button("Analyze Review", type="primary")

    3.3 Results Section

    Add this code to display results in the second column:

       with col2:

            st.write("**Analysis Results:**")

            

            # Only run analysis if button is clicked and there's input

            if analyze_button and input_review.strip():

                with st.spinner("Analyzing with XGBoost model..."):

                    try:

                        # Map user-friendly category to dataset category

                        dataset_category = CATEGORY_MAPPING[category]

                        

                        # TODO: Make prediction with xgb_predict()

                        

                        # TODO: Display the prediction with appropriate styling

                        # Hint: You could use st.error() for AI and st.success() for Human

                        # TODO: Display confidence score

                        

                        # Feature Analysis Expander (check out 3.4 BONUS below!)

                        with st.expander("Feature Analysis"):

                            # TODO: Extract and display the features used for prediction

                            # This helps users understand what the model is "seeing"

                            pass

                        

                    except Exception as e:

                        st.error(f"Error during analysis: {str(e)}")

                        st.exception(e)

            

            elif analyze_button and not input_review.strip():

                st.warning("Please enter a review to analyze!")

    3.4 BONUS: Feature Analysis Display

    To make your app more educational, show users what features were extracted. This is also helpful for you to make sure that your model is seeing the correct features:

       # Inside the expander in the results section:

        with st.expander("Feature Analysis"):

            include_pos = True  # Match your model's feature set

    # TODO: get the features using extract_features        

            st.write("**Extracted Features:**")

            

            # TODO: Display features in two columns

            # Column 1: Basic features (char_length, word_count, etc.)

            # Column 2: POS features (VERB, NOUN, etc.)

        

            with col1_feat:

                # TODO: Display basic features

                pass

            

            with col2_feat:

                # TODO: Display POS features

                pass

    4. Running Your App

    4.1 Test Your App Locally

    Navigate to the webapp directory and run:

    streamlit run app.py

    Your browser should automatically open to ‘http://localhost:8501’ where you can interact with your app!

    4.2 Testing Checklist

    Make sure to test:

    • Review text input works

    • Category selection works

    • Rating input works

    • Analyze button makes predictions

    • Results display correctly

    • Feature analysis shows correct values (optional)

    • Error handling works (try submitting empty text)

    • Different review lengths work

    • Both AI and Human predictions work

    4.3 Test with Known Examples

    Try testing with:

    1. A review you know is human-written (maybe from the original dataset)

    2. An obviously AI-generated review (use ChatGPT to generate one)

    3. Edge cases: very short reviews, very long reviews, extreme ratings

    5. BONUS P2

    Want to make your app even better? Try these:

    5.1 Add Example Reviews

    # Add this after the title

    st.write("**Try these examples:**")

    example_human = "This product exceeded my expectations! The quality is outstanding."

    example_ai = "This product is good. It works well. I recommend it to others."


    if st.button("Load Human Example"):

        st.session_state.review_text = example_human

    if st.button("Load AI Example"):

        st.session_state.review_text = example_ai

    5.2 Add Model Information

    # TODO: Add a sidebar with model info

        

        # TODO: Load and display model metadata from selection_metadata.json

            

        st.metric("Test AUC Score", f"{metadata['test_auc_best']:.4f}")

        st.metric("Features Used", metadata['num_original_features'])

        

    # TODO: loop through metadata[‘best_params’] to print out best params

    5.3 Batch Analysis

    Allow users to upload a CSV of reviews:

    st.write("**Or upload multiple reviews:**")

    uploaded_file = st.file_uploader("Choose a CSV file", type="csv")


    if uploaded_file is not None:

        df = pd.read_csv(uploaded_file)

        # TODO: Process each review and display results in a table

    5.4 Explanation Features

    Show which features contributed most to the prediction:

    # Use SHAP values for feature importance

    import shap


    explainer = shap.TreeExplainer(model_dict["best_model"])

    shap_values = explainer.shap_values(features)


    # Display feature importance plot

    st.pyplot(shap.force_plot(explainer.expected_value, shap_values, features))

    Wrapping Up

    By the end of this week, you should have:

    • A fully functional Streamlit web application

    • Real-time review classification

    • User-friendly interface with inputs and results

    • Feature visualization capabilities

    • Error handling and validation

    • Testing with multiple review types

    Great job! You've built a complete machine learning project from data cleaning through model inference. You are very close to a portfolio-worthy project that demonstrates some key data analytics and machine learning principles.

    Next Steps

    • Touch on another approach to solving the issue of fake Amazon reviews… BERT

    • Deploy your model and Streamlit app so others can use it!

    Comments