MSU AI Club

Week 3: Initial Model Training and Evaluation

Hello again and welcome back to the Amazon Review Analyzer project! Last week, you created robust text preprocessing and feature extraction pipelines for the Amazon reviews dataset. This week, we'll take the exciting step of training your first (unless it’s not) machine learning model to classify reviews as real or computer-generated. I’m assuming you chose this project because training your own model sounded fun, so let’s dive right into it!

1. Understanding Machine Learning Model Training

Before we jump into coding, let's understand what we're about to do. Machine learning models learn patterns from data by being "trained" on examples. We'll split our processed dataset into two parts:

Training set: The model learns from these examples (80% of our data)
Test set: We use this to evaluate how well our model performs on unseen data (20% of our data)

This approach helps us understand if our model can generalize to new reviews it hasn't seen before, which is crucial for real-world performance.

2. Setting Up Your Training Script

2.1 Create the training file

In your project root, create a new file called “train_model.py”. This will be our main training script that brings together everything we've built so far.

2.2 Add the necessary imports

Copy the following imports at the top of your “train_model.py” file:

import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.metrics import classification_report, roc_auc_score, confusion_matrix

from xgboost import XGBClassifier

from pathlib import Path

import joblib

import json

These libraries will help us load data, split it for training, train our XGBoost model, and evaluate its performance.

3. Loading and Preparing Your Data

3.1 Load your processed dataset

Add code to load the feature-rich dataset you created last week. We had to do this last week with a the beginning dataset.

Use the print statements below to confirm that you loaded the dataset correctly:

print(f"Dataset loaded with {len(df)} reviews")

print(f"Columns in dataset: {list(df.columns)}")

3.2 Prepare features and labels for training

To give some background, XGBoost is a library that provides many algorithms for machine learning with efficient gradient boosting techniques (for the sake of project scope, I’ll leave it up to you to look up what gradient boosting is). XGBoost models work with numerical data, so we need to prepare our features and convert our text labels to numbers:

# Prepare features (X) by removing text columns and the label column

X = pd.get_dummies(

df.drop(columns=["label", "text_", "cleaned_text"]),

columns=["category"]

)

# Convert labels to numbers: OR (Original/real Reviews) = 1, CG (Computer-Generated) = 0

label_map = {"OR": 1, "CG": 0}

y = df["label"].map(label_map)

print(f"Features shape: {X.shape}")

print(f"Labels distribution:")

print(y.value_counts())

The pd.get_dummies() function converts categorical variables (like product categories) into numerical columns that our model can understand.

4. Creating Train/Test Split

4.1 Split your data

Follow this quick and helpful video to create your training and testing datasets. Make the testing size around 20% of the data.

print(f"Training set size: {len(X_train)} reviews")

print(f"Test set size: {len(X_test)} reviews")

The random_state = 42 ensures that you get the same split every time you run the script, making your results reproducible. If you were to change this to another number like 9, then the training and testing split would be different.

5. Training Your First Model

5.1 Create and train the XGBoost model

Now for the exciting part - training your model! Add this code:

# Create XGBoost classifier

model = XGBClassifier(

n_estimators=100,

max_depth=4,

use_label_encoder=False,

eval_metric="logloss",

random_state=42

)

print("Training the model...")

# Train the model

model.fit(X_train, y_train)

print("Model training completed!")

Let's break down these parameters:

n_estimators=100: Creates 100 decision trees. A decision tree is a sort of flowchart of choices and their outcomes made by the model, along with their probabilities
max_depth=4: Limits how deep each decision tree can be (prevents overfitting). The end, or leaf node, of each tree is the final decision for that branch. Overfitting happens when the model is trained too closely to the train set and in return cannot generalize to new data that it sees
eval_metric="logloss": Uses logarithmic loss to measure performance during training

6. Evaluating Model Performance

6.1 Make predictions and calculate metrics

Add this code to see how well your model performs (I will explain what each metric means in the next step):

# Make predictions on the test set

y_pred = model.predict(X_test)

y_prob = model.predict_proba(X_test)[:, 1] # Get probabilities for the positive class

# Print detailed classification report

print("Classification Report:")

print(classification_report(y_test, y_pred))

# Print confusion matrix

print("\nConfusion Matrix:")

print(confusion_matrix(y_test, y_pred))

# Calculate and print AUC score

auc_score = roc_auc_score(y_test, y_prob)

print(f"\nAUC Score: {auc_score:.4f}")

6.2 Understanding the results

The classification report will show you:

Precision: Of all reviews your model predicted as fake (computer-generated), how many were actually fake?
Recall: Of all actual fake reviews, how many did your model correctly identify?
F1-score: A balanced measure combining precision and recall. You can think of this grade like a teacher that wants you to get as many answers correct while leaving few blank
AUC Score: A measure of your model’s estimating reliability from 0 to 1, where higher is better (0.5 is like random guessing)

The above metrics are all measures from 0 to 1, 1.0 being a perfect score (which would not be a realistic score for most ML cases).

7. Saving Your Trained Model

7.1 Save the model for future use

Add this code to save your trained model:

# Create directory for saving the model

model_dir = Path("./model")

model_dir.mkdir(parents=True, exist_ok=True)

# Save the trained model

joblib.dump(model, model_dir / "review_classifier.pkl")

# Save feature names for future use

feature_names = X.columns.tolist()

with open(model_dir / "feature_names.json", "w") as f:

json.dump(feature_names, f)

# Save model metadata

model_metadata = {

"test_auc_score": float(auc_score),

"num_features": len(feature_names),

"label_mapping": label_map,

"training_samples": len(X_train),

"test_samples": len(X_test)

}

with open(model_dir / "model_metadata.json", "w") as f:

json.dump(model_metadata, f, indent=2)

print(f"\nModel saved successfully in '{model_dir}' directory!")

8. Run Your Training Script

Now run your complete training script:

python train_model.py

The script should take a few minutes to run and will output the training progress and final performance metrics.

9. BONUS: Analyze Your Results

After running your script, take some time to analyze the results. This is a great practice because it can help you identify what changes need to be made for your model:

What's your AUC score? An AUC above 0.7 is generally considered good, above 0.8 is very good.
Look at the confusion matrix:
- How many real reviews did your model correctly identify?
- How many fake reviews did it catch?
- Where is it making mistakes? Is there an imbalance of true positives and negatives?

Note: The text in red in the image above is to identify each quadrant of the confusion matrix for the labeling of our dataset (real v. fake Amazon reviews)

Check precision vs. recall:

Is your model better at identifying real reviews or fake ones?
Would you prefer high precision (fewer false positives) or high recall (catching more fake reviews)?

After analyzing the results of your first training, I recommend messing with some of the parameters like the number of n_estimators, and retraining your model. You could also take out or add even more features to see how your model results change.

Wrapping Up

By the end of this week, you should have:

Successfully trained your first XGBoost model on the Amazon reviews dataset!
Learned how to evaluate model performance using various metrics and used this insight to assess your current feature set
Saved your trained model for future use

Next Week:

We'll dive deeper into model optimization topics such as grid search to find the best hyperparameters and feature importance to understand which features are most helpful for classification. We'll also learn techniques to improve your model's performance even further.

Congratulations on training your first machine learning model! You've taken a major step towards effectively accomplishing our goal of detecting fake Amazon reviews.

Amazon Review Analyzer Week 3