Week 3 content for the Amazon Review Analyzer project
Hello again and welcome back to the Amazon Review Analyzer project! Last week, you created robust text preprocessing and feature extraction pipelines for the Amazon reviews dataset. This week, we'll take the exciting step of training your first (unless it’s not) machine learning model to classify reviews as real or computer-generated. I’m assuming you chose this project because training your own model sounded fun, so let’s dive right into it!
Before we jump into coding, let's understand what we're about to do. Machine learning models learn patterns from data by being "trained" on examples. We'll split our processed dataset into two parts:
Training set: The model learns from these examples (80% of our data)
Test set: We use this to evaluate how well our model performs on unseen data (20% of our data)
This approach helps us understand if our model can generalize to new reviews it hasn't seen before, which is crucial for real-world performance.
In your project root, create a new file called “train_model.py”. This will be our main training script that brings together everything we've built so far.
Copy the following imports at the top of your “train_model.py” file:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, roc_auc_score, confusion_matrix
from xgboost import XGBClassifier
from pathlib import Path
import joblib
import json
These libraries will help us load data, split it for training, train our XGBoost model, and evaluate its performance.
Add code to load the feature-rich dataset you created last week. We had to do this last week with a the beginning dataset.
Use the print statements below to confirm that you loaded the dataset correctly:
print(f"Dataset loaded with {len(df)} reviews")
print(f"Columns in dataset: {list(df.columns)}")
To give some background, XGBoost is a library that provides many algorithms for machine learning with efficient gradient boosting techniques (for the sake of project scope, I’ll leave it up to you to look up what gradient boosting is). XGBoost models work with numerical data, so we need to prepare our features and convert our text labels to numbers:
# Prepare features (X) by removing text columns and the label column
X = pd.get_dummies(
df.drop(columns=["label", "text_", "cleaned_text"]),
columns=["category"]
)
# Convert labels to numbers: OR (Original/real Reviews) = 1, CG (Computer-Generated) = 0
label_map = {"OR": 1, "CG": 0}
y = df["label"].map(label_map)
print(f"Features shape: {X.shape}")
print(f"Labels distribution:")
print(y.value_counts())
The pd.get_dummies() function converts categorical variables (like product categories) into numerical columns that our model can understand.
Follow this quick and helpful video to create your training and testing datasets. Make the testing size around 20% of the data.
print(f"Training set size: {len(X_train)} reviews")
print(f"Test set size: {len(X_test)} reviews")
The random_state = 42 ensures that you get the same split every time you run the script, making your results reproducible. If you were to change this to another number like 9, then the training and testing split would be different.
Now for the exciting part - training your model! Add this code:
# Create XGBoost classifier
model = XGBClassifier(
n_estimators=100,
max_depth=4,
use_label_encoder=False,
eval_metric="logloss",
random_state=42
)
print("Training the model...")
# Train the model
model.fit(X_train, y_train)
print("Model training completed!")
Let's break down these parameters:
n_estimators=100: Creates 100 decision trees. A decision tree is a sort of flowchart of choices and their outcomes made by the model, along with their probabilities
max_depth=4: Limits how deep each decision tree can be (prevents overfitting). The end, or leaf node, of each tree is the final decision for that branch. Overfitting happens when the model is trained too closely to the train set and in return cannot generalize to new data that it sees
eval_metric="logloss": Uses logarithmic loss to measure performance during training
Add this code to see how well your model performs (I will explain what each metric means in the next step):
# Make predictions on the test set
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1] # Get probabilities for the positive class
# Print detailed classification report
print("Classification Report:")
print(classification_report(y_test, y_pred))
# Print confusion matrix
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))
# Calculate and print AUC score
auc_score = roc_auc_score(y_test, y_prob)
print(f"\nAUC Score: {auc_score:.4f}")
The classification report will show you:
Precision: Of all reviews your model predicted as fake (computer-generated), how many were actually fake?
Recall: Of all actual fake reviews, how many did your model correctly identify?
F1-score: A balanced measure combining precision and recall. You can think of this grade like a teacher that wants you to get as many answers correct while leaving few blank
AUC Score: A measure of your model’s estimating reliability from 0 to 1, where higher is better (0.5 is like random guessing)
The above metrics are all measures from 0 to 1, 1.0 being a perfect score (which would not be a realistic score for most ML cases).
Add this code to save your trained model:
# Create directory for saving the model
model_dir = Path("./model")
model_dir.mkdir(parents=True, exist_ok=True)
# Save the trained model
joblib.dump(model, model_dir / "review_classifier.pkl")
# Save feature names for future use
feature_names = X.columns.tolist()
with open(model_dir / "feature_names.json", "w") as f:
json.dump(feature_names, f)
# Save model metadata
model_metadata = {
"test_auc_score": float(auc_score),
"num_features": len(feature_names),
"label_mapping": label_map,
"training_samples": len(X_train),
"test_samples": len(X_test)
}
with open(model_dir / "model_metadata.json", "w") as f:
json.dump(model_metadata, f, indent=2)
print(f"\nModel saved successfully in '{model_dir}' directory!")
Now run your complete training script:
python train_model.py
The script should take a few minutes to run and will output the training progress and final performance metrics.
After running your script, take some time to analyze the results. This is a great practice because it can help you identify what changes need to be made for your model:
What's your AUC score? An AUC above 0.7 is generally considered good, above 0.8 is very good.
Look at the confusion matrix:
How many real reviews did your model correctly identify?
How many fake reviews did it catch?
Where is it making mistakes? Is there an imbalance of true positives and negatives?
Note: The text in red in the image above is to identify each quadrant of the confusion matrix for the labeling of our dataset (real v. fake Amazon reviews)
Check precision vs. recall:
Is your model better at identifying real reviews or fake ones?
Would you prefer high precision (fewer false positives) or high recall (catching more fake reviews)?
After analyzing the results of your first training, I recommend messing with some of the parameters like the number of n_estimators, and retraining your model. You could also take out or add even more features to see how your model results change.
By the end of this week, you should have:
Successfully trained your first XGBoost model on the Amazon reviews dataset!
Learned how to evaluate model performance using various metrics and used this insight to assess your current feature set
Saved your trained model for future use
We'll dive deeper into model optimization topics such as grid search to find the best hyperparameters and feature importance to understand which features are most helpful for classification. We'll also learn techniques to improve your model's performance even further.
Congratulations on training your first machine learning model! You've taken a major step towards effectively accomplishing our goal of detecting fake Amazon reviews.