Week 4 content for the Amazon Review Analyzer project
Welcome to Week 4 of the Amazon Review Analyzer project! By now, you've cleaned your data, extracted meaningful features, and trained your first XGBoost model. This week, we're going to take things to the next level by fine-tuning your model to squeeze out every bit of performance we can get. We'll explore which features actually matter, find the optimal hyperparameters, and compare our improvements against a baseline. This is where your model transforms from "pretty good" to "actually impressive!"
Note: a lot of code will be provided for this week, but the difficulty comes from getting your model to do what you’d like. Make sure you understand each line of code in order to combat this.
Before we start tweaking hyperparameters, let's figure out which features are actually helping our model make decisions. Not all features are created equal. Some might be super useful while others are just adding noise.
Create a new file at the same location as “train_model.py” called “tune_model.py”. Add these imports to the top of the new file, along with the imports from “train_model.py”. Each package/library will be explained as they are used below:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report, roc_auc_score, confusion_matrix
from sklearn.feature_selection import SelectFromModel
from xgboost import plot_importance
from matplotlib import pyplot
import numpy as np
Because we have already trained a baseline model and do not want to retrain it every time we run this script, we are going to load the model we saved in Week 3.
model_dir = Path("./model")
baseline_model_path = model_dir / "review_classifier.pkl"
model = joblib.load(baseline_model_path)
Add this code to see which features your baseline model thinks are most important. X is the container of features after removing the text columns and the label column. You will need to define this using the same code from Week 3 before adding in these lines:
importances = model.feature_importances_
feature_importance_df = pd.DataFrame({
'feature': X.columns,
'importance': importances
}).sort_values('importance', ascending=False)
print(feature_importance_df)
plot_importance(model)
pyplot.show()
This will print out all of your features and show you a nice bar chart. Run this file and take a moment to look at these results. Determine if the important features are what you expected. Sometimes surprising features end up being really useful! We will come back to this plot soon.
Hint: If you just want to see the top 10 features, then you could add .head(10) to the print statement.
Now comes the fun part: finding the best hyperparameters for your model. Hyperparameters are settings that control how your model learns, and finding the right combination can make a huge difference in performance. Keep in mind your baseline AUC score from last week because we will be trying to beat it!
Add this dictionary of parameters to test:
parameters = {
"n_estimators": [50, 100, 200, 500],
"learning_rate": [0.1, 0.3, 0.6, 1.0],
"max_depth": [3, 6, 10],
"reg_alpha": [0.0, 0.1, 0.5, 1.0],
"reg_lambda": [0.1, 0.5, 1.0, 1.5],
}
Here's what each hyperparameter does:
n_estimators: Creates n decision trees. A decision tree is a sort of flowchart of choices and their outcomes made by the model, along with their probabilities
learning_rate: How much each tree contributes (lower = more conservative)
max_depth: Limits how deep each decision tree can be (prevents overfitting). The end, or leaf node, of each tree is the final decision for that branch. Overfitting happens when the model is trained too closely to the train set and in return cannot generalize to new data that it sees
reg_alpha: L1 regularization (helps prevent overfitting)
reg_lambda: L2 regularization (also helps prevent overfitting)
Add the following code to systematically test all parameter combinations:
xgb_model = XGBClassifier(
use_label_encoder=False,
eval_metric="logloss",
random_state=42,
)
grid_search = GridSearchCV(
estimator=xgb_model,
param_grid=parameters,
scoring="roc_auc",
cv=5, # 5-fold cross-validation
n_jobs=-1, # Use all CPU cores
verbose=1,
return_train_score=True,
)
grid_search.fit(X_train, y_train)
Cross-validation, or cv, can be thought of as a more thorough train-test split. It works by splitting the dataset into several parts, or folds (5 in our case). The model is trained on most folds and tested on the remaining one, repeating this process several times so every data point gets used for both training and testing. This helps check the model’s consistency and prevents overfitting.
GridSearchCV docs: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
Warning: This will take a while to run! I would wait to run it until you have finished adding the print statements for sections 2.3 & 2.4, or else grid search will be rerun every time. Grid search tests every combination of parameters, so with our parameter grid, it's testing hundreds of different models. Go grab a coffee, work on homework, or take a walk. The verbose=1 setting will show you progress as it runs.
Once grid search finishes, let's see what it found:
best_params = grid_search.best_params_
best_model = grid_search.best_estimator_
best_score = grid_search.best_score_
print("\n" + "=" * 50)
print("GRID SEARCH RESULTS")
print("=" * 50)
print(f"Best parameters: {best_params}")
print(f"Best cross-validation AUC score: {best_score:.4f}")
print("=" * 50)
best_pred = best_model.predict(X_test)
best_prob = best_model.predict_proba(X_test)[:, 1]
print("\nBest Model Performance on Test Set:")
print("Classification Report:\n", classification_report(y_test, best_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, best_pred))
print(f"Test AUC Score: {roc_auc_score(y_test, best_prob):.4f}")
Let's see how much we improved:
baseline_pred = model.predict(X_test)
baseline_prob = model.predict_proba(X_test)[:, 1]
print("\n" + "=" * 50)
print("BASELINE vs BEST MODEL COMPARISON")
print("=" * 50)
print(f"Baseline AUC: {roc_auc_score(y_test, baseline_prob):.4f}")
print(f"Best Model AUC: {roc_auc_score(y_test, best_prob):.4f}")
Hopefully, you'll see a nice improvement! Even a 1-2% increase in AUC can be significant in machine learning.
Hint: after successfully running and evaluating grid search, keep track of the best parameters and set them manually when we retrain our model with feature importance. The definition for the best parameters will look similar to the dictionary you created in 2.1. You should also comment out the grid search code so it doesn’t run again. You could also choose to save/load the best_model like we did with the baseline.
Now that we have our baseline model, let's see if we can make it even better (and faster) by removing low-importance features.
After taking a peek at the feature importance plot from the last step, you should choose a threshold value where any features with importance below that threshold will be ignored. For example, you could start with .020 as your threshold and test out a few different ones. Add this code to retrain using only the most important features:
thresh = 0.020 # Only keep features with importance >= 0.020
selection = SelectFromModel(model, threshold=thresh, prefit=True)
select_X_train = selection.transform(X_train)
Then, follow previous training code to:
Create an XGBClassifier object like in 2.2 but add best_params as the first parameter. best_params should be defined after doing grid search.
Train the model with the selected features, but same y_train
select_X_test = selection.transform(X_test)
Use the selected features for testing (code in line above) to get the predictions and probabilities on the testing data
Print the classification report, confusion matrix, and AUC score
This creates a leaner model that only uses your most important features. Sometimes removing noisy features actually improves performance!
Try adjusting the thresh value (e.g., 0.01, 0.05, 0.1) and see how it affects:
The number of features selected
Model performance
Training speed
Find the sweet spot between model simplicity and performance.
Now that you have a newly trained model, let's save it so you can use it later without retraining. We will just add in and modify some of the code from “train_model.py” that saves your baseline model:
# Save the fine-tuned model
joblib.dump(selection_model, model_dir / "selection_model.pkl")
feature_names = X.columns.tolist()
# Save model metadata
model_metadata = {
"best_params": best_params,
"best_cv_score": float(best_score), # may need to remove if grid search is commented out
"test_auc_best": float(roc_auc_score(y_test, best_prob)), # may need to remove if grid search is commented out
"num_original_features": len(feature_names),
}
with open(model_dir / "selection_metadata.json", "w") as f:
json.dump(model_metadata, f, indent=2)
Important: Add model/ to your “.gitignore” file since these files can be large and are generated outputs. Meaning, this directory should not be a part of your repo.
Want to go even further? Here are some advanced techniques to try:
Experiment with optimizing for different metrics in GridSearchCV:
"accuracy": Overall correctness
"f1": Balance of precision and recall
"precision": Minimize false positives
"recall": Minimize false negatives
Just change the scoring parameter in GridSearchCV to test different optimization targets.
Here is a quick article that describes the difference between each of these scores with an example: https://www.labelf.ai/blog/what-is-accuracy-precision-recall-and-f1-score
I know this was said previously, but continue to try out different thresholds to see how it affects your model. You could also examine if there are any features that have too much importance. For example, if a feature like num_spaces has an importance score that is double all the other scores, I would consider removing that column before retraining your model.
By the end of this week, you should have:
Understanding of which features matter most to your model
An optimized model with tuned hyperparameters (there’s always room for improvement)
A saved model ready for deployment or more fine-tuning
Performance metrics showing clear improvement
We'll build a simple web interface where users can input review text and see if our model thinks it's real or fake. We also may fine-tune our models further!
Great work this week. You've officially built a production-quality machine learning model. The jump from "it works" to "it really works well" is huge, and you just made it!
3 Methods for Hyperparameter Tuning with XGBoost: https://www.youtube.com/watch?v=9Ee4PDaqpUs&list=PLXhX6b6y_bWTegYvt-ed5SKTmQUtzwOn4&index=13