MSU AI Club

Week 4: Model Fine-Tuning and Optimization

Welcome to Week 4 of the Amazon Review Analyzer project! By now, you've cleaned your data, extracted meaningful features, and trained your first XGBoost model. This week, we're going to take things to the next level by fine-tuning your model to squeeze out every bit of performance we can get. We'll explore which features actually matter, find the optimal hyperparameters, and compare our improvements against a baseline. This is where your model transforms from "pretty good" to "actually impressive!"

Note: a lot of code will be provided for this week, but the difficulty comes from getting your model to do what you’d like. Make sure you understand each line of code in order to combat this.

1. Understanding Feature Importance

Before we start tweaking hyperparameters, let's figure out which features are actually helping our model make decisions. Not all features are created equal. Some might be super useful while others are just adding noise.

1.0 Import the necessary libraries

Create a new file at the same location as “train_model.py” called “tune_model.py”. Add these imports to the top of the new file, along with the imports from “train_model.py”. Each package/library will be explained as they are used below:

from sklearn.model_selection import GridSearchCV

from sklearn.metrics import classification_report, roc_auc_score, confusion_matrix

from sklearn.feature_selection import SelectFromModel

from xgboost import plot_importance

from matplotlib import pyplot

import numpy as np

1.0.5 Load the baseline model

Because we have already trained a baseline model and do not want to retrain it every time we run this script, we are going to load the model we saved in Week 3.

model_dir = Path("./model")

baseline_model_path = model_dir / "review_classifier.pkl"

model = joblib.load(baseline_model_path)

1.1 Visualize feature importance

Add this code to see which features your baseline model thinks are most important. X is the container of features after removing the text columns and the label column. You will need to define this using the same code from Week 3 before adding in these lines:

importances = model.feature_importances_

feature_importance_df = pd.DataFrame({

'feature': X.columns,

'importance': importances

}).sort_values('importance', ascending=False)

print(feature_importance_df)

plot_importance(model)

pyplot.show()

This will print out all of your features and show you a nice bar chart. Run this file and take a moment to look at these results. Determine if the important features are what you expected. Sometimes surprising features end up being really useful! We will come back to this plot soon.

Hint: If you just want to see the top 10 features, then you could add .head(10) to the print statement.

2. Hyperparameter Tuning with Grid Search

Now comes the fun part: finding the best hyperparameters for your model. Hyperparameters are settings that control how your model learns, and finding the right combination can make a huge difference in performance. Keep in mind your baseline AUC score from last week because we will be trying to beat it!

2.1 Set up your parameter grid

Add this dictionary of parameters to test:

parameters = {

"n_estimators": [50, 100, 200, 500],

"learning_rate": [0.1, 0.3, 0.6, 1.0],

"max_depth": [3, 6, 10],

"reg_alpha": [0.0, 0.1, 0.5, 1.0],

"reg_lambda": [0.1, 0.5, 1.0, 1.5],

}

Here's what each hyperparameter does:

n_estimators: Creates n decision trees. A decision tree is a sort of flowchart of choices and their outcomes made by the model, along with their probabilities
learning_rate: How much each tree contributes (lower = more conservative)
max_depth: Limits how deep each decision tree can be (prevents overfitting). The end, or leaf node, of each tree is the final decision for that branch. Overfitting happens when the model is trained too closely to the train set and in return cannot generalize to new data that it sees
reg_alpha: L1 regularization (helps prevent overfitting)
reg_lambda: L2 regularization (also helps prevent overfitting)

2.2 Run grid search with cross-validation

Add the following code to systematically test all parameter combinations:

xgb_model = XGBClassifier(

use_label_encoder=False,

eval_metric="logloss",

random_state=42,

)

grid_search = GridSearchCV(

estimator=xgb_model,

param_grid=parameters,

scoring="roc_auc",

cv=5, # 5-fold cross-validation

n_jobs=-1, # Use all CPU cores

verbose=1,

return_train_score=True,

)

grid_search.fit(X_train, y_train)

Cross-validation, or cv, can be thought of as a more thorough train-test split. It works by splitting the dataset into several parts, or folds (5 in our case). The model is trained on most folds and tested on the remaining one, repeating this process several times so every data point gets used for both training and testing. This helps check the model’s consistency and prevents overfitting.

GridSearchCV docs: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

Warning: This will take a while to run! I would wait to run it until you have finished adding the print statements for sections 2.3 & 2.4, or else grid search will be rerun every time. Grid search tests every combination of parameters, so with our parameter grid, it's testing hundreds of different models. Go grab a coffee, work on homework, or take a walk. The verbose=1 setting will show you progress as it runs.

2.3 Evaluate the best model

Once grid search finishes, let's see what it found:

best_params = grid_search.best_params_

best_model = grid_search.best_estimator_

best_score = grid_search.best_score_

print("\n" + "=" * 50)

print("GRID SEARCH RESULTS")

print("=" * 50)

print(f"Best parameters: {best_params}")

print(f"Best cross-validation AUC score: {best_score:.4f}")

print("=" * 50)

best_pred = best_model.predict(X_test)

best_prob = best_model.predict_proba(X_test)[:, 1]

print("\nBest Model Performance on Test Set:")

print("Classification Report:\n", classification_report(y_test, best_pred))

print("Confusion Matrix:\n", confusion_matrix(y_test, best_pred))

print(f"Test AUC Score: {roc_auc_score(y_test, best_prob):.4f}")

2.4 Compare baseline vs. optimized model

Let's see how much we improved:

baseline_pred = model.predict(X_test)

baseline_prob = model.predict_proba(X_test)[:, 1]

print("\n" + "=" * 50)

print("BASELINE vs BEST MODEL COMPARISON")

print("=" * 50)

print(f"Baseline AUC: {roc_auc_score(y_test, baseline_prob):.4f}")

print(f"Best Model AUC: {roc_auc_score(y_test, best_prob):.4f}")

Hopefully, you'll see a nice improvement! Even a 1-2% increase in AUC can be significant in machine learning.

Hint: after successfully running and evaluating grid search, keep track of the best parameters and set them manually when we retrain our model with feature importance. The definition for the best parameters will look similar to the dictionary you created in 2.1. You should also comment out the grid search code so it doesn’t run again. You could also choose to save/load the best_model like we did with the baseline.

3. Feature Selection with Importance Thresholding

Now that we have our baseline model, let's see if we can make it even better (and faster) by removing low-importance features.

3.1 Select important features

After taking a peek at the feature importance plot from the last step, you should choose a threshold value where any features with importance below that threshold will be ignored. For example, you could start with .020 as your threshold and test out a few different ones. Add this code to retrain using only the most important features:

thresh = 0.020 # Only keep features with importance >= 0.020

selection = SelectFromModel(model, threshold=thresh, prefit=True)

select_X_train = selection.transform(X_train)

Then, follow previous training code to:

Create an XGBClassifier object like in 2.2 but add best_params as the first parameter. best_params should be defined after doing grid search.
Train the model with the selected features, but same y_train
select_X_test = selection.transform(X_test)
Use the selected features for testing (code in line above) to get the predictions and probabilities on the testing data
Print the classification report, confusion matrix, and AUC score

This creates a leaner model that only uses your most important features. Sometimes removing noisy features actually improves performance!

3.2 Experiment with the threshold

Try adjusting the thresh value (e.g., 0.01, 0.05, 0.1) and see how it affects:

The number of features selected
Model performance
Training speed

Find the sweet spot between model simplicity and performance.

4. Save Your Model

Now that you have a newly trained model, let's save it so you can use it later without retraining. We will just add in and modify some of the code from “train_model.py” that saves your baseline model:

# Save the fine-tuned model

joblib.dump(selection_model, model_dir / "selection_model.pkl")

feature_names = X.columns.tolist()

# Save model metadata

model_metadata = {

"best_params": best_params,

"best_cv_score": float(best_score), # may need to remove if grid search is commented out

"test_auc_best": float(roc_auc_score(y_test, best_prob)), # may need to remove if grid search is commented out

"num_original_features": len(feature_names),

}

with open(model_dir / "selection_metadata.json", "w") as f:

json.dump(model_metadata, f, indent=2)

Important: Add model/ to your “.gitignore” file since these files can be large and are generated outputs. Meaning, this directory should not be a part of your repo.

5. BONUS: Advanced Tuning Techniques

Want to go even further? Here are some advanced techniques to try:

5.1 Try Different Metrics

Experiment with optimizing for different metrics in GridSearchCV:

"accuracy": Overall correctness
"f1": Balance of precision and recall
"precision": Minimize false positives
"recall": Minimize false negatives

Just change the scoring parameter in GridSearchCV to test different optimization targets.

Here is a quick article that describes the difference between each of these scores with an example: https://www.labelf.ai/blog/what-is-accuracy-precision-recall-and-f1-score

5.2 Continue to try different thresholds

I know this was said previously, but continue to try out different thresholds to see how it affects your model. You could also examine if there are any features that have too much importance. For example, if a feature like num_spaces has an importance score that is double all the other scores, I would consider removing that column before retraining your model.

Wrapping Up

By the end of this week, you should have:

Understanding of which features matter most to your model
An optimized model with tuned hyperparameters (there’s always room for improvement)
A saved model ready for deployment or more fine-tuning
Performance metrics showing clear improvement

Next Week:

We'll build a simple web interface where users can input review text and see if our model thinks it's real or fake. We also may fine-tune our models further!

Great work this week. You've officially built a production-quality machine learning model. The jump from "it works" to "it really works well" is huge, and you just made it!

Extra Resources:

3 Methods for Hyperparameter Tuning with XGBoost: https://www.youtube.com/watch?v=9Ee4PDaqpUs&list=PLXhX6b6y_bWTegYvt-ed5SKTmQUtzwOn4&index=13

Amazon Review Analyzer Week 4

Week 4: Model Fine-Tuning and Optimization

1. Understanding Feature Importance

1.0 Import the necessary libraries

1.0.5 Load the baseline model

1.1 Visualize feature importance

2. Hyperparameter Tuning with Grid Search

2.1 Set up your parameter grid

2.2 Run grid search with cross-validation

2.3 Evaluate the best model

2.4 Compare baseline vs. optimized model

3. Feature Selection with Importance Thresholding

3.1 Select important features

3.2 Experiment with the threshold

4. Save Your Model

5. BONUS: Advanced Tuning Techniques

5.1 Try Different Metrics

5.2 Continue to try different thresholds

Wrapping Up

Next Week:

Extra Resources:

Comments