Amazon Review Analyzer Week 4

    Amazon Review Analyzer Week 4

    Week 4 content for the Amazon Review Analyzer project

    By AI Club on 10/13/2025
    0

    Week 4: Model Fine-Tuning and Optimization

    Welcome to Week 4 of the Amazon Review Analyzer project! By now, you've cleaned your data, extracted meaningful features, and trained your first XGBoost model. This week, we're going to take things to the next level by fine-tuning your model to squeeze out every bit of performance we can get. We'll explore which features actually matter, find the optimal hyperparameters, and compare our improvements against a baseline. This is where your model transforms from "pretty good" to "actually impressive!"

    Note: a lot of code will be provided for this week, but the difficulty comes from getting your model to do what you’d like. Make sure you understand each line of code in order to combat this.

    1. Understanding Feature Importance

    Before we start tweaking hyperparameters, let's figure out which features are actually helping our model make decisions. Not all features are created equal. Some might be super useful while others are just adding noise.

    1.0 Import the necessary libraries

    Create a new file at the same location as “train_model.py” called “tune_model.py”. Add these imports to the top of the new file, along with the imports from “train_model.py”. Each package/library will be explained as they are used below:

    from sklearn.model_selection import GridSearchCV

    from sklearn.metrics import classification_report, roc_auc_score, confusion_matrix

    from sklearn.feature_selection import SelectFromModel

    from xgboost import plot_importance

    from matplotlib import pyplot

    import numpy as np

    1.0.5 Load the baseline model

    Because we have already trained a baseline model and do not want to retrain it every time we run this script, we are going to load the model we saved in Week 3.

    model_dir = Path("./model")

    baseline_model_path = model_dir / "review_classifier.pkl"

    model = joblib.load(baseline_model_path)

    1.1 Visualize feature importance

    Add this code to see which features your baseline model thinks are most important. X is the container of features after removing the text columns and the label column. You will need to define this using the same code from Week 3 before adding in these lines:

    importances = model.feature_importances_

    feature_importance_df = pd.DataFrame({

        'feature': X.columns,

        'importance': importances

    }).sort_values('importance', ascending=False)


    print(feature_importance_df)

    plot_importance(model)

    pyplot.show()

    This will print out all of your features and show you a nice bar chart. Run this file and take a moment to look at these results. Determine if the important features are what you expected. Sometimes surprising features end up being really useful! We will come back to this plot soon.

    Hint: If you just want to see the top 10 features, then you could add .head(10) to the print statement.

    2. Hyperparameter Tuning with Grid Search

    Now comes the fun part: finding the best hyperparameters for your model. Hyperparameters are settings that control how your model learns, and finding the right combination can make a huge difference in performance. Keep in mind your baseline AUC score from last week because we will be trying to beat it!

    2.1 Set up your parameter grid

    Add this dictionary of parameters to test:

    parameters = {

        "n_estimators": [50, 100, 200, 500],

        "learning_rate": [0.1, 0.3, 0.6, 1.0],

        "max_depth": [3, 6, 10],

        "reg_alpha": [0.0, 0.1, 0.5, 1.0],

        "reg_lambda": [0.1, 0.5, 1.0, 1.5],

    }

    Here's what each hyperparameter does:

    • n_estimators: Creates n decision trees. A decision tree is a sort of flowchart of choices and their outcomes made by the model, along with their probabilities

    • learning_rate: How much each tree contributes (lower = more conservative)

    • max_depth: Limits how deep each decision tree can be (prevents overfitting). The end, or leaf node, of each tree is the final decision for that branch. Overfitting happens when the model is trained too closely to the train set and in return cannot generalize to new data that it sees

    • reg_alpha: L1 regularization (helps prevent overfitting)

    • reg_lambda: L2 regularization (also helps prevent overfitting)

    2.2 Run grid search with cross-validation

    Add the following code to systematically test all parameter combinations:

    xgb_model = XGBClassifier(

        use_label_encoder=False,

        eval_metric="logloss",

        random_state=42,

    )


    grid_search = GridSearchCV(

        estimator=xgb_model,

        param_grid=parameters,

        scoring="roc_auc",

        cv=5,  # 5-fold cross-validation

        n_jobs=-1,  # Use all CPU cores

        verbose=1,

        return_train_score=True,

    )


    grid_search.fit(X_train, y_train)

    Cross-validation, or cv, can be thought of as a more thorough train-test split. It works by splitting the dataset into several parts, or folds (5 in our case). The model is trained on most folds and tested on the remaining one, repeating this process several times so every data point gets used for both training and testing. This helps check the model’s consistency and prevents overfitting.

    GridSearchCV docs: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html 

    Warning: This will take a while to run! I would wait to run it until you have finished adding the print statements for sections 2.3 & 2.4, or else grid search will be rerun every time. Grid search tests every combination of parameters, so with our parameter grid, it's testing hundreds of different models. Go grab a coffee, work on homework, or take a walk. The verbose=1 setting will show you progress as it runs.

    2.3 Evaluate the best model

    Once grid search finishes, let's see what it found:

    best_params = grid_search.best_params_

    best_model = grid_search.best_estimator_

    best_score = grid_search.best_score_


    print("\n" + "=" * 50)

    print("GRID SEARCH RESULTS")

    print("=" * 50)

    print(f"Best parameters: {best_params}")

    print(f"Best cross-validation AUC score: {best_score:.4f}")

    print("=" * 50)


    best_pred = best_model.predict(X_test)

    best_prob = best_model.predict_proba(X_test)[:, 1]


    print("\nBest Model Performance on Test Set:")

    print("Classification Report:\n", classification_report(y_test, best_pred))

    print("Confusion Matrix:\n", confusion_matrix(y_test, best_pred))

    print(f"Test AUC Score: {roc_auc_score(y_test, best_prob):.4f}")

    2.4 Compare baseline vs. optimized model

    Let's see how much we improved:

    baseline_pred = model.predict(X_test)

    baseline_prob = model.predict_proba(X_test)[:, 1]

    print("\n" + "=" * 50)

    print("BASELINE vs BEST MODEL COMPARISON")

    print("=" * 50)

    print(f"Baseline AUC: {roc_auc_score(y_test, baseline_prob):.4f}")

    print(f"Best Model AUC: {roc_auc_score(y_test, best_prob):.4f}")

    Hopefully, you'll see a nice improvement! Even a 1-2% increase in AUC can be significant in machine learning.

    Hint: after successfully running and evaluating grid search, keep track of the best parameters and set them manually when we retrain our model with feature importance. The definition for the best parameters will look similar to the dictionary you created in 2.1. You should also comment out the grid search code so it doesn’t run again. You could also choose to save/load the best_model like we did with the baseline.

    3. Feature Selection with Importance Thresholding

    Now that we have our baseline model, let's see if we can make it even better (and faster) by removing low-importance features.

    3.1 Select important features

    After taking a peek at the feature importance plot from the last step, you should choose a threshold value where any features with importance below that threshold will be ignored. For example, you could start with .020 as your threshold and test out a few different ones. Add this code to retrain using only the most important features:

    thresh = 0.020  # Only keep features with importance >= 0.020

    selection = SelectFromModel(model, threshold=thresh, prefit=True)

    select_X_train = selection.transform(X_train)

    Then, follow previous training code to:

    1. Create an XGBClassifier object like in 2.2 but add best_params as the first parameter. best_params should be defined after doing grid search.

    2. Train the model with the selected features, but same y_train

    3. select_X_test = selection.transform(X_test)

    4. Use the selected features for testing (code in line above) to get the predictions and probabilities on the testing data

    5. Print the classification report, confusion matrix, and AUC score

    This creates a leaner model that only uses your most important features. Sometimes removing noisy features actually improves performance!

    3.2 Experiment with the threshold

    Try adjusting the thresh value (e.g., 0.01, 0.05, 0.1) and see how it affects:

    • The number of features selected

    • Model performance

    • Training speed

    Find the sweet spot between model simplicity and performance.

    4. Save Your Model

    Now that you have a newly trained model, let's save it so you can use it later without retraining. We will just add in and modify some of the code from “train_model.py” that saves your baseline model:

    # Save the fine-tuned model

    joblib.dump(selection_model, model_dir / "selection_model.pkl")

    feature_names = X.columns.tolist()


    # Save model metadata

    model_metadata = {

        "best_params": best_params,

        "best_cv_score": float(best_score), # may need to remove if grid search is commented out

        "test_auc_best": float(roc_auc_score(y_test, best_prob)), # may need to remove if grid search is commented out

        "num_original_features": len(feature_names),

    }


    with open(model_dir / "selection_metadata.json", "w") as f:

        json.dump(model_metadata, f, indent=2)

    Important: Add model/ to your “.gitignore” file since these files can be large and are generated outputs. Meaning, this directory should not be a part of your repo.

    5. BONUS: Advanced Tuning Techniques

    Want to go even further? Here are some advanced techniques to try:

    5.1 Try Different Metrics

    Experiment with optimizing for different metrics in GridSearchCV:

    • "accuracy": Overall correctness

    • "f1": Balance of precision and recall

    • "precision": Minimize false positives

    • "recall": Minimize false negatives

    Just change the scoring parameter in GridSearchCV to test different optimization targets.

    Here is a quick article that describes the difference between each of these scores with an example: https://www.labelf.ai/blog/what-is-accuracy-precision-recall-and-f1-score 

    5.2 Continue to try different thresholds

    I know this was said previously, but continue to try out different thresholds to see how it affects your model. You could also examine if there are any features that have too much importance. For example, if a feature like num_spaces has an importance score that is double all the other scores, I would consider removing that column before retraining your model.

    Wrapping Up

    By the end of this week, you should have:

    • Understanding of which features matter most to your model

    • An optimized model with tuned hyperparameters (there’s always room for improvement)

    • A saved model ready for deployment or more fine-tuning

    • Performance metrics showing clear improvement

    Next Week:

    We'll build a simple web interface where users can input review text and see if our model thinks it's real or fake. We also may fine-tune our models further!

    Great work this week. You've officially built a production-quality machine learning model. The jump from "it works" to "it really works well" is huge, and you just made it!

    Extra Resources:

    3 Methods for Hyperparameter Tuning with XGBoost: https://www.youtube.com/watch?v=9Ee4PDaqpUs&list=PLXhX6b6y_bWTegYvt-ed5SKTmQUtzwOn4&index=13

    Comments