Amazon Review Analyzer Week 6

    Amazon Review Analyzer Week 6

    Week 6 content for the Amazon Review Analyzer project

    By AI Club on 11/3/2025
    0

    Week 6: BERT Alternative Model with LoRA Fine-Tuning

    Welcome to this week's alternative solution approach! While you've been working with XGBoost (a tree-based model), we're now going to take this week to explore a completely different approach using BERT (Bidirectional Encoder Representations from Transformers). As you have learned, XGBoost must be trained on numerical data, while BERT is a powerful deep learning model that understands language context. We'll also learn about LoRA (Low-Rank Adaptation), a technique that lets us fine-tune large models efficiently on limited hardware.

    Why BERT as an Alternative?

    XGBoost works with hand-crafted features (word counts, sentiment scores, etc.) while BERT learns patterns directly from raw text. Think of it this way:

    • XGBoost: You tell it what to look for (features)

    • BERT: It figures out what matters by understanding language itself

    Both approaches have their place! XGBoost is faster and more interpretable, while BERT can capture complex language patterns. You can see why BERT might come in handy when we are dealing with review text.

    1. Understanding BERT and LoRA

    1.1 What is BERT?

    BERT is a transformer-based model pre-trained on massive amounts of text. It understands context by looking at words from both directions (hence "Bidirectional"). When we fine-tune BERT, we're teaching it to apply its language understanding to our specific task: detecting AI-generated reviews.

    The Challenge: BERT has 110 million parameters. Training all of them requires:

    • Expensive GPUs

    • Long training times

    • Lots of memory

    We will be using LoRA to help with this challenge!

    1.2 What is LoRA?

    LoRA (Low-Rank Adaptation) is a clever technique that:

    • Freezes the original BERT weights (no updates to those 110M parameters)

    • Adds small "adapter" layers (only a few million parameters)

    • Only trains these tiny adapters

    Result: You get 95%+ of full fine-tuning performance while training 100x fewer parameters! This means you can train on a CPU or small GPU.

    2. Setting Up Your BERT Training Script

    2.1 Create the Training File

    Create a new file called ‘train_bert.py’ in your src/ directory. We'll build this step by step.

    2.2 Import Required Libraries

    import pandas as pd

    from sklearn.model_selection import train_test_split

    from sklearn.metrics import (

        accuracy_score,

        precision_recall_fscore_support,

        classification_report,

    )

    from pathlib import Path

    from transformers import (

        BertTokenizerFast,

        BertForSequenceClassification,

        Trainer,

        TrainingArguments,

        EarlyStoppingCallback,

    )

    from datasets import Dataset

    import torch

    Key libraries:

    • transformers: HuggingFace library for BERT and training utilities

    • datasets: Efficient data handling for training

    • torch: PyTorch, the deep learning framework BERT uses

    2.3 Check Available Hardware

    Add this code to see what hardware you have available:

    # Check if GPU is available (if not, we'll use CPU)

    device = "cuda" if torch.cuda.is_available() else "cpu"

    print(f"Using device: {device}")

    if torch.cuda.is_available():

        for i in range(torch.cuda.device_count()):

            gpu_name = torch.cuda.get_device_name(i)

            gpu_memory = torch.cuda.get_device_properties(i).total_memory / 1024**3

            print(f"GPU {i}: {gpu_name} ({gpu_memory:.1f} GB)")

    else:

        print("No GPU detected. Training will use CPU")

    No GPU? Use Google Colab!

    If you don't have a GPU, Google Colab offers free GPU access for students that will speed up training 10-20x. Here is the link to it: colab.research.google.com. Instructions for Colab are out of scope for this lesson, but contact me if you'd like to learn more.

    3. Load and Prepare Your Data

    3.1 Load the Dataset

    data_path = Path(__file__).resolve().parent.parent / "data" / "processed-dataset.csv" # if your processed dataset is in a data directory

    df = pd.read_csv(data_path, encoding="latin-1")

    df = df[["cleaned_text", "label"]].dropna()

    # Convert labels to numeric (BERT expects 0 and 1)

    df["label"] = df["label"].map({"OR": 1, "CG": 0})

    3.2 Create Train/Test Split

    # 80% train, 20% test

    train_df, test_df = train_test_split(

        df, test_size=0.2, random_state=42, stratify=df["label"]

    )

    # Convert to HuggingFace Dataset format

    train_ds = Dataset.from_pandas(train_df)

    test_ds = Dataset.from_pandas(test_df)

    print(f"Train size: {len(train_ds)}")

    print(f"Test size: {len(test_ds)}")

    Why this split?

    • Train (80%): Model learns from this data

    • Test (20%): Final, unseen data to evaluate true performance

    Note: We're using a simpler 80/20 split instead of train/validation/test. The HuggingFace Trainer will automatically use a portion of training data for validation during training.

    4. Tokenization: Converting Text to Numbers

    BERT doesn't understand text directly. We need to convert it to numbers (tokens) that it can process.

    4.1 Load the Tokenizer

    tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")

    This loads BERT's vocabulary (30,000+ words and subwords).

    4.2 Create Tokenization Function

    def tokenize(batch):

        return tokenizer(

            batch["cleaned_text"], 

            padding="max_length",    # Pad shorter reviews

            truncation=True,         # Cut off longer reviews

            max_length=512           # BERT's max length

        )

    # Apply tokenization to all datasets

    train_ds = train_ds.map(tokenize, batched=True)

    test_ds = test_ds.map(tokenize, batched=True)

    What's happening:

    • Each review is split into tokens (words/subwords)

    • Tokens are converted to IDs BERT understands

    • All reviews are padded/truncated to exactly 512 tokens

    5. Setting Up the BERT Model

    5.1 Load Pre-trained BERT

    model = BertForSequenceClassification.from_pretrained(

        "bert-base-uncased",

        num_labels=2,                          # Binary classification

        hidden_dropout_prob=0.2,               # Dropout for regularization

        attention_probs_dropout_prob=0.2,

    )

    # Move to CPU (if no GPU available)

    model = model.to(device)

    5.2 Freeze Lower Layers (Optional but Recommended)

    # Freeze first 6 layers, only train last 6 + classifier

    for name, param in model.bert.named_parameters():

        if any(f"encoder.layer.{i}." in name for i in range(6)):

            param.requires_grad = False

    Why freeze layers?

    • Lower layers learn general language patterns (grammar, syntax)

    • Upper layers learn task-specific patterns

    • Freezing lower layers speeds up training and prevents overfitting

    6. Training Configuration

    6.1 Define Evaluation Metrics

    def compute_metrics(eval_pred):

        logits, labels = eval_pred

        preds = logits.argmax(axis=1)

        precision, recall, f1, = precisionrecall_fscore_support(

            labels, preds, average="binary"

        )

        acc = accuracy_score(labels, preds)

        return {

            "accuracy": acc,

            "f1": f1,

            "precision": precision,

            "recall": recall,

        }

    6.2 Set Training Arguments

    training_args = TrainingArguments(

        output_dir="./bert_model",

        per_device_train_batch_size=8,         # Process 8 reviews at once

        per_device_eval_batch_size=8,

        num_train_epochs=3,                    # Train for 3 full passes

        learning_rate=1e-5,                    # Low learning rate for fine-tuning

        eval_strategy="epoch",                 # Evaluate after each epoch (each pass-through of the data)

        save_strategy="epoch",

        logging_steps=50,

        load_best_model_at_end=True,           # Keep best model, not last

        metric_for_best_model="eval_f1",       # Optimize for F1 score

        greater_is_better=True,

        warmup_steps=300,                      # Gradual learning rate increase

        weight_decay=0.01,                     # Regularization

        report_to=None,                        # Don't log to external services

        save_total_limit=2,                    # Only keep 2 best checkpoints

        dataloader_drop_last=True,

        remove_unused_columns=True,

    )

    Key parameters to understand:

    • learning_rate: How much to update weights each step (lower = more careful)

    • warmup_steps: Gradually increase learning rate (prevents early instability)

    • weight_decay: Penalizes large weights (prevents overfitting)

    6.3 Create Trainer

    trainer = Trainer(

        model=model,

        args=training_args,

        train_dataset=train_ds,

        eval_dataset=test_ds,

        compute_metrics=compute_metrics,

        callbacks=[

            EarlyStoppingCallback(early_stopping_patience=2)

        ],

    )

    Early Stopping: If validation performance doesn't improve for 2 epochs, stop training (prevents overfitting).

    7. Train the Model

    7.1 Start Training

    print("Starting training...")

    trainer.train()

    What to expect:

    • With GPU: 15-30 minutes

    • With CPU: could be much longer…

    You'll see progress bars and metrics printed each epoch. Watch the validation F1 score because you want it to increase!

    7.2 Evaluate Performance

    # Evaluate on train and test datasets

    train_metrics = trainer.evaluate(train_ds)

    print("\nTraining set evaluation:")

    print(f"Accuracy: {train_metrics['eval_accuracy']:.4f}")

    print(f"F1: {train_metrics['eval_f1']:.4f}")

    test_metrics = trainer.evaluate(test_ds)

    print("\nTest set evaluation:")

    print(f"Accuracy: {test_metrics['eval_accuracy']:.4f}")

    print(f"F1: {test_metrics['eval_f1']:.4f}")

    7.3 Detailed Classification Report

    predictions = trainer.predict(test_ds)

    preds = predictions.predictions.argmax(axis=1)

    labels = predictions.label_ids

    print("\n" + "=" * 50)

    print("CLASSIFICATION REPORT")

    print("=" * 50)

    print(classification_report(labels, preds, target_names=["CG", "OR"]))

    8. Save Your Model

    # Save the model and tokenizer

    trainer.save_model("./bert_model_final")

    tokenizer.save_pretrained("./bert_model_final")

    print("Model saved successfully!")

    # Clear GPU memory (if using GPU)

    if torch.cuda.is_available():

        torch.cuda.empty_cache()

    Add to .gitignore:

    bert_model_final/ because models usually take up a lot space and shouldn’t be pushed to your remote repo.

    9. BONUS: LoRA Fine-Tuning for Efficient Training

    NOTE: the video in the description below walks through example code in a Jupyter Notebook and fine-tunes with a GPU. That means our workflow may look slightly different. I.E. importing libraries at the start instead of as they are used.

    Now we'll implement LoRA (Low-Rank Adaptation) to make training efficient enough for CPU or smaller GPUs. This is the recommended approach if you’re trying to train on a CPU. Watch the following 15-minute video and follow the example code along with the given hints to implement PEFT (Parameter-Efficient Fine-Tuning) with a LoRA function.

    9.0 Why LoRA?

    Traditional fine-tuning updates all 110M parameters in BERT. LoRA:

    • Freezes the original BERT weights

    • Adds tiny "adapter" layers (~0.5-1% of parameters)

    • Trains only these adapters

    • Result: 10-20x faster training, works on CPU, same performance!

    9.1 HINTS: Differences in video example code

    Base Model & Task Type

    model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", ...) # TODO: only load_in_8bit if you’re using a GPU

    task_type="SEQ_CLS"  # Classification

    model.classifier = CastOutputToFloat(model.classifier)

    Data Processing

    # Load preprocessed dataset into a dataframe (df)

    df["label"] = df["label"].map({"OR": 1, "CG": 0})  # Binary labels

    # Train/test split

    train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)

    # Explicit tokenization with max_length

    def tokenize_function(examples):

        return tokenizer(examples["cleaned_text"], 

                        truncation=True, padding=True, max_length=512)

    # Uses DataCollatorWithPadding (for classification)

    data_collator=DataCollatorWithPadding(tokenizer)

    Training Configuration

    max_steps=600,              # Longer training

    learning_rate=2e-5,         # Lower learning rate (10x smaller)

    warmup_steps=100,

    logging_steps=10,           # Less frequent logging

    eval_steps=50,              # Evaluates periodically

    save_steps=50,

    metric_for_best_model="f1", # Optimizes for F1

    # Custom metrics function

    def compute_metrics(eval_pred):

        # Computes accuracy, F1, precision, recall

    Model Preparation

    # Explicitly prepares model for quantized training

    model = prepare_model_for_kbit_training(model) # Only need this if you’re training with a GPU

    config = LoraConfig(...) # TODO: should be the same config as the video except for the task_type (SEQ_CLS)

    model = get_peft_model(model, config)

    Evaluation & Metrics

    # TODO: assign these in the Trainer() params

    eval_dataset=test_data,

    compute_metrics=compute_metrics,

    # Detailed final evaluation

    final_results = trainer.evaluate()

    print(f"Accuracy: {final_results['eval_accuracy']:.4f}")

    print(f"F1 Score: {final_results['eval_f1']:.4f}")

    # etc.

    Pushing to HuggingFace Hub

    You can push your model to the HF Hub or save it locally, it just depends on how you want to load it into your Streamlit app to interact with it. Follow the code from our XGBoost implementation to save locally

    10. Using Your Model in Streamlit

    Now let's integrate your trained BERT model into the Streamlit app!

    10.1 Update Your Streamlit App

    Create a copy of your ‘main.py’ to include BERT predictions without messing up your XGBoost app.

    Add these new imports at the top:

    from transformers import BertTokenizerFast, BertForSequenceClassification

    import torch

    Add this function to load your BERT model:

    @st.cache_resource

    def load_bert_model():

        """Load the trained BERT model"""

        model_path = "../scripts/bert_model_final"  # Adjust path as needed

        

        model = BertForSequenceClassification.from_pretrained(model_path)

        tokenizer = BertTokenizerFast.from_pretrained(model_path)

        

        # Move to CPU and set to evaluation mode

        model = model.to("cpu")

        model.eval()

        

        return model, tokenizer

    Add this prediction function:

    def predict_with_bert(text, model, tokenizer):

        """Make prediction using BERT model"""

        # Preprocess the text (reuse your existing function)

        processed_text = preprocess_text(text)

        

        # Tokenize

        inputs = tokenizer(

            processed_text,

            padding="max_length",

            truncation=True,

            max_length=512,

            return_tensors="pt",

        )

        

        # Make prediction

        with torch.no_grad():

            outputs = model(**inputs)

            logits = outputs.logits

        

        # Convert to probabilities

        probs = torch.nn.functional.softmax(logits, dim=1)

        prediction = torch.argmax(logits, dim=1).item()

        confidence = probs[0][prediction].item()

        

        # Map to labels (0=AI/CG, 1=Human/OR)

        label = "AI" if prediction == 0 else "Human"

        

        return label, confidence, probs[0].tolist()

    In your main() function, replace the model loading section:

    # Replace get_xgb_model() with:

    with st.spinner("Loading BERT model..."):

        bert_model, tokenizer = load_bert_model()

    st.success("✅ BERT model loaded successfully!")

    In your analyze button section, replace the xgb_predict call:

    # Replace this:

    # label, confidence, probabilities = xgb_predict(...)

    # With this:

    label, confidence, probabilities = predict_with_bert(

        input_review, bert_model, tokenizer

    )

    Note: Remove the category and rating inputs since BERT doesn't use those features. It only needs the text!

    10.2 Loading a LoRA Model (If You Used LoRA)

    If you completed the bonus LoRA training (Section 9), you'll need to load the model differently.

    Add this import:

    from peft import PeftModel, PeftConfig

    import torch.nn as nn

    Replace load_bert_model() with this LoRA version:

    @st.cache_resource

    def load_lora_model():

        """Load LoRA fine-tuned model"""

        model_path = "../scripts/bert_lora_model"  # Adjust path as needed

        

        # Load base model first

        config = PeftConfig.from_pretrained(model_path)

        model = BertForSequenceClassification.from_pretrained(

            config.base_model_name_or_path,

            num_labels=2,

            return_dict=True,

            torch_dtype=torch.float32,

        )

        

        tokenizer = BertTokenizerFast.from_pretrained(config.base_model_name_or_path)

        

        # Apply the same modifications as during training

        for param in model.parameters():

            param.requires_grad = False

            if param.ndim == 1:

                param.data = param.data.to(torch.float32)

        

        model.gradient_checkpointing_enable()

        model.enable_input_require_grads()

        

        # Cast classifier head to float32 for stability

        class CastOutputToFloat(nn.Sequential):

            def forward(self, x):

                return super().forward(x).to(torch.float32)

        

        model.classifier = CastOutputToFloat(model.classifier)

        

        # Load LoRA adapters

        model = PeftModel.from_pretrained(model, model_path)

        

        # Move to CPU

        model = model.to("cpu")

        model.eval()

        

        return model, tokenizer

    Then use load_lora_model() instead of load_bert_model() in your main function.

    Why the extra steps for LoRA?

    • Gradient checkpointing and float32 casting ensure the model loads correctly

    • These match the setup used during training

    • Without these, you might get errors or incorrect predictions

    10.3 Test Your App

    Run streamlit run main_bert.py with your venv activated of course.

    You should now have two separate apps: one for XGBoost and one for BERT!

    11. Troubleshooting Common Issues

    Issue 1: Training Too Slow on CPU

    Solutions:

    • Use LoRA (90% faster)

    • Reduce dataset size temporarily for testing

    • Train on Google Colab (free GPU)

    Issue 2: Overfitting Detected

    Solutions:

    • Add a validation set to your train and test sets. This will keep track of your performance model’s performance as it is being trained. The evaluation metrics on the validation set can also help with determining overfitting. For example, train and validation scores should be close

    • Increase dropout: hidden_dropout_prob=0.3

    • Reduce training epochs

    • Freeze more layers

    13. Wrapping Up

    By the end of this week, you should have:

    • A trained BERT model saved in ‘bert_model_final/’

    • Training metrics showing train/test performance

    • Understanding of how BERT differs from XGBoost

    • (Bonus) LoRA implementation for efficient training

    • Updated Streamlit app with BERT predictions

    14. Next Steps

    For Upcoming Weeks:

    • Deploy your best model (XGBoost or BERT)

    • Deploy your Streamlit app

    • Consider ensemble methods (combining both models!)

    • More model improvements…

    Extra Resources

    Great work diving into deep learning! You've now experienced both traditional ML (XGBoost) and modern deep learning (BERT). Understanding when to use each approach is a valuable skill in production ML systems.

    Comments