Week 6: BERT Alternative Model with LoRA Fine-Tuning

Welcome to this week's alternative solution approach! While you've been working with XGBoost (a tree-based model), we're now going to take this week to explore a completely different approach using BERT (Bidirectional Encoder Representations from Transformers). As you have learned, XGBoost must be trained on numerical data, while BERT is a powerful deep learning model that understands language context. We'll also learn about LoRA (Low-Rank Adaptation), a technique that lets us fine-tune large models efficiently on limited hardware.

Why BERT as an Alternative?

XGBoost works with hand-crafted features (word counts, sentiment scores, etc.) while BERT learns patterns directly from raw text. Think of it this way:

XGBoost: You tell it what to look for (features)
BERT: It figures out what matters by understanding language itself

Both approaches have their place! XGBoost is faster and more interpretable, while BERT can capture complex language patterns. You can see why BERT might come in handy when we are dealing with review text.

1. Understanding BERT and LoRA

1.1 What is BERT?

BERT is a transformer-based model pre-trained on massive amounts of text. It understands context by looking at words from both directions (hence "Bidirectional"). When we fine-tune BERT, we're teaching it to apply its language understanding to our specific task: detecting AI-generated reviews.

The Challenge: BERT has 110 million parameters. Training all of them requires:

Expensive GPUs
Long training times
Lots of memory

We will be using LoRA to help with this challenge!

1.2 What is LoRA?

LoRA (Low-Rank Adaptation) is a clever technique that:

Freezes the original BERT weights (no updates to those 110M parameters)
Adds small "adapter" layers (only a few million parameters)
Only trains these tiny adapters

Result: You get 95%+ of full fine-tuning performance while training 100x fewer parameters! This means you can train on a CPU or small GPU.

2. Setting Up Your BERT Training Script

2.1 Create the Training File

Create a new file called ‘train_bert.py’ in your src/ directory. We'll build this step by step.

2.2 Import Required Libraries

import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.metrics import (

accuracy_score,

precision_recall_fscore_support,

classification_report,

)

from pathlib import Path

from transformers import (

BertTokenizerFast,

BertForSequenceClassification,

Trainer,

TrainingArguments,

EarlyStoppingCallback,

)

from datasets import Dataset

import torch

Key libraries:

transformers: HuggingFace library for BERT and training utilities
datasets: Efficient data handling for training
torch: PyTorch, the deep learning framework BERT uses

2.3 Check Available Hardware

Add this code to see what hardware you have available:

# Check if GPU is available (if not, we'll use CPU)

device = "cuda" if torch.cuda.is_available() else "cpu"

print(f"Using device: {device}")

if torch.cuda.is_available():

for i in range(torch.cuda.device_count()):

gpu_name = torch.cuda.get_device_name(i)

gpu_memory = torch.cuda.get_device_properties(i).total_memory / 1024**3

print(f"GPU {i}: {gpu_name} ({gpu_memory:.1f} GB)")

else:

print("No GPU detected. Training will use CPU")

No GPU? Use Google Colab!

If you don't have a GPU, Google Colab offers free GPU access for students that will speed up training 10-20x. Here is the link to it: colab.research.google.com. Instructions for Colab are out of scope for this lesson, but contact me if you'd like to learn more.

3. Load and Prepare Your Data

3.1 Load the Dataset

data_path = Path(__file__).resolve().parent.parent / "data" / "processed-dataset.csv" # if your processed dataset is in a data directory

df = pd.read_csv(data_path, encoding="latin-1")

df = df[["cleaned_text", "label"]].dropna()

# Convert labels to numeric (BERT expects 0 and 1)

df["label"] = df["label"].map({"OR": 1, "CG": 0})

3.2 Create Train/Test Split

# 80% train, 20% test

train_df, test_df = train_test_split(

df, test_size=0.2, random_state=42, stratify=df["label"]

)

# Convert to HuggingFace Dataset format

train_ds = Dataset.from_pandas(train_df)

test_ds = Dataset.from_pandas(test_df)

print(f"Train size: {len(train_ds)}")

print(f"Test size: {len(test_ds)}")

Why this split?

Train (80%): Model learns from this data
Test (20%): Final, unseen data to evaluate true performance

Note: We're using a simpler 80/20 split instead of train/validation/test. The HuggingFace Trainer will automatically use a portion of training data for validation during training.

4. Tokenization: Converting Text to Numbers

BERT doesn't understand text directly. We need to convert it to numbers (tokens) that it can process.

4.1 Load the Tokenizer

tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")

This loads BERT's vocabulary (30,000+ words and subwords).

4.2 Create Tokenization Function

def tokenize(batch):

return tokenizer(

batch["cleaned_text"],

padding="max_length", # Pad shorter reviews

truncation=True, # Cut off longer reviews

max_length=512 # BERT's max length

)

# Apply tokenization to all datasets

train_ds = train_ds.map(tokenize, batched=True)

test_ds = test_ds.map(tokenize, batched=True)

What's happening:

Each review is split into tokens (words/subwords)
Tokens are converted to IDs BERT understands
All reviews are padded/truncated to exactly 512 tokens

5. Setting Up the BERT Model

5.1 Load Pre-trained BERT

model = BertForSequenceClassification.from_pretrained(

"bert-base-uncased",

num_labels=2, # Binary classification

hidden_dropout_prob=0.2, # Dropout for regularization

attention_probs_dropout_prob=0.2,

)

# Move to CPU (if no GPU available)

model = model.to(device)

5.2 Freeze Lower Layers (Optional but Recommended)

# Freeze first 6 layers, only train last 6 + classifier

for name, param in model.bert.named_parameters():

if any(f"encoder.layer.{i}." in name for i in range(6)):

param.requires_grad = False

Why freeze layers?

Lower layers learn general language patterns (grammar, syntax)
Upper layers learn task-specific patterns
Freezing lower layers speeds up training and prevents overfitting

6. Training Configuration

6.1 Define Evaluation Metrics

def compute_metrics(eval_pred):

logits, labels = eval_pred

preds = logits.argmax(axis=1)

precision, recall, f1, = precisionrecall_fscore_support(

labels, preds, average="binary"

)

acc = accuracy_score(labels, preds)

return {

"accuracy": acc,

"f1": f1,

"precision": precision,

"recall": recall,

}

6.2 Set Training Arguments

training_args = TrainingArguments(

output_dir="./bert_model",

per_device_train_batch_size=8, # Process 8 reviews at once

per_device_eval_batch_size=8,

num_train_epochs=3, # Train for 3 full passes

learning_rate=1e-5, # Low learning rate for fine-tuning

eval_strategy="epoch", # Evaluate after each epoch (each pass-through of the data)

save_strategy="epoch",

logging_steps=50,

load_best_model_at_end=True, # Keep best model, not last

metric_for_best_model="eval_f1", # Optimize for F1 score

greater_is_better=True,

warmup_steps=300, # Gradual learning rate increase

weight_decay=0.01, # Regularization

report_to=None, # Don't log to external services

save_total_limit=2, # Only keep 2 best checkpoints

dataloader_drop_last=True,

remove_unused_columns=True,

)

Key parameters to understand:

learning_rate: How much to update weights each step (lower = more careful)
warmup_steps: Gradually increase learning rate (prevents early instability)
weight_decay: Penalizes large weights (prevents overfitting)

6.3 Create Trainer

trainer = Trainer(

model=model,

args=training_args,

train_dataset=train_ds,

eval_dataset=test_ds,

compute_metrics=compute_metrics,

callbacks=[

EarlyStoppingCallback(early_stopping_patience=2)

)

Early Stopping: If validation performance doesn't improve for 2 epochs, stop training (prevents overfitting).

7. Train the Model

7.1 Start Training

print("Starting training...")

trainer.train()

What to expect:

With GPU: 15-30 minutes
With CPU: could be much longer…

You'll see progress bars and metrics printed each epoch. Watch the validation F1 score because you want it to increase!

7.2 Evaluate Performance

# Evaluate on train and test datasets

train_metrics = trainer.evaluate(train_ds)

print("\nTraining set evaluation:")

print(f"Accuracy: {train_metrics['eval_accuracy']:.4f}")

print(f"F1: {train_metrics['eval_f1']:.4f}")

test_metrics = trainer.evaluate(test_ds)

print("\nTest set evaluation:")

print(f"Accuracy: {test_metrics['eval_accuracy']:.4f}")

print(f"F1: {test_metrics['eval_f1']:.4f}")

7.3 Detailed Classification Report

predictions = trainer.predict(test_ds)

preds = predictions.predictions.argmax(axis=1)

labels = predictions.label_ids

print("\n" + "=" * 50)

print("CLASSIFICATION REPORT")

print("=" * 50)

print(classification_report(labels, preds, target_names=["CG", "OR"]))

8. Save Your Model

# Save the model and tokenizer

trainer.save_model("./bert_model_final")

tokenizer.save_pretrained("./bert_model_final")

print("Model saved successfully!")

# Clear GPU memory (if using GPU)

if torch.cuda.is_available():

torch.cuda.empty_cache()

Add to .gitignore:

bert_model_final/ because models usually take up a lot space and shouldn’t be pushed to your remote repo.

9. BONUS: LoRA Fine-Tuning for Efficient Training

NOTE: the video in the description below walks through example code in a Jupyter Notebook and fine-tunes with a GPU. That means our workflow may look slightly different. I.E. importing libraries at the start instead of as they are used.

Now we'll implement LoRA (Low-Rank Adaptation) to make training efficient enough for CPU or smaller GPUs. This is the recommended approach if you’re trying to train on a CPU. Watch the following 15-minute video and follow the example code along with the given hints to implement PEFT (Parameter-Efficient Fine-Tuning) with a LoRA function.

9.0 Why LoRA?

Traditional fine-tuning updates all 110M parameters in BERT. LoRA:

Freezes the original BERT weights
Adds tiny "adapter" layers (~0.5-1% of parameters)
Trains only these adapters
Result: 10-20x faster training, works on CPU, same performance!

9.1 HINTS: Differences in video example code

Base Model & Task Type

model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", ...) # TODO: only load_in_8bit if you’re using a GPU

task_type="SEQ_CLS" # Classification

model.classifier = CastOutputToFloat(model.classifier)

Data Processing

# Load preprocessed dataset into a dataframe (df)

df["label"] = df["label"].map({"OR": 1, "CG": 0}) # Binary labels

# Train/test split

train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)

# Explicit tokenization with max_length

def tokenize_function(examples):

return tokenizer(examples["cleaned_text"],

truncation=True, padding=True, max_length=512)

# Uses DataCollatorWithPadding (for classification)

data_collator=DataCollatorWithPadding(tokenizer)

Training Configuration

max_steps=600, # Longer training

learning_rate=2e-5, # Lower learning rate (10x smaller)

warmup_steps=100,

logging_steps=10, # Less frequent logging

eval_steps=50, # Evaluates periodically

save_steps=50,

metric_for_best_model="f1", # Optimizes for F1

# Custom metrics function

def compute_metrics(eval_pred):

# Computes accuracy, F1, precision, recall

Model Preparation

# Explicitly prepares model for quantized training

model = prepare_model_for_kbit_training(model) # Only need this if you’re training with a GPU

config = LoraConfig(...) # TODO: should be the same config as the video except for the task_type (SEQ_CLS)

model = get_peft_model(model, config)

Evaluation & Metrics

# TODO: assign these in the Trainer() params

eval_dataset=test_data,

compute_metrics=compute_metrics,

# Detailed final evaluation

final_results = trainer.evaluate()

print(f"Accuracy: {final_results['eval_accuracy']:.4f}")

print(f"F1 Score: {final_results['eval_f1']:.4f}")

# etc.

Pushing to HuggingFace Hub

You can push your model to the HF Hub or save it locally, it just depends on how you want to load it into your Streamlit app to interact with it. Follow the code from our XGBoost implementation to save locally

10. Using Your Model in Streamlit

Now let's integrate your trained BERT model into the Streamlit app!

10.1 Update Your Streamlit App

Create a copy of your ‘main.py’ to include BERT predictions without messing up your XGBoost app.

Add these new imports at the top:

from transformers import BertTokenizerFast, BertForSequenceClassification

import torch

Add this function to load your BERT model:

@st.cache_resource

def load_bert_model():

"""Load the trained BERT model"""

model_path = "../scripts/bert_model_final" # Adjust path as needed

model = BertForSequenceClassification.from_pretrained(model_path)

tokenizer = BertTokenizerFast.from_pretrained(model_path)

# Move to CPU and set to evaluation mode

model = model.to("cpu")

model.eval()

return model, tokenizer

Add this prediction function:

def predict_with_bert(text, model, tokenizer):

"""Make prediction using BERT model"""

# Preprocess the text (reuse your existing function)

processed_text = preprocess_text(text)

# Tokenize

inputs = tokenizer(

processed_text,

padding="max_length",

truncation=True,

max_length=512,

return_tensors="pt",

)

# Make prediction

with torch.no_grad():

outputs = model(**inputs)

logits = outputs.logits

# Convert to probabilities

probs = torch.nn.functional.softmax(logits, dim=1)

prediction = torch.argmax(logits, dim=1).item()

confidence = probs[0][prediction].item()

# Map to labels (0=AI/CG, 1=Human/OR)

label = "AI" if prediction == 0 else "Human"

return label, confidence, probs[0].tolist()

In your main() function, replace the model loading section:

# Replace get_xgb_model() with:

with st.spinner("Loading BERT model..."):

bert_model, tokenizer = load_bert_model()

st.success("✅ BERT model loaded successfully!")

In your analyze button section, replace the xgb_predict call:

# Replace this:

# label, confidence, probabilities = xgb_predict(...)

# With this:

label, confidence, probabilities = predict_with_bert(

input_review, bert_model, tokenizer

)

Note: Remove the category and rating inputs since BERT doesn't use those features. It only needs the text!

10.2 Loading a LoRA Model (If You Used LoRA)

If you completed the bonus LoRA training (Section 9), you'll need to load the model differently.

Add this import:

from peft import PeftModel, PeftConfig

import torch.nn as nn

Replace load_bert_model() with this LoRA version:

@st.cache_resource

def load_lora_model():

"""Load LoRA fine-tuned model"""

model_path = "../scripts/bert_lora_model" # Adjust path as needed

# Load base model first

config = PeftConfig.from_pretrained(model_path)

model = BertForSequenceClassification.from_pretrained(

config.base_model_name_or_path,

num_labels=2,

return_dict=True,

torch_dtype=torch.float32,

)

tokenizer = BertTokenizerFast.from_pretrained(config.base_model_name_or_path)

# Apply the same modifications as during training

for param in model.parameters():

param.requires_grad = False

if param.ndim == 1:

param.data = param.data.to(torch.float32)

model.gradient_checkpointing_enable()

model.enable_input_require_grads()

# Cast classifier head to float32 for stability

class CastOutputToFloat(nn.Sequential):

def forward(self, x):

return super().forward(x).to(torch.float32)

model.classifier = CastOutputToFloat(model.classifier)

# Load LoRA adapters

model = PeftModel.from_pretrained(model, model_path)

# Move to CPU

model = model.to("cpu")

model.eval()

return model, tokenizer

Then use load_lora_model() instead of load_bert_model() in your main function.

Why the extra steps for LoRA?

Gradient checkpointing and float32 casting ensure the model loads correctly
- Here is a video on Exploding Gradients to help understand this concept a bit more
These match the setup used during training
Without these, you might get errors or incorrect predictions

10.3 Test Your App

Run streamlit run main_bert.py with your venv activated of course.

You should now have two separate apps: one for XGBoost and one for BERT!

11. Troubleshooting Common Issues

Issue 1: Training Too Slow on CPU

Solutions:

Use LoRA (90% faster)
Reduce dataset size temporarily for testing
Train on Google Colab (free GPU)

Issue 2: Overfitting Detected

Solutions:

Add a validation set to your train and test sets. This will keep track of your performance model’s performance as it is being trained. The evaluation metrics on the validation set can also help with determining overfitting. For example, train and validation scores should be close
Increase dropout: hidden_dropout_prob=0.3
Reduce training epochs
Freeze more layers

13. Wrapping Up

By the end of this week, you should have:

A trained BERT model saved in ‘bert_model_final/’
Training metrics showing train/test performance
Understanding of how BERT differs from XGBoost
(Bonus) LoRA implementation for efficient training
Updated Streamlit app with BERT predictions

14. Next Steps

For Upcoming Weeks:

Deploy your best model (XGBoost or BERT)
Deploy your Streamlit app
Consider ensemble methods (combining both models!)
More model improvements…

Extra Resources

Fine-Tuning BERT for Text Classification

Great work diving into deep learning! You've now experienced both traditional ML (XGBoost) and modern deep learning (BERT). Understanding when to use each approach is a valuable skill in production ML systems.

Amazon Review Analyzer Week 6

Week 6: BERT Alternative Model with LoRA Fine-Tuning

Why BERT as an Alternative?

1. Understanding BERT and LoRA

1.1 What is BERT?

1.2 What is LoRA?

2. Setting Up Your BERT Training Script

2.1 Create the Training File

2.2 Import Required Libraries

2.3 Check Available Hardware

3. Load and Prepare Your Data

3.1 Load the Dataset

3.2 Create Train/Test Split

4. Tokenization: Converting Text to Numbers

4.1 Load the Tokenizer

4.2 Create Tokenization Function

5. Setting Up the BERT Model

5.1 Load Pre-trained BERT

5.2 Freeze Lower Layers (Optional but Recommended)

6. Training Configuration

6.1 Define Evaluation Metrics

6.2 Set Training Arguments

6.3 Create Trainer

7. Train the Model

7.1 Start Training

7.2 Evaluate Performance

7.3 Detailed Classification Report

8. Save Your Model

9. BONUS: LoRA Fine-Tuning for Efficient Training

9.0 Why LoRA?

9.1 HINTS: Differences in video example code

10. Using Your Model in Streamlit

10.1 Update Your Streamlit App

10.2 Loading a LoRA Model (If You Used LoRA)

10.3 Test Your App

11. Troubleshooting Common Issues

Issue 1: Training Too Slow on CPU

Issue 2: Overfitting Detected

13. Wrapping Up

14. Next Steps

Extra Resources

Comments