
Week 6 content for the Amazon Review Analyzer project
Welcome to this week's alternative solution approach! While you've been working with XGBoost (a tree-based model), we're now going to take this week to explore a completely different approach using BERT (Bidirectional Encoder Representations from Transformers). As you have learned, XGBoost must be trained on numerical data, while BERT is a powerful deep learning model that understands language context. We'll also learn about LoRA (Low-Rank Adaptation), a technique that lets us fine-tune large models efficiently on limited hardware.
XGBoost works with hand-crafted features (word counts, sentiment scores, etc.) while BERT learns patterns directly from raw text. Think of it this way:
XGBoost: You tell it what to look for (features)
BERT: It figures out what matters by understanding language itself
Both approaches have their place! XGBoost is faster and more interpretable, while BERT can capture complex language patterns. You can see why BERT might come in handy when we are dealing with review text.
BERT is a transformer-based model pre-trained on massive amounts of text. It understands context by looking at words from both directions (hence "Bidirectional"). When we fine-tune BERT, we're teaching it to apply its language understanding to our specific task: detecting AI-generated reviews.
The Challenge: BERT has 110 million parameters. Training all of them requires:
Expensive GPUs
Long training times
Lots of memory
We will be using LoRA to help with this challenge!
LoRA (Low-Rank Adaptation) is a clever technique that:
Freezes the original BERT weights (no updates to those 110M parameters)
Adds small "adapter" layers (only a few million parameters)
Only trains these tiny adapters
Result: You get 95%+ of full fine-tuning performance while training 100x fewer parameters! This means you can train on a CPU or small GPU.
Create a new file called ‘train_bert.py’ in your src/ directory. We'll build this step by step.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
accuracy_score,
precision_recall_fscore_support,
classification_report,
)
from pathlib import Path
from transformers import (
BertTokenizerFast,
BertForSequenceClassification,
Trainer,
TrainingArguments,
EarlyStoppingCallback,
)
from datasets import Dataset
import torch
Key libraries:
transformers: HuggingFace library for BERT and training utilities
datasets: Efficient data handling for training
torch: PyTorch, the deep learning framework BERT uses
Add this code to see what hardware you have available:
# Check if GPU is available (if not, we'll use CPU)
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")
if torch.cuda.is_available():
for i in range(torch.cuda.device_count()):
gpu_name = torch.cuda.get_device_name(i)
gpu_memory = torch.cuda.get_device_properties(i).total_memory / 1024**3
print(f"GPU {i}: {gpu_name} ({gpu_memory:.1f} GB)")
else:
print("No GPU detected. Training will use CPU")
No GPU? Use Google Colab!
If you don't have a GPU, Google Colab offers free GPU access for students that will speed up training 10-20x. Here is the link to it: colab.research.google.com. Instructions for Colab are out of scope for this lesson, but contact me if you'd like to learn more.
data_path = Path(__file__).resolve().parent.parent / "data" / "processed-dataset.csv" # if your processed dataset is in a data directory
df = pd.read_csv(data_path, encoding="latin-1")
df = df[["cleaned_text", "label"]].dropna()
# Convert labels to numeric (BERT expects 0 and 1)
df["label"] = df["label"].map({"OR": 1, "CG": 0})
# 80% train, 20% test
train_df, test_df = train_test_split(
df, test_size=0.2, random_state=42, stratify=df["label"]
)
# Convert to HuggingFace Dataset format
train_ds = Dataset.from_pandas(train_df)
test_ds = Dataset.from_pandas(test_df)
print(f"Train size: {len(train_ds)}")
print(f"Test size: {len(test_ds)}")
Why this split?
Train (80%): Model learns from this data
Test (20%): Final, unseen data to evaluate true performance
Note: We're using a simpler 80/20 split instead of train/validation/test. The HuggingFace Trainer will automatically use a portion of training data for validation during training.
BERT doesn't understand text directly. We need to convert it to numbers (tokens) that it can process.
tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")
This loads BERT's vocabulary (30,000+ words and subwords).
def tokenize(batch):
return tokenizer(
batch["cleaned_text"],
padding="max_length", # Pad shorter reviews
truncation=True, # Cut off longer reviews
max_length=512 # BERT's max length
)
# Apply tokenization to all datasets
train_ds = train_ds.map(tokenize, batched=True)
test_ds = test_ds.map(tokenize, batched=True)
What's happening:
Each review is split into tokens (words/subwords)
Tokens are converted to IDs BERT understands
All reviews are padded/truncated to exactly 512 tokens
model = BertForSequenceClassification.from_pretrained(
"bert-base-uncased",
num_labels=2, # Binary classification
hidden_dropout_prob=0.2, # Dropout for regularization
attention_probs_dropout_prob=0.2,
)
# Move to CPU (if no GPU available)
model = model.to(device)
# Freeze first 6 layers, only train last 6 + classifier
for name, param in model.bert.named_parameters():
if any(f"encoder.layer.{i}." in name for i in range(6)):
param.requires_grad = False
Why freeze layers?
Lower layers learn general language patterns (grammar, syntax)
Upper layers learn task-specific patterns
Freezing lower layers speeds up training and prevents overfitting
def compute_metrics(eval_pred):
logits, labels = eval_pred
preds = logits.argmax(axis=1)
precision, recall, f1, = precisionrecall_fscore_support(
labels, preds, average="binary"
)
acc = accuracy_score(labels, preds)
return {
"accuracy": acc,
"f1": f1,
"precision": precision,
"recall": recall,
}
training_args = TrainingArguments(
output_dir="./bert_model",
per_device_train_batch_size=8, # Process 8 reviews at once
per_device_eval_batch_size=8,
num_train_epochs=3, # Train for 3 full passes
learning_rate=1e-5, # Low learning rate for fine-tuning
eval_strategy="epoch", # Evaluate after each epoch (each pass-through of the data)
save_strategy="epoch",
logging_steps=50,
load_best_model_at_end=True, # Keep best model, not last
metric_for_best_model="eval_f1", # Optimize for F1 score
greater_is_better=True,
warmup_steps=300, # Gradual learning rate increase
weight_decay=0.01, # Regularization
report_to=None, # Don't log to external services
save_total_limit=2, # Only keep 2 best checkpoints
dataloader_drop_last=True,
remove_unused_columns=True,
)
Key parameters to understand:
learning_rate: How much to update weights each step (lower = more careful)
warmup_steps: Gradually increase learning rate (prevents early instability)
weight_decay: Penalizes large weights (prevents overfitting)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_ds,
eval_dataset=test_ds,
compute_metrics=compute_metrics,
callbacks=[
EarlyStoppingCallback(early_stopping_patience=2)
],
)
Early Stopping: If validation performance doesn't improve for 2 epochs, stop training (prevents overfitting).
print("Starting training...")
trainer.train()
What to expect:
With GPU: 15-30 minutes
With CPU: could be much longer…
You'll see progress bars and metrics printed each epoch. Watch the validation F1 score because you want it to increase!
# Evaluate on train and test datasets
train_metrics = trainer.evaluate(train_ds)
print("\nTraining set evaluation:")
print(f"Accuracy: {train_metrics['eval_accuracy']:.4f}")
print(f"F1: {train_metrics['eval_f1']:.4f}")
test_metrics = trainer.evaluate(test_ds)
print("\nTest set evaluation:")
print(f"Accuracy: {test_metrics['eval_accuracy']:.4f}")
print(f"F1: {test_metrics['eval_f1']:.4f}")
predictions = trainer.predict(test_ds)
preds = predictions.predictions.argmax(axis=1)
labels = predictions.label_ids
print("\n" + "=" * 50)
print("CLASSIFICATION REPORT")
print("=" * 50)
print(classification_report(labels, preds, target_names=["CG", "OR"]))
# Save the model and tokenizer
trainer.save_model("./bert_model_final")
tokenizer.save_pretrained("./bert_model_final")
print("Model saved successfully!")
# Clear GPU memory (if using GPU)
if torch.cuda.is_available():
torch.cuda.empty_cache()
Add to .gitignore:
bert_model_final/ because models usually take up a lot space and shouldn’t be pushed to your remote repo.
NOTE: the video in the description below walks through example code in a Jupyter Notebook and fine-tunes with a GPU. That means our workflow may look slightly different. I.E. importing libraries at the start instead of as they are used.
Now we'll implement LoRA (Low-Rank Adaptation) to make training efficient enough for CPU or smaller GPUs. This is the recommended approach if you’re trying to train on a CPU. Watch the following 15-minute video and follow the example code along with the given hints to implement PEFT (Parameter-Efficient Fine-Tuning) with a LoRA function.
Traditional fine-tuning updates all 110M parameters in BERT. LoRA:
Freezes the original BERT weights
Adds tiny "adapter" layers (~0.5-1% of parameters)
Trains only these adapters
Result: 10-20x faster training, works on CPU, same performance!
Base Model & Task Type
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", ...) # TODO: only load_in_8bit if you’re using a GPU
task_type="SEQ_CLS" # Classification
model.classifier = CastOutputToFloat(model.classifier)
Data Processing
# Load preprocessed dataset into a dataframe (df)
df["label"] = df["label"].map({"OR": 1, "CG": 0}) # Binary labels
# Train/test split
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)
# Explicit tokenization with max_length
def tokenize_function(examples):
return tokenizer(examples["cleaned_text"],
truncation=True, padding=True, max_length=512)
# Uses DataCollatorWithPadding (for classification)
data_collator=DataCollatorWithPadding(tokenizer)
Training Configuration
max_steps=600, # Longer training
learning_rate=2e-5, # Lower learning rate (10x smaller)
warmup_steps=100,
logging_steps=10, # Less frequent logging
eval_steps=50, # Evaluates periodically
save_steps=50,
metric_for_best_model="f1", # Optimizes for F1
# Custom metrics function
def compute_metrics(eval_pred):
# Computes accuracy, F1, precision, recall
Model Preparation
# Explicitly prepares model for quantized training
model = prepare_model_for_kbit_training(model) # Only need this if you’re training with a GPU
config = LoraConfig(...) # TODO: should be the same config as the video except for the task_type (SEQ_CLS)
model = get_peft_model(model, config)
Evaluation & Metrics
# TODO: assign these in the Trainer() params
eval_dataset=test_data,
compute_metrics=compute_metrics,
# Detailed final evaluation
final_results = trainer.evaluate()
print(f"Accuracy: {final_results['eval_accuracy']:.4f}")
print(f"F1 Score: {final_results['eval_f1']:.4f}")
# etc.
Pushing to HuggingFace Hub
You can push your model to the HF Hub or save it locally, it just depends on how you want to load it into your Streamlit app to interact with it. Follow the code from our XGBoost implementation to save locally
Now let's integrate your trained BERT model into the Streamlit app!
Create a copy of your ‘main.py’ to include BERT predictions without messing up your XGBoost app.
Add these new imports at the top:
from transformers import BertTokenizerFast, BertForSequenceClassification
import torch
Add this function to load your BERT model:
@st.cache_resource
def load_bert_model():
"""Load the trained BERT model"""
model_path = "../scripts/bert_model_final" # Adjust path as needed
model = BertForSequenceClassification.from_pretrained(model_path)
tokenizer = BertTokenizerFast.from_pretrained(model_path)
# Move to CPU and set to evaluation mode
model = model.to("cpu")
model.eval()
return model, tokenizer
Add this prediction function:
def predict_with_bert(text, model, tokenizer):
"""Make prediction using BERT model"""
# Preprocess the text (reuse your existing function)
processed_text = preprocess_text(text)
# Tokenize
inputs = tokenizer(
processed_text,
padding="max_length",
truncation=True,
max_length=512,
return_tensors="pt",
)
# Make prediction
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits
# Convert to probabilities
probs = torch.nn.functional.softmax(logits, dim=1)
prediction = torch.argmax(logits, dim=1).item()
confidence = probs[0][prediction].item()
# Map to labels (0=AI/CG, 1=Human/OR)
label = "AI" if prediction == 0 else "Human"
return label, confidence, probs[0].tolist()
In your main() function, replace the model loading section:
# Replace get_xgb_model() with:
with st.spinner("Loading BERT model..."):
bert_model, tokenizer = load_bert_model()
st.success("✅ BERT model loaded successfully!")
In your analyze button section, replace the xgb_predict call:
# Replace this:
# label, confidence, probabilities = xgb_predict(...)
# With this:
label, confidence, probabilities = predict_with_bert(
input_review, bert_model, tokenizer
)
Note: Remove the category and rating inputs since BERT doesn't use those features. It only needs the text!
If you completed the bonus LoRA training (Section 9), you'll need to load the model differently.
Add this import:
from peft import PeftModel, PeftConfig
import torch.nn as nn
Replace load_bert_model() with this LoRA version:
@st.cache_resource
def load_lora_model():
"""Load LoRA fine-tuned model"""
model_path = "../scripts/bert_lora_model" # Adjust path as needed
# Load base model first
config = PeftConfig.from_pretrained(model_path)
model = BertForSequenceClassification.from_pretrained(
config.base_model_name_or_path,
num_labels=2,
return_dict=True,
torch_dtype=torch.float32,
)
tokenizer = BertTokenizerFast.from_pretrained(config.base_model_name_or_path)
# Apply the same modifications as during training
for param in model.parameters():
param.requires_grad = False
if param.ndim == 1:
param.data = param.data.to(torch.float32)
model.gradient_checkpointing_enable()
model.enable_input_require_grads()
# Cast classifier head to float32 for stability
class CastOutputToFloat(nn.Sequential):
def forward(self, x):
return super().forward(x).to(torch.float32)
model.classifier = CastOutputToFloat(model.classifier)
# Load LoRA adapters
model = PeftModel.from_pretrained(model, model_path)
# Move to CPU
model = model.to("cpu")
model.eval()
return model, tokenizer
Then use load_lora_model() instead of load_bert_model() in your main function.
Why the extra steps for LoRA?
Gradient checkpointing and float32 casting ensure the model loads correctly
Here is a video on Exploding Gradients to help understand this concept a bit more
These match the setup used during training
Without these, you might get errors or incorrect predictions
Run streamlit run main_bert.py with your venv activated of course.
You should now have two separate apps: one for XGBoost and one for BERT!
Solutions:
Use LoRA (90% faster)
Reduce dataset size temporarily for testing
Train on Google Colab (free GPU)
Solutions:
Add a validation set to your train and test sets. This will keep track of your performance model’s performance as it is being trained. The evaluation metrics on the validation set can also help with determining overfitting. For example, train and validation scores should be close
Increase dropout: hidden_dropout_prob=0.3
Reduce training epochs
Freeze more layers
By the end of this week, you should have:
A trained BERT model saved in ‘bert_model_final/’
Training metrics showing train/test performance
Understanding of how BERT differs from XGBoost
(Bonus) LoRA implementation for efficient training
Updated Streamlit app with BERT predictions
For Upcoming Weeks:
Deploy your best model (XGBoost or BERT)
Deploy your Streamlit app
Consider ensemble methods (combining both models!)
More model improvements…
Great work diving into deep learning! You've now experienced both traditional ML (XGBoost) and modern deep learning (BERT). Understanding when to use each approach is a valuable skill in production ML systems.