MSU AI Club

Week 1: Getting Started With Your Amazon Review Analyzer

Welcome to the Amazon Review Analyzer project! We're so glad you've decided to join us. This week, we're setting up everything you need to start training your own model that analyzes if Amazon reviews are computer-generated or not. It is a known issue that many Amazon reviews are fake or AI-generated, and we are creating this model in order to solve this. Don't worry if you're new to programming because we'll go through each step together this first week. However, less and less code will be provided as the project progresses along with your learning.

1. Setting Up Your Development Environment

Now, let's set up your computer for a Python project. We will be using Python, Streamlit, XGBoost, LangChain, and more. Don't worry if you're unfamiliar with these tools because each will be introduced in detail throughout the following weeks.

1.1 Installing Python

For Windows:

Go to https://www.python.org/downloads/
Download Python 3.11 or later for your operating system.
Run the installer. On Windows, make sure to check "Add Python to PATH."
Verify the installation:
- Open a command prompt or terminal.
- Type python --version and press "Enter".
- You should see the Python version number.

For Mac: Python is usually installed by default. Open a terminal and type python3 to confirm if it is

1.2 Setting up your repo

In GitHub, make a new repository. Give it Public visibility.
Go to that repo and click the "Code" tab, then copy the link within the "Quick setup" box
In the IDE of your choice (I would recommend VS Code because that is what was used to make this project), clone the repo by pasting the link you copied in the previous step. If you're using VS Code, you can do this by clicking the "Clone Git Repository" button and pasting the link in the input field. Then choose a good folder for your project to be cloned to on your computer
Create a ".gitignore" file at the root of your project, which should look something like: C:\...\...\Documents\GitHubProjects\review-analyzer depending on where you saved your project and what it is called. Then, simply add venv/ to it. This sets us up for the next steps by not including any of your virtual environment files in your repo because package/environment files should never be a part of source control with git.

1.3 Setting up your virtual environment

Virtual environments help keep your projects organized. They allow you to install libraries and packages specific to this project without affecting your system's Python installation.

Open a command prompt or terminal in the VS Code window for your project.
Create a virtual environment:
- On Windows: python -m venv venv at the root of your project
- On Mac: python3 -m venv venv
Activate the virtual environment:
- On Windows: venv\Scripts\activate
- On Mac: source venv\bin\activate

1.4 Installing required libraries

Now that your virtual environment is active, let's install the necessary libraries.

Ensure you're in your project root directory with the virtual environment activated.
Download this requirements.txt file provided, which has a list of necessary packages for the project.
Drag this file into the root of your project. Install the libraries using pip: pip install -r .\requirements.txt
Verify the installations: pip list

You should see libraries like xgboost and accelerate listed.

2. Download the Dataset and Start Performing Exploratory Data Analysis

2.1 Downloading the dataset

This is the dataset that we will be using to train our model. Download it and drag it into the root of your project. You can then open the .csv file and examine the data to make sure it looks right. The column labeled "label" has values of either "CG" (Computer-Generated) or "OR" (Original/human review). The labels will be required for training and testing your model later.

2.2 Exploratory Data Analysis (EDA)

EDA is one of the first steps that Data Analysts perform when starting a project with new data. We won't make you perform this whole process on your own, though, because it can be quite daunting with thousands of records of new data. However, we will teach you a little more about this process:

Watch at least the first 12 minutes of this video (I recommend watching the whole thing if you have time) as a tutorial on EDA
Create a new file named eda_starter.py. Copy the following code snippets and save/run them step by step with py ./eda_starter.py. Make sure to take a look at the output after each run.

import pandas as pd

# Load your dataset (make sure the path is correct)

df = pd.read_csv("fake reviews dataset.csv")

Pandas is a library used for data manipulation and analysis

# Quick look at the first rows

print(df.head())

# Count missing values in each column

print(df.isnull().sum())

# Character length of each review

df["char_length"] = df["text_"].apply(len)

# Word count of each review

df["word_count"] = df["text_"].str.split().apply(len)

The two lines above add 2 new categories to our data: char_length and word_count

import seaborn as sns # To visualize data

import matplotlib.pyplot as plt # To manipulate graphics of data

sns.boxplot(x="char_length", y="label", data=df)

plt.title("Character Length of Reviews by Label")

plt.xlabel("Character length")

plt.ylabel("Review label")

plt.show()

3. BONUS: Come up with some other plots and features to create for the dataset

For example, you could create a plot comparing how many fake vs. real reviews there are for each rating (1.0, 2.0, …). You could also simply change the x-axis of the boxplot above to be word_count instead of char_length and examine the difference.

Note: The features being added now may not be the same ones we use to train the model. Right now we are just exploring these features. You will see next week how we add the finalized features to our processed dataset to perform model training.

Wrapping Up

By the end of this week, you should have:

Understood what kind of model we want to achieve.
Set up a Python development environment with a virtual environment.
Downloaded and explored a dataset of computer and human-generated Amazon reviews using EDA principles.

Next week:

We will begin programming in more features and categories that we want our model to use to classify reviews. Hint: another good feature for the purpose of this project would be something like review length.

Great job getting everything set up!