Sentence Transformer is a powerful deep learning model that transforms sentences into high-quality fixed-length embeddings and captures their semantic meaning. These embeddings are useful for various natural language processing (NLP) tasks such as text classification, clustering, semantic search, and information retrieval.
In this post, I’ll show you how to fine-tune a sentence transformer to specifically categorize Amazon products into their product categories (such as toys or sporting goods). We introduce two different sentence transformers: paraphrase-MiniLM-L6-v2 and Amazon’s proprietary Large-Scale Language Model (LLM). M5_ASIN_SMALL_V2.0
and compare the results. M5 LLMS is a BERT-based LLM that is fine-tuned based on Amazon’s internal product catalog data using product titles, bullet points, descriptions, and more. These are currently used for use cases such as automatic product classification and similar product recommendations. Our hypothesis is: M5_ASIN_SMALL_V2.0
It is fine-tuned for Amazon product data, resulting in improved performance for Amazon product categorization use cases. We will prove this hypothesis in the next experiment described in this post.
Solution overview
This post shows how to fine-tune a sentence transformer using Amazon product data and how the resulting sentence transformer can be used to improve classification accuracy for product categories using XGBoost decision trees . This demonstration uses a public Amazon product dataset called Amazon Product Dataset 2020 from the kaggle competition. This dataset contains the following attributes and fields:
- domain name – Amazon.com
- date range – From January 1, 2020 to January 31, 2020
- file extension – CSV
- Available fields – Unique ID, Product Name, Brand Name, Asin, Category, UPC Ean Code, List Price, Selling Price, Quantity, Model Number, About the Product, Product Specifications, Technical Details, Shipping Weight, Product Dimensions, Image, Variation, SKU, Product URL, stock, product details, dimensions, color, ingredients, usage instructions, Amazon seller, size and quantity variations, product description
- label field – Category
Prerequisites
Before you begin, please install the following packages: You can do this in an Amazon SageMaker notebook or a local Jupyter notebook by running the following command.
!pip install sentencepiece --quiet
!pip install sentence_transformers --quiet
!pip install xgboost –-quiet
!pip install scikit-learn –-quiet/
Data preprocessing
The first step required to fine-tune the Sentence Transformer is to preprocess the Amazon product data so that the Sentence Transformer can use the data to effectively fine-tune it. This involves normalizing the text data, extracting the first category from the (Category) field to define the main category of the product, and extracting the most important categories from the dataset to help accurately classify the main category of the product. Includes selecting fields. Use the following code for preprocessing:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
data = pd.read_csv('marketing_sample_for_amazon_com-ecommerce__20200101_20200131__10k_data.csv')
data.columns = data.columns.str.lower().str.replace(' ', '_')
data('main_category') = data('category').str.split("|").str(0)
data("all_text") = data.apply(
lambda r: " ".join(
(
str(r("product_name")) if pd.notnull(r("product_name")) else "",
str(r("about_product")) if pd.notnull(r("about_product")) else "",
str(r("product_specification")) if pd.notnull(r("product_specification")) else "",
str(r("technical_details")) if pd.notnull(r("technical_details")) else ""
)
),
axis=1
)
label_encoder = LabelEncoder()
labels_transform = label_encoder.fit_transform(data('main_category'))
data('label')=labels_transform
data(('all_text','label'))
The following screenshot is an example of what the dataset looks like after preprocessing.
Sentence conversion paraphrase fine tuning-MiniLM-L6-v2
The first sentence converter we’ll fine-tune is called paraphrase-MiniLM-L6-v2. Using the popular BERT model as the underlying architecture, we transform the product description text into a 384-dimensional dense vector embedding, which is used in the XGBoost classifier for product category classification. Use the following code to fine-tune paraphrase-MiniLM-L6-v2 using preprocessed Amazon product data.
from sentence_transformers import SentenceTransformer
model_name="paraphrase-MiniLM-L6-v2"
model = SentenceTransformer(model_name)
The first step is to define classification headings that represent the 24 product categories into which Amazon products can be categorized. This classification head is specifically used to train a sentence transformer to more effectively transform product descriptions according to 24 product categories. The idea is that all product descriptions within the same category should be converted into vector embeddings that are close in distance compared to product descriptions belonging to different categories.
The following code is for fine-tuning statement transformer 1.
import torch.nn as nn
# Define classification head
class ClassificationHead(nn.Module):
def __init__(self, embedding_dim, num_classes):
super(ClassificationHead, self).__init__()
self.linear = nn.Linear(embedding_dim, num_classes)
def forward(self, features):
x = features('sentence_embedding')
x = self.linear(x)
return x
# Define the number of classes for a classification task.
num_classes = 24
print('class number:', num_classes)
classification_head = ClassificationHead(model.get_sentence_embedding_dimension(), num_classes)
# Combine SentenceTransformer model and classification head."
class SentenceTransformerWithHead(nn.Module):
def __init__(self, transformer, head):
super(SentenceTransformerWithHead, self).__init__()
self.transformer = transformer
self.head = head
def forward(self, input):
features = self.transformer(input)
logits = self.head(features)
return logits
model_with_head = SentenceTransformerWithHead(model, classification_head)
Next, set the fine-tuning parameters. In this post, we will train with 5 epochs, optimize cross-entropy loss, and use the AdamW optimization method. After testing different epoch values, we observed that the loss was minimized at epoch 5, so we chose epoch 5. This resulted in the optimal number of training iterations to achieve the best classification results.
The following code is for fine-tuning statement transformer 2.
import os
os.environ("TORCH_USE_CUDA_DSA") = "1"
os.environ("CUDA_LAUNCH_BLOCKING") = "1"
from sentence_transformers import SentenceTransformer, InputExample, LoggingHandler
import torch
from torch.utils.data import DataLoader
from transformers import AdamW, get_linear_schedule_with_warmup
train_sentences = data('all_text')
train_labels = data('label')
# training parameters
num_epochs = 5
batch_size = 2
learning_rate = 2e-5
# Convert the dataset to PyTorch tensors.
train_examples = (InputExample(texts=(s), label=l) for s, l in zip(train_sentences, train_labels))
# Customize collate_fn to convert InputExample objects into tensors.
def collate_fn(batch):
texts = (example.texts(0) for example in batch)
labels = torch.tensor((example.label for example in batch))
return texts, labels
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=batch_size, collate_fn=collate_fn)
# Define the loss function, optimizer, and learning rate scheduler.
criterion = nn.CrossEntropyLoss()
optimizer = AdamW(model_with_head.parameters(), lr=learning_rate)
total_steps = len(train_dataloader) * num_epochs
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=0, num_training_steps=total_steps)
# Training loop
loss_list=()
for epoch in range(num_epochs):
model_with_head.train()
for step, (texts, labels) in enumerate(train_dataloader):
labels = labels.to(model.device)
optimizer.zero_grad()
# Encode text and pass through classification head.
inputs = model.tokenize(texts)
input_ids = inputs('input_ids').to(model.device)
input_attention_mask = inputs('attention_mask').to(model.device)
inputs_final = {'input_ids': input_ids, 'attention_mask': input_attention_mask}
# move model_with_head to the same device
model_with_head = model_with_head.to(model.device)
logits = model_with_head(inputs_final)
loss = criterion(logits, labels)
loss.backward()
optimizer.step()
scheduler.step()
if step % 100 == 0:
print(f"Epoch {epoch}, Step {step}, Loss: {loss.item()}")
print(f'Epoch {epoch+1}/{num_epochs}, Loss: {loss.item()}')
model_save_path = f'./intermediate-output/epoch-{epoch}'
model.save(model_save_path)
loss_list.append(loss.item())
# Save the final model
model_final_save_path="st_ft_epoch_5"
model.save(model_final_save_path)
To observe whether the resulting fine-tuned sentence transformer improves the accuracy of product category classification, use it as a text embedder in the XGBoost classifier in the next step.
XGBoost classification
XGBoost (Extreme Gradient Boosting) classification is a machine learning technique used for classification tasks. This is an implementation of a gradient boosting framework designed to be efficient, flexible, and portable. In this post, we let XGBoost consume the product description text embedding output of a sentence transformer and observed its classification accuracy for product categories. To use the standard use the following code: paraphrase-MiniLM-L6-v2
Sentence transformer before being tweaked to categorize Amazon products into their respective categories:
from sklearn.model_selection import train_test_split
import xgboost as xgb
from sklearn.metrics import accuracy_score
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')
data('text_embedding') = data('all_text').apply(lambda x: model.encode(str(x)))
text_embeddings = pd.DataFrame(data('text_embedding').tolist(), index=data.index, dtype=float)
# Convert numeric columns stored as strings to floats
numeric_columns = ('selling_price', 'shipping_weight', 'product_dimensions') # Add more columns as needed
for col in numeric_columns:
data(col) = pd.to_numeric(data(col), errors="coerce")
# Convert categorical columns to category type
categorical_columns = ('model_number', 'is_amazon_seller') # Add more columns as needed
for col in categorical_columns:
data(col) = data(col).astype('category')
X_0 = data(('selling_price','model_number','is_amazon_seller'))
X = pd.concat((X_0, text_embeddings), axis=1)
label_encoder = LabelEncoder()
data('main_category_encoded') = label_encoder.fit_transform(data('main_category'))
y = data('main_category_encoded')
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Re-encode the labels to ensure they are consecutive integers starting from 0
unique_labels = sorted(set(y_train) | set(y_test))
label_mapping = {label: idx for idx, label in enumerate(unique_labels)}
y_train = y_train.map(label_mapping)
y_test = y_test.map(label_mapping)
# Enable categorical support for XGBoost
dtrain = xgb.DMatrix(X_train, label=y_train, enable_categorical=True)
dtest = xgb.DMatrix(X_test, label=y_test, enable_categorical=True)
param = {
'max_depth': 6,
'eta': 0.3,
'objective': 'multi:softmax',
'num_class': len(label_mapping),
'eval_metric': 'mlogloss'
}
num_round = 100
bst = xgb.train(param, dtrain, num_round)
# Evaluate the model
y_pred = bst.predict(dtest)
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')
Accuracy: 0.78
When using stock, an accuracy of 78% is observed. paraphrase-MiniLM-L6-v2
sentence transformer. To observe the results of your tweaks paraphrase-MiniLM-L6-v2
To use the statement transformer, you need to update the beginning of your code as follows: All other code remains the same.
model = SentenceTransformer('st_ft_epoch_5')
data('text_embedding_miniLM_ft10') = data('all_text').apply(lambda x: model.encode(str(x)))
text_embeddings = pd.DataFrame(data('text_embedding_finetuned').tolist(), index=data.index, dtype=float)
X_pa_finetuned = pd.concat((X_0, text_embeddings), axis=1)
X_train, X_test, y_train, y_test = train_test_split(X_pa_finetuned, y, test_size=0.2, random_state=42)
# Re-encode the labels to ensure they are consecutive integers starting from 0
unique_labels = sorted(set(y_train) | set(y_test))
label_mapping = {label: idx for idx, label in enumerate(unique_labels)}
y_train = y_train.map(label_mapping)
y_test = y_test.map(label_mapping)
# Build and train the XGBoost model
# Enable categorical support for XGBoost
dtrain = xgb.DMatrix(X_train, label=y_train, enable_categorical=True)
dtest = xgb.DMatrix(X_test, label=y_test, enable_categorical=True)
param = {
'max_depth': 6,
'eta': 0.3,
'objective': 'multi:softmax',
'num_class': len(label_mapping),
'eval_metric': 'mlogloss'
}
num_round = 100
bst = xgb.train(param, dtrain, num_round)
y_pred = bst.predict(dtest)
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')
# Optionally, convert the predicted labels back to the original category labels
inverse_label_mapping = {idx: label for label, idx in label_mapping.items()}
y_pred_labels = pd.Series(y_pred).map(inverse_label_mapping)
Accuracy: 0.94
fine-tuned paraphrase-MiniLM-L6-v2
As a result of the sentence transformer, we observed an accuracy of 94%, an increase of 16% from the baseline accuracy of 78%. From this observation, we conclude that fine-tuning: paraphrase-MiniLM-L6-v2
This is effective when classifying Amazon product data into product categories.
Fine-tune the sentence transformer M5_ASIN_SMALL_V20
Next, create a sentence transformer from the BERT-based model. M5_ASIN_SMALL_V2.0
. This is a 40 million parameter BERT-based model trained on M5, Amazon’s internal team that specializes in fine-tuning LLMs with Amazon product data. It is extracted from a larger supervised model (approximately 5 billion parameters), pre-trained on a large amount of unlabeled ASIN data, and run through a series of Amazon supervised learning tasks (multi-task pre-fine tuning). Pre-tweaked. tuning). It is a multitasking, multilingual, multilocale, and multimodal BERT-based encoder-only model trained on text and structured data inputs. The details of its neural network architecture are as follows.
Backbone of the model:
Hidden size: 384
Number of hidden layers: 24
Number of featured heads: 16
Intermediate size: 1536
Vocabulary count: 256,035
Number of backbone parameters: 42,587,904
Number of word embedding parameters (bert.embedding.*): 98,517,504
Total number of parameters: 141,259,023
because M5_ASIN_SMALL_V20
specifically pre-trained on Amazon product data, we hypothesize that building sentence transformations from it will improve product category classification accuracy. Build the statement transformer by performing the following steps: M5_ASIN_SMALL_V20
Fine-tune it, feed it into the XGBoost classifier, and observe the effect on accuracy.
- Load the pretrained M5 model to use as the base encoder.
- Please use the M5 model within the following range.
SentenceTransformer
A framework for creating statement transformers. - Add a pooling layer to create fixed-size sentence embeddings from the variable-length output of the BERT model.
- Combine the M5 model and the pooling layer into one model.
- Fine-tune your model on relevant datasets.
See the code below for steps 1-3.
from sentence_transformers import models
from transformers import AutoTokenizer
# Step 1: Load Pre-trained M5 Model
model_path="M5_ASIN_SMALL_V20" # or your custom model path
transformer_model = models.Transformer(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)
# Step 2: Define Pooling Layer
pooling_model = models.Pooling(transformer_model.get_word_embedding_dimension(),
pooling_mode_mean_tokens=True)
# Step 3: Create SentenceTransformer Model
model_mean_m5_base = SentenceTransformer(modules=(transformer_model, pooling_model))
The rest of the code remains tweaked, paraphrase-MiniLM-L6-v2
However, to create text embeddings in the dataset, we instead use a fine-tuned M5 sentence transformer.
loaded_model = SentenceTransformer('m5_ft_epoch_5_mean')
data('text_embedding_m5') = data('all_text').apply(lambda x: loaded_model.encode(str(x)))
result
Similar results are observed below. paraphrase-MiniLM-L6-v2
When we checked the accuracy before fine-tuning, we observed an accuracy of 78%. M5_ASIN_SMALL_V20
. However, fine-tuned M5_ASIN_SMALL_V20
The sentence transformer is fine-tuned paraphrase-MiniLM-L6-v2
. The accuracy is 98%, while the fine-tuned one is 94%. paraphrase-MiniLM-L6-v2
. We fine-tuned the sentence transformer to 5 epochs because our experiments showed that this was the optimal number to minimize loss. The following graph summarizes the observed accuracy improvements from five epochs of fine-tuning into one comparative graph.
cleaning
We recommend fine-tuning statement transformers (such as ml.g5.4xlarge and ml.g4dn.16xlarge) using GPUs. Be sure to clean up your resources to avoid additional costs.
If you are using SageMaker notebook instances, see Clean up Amazon SageMaker notebook instance resources. If you use Amazon SageMaker Studio, see Delete or Stop Running Instances, Applications, and Spaces in Studio.
conclusion
In this post, we discussed sentence transformers and how to use them effectively for text classification tasks. A deep dive into sentence transformers paraphrase-MiniLM-L6-v2
we demonstrated how to use a BERT-based model such as: M5_ASIN_SMALL_V20
We created a sentence transformer, showed how to fine-tune a sentence transformer, and showed how fine-tuning a sentence transformer affects accuracy.
Sentence transformer fine-tuning has been proven to be very effective in classifying product descriptions into categories and significantly improve prediction accuracy. As a next step, we recommend trying out different sentence transformations in Hugging Face.
Finally, if you want to explore M5, please note that M5 is proprietary to Amazon and can only be accessed as an Amazon partner or customer at the time of this publication. If you are an Amazon partner or customer interested in using M5, please contact your Amazon representative. You will be guided through M5 services and how to use them in your use case.
About the author
Kara Jan He is a data scientist with AWS Professional Services in the San Francisco Bay Area and has extensive experience in AI/ML. She specializes in leveraging cloud computing, machine learning, and generative AI to help clients address complex business challenges across a variety of industries. Kara is passionate about innovation and continuous learning.
Farshad Harirchi is a Principal Data Scientist at AWS Professional Services. He helps clients in a variety of industries, from retail to industrial and financial services, with the design and development of generative AI and machine learning solutions. Farshad brings extensive experience across the machine learning and MLOps stack. Outside of work, I enjoy traveling, outdoor sports, and exploring board games.
james poquise is a data scientist with AWS Professional Services based in Orange County, California. He holds a bachelor’s degree in computer science from the University of California, Irvine and has several years of experience working in the data domain in various roles. Currently, he works on implementing and deploying scalable ML solutions to achieve business outcomes for AWS clients.