Quick Email Classification With Hugging Face Transformers

In today’s digital age, we’re constantly bombarded with emails. For businesses and individuals alike, sorting through hundreds of incoming messages can be time-consuming and inefficient. This is where Natural Language Processing (NLP) comes to the rescue. This blog will show you how we can create an email classification system and this is an example on how to use NLP and Hugging Face. I’ll walk you through building a practical system using state-of-the-art transformer models that automatically categorizes your emails into meaningful categories.

Table of Contents

Project Overview

We’ll build a system that automatically categorizes emails into different classes such as:

Promotional
Personal
Updates/Notifications
Spam
Inquiries

By the end of this tutorial, you’ll have a working email classifier that you can integrate into your workflow or customize for more specific categories.

Prerequisites

Basic knowledge of Python
Understanding of NLP concepts
Familiarity with PyTorch (helpful but not required)

A Google Colab account or local environment

Part 1: Setting Up Our Environment

Let’s start by installing the necessary libraries:

Python

# Install required packages
!pip install transformers datasets pandas numpy scikit-learn

Now, let’s import our dependencies:

Python

import pandas as pd
import numpy as np
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import Trainer, TrainingArguments
from datasets import Dataset
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

Part 2: Preparing the Dataset

For this tutorial, we’ll use a synthetic email dataset. In a real-world scenario, you might want to use your own labeled emails or an open-source dataset.

Python

# Load data from a public email classification dataset
# You can download the Enron Email Dataset from: https://www.kaggle.com/datasets/wcukierski/enron-email-dataset
# Or use the Lingspam dataset: https://www.kaggle.com/datasets/mandygu/lingspam-dataset

# For this tutorial, we'll use a simplified version with sample data
# In practice, you'd load your dataset like this:
# df = pd.read_csv('path_to_your_dataset.csv')

emails = [
    {"text": "Congratulations! You've won a free vacation to Hawaii. Click here to claim now!", "label": 3},  # Spam
    {"text": "Hi Sarah, can we meet for coffee this weekend? Let me know what works for you.", "label": 1},  # Personal
    {"text": "Your monthly subscription has been renewed. Your next billing date is June 15.", "label": 2},  # Update
    {"text": "FLASH SALE: 50% off all items for the next 24 hours only!", "label": 0},  # Promotional
    {"text": "I'm interested in your services. Could you please provide more information about pricing?", "label": 4},  # Inquiry
    # Add more examples...
]

# Convert to DataFrame
df = pd.DataFrame(emails)

# Display the first few rows
print(df.head())

Now, let’s split our data into training and validation sets:

Python

train_df, val_df = train_test_split(df, test_size=0.2, random_state=42)

# Convert to Hugging Face Datasets
train_dataset = Dataset.from_pandas(train_df)
val_dataset = Dataset.from_pandas(val_df)

Part 3: Fine-tuning a Pre-trained Model

We’ll use a pre-trained BERT model from Hugging Face and fine-tune it for our classification task:

Python

# Define label mapping
id2label = {
    0: "Promotional",
    1: "Personal",
    2: "Update/Notification",
    3: "Spam",
    4: "Inquiry"
}
label2id = {v: k for k, v in id2label.items()}

# Load tokenizer and model
model_name = "distilbert-base-uncased"  # A smaller version of BERT, faster to train
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(
    model_name, 
    num_labels=5,
    id2label=id2label,
    label2id=label2id
)

# Tokenize the data
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=128)

tokenized_train = train_dataset.map(tokenize_function, batched=True)
tokenized_val = val_dataset.map(tokenize_function, batched=True)

Next, we’ll set up our training arguments and train the model:

Python

# Define metrics computation function
def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='weighted')
    acc = accuracy_score(labels, preds)
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

# Define training arguments
training_args = TrainingArguments(
    output_dir="./results",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    push_to_hub=False,
)

# Initialize trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_val,
    compute_metrics=compute_metrics,
)

# Train the model
trainer.train()

Part 4: Evaluating Our Model

Let’s evaluate our model on the validation set:

Python

# Evaluate the model
eval_results = trainer.evaluate()
print(f"Evaluation results: {eval_results}")

Part 5: Making Predictions on New Emails

Now we can use our trained model to classify new, unseen emails:

Python

# Function to classify new emails
def classify_email(email_text):
    inputs = tokenizer(email_text, return_tensors="pt", padding=True, truncation=True, max_length=128)
    with torch.no_grad():
        outputs = model(**inputs)
    
    # Get predicted class
    predicted_class = torch.argmax(outputs.logits, dim=1).item()
    return id2label[predicted_class], outputs.logits.softmax(dim=1)[0]

# Test with new emails
test_emails = [
    "Don't miss our biggest sale of the year this weekend!",
    "Can you send me the report by tomorrow? Thanks!",
    "Your account password was recently changed. If this wasn't you, please contact support.",
    "You've been selected to receive a free iPhone. Click here to claim now!",
    "I saw your website and I'm interested in your consulting services. Do you have time for a call next week?"
]

for email in test_emails:
    label, confidence = classify_email(email)
    print(f"Email: {email[:50]}...")
    print(f"Predicted class: {label}")
    print(f"Confidence: {confidence.max().item():.4f}")
    print("-" * 50)

Part 6: Saving and Loading Our Model

To deploy our model for future use:

Python

# Save the model and tokenizer
model_path = "./email_classifier_model"
model.save_pretrained(model_path)
tokenizer.save_pretrained(model_path)

# Later, to load the model
loaded_tokenizer = AutoTokenizer.from_pretrained(model_path)
loaded_model = AutoModelForSequenceClassification.from_pretrained(model_path)

Part 7: Improving the Model

Here are some ways to enhance our email classifier:

Use a larger dataset: The more diverse examples our model sees, the better it will generalize.
Data augmentation: Create variations of existing emails to improve robustness.
Try different pre-trained models: Experiment with models like RoBERTa, XLNet, or BART which might perform better for certain email types.
Hyperparameter tuning: Adjust learning rates, batch sizes, and other parameters to optimize performance.
Add more classes: Expand beyond our initial categories for more granular classification.

Part 8: Integration Ideas

Once your classifier is working well, you can:

Build a simple API around it using Flask or FastAPI
Integrate it with email clients using their APIs
Create a browser extension that categorizes emails in real-time
Set up automated filters and rules based on classification results

Conclusion

In this tutorial, we’ve built a powerful email classification system using Hugging Face’s transformer models. This approach leverages pre-trained language models that understand the nuances of natural language, making our classifier much more accurate than traditional methods based on keywords or simple rules.

The applications are numerous – from helping individuals manage their inboxes more efficiently to enabling businesses to automatically route customer inquiries to the right department. As NLP technology continues to advance, these tools will become even more powerful and accessible.

I hope this tutorial helps you implement your own email classification system. Feel free to expand upon this foundation to meet your specific needs!

Resources for Further Learning

Hugging Face Transformers Documentation
Enron Email Dataset – A large dataset of real emails for training
Lingspam Dataset – A dataset specifically for spam classification
Email Spam Classification Datasets
Fine-tuning Transformer Models
Text Classification with Transformers Tutorial

Happy coding!

Email Classification with Hugging Face Transformers

Project Overview

Prerequisites

Part 1: Setting Up Our Environment

Part 2: Preparing the Dataset

Part 3: Fine-tuning a Pre-trained Model

Part 4: Evaluating Our Model

Part 5: Making Predictions on New Emails

Part 6: Saving and Loading Our Model

Part 7: Improving the Model

Part 8: Integration Ideas

Conclusion

Resources for Further Learning

Sachin

Understanding Linear Regression: A Beginner’s Guide

The Metaverse and Gaming in 2025: A Budding Synergy or a Mismatched Pair?

Battle of the Titans: Reviewing the Latest Flagship Smartphones (March/April 2025 Releases)

Electric Vehicles and Autonomous Driving: Decoding the Future of Mobility in 2025 and Beyond

Cybersecurity Under Siege: Major Breaches and How to Stay Protected

Understanding Linear Regression: A Beginner’s Guide

Email Classification with Hugging Face Transformers

Your First Steps with Hugging Face: Learn Transformers Without the Headache

Project Overview

Prerequisites

Part 1: Setting Up Our Environment

Part 2: Preparing the Dataset

Part 3: Fine-tuning a Pre-trained Model

Part 4: Evaluating Our Model

Part 5: Making Predictions on New Emails

Part 6: Saving and Loading Our Model

Part 7: Improving the Model

Part 8: Integration Ideas

Conclusion

Resources for Further Learning

Sachin

Related Posts

Trending now