T5-Base, T5-Large, and BART — The Battle of the Summarization Models
Project Overview
This project explores text summarization transformer models T5-Base, T5-Large and BART-CNN and compare their summarization results. Building upon the Reddit NYCApartment dataset that was pulled using the API PRAW, the focus is to evaluate and compare these models when summarizing user comments and posts. By experimenting with the pipeline API, fine-tuning parameters, and analyzing different model sizes, we gain insights into the strengths and weaknesses of each approach for summarization tasks.
✽ The full article can be found on my page on medium.com.
✽ The corresponding code can be found in github.
Table of Contents
- Data Visualization & Analysis
- Pipelines for Summarization
a. T5-BASE Model
b. T5-LARGE Model
c. BART-LARGE-CNN Model - More Parameter Control
- Exploring
num_beans
Parameter - Summarizing the Full Dataset
1. Data Visualization & Analysis
Before diving into the summarization models, I first analyzed the Reddit dataset, which contains posts and comments from the r/NYCApartment subreddit. I used histograms to analyze post sentiment distributions and analyzed the top 25% most engaging posts. This helped inform how summarization could be applied effectively.
2. Pipelines for Summarization
For text summarization, I utilized Hugging Face’s pipeline API, which simplifies the process of using transformer models. Below is an example of how the summarization pipeline is used:
Each model has its unique characteristics and produces different summary results. Below is a comparison of the models tested.
a. T5-BASE Model
* Model size: 220M parameters
* Strengths: Balanced performance and efficiency
* Weaknesses: Less fluent and less coherent than larger models
* Best for: General text summarization, Q&A, and text generation
b. T5-LARGE Model
* Model size: 770M parameters
* Strengths: More fluent and detailed summaries
* Weaknesses: Requires more computational power
* Best for: Complex summarization tasks, longer documents
c. BART-LARGE-CNN Model
* Model size: 406M parameters
* Strengths: High-quality abstractive summaries
* Weaknesses: Requires fine-tuning for domain-specific texts
* Best for: News summarization, content condensation
Here's a quick code snippet on how to use BART-LARGE-CNN:
T5 vs. BART Comparison:
Feature | T5 | BART |
---|---|---|
Model Type | Encoder-Decoder | Encoder-Decoder |
Pretraining | Text-to-Text | Denoising Autoencoder |
Output Style | Keeps key phrases | Rephrases more |
Performance | Works well for structured text | Handles noisy text better |
Ideal Use Case | Summarizing clean, factual text | Summarizing complex or opinion-heavy content |
3. More Parameter Control
While the pipeline
method is very easy and convenient, manually loading the model lets us to better customize summarization results. The generate
function allows for additional control over summarization.
4. Exploring num_beans
Parameter
num_beams
controls the number of possible candidate sequences considered during beam search decoding. Setting num_beams=1
is the equivalent to greedy search which might be fastest but does not always provide the most coherent result. Higher values improve quality but increase computation time.
num_beams |
Summary Quality | Computational Time |
---|---|---|
1 (greedy) | Lowest | Fastest |
4 | Moderate | Medium |
8 | High | Slower |
5. Summarizing the Full Dataset
Once the individual summarization tests were complete, I applied the summarization pipeline to the entire dataset to generate summaries for all Reddit comments. This step is important for handling large datasets efficiently.
Final Summary of Model Performance
Model | Size | Strengths | Weaknesses |
---|---|---|---|
T5-BASE |
220M | Good performance & Efficient | Less coherent summaries |
T5-LARGE |
770M | More detailed output | Higher computational cost |
BART-CNN-LARGE |
406M | Higher quality summaries | Requires fine-tuning for domain specific tasks |
This project demonstrates how different transformer models can be leveraged for text summarization, showcasing various methods for customizing summarization quality based on the specific needs of a dataset.