Skip to content

T5-Base, T5-Large, and BART — The Battle of the Summarization Models

transformers

Project Overview

This project explores text summarization transformer models T5-Base, T5-Large and BART-CNN and compare their summarization results. Building upon the Reddit NYCApartment dataset that was pulled using the API PRAW, the focus is to evaluate and compare these models when summarizing user comments and posts. By experimenting with the pipeline API, fine-tuning parameters, and analyzing different model sizes, we gain insights into the strengths and weaknesses of each approach for summarization tasks.

✽ The full article can be found on my page on medium.com.
✽ The corresponding code can be found in github.

Table of Contents

  1. Data Visualization & Analysis
  2. Pipelines for Summarization
    a. T5-BASE Model
    b. T5-LARGE Model
    c. BART-LARGE-CNN Model
  3. More Parameter Control
  4. Exploring num_beans Parameter
  5. Summarizing the Full Dataset

1. Data Visualization & Analysis

Before diving into the summarization models, I first analyzed the Reddit dataset, which contains posts and comments from the r/NYCApartment subreddit. I used histograms to analyze post sentiment distributions and analyzed the top 25% most engaging posts. This helped inform how summarization could be applied effectively.

2. Pipelines for Summarization

For text summarization, I utilized Hugging Face’s pipeline API, which simplifies the process of using transformer models. Below is an example of how the summarization pipeline is used:

1
2
3
4
# summarize just one comment using t5-base
summarizer = pipeline("summarization", model="t5-base", tokenizer="t5-base")
summary = summarizer(sample_comment, min_length=5, max_length=200,
                     do_sample=False)

Each model has its unique characteristics and produces different summary results. Below is a comparison of the models tested.

a. T5-BASE Model

* Model size: 220M parameters
* Strengths: Balanced performance and efficiency
* Weaknesses: Less fluent and less coherent than larger models
* Best for: General text summarization, Q&A, and text generation

b. T5-LARGE Model

* Model size: 770M parameters
* Strengths: More fluent and detailed summaries
* Weaknesses: Requires more computational power
* Best for: Complex summarization tasks, longer documents

c. BART-LARGE-CNN Model

* Model size: 406M parameters
* Strengths: High-quality abstractive summaries
* Weaknesses: Requires fine-tuning for domain-specific texts
* Best for: News summarization, content condensation

Here's a quick code snippet on how to use BART-LARGE-CNN:

1
2
3
4
5
6
7
8
# Load summarization pipeline
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

# Load the tokenizer separately to enable manual truncation
tokenizer = BartTokenizer.from_pretrained("facebook/bart-large-cnn")

summary = summarizer(sample_comment, min_length=30, max_length=100,
                     do_sample=False)

T5 vs. BART Comparison:

Feature T5 BART
Model Type Encoder-Decoder Encoder-Decoder
Pretraining Text-to-Text Denoising Autoencoder
Output Style Keeps key phrases Rephrases more
Performance Works well for structured text Handles noisy text better
Ideal Use Case Summarizing clean, factual text Summarizing complex or opinion-heavy content

3. More Parameter Control

While the pipeline method is very easy and convenient, manually loading the model lets us to better customize summarization results. The generate function allows for additional control over summarization.

1
2
3
4
5
6
7
8
9
from transformers import BartForConditionalGeneration, BartTokenizer
import torch

model = BartForConditionalGeneration.from_pretrained("facebook/bart-large-cnn")
tokenizer = BartTokenizer.from_pretrained("facebook/bart-large-cnn")

inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=1024)
summary_ids = model.generate(**inputs, max_length=60, min_length=30, do_sample=False)
summary_text = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

4. Exploring num_beans Parameter

num_beams controls the number of possible candidate sequences considered during beam search decoding. Setting num_beams=1 is the equivalent to greedy search which might be fastest but does not always provide the most coherent result. Higher values improve quality but increase computation time.

num_beams Summary Quality Computational Time
1 (greedy) Lowest Fastest
4 Moderate Medium
8 High Slower

5. Summarizing the Full Dataset

Once the individual summarization tests were complete, I applied the summarization pipeline to the entire dataset to generate summaries for all Reddit comments. This step is important for handling large datasets efficiently.

1
2
3
4
5
import pandas as pd

# Apply summarization to each post's comment
df['summary'] = df['comments'].apply(lambda x: summarizer(
        x, max_length=60,  min_length=30, do_sample=False)[0]['summary_text'])

Final Summary of Model Performance

Model Size Strengths Weaknesses
T5-BASE 220M Good performance & Efficient Less coherent summaries
T5-LARGE 770M More detailed output Higher computational cost
BART-CNN-LARGE 406M Higher quality summaries Requires fine-tuning for domain specific tasks

This project demonstrates how different transformer models can be leveraged for text summarization, showcasing various methods for customizing summarization quality based on the specific needs of a dataset.


Resources