← Back

A zero-shot LLM matched trained classifiers. The disagreements were more interesting than the scores.

MailTask.ai team··12 min read
TL;DR: We ran a zero-shot LLM against the Parakweet EmailIntentDataSet (3,551 labeled email sentences) and matched the best published classifier at 77.9% F1. But investigating the 751 disagreements revealed something more interesting: roughly 15% of the dataset's ground truth labels appear to be wrong. Human annotators only agreed 85% of the time, and an independent AI judge sided with our system on over half the disputed cases. The labels, not the model, may be the bottleneck.

The dataset

Identifying action items in email is a well-studied NLP problem. The Parakweet EmailIntentDataSet provides 3,551 real workplace email sentences, each labeled by human annotators as containing an actionable “intent” or not. It has been the standard benchmark in this space since its release, with published classifiers achieving F1 scores between 71.2% and 77.8%.

A key detail from the original research: human annotators only agree about 85% of the time on whether a sentence contains a task. That means roughly 1 in 7 labels may be wrong in the ground truth itself.

We wanted to test whether a modern general-purpose LLM, with no task-specific training, could match these specialized systems. We ran a zero-shot structured extraction prompt against the full dataset, with no modifications or tuning for the benchmark.

The experiment

We fed each sentence to the system as a simulated single-message email thread and checked whether it extracted any actions.

  • Model: google/gemini-2.5-flash-lite via OpenRouter
  • Approach: Zero-shot structured output. The prompt asks the model to extract TODO items with fields for task text, due date, priority, sentiment, and context. No fine-tuning, no few-shot examples. The production prompt was used as-is.
  • Temperature: 0 (deterministic)
  • Classification rule: If the model extracted ≥1 action, predict “intent”. Otherwise, “non-intent”.

Exclusions

We excluded 106 sentences that matched marketing/automated email patterns (e.g. “Shop the latest styles,” “Unsubscribe,” “Free shipping”). The original dataset labels some of these as “intent,” but they are promotional CTAs, not personal tasks. The filter uses keyword-based regex patterns. All exclusions are documented and reproducible.

Results

Using the original dataset labels as ground truth, the system achieved a 77.9% F1 score, matching the best published specialized classifier (77.8%).

SystemPrecisionRecallF1
SVM + n-grams (baseline)71.9%78.5%71.2%
SVM + additional features (best published)77.3%78.3%77.8%
Zero-shot LLM (this study)77.3%78.5%77.9%
Academic baseline (SVM + n-grams)71.2%Academic best (SVM + features)77.8%This study (zero-shot LLM)77.9%This study (adjusted)89.3%

Dashed bar indicates adjusted score (see dataset quality analysis below).

Confusion matrix

Of 3,445 sentences (after filtering), the system produced 389 false positives and 362 false negatives:

PredictedIntentNon-intentActual intentActual non-intent850(24.7%)362(10.5%)389(11.3%)1844(53.5%)

The disagreements tell a more interesting story

Manually reviewing a sample of the 751 disagreements, we noticed recurring patterns:

Ambiguous requests (false positives)

Sentences where reasonable people would disagree on whether it is a task:

Let me know if you have any questions.

Dataset: non-intentSystem: intent

Polite sign-off or genuine request? Depends on context that a single sentence cannot provide.

I look forward to hearing from you.

Dataset: non-intentSystem: intent

The system interpreted this as a request to respond. Arguable either way.

Context-dependent sentences (false negatives)

Sentences that require surrounding context to classify correctly:

See attached.

Dataset: intentSystem: non-intent

Is this a task (review the attachment)? Without the attachment or thread context, it is ambiguous.

Boilerplate labeled as intent (false negatives)

Some sentences labeled “intent” in the dataset are automated boilerplate or social media posts, not personal tasks:

Click here to start a new romance with someone living near you.

Dataset: intentSystem: non-intent

Dating spam. The dataset labeled this as an action item.

Exposed! The Secret Exposed By Internet Millionaire Exposed.

Dataset: intentSystem: non-intent

Spam subject line. Not a personal task for the recipient.

These patterns suggested that a meaningful share of the “errors” might be noise in the ground truth rather than actual system mistakes. We decided to investigate this systematically.

Quantifying the label noise

A caveat upfront: using one AI to evaluate another has obvious circularity concerns. We present this analysis not as a definitive correction, but as evidence that the original labels have meaningful noise, something the dataset authors themselves noted with their ~85% inter-annotator agreement figure.

We sent all 751 disputed cases (where our system and the dataset disagreed) to an independent judge model (Claude Sonnet) for a second opinion. The judge reviewed each sentence in isolation and classified it as “intent” or “non-intent” using a strict prompt that leans toward “non-intent” on borderline cases.

The judge prompt

For transparency, here is the exact prompt used for the judge. It was not tuned to favor our system:

CLASSIFY AS "intent" (task/action for the recipient) if the sentence:
- Makes a direct request or ask of the recipient
- Proposes or confirms a meeting, call, or schedule
- Asks a question that requires a response
- Shares a document or attachment to review
- Invites the recipient to an event
- Gives instructions or directions
- Contains a deadline or reminder

CLASSIFY AS "non-intent" (no task) if the sentence:
- Is purely informational with no action implied
- Describes what a third party will do
- Is marketing, spam, or automated boilerplate
- Is a statement of opinion, fact, or status
- Is the sender describing their own plans
- Is a generic CTA from a company/service
- Is a rhetorical question or joke

Be strict: if it is borderline, lean toward "non-intent".

Judge results

Of 389 false positives (sentences our system flagged as tasks, but the dataset said were not), the judge agreed with our system on 182 (47%), meaning those were arguably real tasks the dataset had mislabeled.

Of 362 false negatives (tasks according to the dataset that our system missed), the judge agreed with our system on 209 (58%), classifying them as not real tasks (boilerplate, spam, social media posts).

False positives (389)18220747%False negatives (362)20915358%Judge sided with systemJudge sided with dataset

If we treat the judge labels as the corrected ground truth (which we acknowledge is imperfect), the adjusted metrics are:

MetricRawAdjusted
Precision77.3%87.9%
Recall78.5%90.8%
F177.9%89.3%

We present this as an upper bound estimate, not a definitive score. The true performance likely falls somewhere between the raw 77.9% and the adjusted 89.3%. The main takeaway is that dataset label noise accounts for a significant portion of the measured error rate, and this is consistent with the ~15% disagreement rate reported by the original annotators.

Discussion

If roughly 15% of a benchmark's labels are noisy, measured F1 scores have a hard ceiling well below 100%. A system scoring 78% on noisy labels might actually be performing at 89% on the “true” task. This applies to any classifier evaluated on this dataset, not just ours.

This is probably not an outlier. Inter-annotator agreement rates of 80-90% are common across NLP datasets. The Parakweet authors were unusually transparent in reporting theirs. Many datasets do not report this figure at all. The implication: published state-of-the-art numbers on many benchmarks may be closer to the noise ceiling than they appear.

Using a second AI to audit labels has obvious circularity risks, but it scales in a way that fresh human annotation does not. A more rigorous follow-up would be hiring new annotators to re-label the disputed cases and comparing all three labels (original human, LLM system, and judge). We did not do this due to cost, but it would be the natural next step.

An unexpected takeaway: zero-shot LLMs may be useful not just as classifiers, but as tools for auditing the quality of labeled datasets. If an LLM consistently disagrees with labels in patterns that a third judge also finds suspicious, that is a signal worth investigating, regardless of what you think about the LLM's classification ability.

Limitations

  • The test emails are from 2001. The dataset predates Slack notifications, Zoom links, and GitHub updates. Performance on modern workplace email may differ.
  • Individual sentences, not full threads. Each sentence was tested in isolation. Processing full email threads would provide more context for classification.
  • The prompt was refined through real-world use. While the system was not trained on this dataset, the prompt has been improved over months of processing real inboxes. This is not a pure out-of-the-box evaluation of the base model.
  • 106 sentences were excluded. We excluded marketing/automated content using keyword-based regex patterns. This is a judgment call, but the exclusion criteria are documented and deterministic.
  • The adjusted score uses an AI judge. Claude Sonnet reviewed disputed cases in isolation. This introduces its own biases and is not a substitute for fresh human annotation.
  • We only tested the “is this a task?” question. The system also extracts due dates, priority, and sentiment, but the dataset does not include ground truth for those fields, so we cannot measure their accuracy here.