ToMMeR – Efficient Entity Mention Detection from Large Language Models

Published: October 28, 2025

Abstract: Identifying which text spans refer to entities – mention detection – is both foundational for information extraction and a known performance bottleneck. We introduce ToMMeR, a lightweight model (<300K parameters) probing mention detection capabilities from early LLM layers. Across 13 NER benchmarks, ToMMeR achieves 93\% recall zero-shot, with over 90\% precision using an LLM as a judge showing that ToMMeR rarely produces spurious predictions despite high recall. Cross-model analysis reveals that diverse architectures (14M-15B parameters) converge on similar mention boundaries (DICE >75\%), confirming that mention detection emerges naturally from language modeling. When extended with span classification heads, ToMMeR achieves near SOTA NER performance (80-87\% F1 on standard benchmarks). Our work provides evidence that structured entity representations exist in early transformer layers and can be efficiently recovered with minimal parameters.

What ToMMeR Does

➡️ Try if yourself !

ToMMeR is a lightweight probing model extracting emergent mention detection capabilities from early layers representations of any LLM backbone. Trained to generalize LLM annotated data, ToMMeR achieves high Zero Shot recall across a wide set of 13 NER benchmarks.

Here is an example output of a ToMMeR model trained on representation for layer 6 of Llama3.2-1B, you can try it on Colab or directly download it from 🤗hugginface. We trained a lot more models across many other layers and backbones, see the related 🤗collection.

Large GOLD PRED Language Models are awesome : while trained on language GOLD PRED modeling PRED , they exhibit emergent PRED abilities that make them suitable for a wide range of tasks PRED , including mention GOLD PRED PRED detection .

GOLD

Ground truth entity mentions from LLM-labeled data

PRED

Predicted entity mentions by ToMMeR

ToMMeR Span Probabilities and Attention scores

ToMMeR bases its predictions primarily on a learned self-attention layer, we leverage circuitsvis to visualize the attention scores along with the span scores.

Share on

Bluesky Facebook LinkedIn X (formerly Twitter)

Victor Morand

What ToMMeR Does

ToMMeR Span Probabilities and Attention scores

Share on