ToMMeR – Efficient Entity Mention Detection from Large Language Models

Published:

Abstract: Identifying which text spans refer to entities – mention detection – is both foundational for information extraction and a known performance bottleneck. We introduce ToMMeR, a lightweight model (<300K parameters) probing mention detection capabilities from early LLM layers. Across 13 NER benchmarks, ToMMeR achieves 93\% recall zero-shot, with over 90\% precision using an LLM as a judge showing that ToMMeR rarely produces spurious predictions despite high recall. Cross-model analysis reveals that diverse architectures (14M-15B parameters) converge on similar mention boundaries (DICE >75\%), confirming that mention detection emerges naturally from language modeling. When extended with span classification heads, ToMMeR achieves near SOTA NER performance (80-87\% F1 on standard benchmarks). Our work provides evidence that structured entity representations exist in early transformer layers and can be efficiently recovered with minimal parameters.

What ToMMeR Does

➡️ Try if yourself ! Colab

ToMMeR is a lightweight probing model extracting emergent mention detection capabilities from early layers representations of any LLM backbone. Trained to generalize LLM annotated data, ToMMeR achieves high Zero Shot recall across a wide set of 13 NER benchmarks.

Here is an example output of a ToMMeR model trained on representation for layer 6 of Llama3.2-1B, you can try it on Colab or directly download it from 🤗hugginface. We trained a lot more models across many other layers and backbones, see the related 🤗collection.

Large GOLD PRED Language Models are awesome : while trained on language GOLD PRED modeling PRED , they exhibit emergent PRED abilities that make them suitable for a wide range of tasks PRED , including mention GOLD PRED PRED detection .
GOLD
Ground truth entity mentions from LLM-labeled data
PRED
Predicted entity mentions by ToMMeR

ToMMeR Span Probabilities and Attention scores

ToMMeR bases its predictions primarily on a learned self-attention layer, we leverage circuitsvis to visualize the attention scores along with the span scores.