ToMMeR – Efficient Entity Mention Detection from Large Language Models
Published:
Abstract: Identifying which text spans refer to entities – mention detection – is both foundational for information extraction and a known performance bottleneck. We introduce ToMMeR, a lightweight model (<300K parameters) probing mention detection capabilities from early LLM layers. Across 13 NER benchmarks, ToMMeR achieves 93\% recall zero-shot, with over 90\% precision using an LLM as a judge showing that ToMMeR rarely produces spurious predictions despite high recall. Cross-model analysis reveals that diverse architectures (14M-15B parameters) converge on similar mention boundaries (DICE >75\%), confirming that mention detection emerges naturally from language modeling. When extended with span classification heads, ToMMeR achieves near SOTA NER performance (80-87\% F1 on standard benchmarks). Our work provides evidence that structured entity representations exist in early transformer layers and can be efficiently recovered with minimal parameters.
What ToMMeR Does
ToMMeR is a lightweight probing model extracting emergent mention detection capabilities from early layers representations of any LLM backbone. Trained to generalize LLM annotated data, ToMMeR achieves high Zero Shot recall across a wide set of 13 NER benchmarks.
Here is an example output of a ToMMeR model trained on representation for layer 6 of Llama3.2-1B, you can try it on Colab or directly download it from 🤗hugginface. We trained a lot more models across many other layers and backbones, see the related 🤗collection.
ToMMeR Span Probabilities and Attention scores
ToMMeR bases its predictions primarily on a learned self-attention layer, we leverage circuitsvis to visualize the attention scores along with the span scores.
