Measure
Is recognition actually increasing?
Screenshots are not data. Single prompts are not insights. AI visibility is a probability problem — and it needs to be measured like one.
- Why screenshots don’t work and never will
- The mental model shift: from rank tracker to pollster
- What mention rate is and why it’s the right metric
- The N×k sampling framework — and how many samples you actually need
- What you can and can’t do with the number
We’re measuring AI visibility the same way we measure keyword rankings. That assumption is breaking everything.
Here’s what most teams are doing right now: they ask ChatGPT about their brand, see whether their name shows up, and either panic or screenshot the one result where they do appear. Different tools. Different wrappers. Same assumption underneath all of it.
The assumption is: if I check this prompt, I know where we stand.
AI doesn’t work like that. There is no static result for prompts. No single answer. No shared result that “everyone sees.” In traditional search, everyone running the same query sees roughly the same result — that shared reality is what makes rank tracking work. AI gives you none of that.
When you ask an LLM a question, you’re not retrieving a fixed answer. You’re sampling from a probability distribution. Treating that output like a ranking position is a category error. This is why daily prompt tracking feels chaotic — one day you’re mentioned, the next you’re not. Nothing changed. You just pulled a different sample from the same underlying distribution.
What the research says
Why screenshots don’t work — and the data that proves it.
This isn’t just intuition. Two studies make the case clearly.
Study 1 — Large Language Monkeys (Brown et al., 2024)
One attempt solved 16%. 250 attempts solved 56%.
Single-shot responses show only a fraction of what’s possible.
A one-off screenshot is not a data point — it’s a single draw from a much larger distribution. The takeaway is unavoidable: single-shot testing is fundamentally unreliable for probabilistic systems. You need to sample repeatedly to understand what’s actually happening.
Study 2 — SparkToro & Gumshoe AI (January 2026)
142 participants. Same intent. Semantic similarity of 0.081.
Real users don’t repeat the same prompt. At all.
When 142 people wrote their own version of the same question, their prompts were as different from each other as “Kung Pao Chicken and Peanut Butter.” The full study ran 2,961 tests across ChatGPT, Claude, and Google AI. Less than 1% chance of getting the same brand list twice. About 0.1% — one in a thousand — chance of getting the same list in the same order.
And yet top brands still appeared in 55–77% of responses despite all that variation. Which proves two things: ranking position is meaningless — it changes nearly every time. Visibility percentage across diverse prompts is the metric that actually means something.
Mention rate.
If rankings don’t apply, what does? The most defensible metric available right now is mention rate. It’s simple:
Did we appear in this one prompt?
↓
Across N prompt variations with k runs each, what percentage mentioned us?
That framing turns AI visibility from a binary outcome into a distribution — which is exactly what LLMs produce. It accounts for LLM variance across multiple runs of the same prompt. It accounts for human variance across the different ways people ask the same question. It gives you a probability, not a position.
When pollsters want to know who’s winning an election, they don’t ask one person. They sample across geographic diversity and time, then report: “Candidate A has 52% support ± 3%.” That’s what valid AI visibility measurement looks like. A percentage with a margin of error — not a screenshot with a caption.
The framework
N × k sampling.
Most teams think more prompts will fix the problem. They won’t. You need two different dimensions of sampling — and both matter.
N
Breadth — prompt diversity
How many different ways do people ask about your topic? Captures the variation in how real users phrase the same underlying intent. The SparkToro study proved this variation is massive — 0.081 semantic similarity across 142 people asking the same question. N accounts for that.
k
Depth — inference stability
How many times do you run each prompt variation? Reveals how volatile or stable each prompt actually is. Some prompts are surprisingly consistent. Others produce different results nearly every time. You only learn which is which by running them multiple times.
Running the same prompt 100 times tells you about inference stability but nothing about prompt diversity. Running 100 different prompts once each tells you about diversity but nothing about stability. Total samples = N × k. Both dimensions matter.
How many samples do you actually need?
This is standard statistics — Cochran’s formula, the same math pollsters use to calculate sample sizes for election surveys. Pick the tier that matches how much you need to trust your results.
20 prompts × 5 runs = ~100 samples. Good for pressure-testing a topic or early exploration. Your mention rate could swing 10 points either direction — but still infinitely better than a single screenshot. Use when you’re getting started or testing a new entity.
±10%
40 prompts × 10 runs = ~400 samples. Reliable for monthly tracking and internal reporting. You can detect real shifts of 6+ points and start comparing results over time. Right for most ongoing visibility programs.
±5%
80 prompts × 30 runs = ~2,400 samples. Decision-grade data for board reporting, competitive claims, or measuring campaign impact. Small changes are detectable and defensible to stakeholders.
±2%
What you can and can’t do with mention rate.
After 3–6 months of consistent measurement, mention rate becomes genuinely useful. But it has real limits — and being honest about those limits is what makes it defensible.
You can say
- “We improved from 42% to 52% over Q1”
- “We outperform Competitor X by 1.8×”
- “This campaign drove a 9-point lift”
- Track month-over-month trends
- Compare categories (“stronger in X than Y”)
- Set benchmarks and test interventions
You cannot say
- “We rank #3” — there is no rank
- “We dropped 2% this month” (within margin of error)
- Panic over one bad screenshot
- Expect the same answer twice
- “47.3%” from 100 samples (false precision)
- Treat this like keyword position tracking
One rule that matters more than any of the others: only compare distributions to distributions. Never compare single data points to single data points. If your margin of error is ±5%, a 2-point drop means nothing. A 9-point lift probably does.
Start measuring
The tools that put this into practice.
Beta v2.0
Mention Rate Tool
Structured prompt sampling, mention rate calculation, and tier-based reporting. The N×k framework, automated — without doing it all by hand.
Coming soon: Sentiment Tool — track how AI systems describe your brand, not just whether they mention it. The language used, the associations made, the framing applied.