Measure
Measurement
Screenshots are not data. Single prompts are not insights. AI visibility is a probability problem — and it needs to be measured like one.
Stop thinking like a rank tracker. Start thinking like a pollster.
The problem
We’re measuring AI visibility the same way we measure keyword rankings.
That assumption is breaking everything.
Here’s what most teams are doing right now: they ask ChatGPT about their brand, don’t see their name, and either panic or screenshot the one result where they do appear. Different tools. Different wrappers. Same assumption underneath all of it.
The assumption is: if I check this prompt, I know where we stand.
AI doesn’t work like that. There is no static SERP for prompts. No single answer. No shared result. No “what everyone else saw.”
In traditional search, everyone running the same query at the same moment sees roughly the same result. That shared reality is what makes rank tracking work. You can screenshot rank 3 and say “this is what rank 3 looks like today.”
AI gives you none of that. When you ask an LLM a question, you’re not retrieving a fixed answer. You’re sampling from a probability distribution. Treating that output like a ranking position is a category error.
This is why daily prompt tracking feels chaotic. One day you’re mentioned. The next you’re not. Nothing changed — you just pulled a different sample from the same underlying distribution.
What the research says
Why screenshots don’t work and never will.
This isn’t just intuition. The research is clear.
Study 1 — Large Language Monkeys (Brown et al., 2024)
Single-shot responses show only a fraction of what’s possible. A one-off screenshot is not a data point — it’s a single draw from a much larger distribution. The takeaway is unavoidable: single-shot testing is fundamentally unreliable for probabilistic systems.
Study 2 — SparkToro & Gumshoe AI (January 2026)
Real users don’t repeat the same prompt. When 142 people wrote their own versions of the same question, their prompts were as different from each other as “Kung Pao Chicken and Peanut Butter.” Even if you perfectly capture how one person asks a question, you’ve learned almost nothing about how the next person will ask it.
The full SparkToro study ran 2,961 tests across ChatGPT, Claude, and Google AI. What they found: less than 1% chance of getting the same brand list twice. About 0.1% — one in a thousand — chance of getting the same list in the same order. And yet top brands still appeared in 55–77% of responses despite massive variation.
This proves two things. Ranking position is meaningless — it changes nearly every time. Visibility percentage across diverse prompts is the real metric.
The right metric
Mention rate.
If rankings don’t apply, what does?
The most defensible metric available right now is mention rate. It’s simple:
Mention rate = the percentage of responses where your brand appears.
Not “did we show up?” But: what percent of the time do we show up?
Stop tracking: “Did we appear in this one prompt?” Start tracking: “Across N prompt variations with k runs each, what percentage mentioned us?”
That framing turns AI visibility from a binary outcome into a distribution — which is exactly what LLMs produce. It accounts for LLM variance across multiple runs of the same prompt. It accounts for human variance across the different ways people ask the same question. It gives you a probability, not a position. And it actually resembles how real users experience AI search.
The mental model shift
LLMs don’t return answers. They return distributions.
When pollsters want to know who’s winning an election, they don’t ask one person. They sample across geographic diversity and time stability. Then they report: “Candidate A has 52% support ± 3%.”
That’s what valid AI visibility measurement looks like. A percentage with a margin of error — not a screenshot with a caption.
The framework
N × k sampling.
Most teams think more prompts will fix the problem. They won’t. You need breadth and depth — two different dimensions of sampling.
Breadth — prompt diversity
Captures the variation in how real users phrase the same underlying intent. SparkToro proved this variation is massive — 0.081 semantic similarity across 142 participants asking the same question. N accounts for that.
Depth — inference stability
Reveals how volatile or stable each prompt actually is. Some prompts are surprisingly consistent. Others are chaos machines. You only learn which is which by running them multiple times.
Total samples = N × k. Both dimensions matter. Running the same prompt 100 times tells you about inference stability but nothing about prompt diversity. Running 100 different prompts once each tells you about diversity but nothing about stability. You need both.
How many samples do you actually need?
This isn’t guesswork. It’s standard statistics — Cochran’s formula, the same math pollsters use to calculate sample sizes for election surveys.
Tier 1 — Exploratory
Use when you’re pressure-testing a topic or doing early exploration. Your mention rate could swing 10 points in either direction — but it’s still infinitely better than a single screenshot.
Tier 2 — Monitoring
Use for monthly tracking or internal reporting. You can detect real shifts of 6+ points. This is the right tier for most ongoing visibility programs.
Tier 3 — High-Stakes
Use for board reporting, competitive claims, or PR impact measurement. You can confidently measure small changes and defend the numbers to stakeholders.
You don’t need perfection. You need directional confidence. Choose the tier that matches your use case and accept the corresponding margin of error.
Using the number
What you can and can’t do with mention rate.
After 3–6 months of consistent measurement, the number becomes useful. But it has real limits. Being honest about those limits is what makes it defensible.
You CAN
You CAN’T
One more rule that matters more than any of the others:
Only compare distributions to distributions. Never compare single points to single points. If your margin of error is ±5%, a 2-point drop means nothing. A 9-point lift probably does.
The tools
Start measuring.
The Mention Rate Tool puts this framework into practice. Run structured prompt sampling, record brand mentions, and calculate mention rate across tiers — without doing it all by hand.
More tools are in development under this section, including a sentiment tool for tracking how AI systems describe your brand — not just whether they mention it.
Available now
Structured prompt sampling, mention rate calculation, and tier-based reporting. The N×k framework, automated.
Coming soon
Track how AI systems describe your brand — the language used, the associations made, the framing applied. Not just presence. Perception.
The full framework
Topic entities, brand, services, audiences, and concepts.
← Back
Turn your entities into hub pages and content architecture.
← Back
Internal linking and schema that connects the pieces.
← Back
Mention rate, N×k sampling, and AI visibility with real numbers.
Currently reading ↑