Step 04
Measure

Measurement

Screenshots are not data. Single prompts are not insights. AI visibility is a probability problem — and it needs to be measured like one.

Stop thinking like a rank tracker. Start thinking like a pollster.

Try the Mention Rate Tool →
← Back to Entity Reinforcement

The problem

We’re measuring AI visibility the same way we measure keyword rankings.

That assumption is breaking everything.

Here’s what most teams are doing right now: they ask ChatGPT about their brand, don’t see their name, and either panic or screenshot the one result where they do appear. Different tools. Different wrappers. Same assumption underneath all of it.

The assumption is: if I check this prompt, I know where we stand.

AI doesn’t work like that. There is no static SERP for prompts. No single answer. No shared result. No “what everyone else saw.”

In traditional search, everyone running the same query at the same moment sees roughly the same result. That shared reality is what makes rank tracking work. You can screenshot rank 3 and say “this is what rank 3 looks like today.”

AI gives you none of that. When you ask an LLM a question, you’re not retrieving a fixed answer. You’re sampling from a probability distribution. Treating that output like a ranking position is a category error.

This is why daily prompt tracking feels chaotic. One day you’re mentioned. The next you’re not. Nothing changed — you just pulled a different sample from the same underlying distribution.

What the research says

Why screenshots don’t work and never will.

This isn’t just intuition. The research is clear.

Study 1 — Large Language Monkeys (Brown et al., 2024)

One attempt solved 16%. 250 attempts solved 56%.

Single-shot responses show only a fraction of what’s possible. A one-off screenshot is not a data point — it’s a single draw from a much larger distribution. The takeaway is unavoidable: single-shot testing is fundamentally unreliable for probabilistic systems.

Study 2 — SparkToro & Gumshoe AI (January 2026)

142 participants. Same intent. Semantic similarity of 0.081.

Real users don’t repeat the same prompt. When 142 people wrote their own versions of the same question, their prompts were as different from each other as “Kung Pao Chicken and Peanut Butter.” Even if you perfectly capture how one person asks a question, you’ve learned almost nothing about how the next person will ask it.

The full SparkToro study ran 2,961 tests across ChatGPT, Claude, and Google AI. What they found: less than 1% chance of getting the same brand list twice. About 0.1% — one in a thousand — chance of getting the same list in the same order. And yet top brands still appeared in 55–77% of responses despite massive variation.

This proves two things. Ranking position is meaningless — it changes nearly every time. Visibility percentage across diverse prompts is the real metric.

The right metric

Mention rate.

If rankings don’t apply, what does?

The most defensible metric available right now is mention rate. It’s simple:

Mention rate = the percentage of responses where your brand appears.

Not “did we show up?” But: what percent of the time do we show up?

Stop tracking: “Did we appear in this one prompt?” Start tracking: “Across N prompt variations with k runs each, what percentage mentioned us?”

That framing turns AI visibility from a binary outcome into a distribution — which is exactly what LLMs produce. It accounts for LLM variance across multiple runs of the same prompt. It accounts for human variance across the different ways people ask the same question. It gives you a probability, not a position. And it actually resembles how real users experience AI search.

The mental model shift

LLMs don’t return answers. They return distributions.

When pollsters want to know who’s winning an election, they don’t ask one person. They sample across geographic diversity and time stability. Then they report: “Candidate A has 52% support ± 3%.”

That’s what valid AI visibility measurement looks like. A percentage with a margin of error — not a screenshot with a caption.

The framework

N × k sampling.

Most teams think more prompts will fix the problem. They won’t. You need breadth and depth — two different dimensions of sampling.

Breadth — prompt diversity

How many different ways do people ask about your topic?

Captures the variation in how real users phrase the same underlying intent. SparkToro proved this variation is massive — 0.081 semantic similarity across 142 participants asking the same question. N accounts for that.

Depth — inference stability

How many times do you run each prompt variation?

Reveals how volatile or stable each prompt actually is. Some prompts are surprisingly consistent. Others are chaos machines. You only learn which is which by running them multiple times.

Total samples = N × k. Both dimensions matter. Running the same prompt 100 times tells you about inference stability but nothing about prompt diversity. Running 100 different prompts once each tells you about diversity but nothing about stability. You need both.

How many samples do you actually need?

This isn’t guesswork. It’s standard statistics — Cochran’s formula, the same math pollsters use to calculate sample sizes for election surveys.

Tier 1 — Exploratory

~100

20 prompts × 5 runs · ±10% margin

Use when you’re pressure-testing a topic or doing early exploration. Your mention rate could swing 10 points in either direction — but it’s still infinitely better than a single screenshot.

Tier 2 — Monitoring

~400

40 prompts × 10 runs · ±5% margin

Use for monthly tracking or internal reporting. You can detect real shifts of 6+ points. This is the right tier for most ongoing visibility programs.

Tier 3 — High-Stakes

~2,400

80 prompts × 30 runs · ±2% margin

Use for board reporting, competitive claims, or PR impact measurement. You can confidently measure small changes and defend the numbers to stakeholders.

You don’t need perfection. You need directional confidence. Choose the tier that matches your use case and accept the corresponding margin of error.

Using the number

What you can and can’t do with mention rate.

After 3–6 months of consistent measurement, the number becomes useful. But it has real limits. Being honest about those limits is what makes it defensible.

You CAN

✓ “We improved from 42% to 52% over Q1”

✓ “We outperform Competitor X by 1.8×”

✓ “This campaign drove a 9-point lift”

✓ Track month-over-month trends

✓ Compare categories (“stronger in X than Y”)

✓ Set benchmarks and test interventions

You CAN’T

✕ Say “we rank #3” — there is no rank

✕ “We dropped 2% this month” (within margin of error)

✕ Panic over one bad screenshot

✕ Expect the same answer twice

✕ Report “47.3%” from 100 samples (false precision)

✕ Treat this like keyword position tracking

One more rule that matters more than any of the others:

Only compare distributions to distributions. Never compare single points to single points. If your margin of error is ±5%, a 2-point drop means nothing. A 9-point lift probably does.

The tools

Start measuring.

The Mention Rate Tool puts this framework into practice. Run structured prompt sampling, record brand mentions, and calculate mention rate across tiers — without doing it all by hand.

More tools are in development under this section, including a sentiment tool for tracking how AI systems describe your brand — not just whether they mention it.

Available now

Mention Rate Tool

Structured prompt sampling, mention rate calculation, and tier-based reporting. The N×k framework, automated.

Try the Tool →

Coming soon

Sentiment Tool

Track how AI systems describe your brand — the language used, the associations made, the framing applied. Not just presence. Perception.

The full framework

Define

Topic entities, brand, services, audiences, and concepts.

← Back

Site Structure

Turn your entities into hub pages and content architecture.

← Back

Entity Reinforcement

Internal linking and schema that connects the pieces.

← Back

04 — You are here

Measurement

Mention rate, N×k sampling, and AI visibility with real numbers.

Currently reading ↑

Measurement

We’re measuring AI visibility the same way we measure keyword rankings.

Why screenshots don’t work and never will.

Mention rate.

The mental model shift

N × k sampling.

How many samples do you actually need?

What you can and can’t do with mention rate.

Start measuring.

Share this: