Microsoft Clarity Now Catches AI Crawlers Breaking Your Website Rules — Here's Why Service Businesses Should Care

Microsoft Clarity added robots.txt violation detection to its free Bot Analytics dashboard on June 23, 2026. Here is what it means for service business AI visibility.

Ido Cohen · Published 2026-06-27 · SEO & Search

Microsoft just handed every service business a free window into something that was previously invisible: which AI crawlers are breaking into the parts of your website you explicitly told them to stay out of — and exactly which content they are targeting when they do it.

On June 23, 2026, Microsoft Clarity released a robots.txt violations layer inside its Bot Analytics dashboard. Search Engine Journal, PPC Land, and the official Microsoft Clarity blog all covered the launch within 48 hours. This is not a niche developer story. If you own a website for a plumbing company, a law firm, a dental practice, or an HVAC business, this update affects how AI systems see you — and how you can finally start measuring it.

What Actually Changed in Microsoft Clarity

The new feature is a Violations card inside the existing Bot Analytics dashboard, and it does something genuinely new.

Before this release, you could see that AI crawlers were visiting your site. Now, according to the Microsoft Clarity blog, you can see when those crawlers are accessing paths your robots.txt file explicitly disallowed — shown as a percentage of total bot traffic, filterable by operator, bot name, and activity type. The feature is free for all Clarity users who have connected a supported CDN provider, including Cloudflare, Amazon CloudFront, Fastly, Azure Front Door, and Akamai. WordPress sites using the latest Clarity plugin are also supported without any CDN setup.

To set it up, a project admin needs to enable it inside the AI Visibility section of Project Settings. After connecting the CDN integration, the dashboard takes roughly 24 hours to populate with the first data.

Here is what the dashboard now shows you:

**AI bot request share (%):** What percentage of your total traffic comes from AI bots, agents, and crawlers
**Pages crawled (%):** What share of your site's pages AI bots are requesting
**Violations card:** Which specific crawlers are accessing paths you explicitly blocked, trended over time
**Compliance comparison view:** Side-by-side look at compliant vs. non-compliant crawlers by operator and bot name

One important limitation to understand: Clarity measures and reports violations. It does not block them. To actually stop a non-compliant crawler, you still need robots.txt directives, a web application firewall, or CDN-level rules. Clarity tells you what is happening; enforcement is a separate step.

Why This Matters More Than Most People Realize

AI crawlers are already a significant — and growing — share of your website traffic, whether you know it or not.

According to Cloudflare's 2025 data reported by PPC Land, AI bots accounted for 4.2% of all HTML requests across Cloudflare's network in December 2025. That share is still climbing. HUMAN Security's April 2026 State of AI Traffic report documented that automation is growing eight times faster than human traffic. And here is the part that gets expensive for local service businesses: Botify data cited by PPC Land showed OpenAI bots crawling retail sites 198 times for every single referral visit they deliver — compared to 1 visit for every 6 crawls by Google. Your server is doing real work, paying real infrastructure costs, for visits that mostly go nowhere.

For service businesses with modest hosting plans — shared hosting for a dentist's website, a basic managed WordPress setup for a plumber — that load matters. Kinsta's infrastructure analysis found AI bots trapped in query-string loops hitting WooCommerce cart and checkout pages up to 3.75 million times in a single day across a sample of 10 billion requests. That is not a misprint.

The violations layer is important because the scale of non-compliance is larger than most site owners realize. According to a study cited by 365i (a UK digital agency), AI crawlers violated explicit robots.txt rules on 72% of UK business websites tested, averaging 156 violation requests per site over a three-week period in 2025.

The Bigger AI Visibility Problem This Unlocks

Here is the strategic angle that goes beyond server load: your robots.txt rules can accidentally make you invisible in AI-generated answers — and you would not know it.

Traditional analytics tools like Google Analytics 4 are functionally blind to AI crawlers. The reason, as explained by AI Rockstars in their Clarity analysis, is architectural: tools like GA4 are built on client-side JavaScript. AI crawlers such as GPTBot or ClaudeBot request the HTML source code of a page but do not execute JavaScript to save resources. The result is a large category of "ghost traffic" — visits that consume server resources and shape what AI systems know about your business, but never appear in any standard dashboard.

Microsoft Clarity bypasses this blindspot by analyzing CDN log data directly at the server level, capturing 100% of HTTP requests regardless of JavaScript execution. That is the core technical value of the new violations feature: it is telling you about activity that your other tools cannot see.

The Microsoft Clarity blog makes the business case clearly: if you are seeing a high number of unsuccessful bot requests, it may indicate that AI systems cannot properly access your content — whether due to blocked access, broken URLs, or restricted pages. That results in missed opportunities to be discovered or referenced in AI-generated answers.

This matters enormously right now. According to Search Engine Journal's coverage of the Clarity release, the feature adds to existing AI Visibility tools in the dashboard, which in May began showing the grounding queries behind AI citations. Put these two things together and you get a picture that was previously impossible to assemble: which crawlers are blocked or misbehaving, and which search queries are actually pulling citations from your content.

The Compliance Landscape Is a Mess — And Getting Messier

Not all AI crawlers behave the same way, and the differences matter for service businesses trying to manage who gets in.

According to PPC Land's reporting on the Clarity release, OpenAI made a significant policy change in late 2025: its ChatGPT-User agent announced it would no longer follow robots.txt directives for user-initiated browsing. That means when a real ChatGPT user asks the chatbot to browse your site, the agent may disregard your robots.txt rules. Anthropic, by contrast, publicly committed in February 2026 that its three crawlers — ClaudeBot, Claude-User, and Claude-SearchBot — do respect robots.txt and will not attempt to bypass CAPTCHAs. Whether that commitment holds in practice is another matter: Reddit's lawsuit against Anthropic alleged the company accessed its platform more than 100,000 times after publicly claiming it had stopped.

There is also a spoofing problem. HUMAN Security's Satori threat intelligence team found that a significant portion of requests claiming to be ChatGPT, Mistral, and Perplexity bots did not originate from those operators' actual infrastructure. Attackers spoof user-agent strings to exploit the trust that site owners extend to recognized AI crawlers, bypassing robots.txt allowlists. This is not a theoretical risk; it is active.

What this means for a service business owner: your robots.txt file is a statement of intent, not a lock on the door. Clarity's new violations feature at least tells you how often the door is being tested.

The Specific Risk for Service Businesses

Here is where this becomes personal for a dentist, a real estate agent, or an HVAC contractor.

Your website probably has pages you do not want AI systems reading and citing. Possible examples:

**Outdated pricing pages** you haven't deleted yet but that no longer reflect what you actually charge
**Service areas you no longer cover** that are still buried in old blog posts
**/wp-admin/ or /wp-login/ pages** that have no business being indexed or cited by anything
**Internal scheduling or quote forms** meant for humans, not scrapers
**Thank-you pages** post-conversion that could distort AI's understanding of your business

If an AI crawler is accessing those pages despite your robots.txt instructions, and that content finds its way into an AI-generated answer about your business — say, a ChatGPT response recommending your business to someone in a service area you left two years ago — that is a real reputational and operational problem.

The flip side is equally important. According to analysis from ClickRank and Clarity's own FAQ, if you have accidentally blocked AI search crawlers from your high-value service pages, those pages may still rank in traditional Google results but disappear from AI summaries, citations, and recommendations entirely. Your traditional SEO stays stable while your AI-driven discovery quietly collapses.

How Clarity Fits Into the Broader AI Visibility Stack

Clarity is free, which makes it unusual in this category. But free also means limited.

Here is a practical comparison of your current options for monitoring AI crawler behavior:

Clarity is best understood as the measurement layer. It gives you the intelligence to make decisions. Enforcement happens at the CDN or firewall level, separately. For most service businesses, the right starting point is Clarity — because you need to understand what is actually happening before you can make intelligent enforcement decisions.

One more thing: Clarity's AI Visibility section already shows which grounding queries are pulling citations from your content, and which of your pages are being cited in AI-generated answers. The new violations feature completes the upstream picture. You can now see the full chain: crawlers arrive, some comply with your rules and some do not, some of that crawling eventually produces citations or referral traffic — and the gap between what gets crawled and what generates a visit is where your AI visibility is leaking.

What to Do This Week

You do not need to be technical to take these steps. If you have a web developer or marketing agency, forward them this post.

1. Install or update Microsoft Clarity on your website. It is free and takes about 10 minutes. Go to clarity.microsoft.com and follow the setup wizard. For WordPress, install the official Clarity plugin.

2. Connect your CDN to Clarity. If your site is on Cloudflare (very common for small businesses), connect it to Clarity's Bot Analytics in the AI Visibility section of Project Settings. This unlocks the violations data and the bot activity dashboard.

3. Wait 24 hours, then check your Bot Analytics dashboard. Look specifically at: (a) what percentage of your traffic is AI bots, (b) which crawlers appear in the Violations card, and (c) which pages are getting the most non-compliant bot attention.

4. Audit your robots.txt file against your actual service pages. Open your robots.txt file (it lives at yourdomain.com/robots.txt) and check whether you have accidentally blocked OAI-SearchBot (OpenAI's search crawler) or PerplexityBot. If you blocked them to stop training crawlers, you may have cut yourself out of AI search results in the process.

5. Review high-violation pages and decide on enforcement. If specific pages are getting hammered by non-compliant bots, talk to your web developer or CDN provider about adding firewall rules for the worst offenders. This is separate from Clarity, which only measures — it does not block.

6. Set up a monthly review. AI crawler behavior changes fast. Bytespider (ByteDance) nearly doubled its share of AI bot traffic in May 2026 alone, according to Cloudflare Radar data reported by TechnologyChecker.io. The landscape you map this week will look different in 60 days.

Frequently Asked Questions

What is robots.txt and why does it matter for my service business website?

robots.txt is a plain-text file that sits at the root of your website (yoursite.com/robots.txt) and tells automated crawlers — search bots, AI training crawlers, AI search bots — which parts of your site they are allowed to access. It is an honor-system standard; legitimate crawlers respect it, but non-compliant ones do not. For service businesses, it matters because what AI crawlers access shapes what AI systems like ChatGPT and Google's AI Mode know about your business, your service areas, and your pricing.

Is Microsoft Clarity actually free, and what do I need to use the new violations feature?

Yes, Clarity is entirely free with no traffic limits or upgrade tiers. The Bot Analytics violations feature specifically requires connecting a supported CDN provider — Cloudflare, Amazon CloudFront, Fastly, Azure Front Door, or Akamai — or using the latest Microsoft Clarity WordPress plugin. Standard Clarity installation alone is not enough; the CDN or plugin connection is required because the violations data comes from server-level logs, not client-side JavaScript.

Will blocking AI crawlers hurt my search rankings or AI visibility?

It depends on which crawler you block and why. Blocking Google's training crawler (Google-Extended) does not remove you from Google search results, but it also does not prevent your content from appearing in Google AI Overviews. Blocking OAI-SearchBot, OpenAI's search crawler, could remove your content from ChatGPT search results. Blocking ClaudeBot could affect whether your business appears in Anthropic's Claude responses. The safest approach for most service businesses is to block only training-specific crawlers and keep search-specific crawlers open — and use Clarity to verify your rules are actually working as intended.

Why can't I just see AI crawler violations in Google Analytics 4?

GA4 is built on a client-side JavaScript tag, which means it only records visits from browsers that load and execute JavaScript. AI crawlers like GPTBot and ClaudeBot request raw HTML but skip JavaScript execution to conserve resources. The result is that these visits happen — consuming your server bandwidth and shaping what AI systems know about your content — but they are completely invisible in GA4. Microsoft Clarity bypasses this by ingesting CDN-level server logs directly, which capture every HTTP request regardless of JavaScript.

How do I know if AI crawlers are actually hurting my website's performance?

Open your hosting control panel or CDN dashboard and look at server-side traffic logs rather than your analytics dashboard. If your actual bandwidth usage and server request counts are significantly higher than your GA4 sessions suggest, AI crawlers are likely the gap. After connecting Clarity to your CDN, check the AI bot request share percentage in the Bot Analytics dashboard. Anything above 15-20% of total requests from AI bots warrants a closer look, especially if your hosting plan has bandwidth limits or your site starts slowing down during high-crawl periods.

Sources:

Microsoft Clarity Blog — "New Ways to Measure Bot Activity in Clarity" — June 23, 2026
Search Engine Journal — "Microsoft Clarity Now Flags Bots That Ignore Robots.txt" — June 25, 2026
PPC Land — "Microsoft Clarity now flags robots.txt violations inside Bot Analytics" — June 23, 2026
MarketingProfs — "AI Update, June 26, 2026: AI News and Views From the Past Week" — June 26, 2026
HUMAN Security — "2026 State of AI Traffic & Cyberthreat Benchmark Report" — April 2026
TechnologyChecker.io — "We Analyzed robots.txt Across Cloudflare's Network" — June 2026
365i — "AI Crawlers Violate robots.txt on 72% of UK Sites" — January 2026
ClickRank.ai — "Crawlers to Allow or Block: The Ultimate 2026 Master List" — March 2026
AI Rockstars — "Microsoft Clarity: AI Bot Activity & Traffic-Analysis" — January 2026
Culture Foundry — "Using Clarity's AI Bot Activity Report (2026 Guide for SEO & AEO)" — March 2026
No Hacks — "The AI User-Agent Landscape in 2026: A Complete Reference" — April 2026