Why Publisher Traffic Collapsed While Google Search Revenue Grew

From inside a digital agency I see dozens of Google Analytics dashboards. Clients across niches, partners I've worked with for years, friends in publishing who let me peek at their numbers. The decline pattern in 2024 through 2026 is specific in a way that doesn't match Google's stated guidance about what helpful content gets rewarded. Some sites lost ninety percent of their organic traffic. Others held flat or grew. The split tracks a structural feature of the content, not the quality of the writing, and a structural feature of Google's revenue model that nobody seems to want to name out loud.

So the goal here isn't to argue that SEO is dead. That phrase doesn't survive contact with the data, and it bores me anyway. What I want to argue is something more specific: between 2023 and 2026 Google's Search business improved (revenue up, AI features named by management as a driver of usage growth, share of Alphabet only marginally compressed) while the publisher tier that historically depended on Search degraded along measurable axes. The two trends are tightly correlated. The product Google credits for accelerating Search is the same product whose CTR data shows publishers losing their clicks. That part is the strange one. Google said it on the earnings call.

Three layers, in order. First the empirical pattern: a decay measurement across 27 mid-tier publisher properties in six niches, 2022 through 2026, using Ahrefs organic-traffic estimates. Second the financial structure: a 12-quarter decomposition of Alphabet's reported revenue mix from primary earnings-release sources, looking at what changed for Google as the publisher tier shrank. Third the algorithmic mechanism: the March 2024 Google Content Warehouse leak, analyzed by Mike King at iPullRank and Rand Fishkin at SparkToro, which surfaces the site-level proxy signals that bias ranking against single-author and small-operation publishers by construction.

The reason nobody in the industry wants to make these three claims in the same place is that together they imply a conclusion uncomfortable for both halves of the SEO discourse. The "Google is hostile to creators" half is wrong: Google didn't kill publishers in any active sense. The "Google rewards quality" half is also wrong: the proxies that drive ranking don't measure quality. Both halves orbit a fact neither says aloud. Google still needs the open web as raw input, but it no longer needs to route enough value back through clicks to sustain the mid-tier publisher economy. The content layer didn't become useless. The click-routed subsidy that funded it did.

One framing note before the evidence. This is not a causal-identification paper, and I'm not going to pretend three datasets add up to a controlled experiment. It's a triangulation argument: three independent surfaces (traffic data, financial filings, a leaked ranking schema) point at the same structural shift, and the version of events that fits all three at once is the one I'll defend. Where the evidence is inference rather than proof, I'll say so.

What 27 publisher sites tell you

I started with a question and a list. The question: did the publisher collapse since 2022 happen uniformly across content types, or is there structure to it? The list: 27 mid-tier publisher properties across six niches, the kind that grew up in the 2013–2018 window when the SEO/AdSense/evergreen formula still worked. Travel: Nomadic Matt, The Blonde Abroad, Expert Vagabond, Never Ending Footsteps, Adventurous Kate, The World Pursuit. Recipe: Smitten Kitchen, Pinch of Yum, Minimalist Baker, Half-Baked Harvest, Sally's Baking Addiction. Tech how-to: How-To Geek, MakeUseOf, Instructables, Beebom, AddictiveTips. Hobby and craft: Allfreesewing, Crochetkim. Personal finance: Mr. Money Mustache, Wise Bread, Get Rich Slowly, Frugalwoods, Budgets Are Sexy. Lifestyle: Apartment Therapy, Cup of Jo, Design Mom, Alpha Mom.

Some of these are single-author operations; others are part of digital media portfolios. How-To Geek and MakeUseOf are owned by Valnet (acquired 2023 and 2020 respectively), Instructables by Autodesk (since 2011), Allfreesewing is part of Prime Publishing. I deliberately excluded large media conglomerates (Condé Nast, Hearst, Recurrent Ventures). The sample is mid-tier publisher properties: bigger than personal blogs, smaller than enterprise media. That tier is where Google's stated helpful-content criteria should most directly apply.

For each site I pulled monthly organic-traffic estimates from Ahrefs, January 2022 through April 2026. 1,404 monthly observations across 27 domains. Peak month is defined as the month with the highest twelve-month trailing average traffic, restricted to before August 2024 so that post-AI-Overviews recovery couldn't be confused with the actual peak. Current is the trailing three months (February through April 2026) to smooth out single-month index artifacts. Inflection month is the first time after peak where the rolling three-month mean drops fifteen percent or more and the decline holds for the next three months. That catches the cliff trigger, not the floor of the slide.

Ahrefs is an estimate. It is not ground truth. The estimates are derived from a keyword corpus and a click model, and when Ahrefs adds new keywords or recalibrates the model, all historical estimates can shift. I'll flag specific data-quality concerns where they matter. The relative comparison across sites and the relative position of inflection months are more reliable than any single absolute number.

Publisher organic-traffic decay 2022–2026, 27 sites normalized to peak

The headline finding is that the collapse is bifurcated, not uniform. (I'll say bifurcated, not bimodal: n=27 shows a clear split into winners and losers, but it's too small to claim two formal statistical modes.) Of the 27 sites:

5 are catastrophic: lost 90 percent or more of peak traffic (How-To Geek, AddictiveTips, MakeUseOf, Expert Vagabond, The World Pursuit)
7 lost 50 to 89 percent (Mr. Money Mustache, Frugalwoods, Wise Bread, Nomadic Matt, Minimalist Baker, Crochetkim, Allfreesewing)
8 lost 1 to 49 percent (the moderate middle)
7 grew or held flat (Never Ending Footsteps +188%, Beebom +168%, Adventurous Kate +72%, Budgets Are Sexy +45%, Instructables +23%, Half-Baked Harvest +19%, The Blonde Abroad +7%)

Median decline across the whole sample is −31 percent. Mean is −20 percent. These are not "publisher apocalypse" numbers; they are bifurcation numbers. Some sites died completely. Many sites declined modestly. A meaningful minority grew through the same window.

The niche breakdown sharpens the bifurcation:

Niche	Sites	Median decline
Recipe	5	−10%
Travel	6	−27% (wide range)
Lifestyle	4	−30%
Personal finance	5	−53%
Hobby and craft	2	−79%
Tech how-to	5	−93%

Median traffic decline by niche, 2022 peak vs February-April 2026

Recipe sites largely survived. Tech how-to sites were gutted. That's an 83-point spread between the niche that did best and the niche that did worst, on the same algorithm, in the same window. Lifestyle held; how-to and craft collapsed; travel and personal finance split.

Now look at when the cliffs happened. Across 23 detected inflection months (four sites declined gradually with no clean inflection trigger), the distribution is:

August through December 2022 (HCU first launch period): 2 inflections
September through December 2023 (HCU September 2023 + October and November core updates): 6 inflections
January through April 2024 (March 2024 core update): 1 inflection
May through August 2024 (AI Overviews US launch May 14): 1 inflection
Q4 2024 onward (post-AIO drift): 11 inflections

In this dataset, the largest synchronized cliff clusters around the September 2023 HCU and the back-to-back October and November core updates. Six of twenty-three detected inflections fall in that twelve-week window. The World Pursuit (October 2023), MakeUseOf (October 2023), AddictiveTips (November 2023), Crochetkim (November 2023), Allfreesewing (December 2023), and The Blonde Abroad (December 2023) all began their declines during this period.

The August 2022 HCU, the one with the brand-name reputation, matters less than its reputation suggests. Only two inflections fall in the August–December 2022 window. The original HCU was real but mild compared to what came thirteen months later.

The AI Overviews US launch (May 14, 2024) is barely visible as a discrete trigger. One inflection in the immediate post-launch window. AIO matters intensely for click-through rates on the queries it covers, Ahrefs measured a 34.5 percent drop in CTR for the top-ranking page when an AIO is present (Ahrefs, April 2025; an updated Ahrefs analysis from February 2026 using newer GSC data put the figure at 58 percent at position 1), but it does not show up as a sharp inflection in traffic estimates the way HCU does. The 11 post-AIO inflections through Q4 2024 and 2025 are slow drift, not cliffs.

So the data says the 2023 HCU was the cliff. AIO is the drift. One nuance on AIO, though: it doesn't behave like a launch-date shock in this data, and it shouldn't be read as one. It behaves like a surface-adoption drag. Query coverage, user behavior, and SERP answer quality all expand over time, so the click suppression compounds gradually rather than landing on May 14, 2024. The named 2022 HCU and the named May 2024 AIO are both real, but neither is the largest single discrete trigger here.

Two caveats worth honest disclosure. The World Pursuit lost 99.8 percent of its traffic, but a live check on the site shows the owners stopped publishing in February 2024. The cliff in October 2023 was real algorithm impact, but the subsequent slide into single-digit visit territory was abandonment, not just penalty. Mr. Money Mustache shows the cleanest example of "post-AIO drift turning into a discrete cliff": a gradual climb through early 2025 followed by a sharp drop starting August 2025. The timing is suspicious enough to warrant a second-source check before treating it as pure algorithm impact. Beebom's growth (+168 percent) is the only catastrophic-niche outlier. The jump from roughly 3M to 10M visits in three months reads more like an Ahrefs model recalibration than real overnight traffic doubling.

Removing all three suspect data points doesn't change the bifurcated pattern. The relative-decline order across niches holds, and so does the cliff timing in late 2023.

Here is where the data becomes mechanism-revealing. What got destroyed is content where the query has a summarizable answer. How-to articles ("how to convert a video file", "how to enable dark mode"), tech tips, definitions, factual lookups. The answer fits in three sentences. AI Overviews can deliver it directly in the SERP, and the user doesn't need to click through.

What survived is content where the page itself is the destination. A recipe is the page. You go to Smitten Kitchen because you're going to cook the cookies. AI Overviews can summarize the recipe in the SERP, but the user still needs the full ingredient list and the timing and the picture of the dough and the comments below from people who substituted brown butter. Lifestyle articles are read for the writer's voice and detail; they don't compress without losing the thing being consumed. Personal essays. Long-form analysis. Photographic content. The page is what you came for.

It's worth making this a real variable rather than a vibe. Call it answer-compressibility: a page category is high-compressibility if the user's search intent can be satisfied by a short procedural, definitional, or factual answer without consuming the page as an artifact. How-to, definition, and calculator queries are high-compressibility. Recipe, lifestyle, long-form essay, and photographic content are low-compressibility, because the page itself is the thing you came for, not a wrapper around a three-sentence answer.

That turns Mechanism A into a falsifiable prediction: high-compressibility content should lose disproportionately more organic traffic than low-compressibility content over the HCU-and-AIO window. The data, niche by niche, matches it. Recipe (−10%): low-compressibility, survived. Lifestyle (−30%): low-compressibility, survived. Tech how-to (−93%): high-compressibility, destroyed. Hobby craft (−79%): high-compressibility, destroyed. Travel (mixed −27%): split, because informational travel queries are high-compressibility and experiential travel content isn't, and the niche shows both extremes. Personal finance (mixed −53%): split for the same reason. Calculator-type queries are high-compressibility; personal-philosophy essays aren't.

The mechanism that makes this prediction work shows up in two places. The algorithm itself, which we'll get to in section four. And Google's own revenue model, which is next.

Why Google's incentive to fix this is gone

The standard narrative around publisher decline contains an implicit hope: at some point, complaints will accumulate, the algorithm will be corrected, traffic will route back to good sites. Google has revised named systems before. The August 2022 HCU was retired as a standalone named system in March 2024 and incorporated into the core ranking system. The mental model assumes Google has an incentive to keep the publisher ecosystem alive because Search needs publisher pages to be the destination users click toward.

That mental model worked when Search was the dominant share of Alphabet's revenue and growth. It doesn't anymore. To see why, the cleanest place to look is Alphabet's quarterly reported revenue mix from primary sources, the earnings releases that 10-Q filings rest on.

I pulled the last 12 reported quarters, Q2 2023 through Q1 2026 (the most recent quarter as of this writing, released April 29 2026). Every number below comes from the Alphabet investor-relations CDN where they post each quarter's release as a PDF, cross-checked against the comparative column in the following quarter's release. No analyst estimates, no third-party rollups.

Alphabet revenue mix evolution, Q2 2023 through Q1 2026

Three things stand out.

First, Cloud's share of Alphabet roughly doubled. Google Cloud went from 10.8 percent of consolidated revenue in Q2 2023 to 18.2 percent in Q1 2026. Google Search & other slipped from 57.1 percent to 55.0 percent over the same window. Cloud absorbed roughly seven and a half percentage points of revenue mix in three years. The other segments held their relative shares within a couple of points, Cloud is the segment that grew its slice of the pie.

Second, in absolute dollars Cloud is the marginal-growth engine even though Search is still the larger absolute contributor. Over the window Cloud added $12.0 billion in quarterly revenue (from $8.03B to $20.03B, +149 percent). Search added $17.8 billion in quarterly revenue (from $42.63B to $60.40B, +42 percent). Search added more absolute dollars off a much larger base, which is the nuance the article should not bury. But Cloud's YoY growth rate accelerated through the window (+28.9% YoY in Q2 2024 climbing to +63.4% YoY in Q1 2026) while Search's growth rate hovered between +9.8% (trough) and +19.1% (most recent quarter). Cloud is growing more than three times faster than Search.

Third, and this is the part that matters for publishers, Search-line revenue keeps growing while publisher organic clicks decline. The CTR studies say that for queries with an AI Overview, the top-ranking page gets 34.5 percent fewer clicks than equivalent queries without one (Ahrefs again, the February 2026 update puts the figure higher at 58 percent). The Authoritas study of UK news publishers found per-query CTR loss of almost 50 percent (Press Gazette, 2025). Stack Overflow's daily traffic fell about 12 percent post-ChatGPT (Burtch, Lee & Chen, 2024, Scientific Reports), and new-question volume fell about 75 percent from its 2017 peak based on the public Stack Exchange data dump (Holscher, 2025). The trend lines point the same direction: clicks are leaving the publisher tier.

But Google's Search-line revenue grew through the same period. Q1 2026: +19.1 percent year-over-year. The composition of that growth, what's driving it, is not fully in the press release segment table. Pichai's exact press-release line is that "Search had a strong quarter with AI experiences driving usage, queries at an all-time high, and 19% revenue growth." On the accompanying earnings call leadership extended that, naming AI Overviews and AI Mode specifically as usage drivers, alongside vertical strength in retail and finance as separate revenue contributors. Nobody at Alphabet has quantified how much of the reacceleration is attributable to AI features versus the other drivers. But they have named those features as material, and that is enough for the argument: the product that takes click-through rate from the publisher tier is the same product Google credits with growing Search.

Read for the publisher this is a clean causal arrow. AI Overviews extracts value from publisher content (it retrieves, summarizes, and surfaces what publishers wrote at serving time, while Gemini-class models are also trained on web-scale human text more broadly) without routing the user to the publisher. The user gets the answer, stays on Google's surface, sees Google's ads, and never visits the site that originated the information. Alphabet doesn't report per-query monetization in its segment table, so "per-query monetization rose for Google" is inference from the combination of growing Search revenue and the CTR studies above. The inference is straightforward: same product mechanism, opposite consequences for publisher and platform.

Read for Google this is rational allocation. When Search required publisher pages to be the destination users clicked toward, publisher health was Google's problem. The flywheel only spun if there was good content for the link to point at. Now the destination is the SERP itself, populated with AIO summaries, the Discussions box pulling from Reddit (the $60M-per-year Reddit-Google data licensing deal reported by Reuters in February 2024 was not coincidental), the Knowledge Panel, the People Also Ask, the embedded video clips. Publishers became inputs to a destination Google now owns, rather than destinations Google referred users to. Search didn't degrade for Google. Search got upgraded to extract value at the SERP layer rather than route it through.

This is what changed. Not "Google killed publishers", that frame is too active. The frame is that Google rebuilt the search product so that the publisher tier became an upstream supplier of content rather than the downstream destination of clicks. Suppliers can be replaced, recombined, or removed when their substitutes are good enough. Destinations could not be. The shift from destination to supplier is what makes the publisher position structurally weaker, and it's what makes the algorithm changes self-reinforcing rather than self-correcting.

The standard response to this argument is "Search is still the bigger absolute business, look at the dollars". True. The data does not say Search is collapsing or dying. The data says Search is healthy for Google. It also says Search is growing slower than Cloud, that Cloud is roughly tripling its growth rate relative to Search, and (importantly for the publisher question) that Search is healthy despite sending fewer clicks to its underlying content tier. Health for Google and health for publishers are no longer the same thing. They were aligned when the flywheel needed publisher pages to spin. They are decoupled now that the flywheel spins inside the SERP.

One useful asymmetry to notice. Cloud's reported backlog "nearly doubled quarter on quarter to over $460 billion" per Alphabet's Q1 2026 earnings release. Annualizing the Q1 2026 Cloud revenue ($20.03B × 4 = ~$80B), $460B of backlog is roughly 5.7x that figure (though "backlog" here is contracted future revenue, RPO, not the same as a committed-pipeline number you'd see at a startup). The reported mix can't prove internal priority, but it makes the posture legible: Search remains the cash engine, Cloud is the forward-growth narrative. The filings prove the revenue asymmetry, not the boardroom psychology. What the asymmetry implies is modest but real: the strategic attention inside Alphabet (capex allocation, hiring priority, executive promotions) tends to follow growth rate and backlog, and on those metrics Search is no longer where the story is.

This explains, mechanically, why no algorithm rollback is coming. The publisher-tier degradation isn't an unfortunate side effect of an updated ranking system Google would prefer not to ship. It's the predictable outcome of moving the destination layer onto Google's surface, which is exactly what AIO and AI Mode are. To roll the algorithm back to the previous equilibrium would require unshipping the product line that explains Search's revenue reacceleration. That doesn't happen.

Why the algorithm has the bias it has, by construction

Section 3 showed Google's incentive structure has shifted: Search can grow while publisher referrals decline because per-query monetization moved onto the SERP. This section shows the algorithmic mechanism that makes the shift permanent.

There are two mechanisms operating in parallel, and the data in section 2 separated them for us without our knowing it.

Mechanism A is content summarizability. The how-to queries die because the answer is the entire content, three sentences and the user is done. The recipe queries survive because the recipe is what you do with the answer, not the answer itself. Mechanism A explains which content TYPES survive at all.

Mechanism A operates through two surfaces with different timing. The 2023 cliffs in the data above predate AI Overviews by eight months. The September 2023 HCU and the back-to-back October/November core updates directly demoted content the algorithm classified as unhelpful. Empirically, the content the helpful-content classifier downranked was the same kind of content that's summarizable in three sentences: templated how-tos, thin reference posts, evergreen one-paragraph definitions. Then in May 2024 AI Overviews launched and started suppressing clicks on those same query types from a different surface. Instead of demoting the result, it surfaces the answer in the SERP so users don't click through. Same content category targeted, two different mechanisms, two different timing windows. The 2024-2025 drift in the data is the AIO half; the 2023 cliffs are the HCU half.

Mechanism B is site-level proxy signals. Google's ranking system computes site-wide trust scores from observable correlates of operational infrastructure: backlinks, Chrome clickstream volume, brand-search velocity, sustained mention density in news corpora, schema markup completeness. These signals scale with enterprise content operations, not with first-hand experience. Mechanism B explains who WINS among the content types that survived.

Together they answer the puzzle that single-mechanism explanations cannot. A signal-catalog-only explanation cannot account for Smitten Kitchen (recipe, single-voice independent, −3.5%) surviving alongside How-To Geek (tech how-to, larger operational footprint, −93%). A summarizability-only explanation cannot account for Apartment Therapy (lifestyle, larger brand operation, −39%) outperforming Frugalwoods (personal finance, single-author, −60%) within the niches that didn't get categorically destroyed. Mechanism A determines which content TYPES the algorithm and AIO together remove from the click economy; Mechanism B determines which sites win among the survivors. The bifurcated data is what you'd expect if both mechanisms are operating.

Mechanism A is observable from outside Google. AI Overviews are public; the queries they appear on are public; the citations they include are public; the click-through rates have been measured by Ahrefs, Authoritas, and others. No insider information needed.

Mechanism B is exactly where the Google leak matters.

The leaked signals

In March through May 2024, Erfan Azimi released approximately 14,014 attributes from Google's internal Content Warehouse API documentation. Rand Fishkin at SparkToro and Mike King at iPullRank published their analyses on May 27, 2024, the same day. Google confirmed authenticity through a statement to The Verge (also reported by Search Engine Land) that didn't deny the schema but qualified that some attributes might be "out-of-context, outdated, or incomplete."

The subset of attributes most relevant to site-level quality and trust, the layer that appears to shape ranking before any page-specific relevance work, is small and consistent. The leak shows attribute names and types in API documentation, not active weights or current deployment status, so treat the catalog below as the framework the system is built on, not as proof of any specific current weighting. Mike King's analysis is the most rigorous walk-through. Here are the load-bearing signals.

Figure 4·Mechanism architecture·Site-level signals into ranking

Six site-level signals sit in the leaked schema next to the Qstar layer. The first-hand-experience attributes don't.

Every card below is a real protobuf field from the Content Warehouse leak. The ghost row shows attributes that E-E-A-T messaging implies should exist, and which simply do not appear in the leaked surface.

Site-level ranking influence

Qstar layer

Present in the leaked schema · 6 attributes

siteAuthoritySite-wide authority scorescore
chromeInTotalChrome clickstream volumesignal
hostNsrHost-level normalized site rankscore
siteFocusScoreTopical focusscore
siteRadiusEmbedding distancevector
smallPersonalSiteHobby-site flagflagdisputed

✕No matching field in the leaked surface

Not in the leaked schema · 2+ expected attributes

firstHandExperiencenot in schema
Would mark an "I was actually there" attribution.
authorWasActuallyTherenot in schema
Would model the "Experience" claim from E-E-A-T.

Caveat · the leak is incomplete by Google's own admission. The absence of these literal field names from the leaked surface is not proof that Google lacks any experience signal.

Qstar (ranking layer)In schemaExpected · absent

Sources: Content Warehouse API leak (March 2024) · iPullRank · SparkToro

Attribute names sampled from GoogleApi.ContentWarehouse.V1.Model.* protobuf definitions exposed in the March 2024 leak. The six named above are present in the leaked schema as attribute definitions, not confirmed live ranking weights. The ghost-row names are synthesized; they describe what an E-E-A-T-faithful schema would need and do not appear in the leaked surface.

The Status column is deliberately careful: a field's presence in the leaked schema confirms that Google represents the concept internally, not that the field carries a known weight in live ranking. Mike King makes the same caveat: the docs show attributes, not scoring functions or pipeline wiring.

Signal	What it appears to measure	Why it biases against independents	Status
`siteAuthority`	Site-wide authority score	A site-wide multiplier, no individual post can outrun a low site score	In schema; publicly denied by Google. Confirms a domain-authority representation exists. Live weight unknown.
`chromeInTotal`	Site-level Chrome clickstream volume	Independent sites without distribution have near-zero baseline; signal compounds with audience, not content quality	In schema; sits against John Mueller's public position that the only Chrome data used for ranking is CrUX page-experience aggregates. Live weight unknown.
`hostNsr`	Host-level normalized site rank	A site whose chunks read as a single hobbyist voice gets one score for the whole host	In schema; usage/weight unknown
`siteFocusScore`	How topically focused the site is	Personal sites mixing travel + code + essays read as unfocused by construction	In schema; usage/weight unknown
`siteRadius`	How far page embeddings deviate from the site embedding	A first-hand essay outside the site's topical centroid would be structurally penalized regardless of quality	In schema; usage/weight unknown
`smallPersonalSite`	Flag for small personal sites	Marks "this is a hobby site" as an internal category that ranking code can read	In schema; direction (boost vs demote) disputed

Two omissions worth flagging, with the standing caveat that the leak is incomplete by Google's own admission, so absence of a literal field name doesn't prove absence of a corresponding mechanism. No firstHandExperience attribute appears in the leaked surface. No authorWasActuallyThere attribute either. The absence doesn't prove Google lacks an experience signal: experience could be inferred through quality-rater pipelines, embeddings, or entity systems that the leak doesn't expose. It proves something narrower. The public concept of "first-hand experience" is not represented in the leaked schema with anything like the explicitness of the operational proxies that sit right next to it.

Figure 5·Public guidance·vs.·Leaked schema

What Google names is not what Google measures.

Public guidance talks about qualities of content. The leaked Content Warehouse schema queries properties of sites. Different unit, different vocabulary, and almost no overlap by name.

What Google says

The four principles of helpful contentSearch Quality Rater Guidelines, public

E
Experience
First-hand engagement with the subject.
E
Expertise
Deep, demonstrable knowledge of the topic.
A
Authoritativeness
Recognised authority in the field.
T
Trustworthiness
Reliable, accurate, transparent.

What the schema queries

Queryable site-level featuresContent Warehouse API, leak · March 2024

·siteAuthorityscoreSite-wide authority score, not topical.
·chromeInTotalsignalAggregated Chrome clickstream volume.
·hostNsrscoreHost-level normalized site rank.
·siteFocusScorescoreHow topically focused a site is.
·siteRadiusvectorEmbedding distance from the site's centroid.
·smallPersonalSiteflagHobby-site classifierdisputed

Search Quality Rater Guidelines · Search Liaison

sources

Google Content Warehouse API leak, March 2024

Feature names sampled from GoogleApi.ContentWarehouse.V1.Model.* protobuf definitions exposed in the March 2024 leak. These names are present in the leaked schema, but the leak shows attribute definitions, not confirmed live ranking weights.

Why the schema looks this way

Google cannot verify factual quality at web scale. No service inside Search calls an LLM to ask whether the author of "What it's like to spend a month in Pattaya" actually spent a month in Pattaya. Even at current inference prices that check is uneconomic across the indexed web, and it would still be wrong half the time because the model has no ground truth either. The verification problem is structural, not a tooling gap that better models close.

So Google does what any large-scale ranker does when the thing it wants to measure is unmeasurable: it picks proxies. Observable correlates of the thing. Signals that move together with quality, on average, across a large corpus, even when any single instance is noisy. This is not a scandal. It is the only design that exists at this scale.

The question is which proxies. And here the leak is precise where Google's public messaging is vague. The proxies that drive site-level ranking are operational signatures: siteAuthority derived from Qstar, chromeInTotal derived from browser clickstream volume, hostNsr derived from sitechunk aggregates, siteFocusScore and siteRadius derived from embedding distances. Layered on top of that is the well-documented entity infrastructure: Knowledge Graph linkage, branded-search volume for navigational queries, schema markup, sustained mention velocity in news corpora. The full stack's required inputs all scale with operational infrastructure, not with whether the author was in the room.

Read the bullish way: these proxies correlate with trustworthy publishers because trustworthy publishers tend to build operational infrastructure. A site that has run for a decade, gets cited in news, holds Knowledge Graph linkage, and shows steady branded search has, on average, earned that footprint by being reliable. The proxy is doing its job.

Read the bearish way: the proxies don't measure trust. They measure the operational signature of an entity that has the staff, time, and budget to look like a brand. A small consultancy with a PR retainer, a domain registered as a business, consistent NAP across directories, a paid Knowledge Graph push, and a content team producing tightly-clustered topical material gets the same site-level score as a publication that actually fact-checks. Both readings are correct, and which one matters depends on whether you're asking about average quality or marginal quality.

The marginal case is where the contradiction lives. A solo author with first-hand domain experience (a working trader writing about a market they live in, a developer writing about a tool they built, a parent writing about a school system their kids attend) produces, by definition, the kind of content E-E-A-T was supposed to reward. The same author cannot economically generate the operational signature the ranker actually consumes. They have no PR budget to seed news mentions. Their NAP is one name across one personal domain. Branded search for their name returns near-zero navigational queries because nobody has heard of them yet. siteFocusScore reads as low because they write across the topics their life actually intersects. chromeInTotal is small because they have no distribution. hostNsr averages low because the sitechunks span first-hand essays, side-project documentation, and an old talk. Every one of these signals codes "small personal site" as "weak site" through the same proxy mechanism. The smallPersonalSite flag itself is a separate question, and worth not over-reading. Its existence proves that "small personal site" exists as an internal representational category that ranking code can key on. It does not prove a live demotion path: the leak doesn't tell us whether that category boosts or demotes, and some published readings interpret it as a small-site promotion signal rather than a penalty.

Building the operational signature isn't a content problem. It's an enterprise-procurement problem. Knowledge Graph entries, sustained PR coverage, schema TravelAgency or schema NewsArticle, branded navigational query volume: these are line items in a marketing budget, not byproducts of writing well. The author who knows the most about Pattaya cannot rank against the affiliate site that has never been there, because the affiliate site has the procurement budget and the author doesn't.

The two mechanisms together

Now go back to the bifurcated data with both mechanisms in hand.

Recipe sites survived (Mechanism A) AND tend to have strong operational signatures (Mechanism B). Smitten Kitchen has sixteen years of compounding distribution, sustained mention velocity in food media, recurring direct traffic, Pinterest distribution. The combination of "AIO can't summarize the cooking experience" and "Smitten Kitchen has the infrastructure to win among survivors" yields the −3.5 percent decline we observed.

How-To Geek lost to Mechanism A first, the queries it served are exactly the queries the September 2023 HCU downranked and AIO later started answering in three sentences, but its operational signature wasn't enough to win the much smaller market that survived. Hence the −93 percent.

Mr. Money Mustache survived Mechanism A for years, since personal-finance philosophy isn't summarizable, but lost slowly to Mechanism B (one author, narrowing focus, no enterprise content operation). His decay shows up as gradual drift through 2024 and early 2025 followed by a discrete cliff in August 2025, the shape you'd expect from a core update reweighting site-level signals landing asymmetrically on lone-author sites, with the caveat from Section 2 that the August 2025 timing is also suspicious enough to warrant a second-source check before treating it as pure algorithm impact.

The World Pursuit lost to both mechanisms in October 2023 and stopped publishing in February 2024, which exposes a third dynamic: when the algorithm crosses the breakeven threshold for a small site's economics, the site stops producing content, which then makes the algorithm's verdict self-confirming. The signal smallPersonalSite was, in this case, a leading indicator of a site that would stop existing.

This is the architecture that makes the publisher position structurally weaker than it was. Mechanism A removed an entire content category from the click economy. Mechanism B ensured that the remaining content category went to operational entities rather than individual experts. Section 3 showed Google has no financial incentive to reverse either mechanism. The next section shows what's forming in the space left over.

The licensing market forming in the rubble

Sections 2 through 4 are about what's broken. This section is about what's being built.

The thesis: while the publisher ecosystem decays on the ad-supported click economy, a new market is forming on the AI-content-licensing model. The components are visible. They have not been assembled into one story.

In December 2023 Axel Springer signed a multi-year deal with OpenAI covering POLITICO, Business Insider, BILD, and WELT; secondary reporting from Axios put the terms at three years and tens of millions of euros (the official announcement didn't disclose figures). The same month NYT filed its lawsuit against OpenAI and Microsoft seeking billions in damages and the destruction of training datasets containing NYT content. Two months later, in February 2024, Reuters reported that Reddit had signed an AI-training data licensing arrangement with Google for approximately $60 million per year. In May 2024 News Corp signed its own deal with OpenAI, valued by WSJ at more than $250 million over five years, covering WSJ, Barron's, MarketWatch, NY Post, The Times (UK), The Sun, The Australian. Two months after that, on July 30 2024, Perplexity launched its Publishers Program with six founding partners (TIME, Der Spiegel, Fortune, Entrepreneur, Texas Tribune, WordPress.com), structured as a revenue share whenever Perplexity earns money from interactions referencing partner content.

Then on July 1, 2025, Cloudflare made the most structural move of any of them. They flipped the default on new domains to block AI crawlers, and launched a pay-per-crawl market priced via HTTP 402. A crawler hits a publisher's site, receives 402 Payment Required, retries with a crawler-exact-price header acknowledging the publisher's stated price, and the request is authenticated cryptographically via RFC 9421. As Matthew Prince put it in the launch post, "Cloudflare, along with a majority of the world's leading publishers and AI companies, is changing the default to block AI crawlers unless they pay creators for their content."

Licensing market timeline 2023-2026 with structural mechanisms emerging in three tiers

Stand back from the individual deals and the shape is clear. Three structural mechanisms are forming simultaneously:

Enterprise licensing at the top of the market, News Corp, Axel Springer, AP-class deals where AI companies pay seven- and eight-figure annual sums for blanket training and serving rights.
Revenue share in the middle, Perplexity's program where publishers receive a cut of revenue from interactions citing their content.
Infrastructure metering at the long tail, Cloudflare's HTTP 402 layer turning every individual crawler request into a priced transaction.

These are not competing models. They are three pricing tiers for the same underlying market: human-written signal sold to LLM providers as training data and as RAG retrieval source. The enterprise tier serves the publishers who can negotiate. The revenue-share tier serves the middle. The metering tier handles everyone else.

Why demand for human signal persists, and why licensing may still fail the median publisher

The case for inevitability rests on a single piece of empirical research that engineering readers should know specifically. In May 2023 Shumailov, Shumaylov, Zhao, Gal, Papernot and Anderson posted The Curse of Recursion: Training on Generated Data Makes Models Forget to arXiv. In July 2024 the same team published the peer-reviewed extension in Nature under the title AI models collapse when trained on recursively generated data. The mechanism they describe, model collapse, is that successive generations of model trained on data produced by previous generations of model lose statistical fidelity. The tails of the distribution vanish. Outputs converge toward the mode of an increasingly impoverished generative distribution.

The Shumailov result establishes that LLM providers face permanent demand for high-quality human or human-curated signal, purely synthetic training loops degrade. It does not prove the demand can only be met through licensing; curated mixtures of synthetic and human data, distillation from larger models, and filtered web crawls all reduce the collapse problem in practice. What it does mean is that ongoing access to fresh human signal is a structural input rather than an optional one. Combine that with Cloudflare-style metering pricing previously-free crawler access, the open web filling with AI-generated content that fails the freshness test, and the legal exposure the NYT lawsuit is establishing on the unilateral-scraping side, and the cost of meeting that demand through unmonetized scraping starts rising relative to the cost of licensing it.

So a market is forming. The question has never been "will publishers get paid for content licensing?" The question is what shape the payments take and at what scale.

Engaging the strongest counter

The most rigorous published counter to the licensing-market thesis is Nieman Lab's December 2025 piece arguing there is no meaningful licensing revenue for most publishers. The argument is both empirical (the 2025 numbers are small) and structural (AI firms have stronger bargaining power than publishers and may never need to pay most of them, even if a market exists). The empirical side is correct as of late 2025; license revenue for the median publisher is small, and Cloudflare's pay-per-crawl was six months old when Nieman published.

The structural side is the load-bearing one. Nieman's reading is that bargaining power favors AI providers permanently: they can substitute across publishers, train on what's already scraped, operate with thinner data over time, and pay only the few largest publishers who can credibly withhold access. My reading is that bargaining power is exactly the relevant question, and infrastructure like Cloudflare's per-domain metering changes it by removing the substitute-by-scraping option for any publisher behind the network. Whether that shift in the bargaining environment translates to revenue flowing past the top tier is the actual open question, and it's empirical for the 2026-2028 window. The relevant analogy for the "is it forming?" half of the question is infrastructure-market scale-up: Stripe took several years from founding to meaningful payments volume, Visa took two decades. An 18-month-old pay-per-crawl market not having scaled by late 2025 is consistent with new infrastructure markets, not evidence the structure won't form.

The substantive risk Nieman flags is real but different from the existential one: the licensing market may form in a shape that concentrates revenue at the top tier without flowing down to mid-tier or independent publishers. That's a distribution problem, not an existence problem. Cloudflare's per-domain pricing tries to address it on the infrastructure side. Revenue-share programs like Perplexity's try to address it in the middle. Whether these mechanisms scale enough to meaningfully compensate independent publishers is the actual open question, and it is one the 2026–2028 window will answer empirically.

What can be said with the data we have: the destination layer Google built on the SERP is now structurally extracting value from publisher content without compensation, and the AI-licensing infrastructure being built in parallel is the first mechanism that monetizes content back to the source even partially. Publishers who survive 2026–2028 will be the ones who end up on the supply side of this market with the operational capacity to negotiate or invoice. Publishers who don't survive will exit through the same door (Mechanism A and Mechanism B from Section 4) that closed off the click economy.

What would falsify this

A triangulation argument earns trust by saying in advance what would break it. Here is what I'd accept as disconfirming, stated as tests anyone with query-level GSC access could run.

If answer-compressibility is really the mechanism behind Mechanism A, then within a single domain, high-compressibility URLs should lose more organic traffic than low-compressibility URLs across the HCU-and-AIO window. If they decline at the same rate, the niche-level pattern I'm reading as compressibility is actually something else (domain-level penalty, say).

If AIO is drift rather than a launch-date cliff, then query-level GSC data should show gradual CTR erosion that tracks AIO coverage expansion, not a discontinuity concentrated on May 2024. A sharp May 2024 step with flat CTR on either side would falsify the adoption-drag reading.

If operational proxies (Mechanism B) really decide outcomes among survivors, then low-compressibility content on high-brand domains should outperform the same kind of content on low-brand single-author domains, holding query intent constant. If a single-author site with no operational footprint ranks just as well as a branded operation for the same low-compressibility queries, Mechanism B isn't doing the work I claim.

And the load-bearing economic claim: if what died is the referral subsidy rather than demand for content, then AI-licensing deal flow should keep expanding down-market while click-referral traffic keeps falling. If click-referral to mid-tier publishers recovers materially in 2026-2027 without Google reversing AIO or AI Mode, the "destination moved onto the SERP" thesis is wrong and I'll have to retract it.

What this means structurally

Three concurrent shifts are documented in the data above. They are not the same shift. They reinforce each other but they have different mechanisms and different time horizons.

Algorithmically (Section 4): the site-level proxy signals that drive ranking systematically advantage operational infrastructure over first-hand content. The smallPersonalSite flag exists in the leaked schema (direction disputed). No firstHandExperience attribute appears in the leaked surface, though the leak is incomplete by Google's own admission. By construction, single-author and small-operation publishers compete against operational entities on the metrics the algorithm appears to measure.

Economically (Section 3): Google's Search-line revenue grew through the same window that publisher click-through rates collapsed. Search revenue rose while clicks routed to publishers fell, implying per-query monetization rose at Google's end. The financial incentive Google had to maintain the publisher ecosystem (when publisher pages were the destination users clicked toward) was rebuilt away when AIO and AI Mode moved the destination onto the SERP. Search is healthy for Google. Publisher dependence on Search is the inverse.

Infrastructurally (Section 5): the AI-content-licensing market is forming in three tiers: enterprise deals, revenue share, infrastructure metering. The Shumailov et al. result establishes that LLM providers have a permanent demand for fresh human-written signal. The 2026–2028 window will determine the shape and scale of the market that emerges, with the distribution problem (does revenue flow past the top tier) as the substantive open question.

What's permanently lost is the 2013-era equilibrium where an anonymous domain could publish evergreen reference content, accumulate links over time, monetize via AdSense, and operate as a stable business at modest scale. AI Overviews removed the queries the content category served. Site-level proxy signals removed the ranking advantage anonymous independent sites might have had against operational competitors. The financial incentive that would have led Google to protect that equilibrium has been rebuilt around a destination layer Google now owns. None of these reverse.

What's still functional is content where engagement cannot be summarized, where the page is what you came for. Recipes you cook. Lifestyle you read. Long-form analysis. Photographic content. Personal philosophy. The categories that survived in the section 2 data are the categories whose underlying queries are not three-sentence-summarizable, and whose readers want the destination and not just the answer.

What's emerging is content licensing to AI providers as a primary revenue substrate, with the per-page web visit becoming residual. The publishers who navigate the 2026–2028 window are the publishers who end up on the supply side of this market with the operational capacity to invoice. The publishers who don't are the publishers who exit through Mechanism A and Mechanism B.

The blog as a product survives only where engagement cannot be summarized. The blog as a business survives only where it has the operational infrastructure of an enterprise (or the licensing relationship of one). The middle is empty, and it is not coming back.

Appendix: the corpus

Selection method, stated plainly so you can judge the cherry-picking risk yourself. I started from a list of well-known mid-tier publisher sites across six niches that were active and ranking in the 2018-2022 window, chose for niche diversity, and excluded large media conglomerates. A few were swapped during data collection when Ahrefs history was too sparse to use (thesimpledollar.com, sold and redirected, was replaced by wisebread.com). The list was not pre-registered before I saw outcomes, so treat it as a constructed convenience sample, not a random draw. The compressibility class is assigned by content type, not by outcome, which is the point: it's a prediction made independently of the decline number sitting next to it.

Domain	Niche	Compressibility	Peak → Current	Decline	Inflection	Note
nomadicmatt.com	travel	mixed	280K → 109K	−61%	2024-04
theblondeabroad.com	travel	mixed	57K → 61K	+7%	2023-12
expertvagabond.com	travel	mixed	159K → 3.6K	−98%	2022-10
neverendingfootsteps.com	travel	low	63K → 181K	+188%	none	gradual growth
adventurouskate.com	travel	mixed	86K → 148K	+72%	2024-10
theworldpursuit.com	travel	mixed	171K → 0.3K	−100%	2023-10	abandoned Feb 2024
smittenkitchen.com	recipe	low	221K → 213K	−4%	2025-03
pinchofyum.com	recipe	low	2.5M → 2.0M	−21%	2024-10
minimalistbaker.com	recipe	low	2.4M → 931K	−61%	2025-01	thin AI-assisted variations suspected
halfbakedharvest.com	recipe	low	822K → 977K	+19%	2025-08
sallysbakingaddiction.com	recipe	low	11.6M → 10.4M	−10%	2025-03
howtogeek.com	tech-howto	high	4.8M → 330K	−93%	2022-11	Valnet-owned
makeuseof.com	tech-howto	high	6.7M → 161K	−98%	2023-10	Valnet-owned
instructables.com	tech-howto	high	2.5M → 3.0M	+23%	none	Autodesk-owned
beebom.com	tech-howto	high	2.3M → 6.2M	+168%	none	suspect; likely Ahrefs recalibration
addictivetips.com	tech-howto	high	138K → 7K	−95%	2023-11	high recent volatility
allfreesewing.com	hobby-craft	high	22K → 3.4K	−84%	2023-12	Prime Publishing
crochetkim.com	hobby-craft	high	13K → 3.4K	−74%	2023-11
mrmoneymustache.com	personal-finance	mixed	88K → 41K	−53%	2025-08	late cliff, second-source check advised
wisebread.com	personal-finance	mixed	23K → 9K	−61%	2024-06
getrichslowly.org	personal-finance	mixed	8.7K → 8.6K	−1%	2023-01
frugalwoods.com	personal-finance	mixed	2.8K → 1.3K	−54%	2025-06	low baseline, noisy
budgetsaresexy.com	personal-finance	mixed	3.1K → 4.4K	+45%	2024-10	low baseline, noisy
cupofjo.com	lifestyle	low	56K → 39K	−31%	none
apartmenttherapy.com	lifestyle	low	1.6M → 959K	−39%	2024-09	larger brand operation
designmom.com	lifestyle	low	7.0K → 6.3K	−10%	2025-11
alphamom.com	lifestyle	low	16.8K → 12.1K	−28%	2023-01

Data: Ahrefs Site Explorer, monthly worldwide organic-traffic estimates, pulled 2026-05-23. Peak = highest 12-month trailing average before August 2024. Current = mean of February through April 2026. "Compressibility" is the answer-compressibility class from Section 4, assigned by content type. The one tension worth naming: hobby-craft sites are marked high-compressibility (free patterns and how-to-stitch content are largely procedural) and they did collapse, but they're also the smallest sites in the corpus, so their decline is overdetermined: both mechanisms point the same way and the data can't cleanly separate them at this scale.

The Quality Paradox: Why Search Improved for Google While It Degraded for Publishers

What 27 publisher sites tell you

Why Google's incentive to fix this is gone

Why the algorithm has the bias it has, by construction

The leaked signals

Six site-level signals sit in the leaked schema next to the Qstar layer. The first-hand-experience attributes don't.

What Google names is not what Google measures.

Experience

Expertise

Authoritativeness

Trustworthiness