Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aligning technology origins and adoption metric between the crux and the crawl data #42

Open
max-ostapenko opened this issue Jan 8, 2025 · 2 comments

Comments

@max-ostapenko
Copy link
Contributor

Quoting @rviscomi :

The number seemed high because I anecdotally know WordPress market share of CMSs is around 75%, but we're only showing 3.2M WordPress origins in the CMS report.
So if there are 8.8M sites that use a CMS, that puts WordPress's market share at 36%, which is way too low.
The issue seems to be that the WordPress count is taken after joining with the CrUX dataset, and many sites have fallen out of CrUX.
Modifying your query to count WordPress sites in November:

SELECT
  COUNT(DISTINCT root_page)
FROM crawl.pages
WHERE date = '2024-11-01'
  AND client = 'mobile'
  AND 'WordPress' IN UNNEST(technologies.technology)

The result is 5785472, which gets us much closer to the expected market share: 65%.
So there are about 2.5M WordPress sites that we're counting in the category total but not in the technology total.
Open to suggestions on how to fix this. One idea is to remove the CrUX join (or do some sort of outer join) when calculating origin counts.
Yeah we subtly changed the name from "CWV Tech Report" to "HTTP Archive Tech Report" so that we could lean more heavily onto the adoption side, so joining forces makes a lot of sense


I see 2 issues here:

  1. we use different URL sets in the report: November crawl is based on Oct CrUX, but we are JOINing it with Nov CrUX. It's either complete or timely. (0.6M discrepancy)
  2. we are not using tablet and NULL clients from CrUX - so more unmatched origins (1.9M). No geo and rank available for aggregation.

A promising analysis logic

Calculate adoption with crawl data, as it's the original source.
This will help us to solve adoption with the most complete set of origins, including the CrUX's tablet and NULL clients.
But only the global ones, geo dimension is part of CrUX and thus unavailable. We could still use INNER JOIN there.

@tunetheweb
Copy link
Member

What would the adoption share be if we included CrUX data in the total? i.e. we insist on both to be in this dataset?

@max-ostapenko
Copy link
Contributor Author

max-ostapenko commented Jan 12, 2025

I'm assuming you're talking about total adoption, but let's check category share as well.

In 2024-11 we have:

  • crawl total origins: 16257555
  • when joined with crux: 9316422

CMS / WordPress stats

  1. Today we show:
  • 3222201 / 9316422 = 35% adoption
  • 3222201 / 8882835 = 36% category share
SELECT
  category_total,
  tech_adoption,
  tech_adoption / category_total AS category_share
FROM (
  SELECT
    adoption.mobile AS tech_adoption
  FROM `httparchive.reports.cwv_tech_adoption`
  WHERE date = "2024-11-01"
    AND technology = 'WordPress'
    AND rank = 'ALL'
    AND geo = 'ALL'
)
CROSS JOIN (
  SELECT
    origins AS category_total
  FROM `httparchive.reports.cwv_tech_categories`
  WHERE category = 'CMS'
)

We loose 1.9M origins, when we don't include origins from part crux.device = 'tablet' OR IS NULL in the JOIN.

  1. The complete adoption (only crawl):
  • 5785472 / 9316422 = 62% adoption
  • 5785472 / 8882835 = 65% category share
SELECT
  category_total,
  tech_adoption,
  tech_adoption / category_total AS category_share
FROM (
  SELECT 
    COUNT(DISTINCT root_page) AS tech_adoption
  FROM crawl.pages
  WHERE date = '2024-11-01'
    AND client = 'mobile'
    AND 'WordPress' IN UNNEST(technologies.technology)
)
CROSS JOIN (
  SELECT
    origins AS category_total
  FROM `httparchive.reports.cwv_tech_categories`
  WHERE category = 'CMS'
)
  1. If we calculate category AND tech adoption based on CrUX:
  • 3222201 / 4945852 = 65% category_share
WITH crux AS (
  SELECT DISTINCT
    CONCAT(origin, '/') AS root_page
  FROM `chrome-ux-report.materialized.device_summary`
  WHERE
    date = '2024-11-01'
    AND device IN ('phone')
), pages AS (
  SELECT DISTINCT
    root_page
  FROM crawl.pages,
    UNNEST(technologies) AS tech
  WHERE
    date = '2024-11-01'
    AND client = 'mobile'
    AND 'CMS' IN UNNEST(tech.categories)
)

SELECT
  category_total,
  tech_adoption,
  tech_adoption / category_total AS category_share
FROM (
  SELECT
    adoption.mobile AS tech_adoption
  FROM `httparchive.reports.cwv_tech_adoption`
  WHERE date = "2024-11-01"
    AND technology = 'WordPress'
    AND rank = 'ALL'
    AND geo = 'ALL'
)
CROSS JOIN (
  SELECT
    COUNT(DISTINCT root_page) AS category_total
  FROM crux
  INNER JOIN pages
  USING (root_page)
)

Email / MailChimp

type category share adoption
mixed (today) 402076 / 5178762 = 7% 402076 / 16257555 = 2.5%
full crawl 638642 / 5178762 = 12% 638642 / 16257555 = 3.9%
both via crux join 402076 / 3018043 = 13% 402076 / 9316422 = 4.3%

Shrinking the sample of origins will have more noticeable impact for less popular technologies.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants