This repository handles the HTTP Archive data pipeline, which takes the results of the monthly HTTP Archive run and saves this to the httparchive
dataset in BigQuery.
The pipelines are run in Dataform service in Google Cloud Platform (GCP) and are kicked off automatically on crawl completion and other events. The code in the main
branch is used on each triggered pipeline run.
Tag: crawl_complete
- httparchive.crawl.pages
- httparchive.crawl.parsed_css
- httparchive.crawl.requests
Tag: cwv_tech_report
- httparchive.core_web_vitals.technologies
Consumers:
Tag: blink_features_report
- httparchive.blink_features.features
- httparchive.blink_features.usage
Consumers:
- chromestatus.com - example
Tag: crawl_results_legacy
- httparchive.all.pages
- httparchive.all.parsed_css
- httparchive.all.requests
- httparchive.lighthouse.YYYY_MM_DD_client
- httparchive.pages.YYYY_MM_DD_client
- httparchive.requests.YYYY_MM_DD_client
- httparchive.response_bodies.YYYY_MM_DD_client
- httparchive.summary_pages.YYYY_MM_DD_client
- httparchive.summary_requests.YYYY_MM_DD_client
- httparchive.technologies.YYYY_MM_DD_client
-
crawl-complete PubSub subscription
Tags: ["crawl_complete", "blink_features_report", "crawl_results_legacy"]
-
bq-poller-cwv-tech-report Scheduler
Tags: ["cwv_tech_report"]
- Create new dev workspace in Dataform.
- Make adjustments to the dataform configuration files and manually run a workflow to verify.
- Push all your changes to a dev branch & open a PR with the link to the BigQuery artifacts generated in the test workflow.
-
In workflow settings vars:
- set
env_name: dev
to process sampled data in dev workspace. - change
today
variable to a month in the past. May be helpful for testing pipelines based onchrome-ux-report
data.
- set
-
definitions/extra/test_env.sqlx
script helps to setup the tables required to run pipelines when in dev workspace. It's disabled by default.
The issues within the pipeline are being tracked using the following alerts:
- the event trigger processing fails - Dataform Trigger Function Error
- a job in the workflow fails - "Dataform Workflow Invocation Failed
Error notifications are sent to #10x-infra Slack channel.