Update tools and jobs to publish curated data #468

tidoust · 2022-01-30T16:42:55Z

The goal of this update is to create a curated data view along with the raw data view and the npm packages views (see #277).

Curation means applying patches to the raw data and re-generating the idlparsed, idlnames and idlnamesparsed folders. The latter two will only contain IDL names targeted at browsers, although note that actual spec filtering remains a TODO at this stage (see corresponding TODO comments in prepare-curated.js and prepare-packages.js).

To create the curated data view, this update introduces new tools:

a prepare-curated.js tool that copies the raw data to the given folder, applies patches (CSS, elements, IDL) when needed, re-generates the idlparsed folder, re-generates the idlnames and idlnamesparsed folders and adjusts the index.json and idlnames.json files accordingly.
a prepare-packages.js tool (replaces the now gone packages/prepare.js) that copies relevant curated data from the curated folder to the packages folder.
a commit-curated.js tool that updates the curated branch with the contents of the given curated folder.

Goal is to have the curated branch be the one published as GitHub Pages.

The test logic was partially re-written to run the tests against the curated data, and against both the curated data and the NPM packages data when tests may yield different results.

A new curate.yml job publishes the curated data whenever the crawl data is updated. The job also takes care or preparing package release PRs as needed, replacing the previous prepare-xxx-release jobs.

The release workflow becomes:

Crawled data is updated (update-ed.yml)
Curated data and package data get generated (curate.yml)
Curated data and package data get tested (curate.yml)
The curated branch gets updated with the curated data (curate.yml)
Npm package pre-release PR gets created (curate.yml)
Someone reviews and merges the PR
New versions of npm packages are released (release-package.yml)
A Raw data for @webref/ttt@vx.y.z tag gets added to the relevant commit on the main branch.
A @webref/ttt@vx.y.z tag gets added to the relevant commit on the curated branch.
The @webref/ttt@latest tag gets updated to point to the relevant commit on the curated branch.

Note that, in order for a release to be created, curated data needs to have changed. A change to the static content in the packages folder won't be enough to trigger a release for instance. That should not be a major problem.

jodinathan · 2022-01-30T20:46:00Z

this will help us so much :D

The goal of this update is to create a curated data view along with the raw data view and the npm packages views (see #277). Curation means applying patches to the raw data and re-generating the `idlparsed`, `idlnames` and `idlnamesparsed` folders`. The latter two will only contain IDL names targeted at browsers, although note that actual spec filtering remains a TODO at this stage (see corresponding TODO comments in `prepare-curated.js` and `prepare-packages.js`). To create the curated data view, this update introduces new tools: - a `prepare-curated.js` tool that copies the raw data to the given folder, applies patches (CSS, elements, IDL) when needed, re-generates the `idlparsed` folder, re-generates the `idlnames` and `idlnamesparsed` folders and adjusts the `index.json` and `idlnames.json` files accordingly. - a `prepare-packages.js` tool (replaces the now gone `packages/prepare.js`) that copies relevant curated data from the curated folder to the packages folder. - a `commit-curated.js` tool that updates the `curated` branch with the contents of the given curated folder. Goal is to have the `curated` branch be the one published as GitHub Pages. The test logic was partially re-written to run the tests against the curated data, and against both the curated data and the NPM packages data when tests may yield different results. A new `curate.yml` job publishes the curated data whenever the crawl data is updated. The job also takes care or preparing package release PRs as needed, replacing the previous prepare-xxx-release jobs. The release workflow becomes: 1. Crawled data is updated (`update-ed.yml`) 2. Curated data and package data get generated (`curate.yml`) 3. Curated data and package data get tested (`curate.yml`) 4. The `curated` branch gets updated with the curated data (`curate.yml`) 5. Npm package pre-release PR gets created (`curate.yml`) 6. Someone reviews and merges the PR 7. New version of npm packages are released (`release-package.yml`) 8. A `Raw data for @webref/ttt@vx.y.z` tag gets added to the relevant commit on the `main` branch. 9. A `@webref/ttt@vx.y.z` tag gets added to the relevant commit on the `curated` branch. 10. The `@webref/ttt@latest` tag gets updated to point to the relevant commit on the `curated` branch. Note that, in order for a release to be created, curated data needs to have changed. A change to the static content in the `packages` folder won't be enough to trigger a release for instance. That should not be a major problem.

The tests are only meaningful for curated and package data. The data curation job will take care of running them in any case.

tidoust · 2022-01-31T09:21:29Z

@dontcallmedom CI tests fail because the UUID spec still exists in webref. We should automate removal of files that are linked to a spec that got removed from browser-specs (in such cases, we can be confident that the removal was not accidental)

tidoust · 2022-01-31T10:43:30Z

Replying to myself:

Note that, in order for a release to be created, curated data needs to have changed. A change to the static content in the packages folder won't be enough to trigger a release for instance. That should not be a major problem.

Actually, that is going to be a problem because we want to do major/minor bumps under packages once in a while, and we want to update the package PR accordingly. I think that the easiest is to also include files under the packages folder in the curated folder so that updates to these files in the main branch also trigger updates to the curated branch. I'll update the scripts accordingly.

dontcallmedom

Yet another amazing piece of work, thank you so much!

A few nits for your consideration, but looks great to me in any case

packages/css/index.js

dontcallmedom · 2022-01-31T09:56:44Z

tools/apply-patches.js

+ // rm dstDir/*.${fileExt}
+ const dstFiles = await fs.readdir(dstDir);
+ for (const file of dstFiles) {
+ if (file.endsWith(`.${fileExt}`) && file !== 'package.json') {


probably not a short term concern, but maybe something worth protecting us from at the browser-spec levels: what if a spec ends up using "package" as a shortname?

Tracked in #472.

tools/commit-curated.js

tools/prepare-curated.js

dontcallmedom · 2022-01-31T10:51:35Z

tools/prepare-curated.js

+
+async function cleanCrawlOutcome(spec) {
+ await Promise.all(Object.keys(spec).map(async property => {
+ // Only consider properties that link to an extract


since the list of extracted properties is more or less hardcoded in the rest of the code (e.g. L77), it might be better to use a shared hardcoded list rather than rely on heuristics?

I went the opposite way, actually, and dropped the hardcoded list (except to note the folders that must not be integrated), so that the script can handle new extracts without having to be modified.

Note the heuristics are exactly the same as in the expandCrawlResults function in Reffy.

tools/utils.js

Having the static packages files (`index.js`, `README.md`, `package.json`) in the `curated` branch is both useful to trigger an update of the branch when these files are updated (and thus make it possible to publish a new NPM package) as well as to have these files directly available under the `@webref/xxx@vx.y.z` tag.

- Use destructuring assignments for options - Drop hardcoded list of folders in `prepare-curated.js` - Add explanation about clean function purpose

tidoust force-pushed the data-curation branch from e33f108 to f8c69a8 Compare January 31, 2022 07:23

tidoust and others added 5 commits January 31, 2022 09:18

Bump dependencies to resolve conflicts and update prepare step

d2dfed8

Merge branch 'main' into data-curation

cc757dc

Do not run tests on pushes to main branch

9968989

The tests are only meaningful for curated and package data. The data curation job will take care of running them in any case.

Explain curated branch in README

381e654

Merge branch 'data-curation' of github.com:w3c/webref into data-curation

2486713

tidoust marked this pull request as ready for review January 31, 2022 09:18

tidoust requested a review from dontcallmedom January 31, 2022 09:18

tidoust mentioned this pull request Jan 31, 2022

Automatically remove extracts from specs that no longer are in browser-specs #470

Closed

tidoust marked this pull request as draft January 31, 2022 10:44

dontcallmedom mentioned this pull request Jan 31, 2022

Use curated webref branch dontcallmedom/webidlpedia#36

Closed

dontcallmedom approved these changes Jan 31, 2022

View reviewed changes

tidoust added 2 commits January 31, 2022 12:03

Merge branch 'main' into data-curation

1e97484

tidoust marked this pull request as ready for review January 31, 2022 11:08

tidoust mentioned this pull request Jan 31, 2022

Handle case when spec shortname is "package" #472

Open

Integrate feedback

d474ad5

- Use destructuring assignments for options - Drop hardcoded list of folders in `prepare-curated.js` - Add explanation about clean function purpose

tidoust merged commit edb3390 into main Jan 31, 2022

tidoust deleted the data-curation branch January 31, 2022 13:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update tools and jobs to publish curated data #468

Update tools and jobs to publish curated data #468

tidoust commented Jan 30, 2022 •

edited

Loading

jodinathan commented Jan 30, 2022

tidoust commented Jan 31, 2022

tidoust commented Jan 31, 2022

dontcallmedom left a comment

dontcallmedom Jan 31, 2022

tidoust Jan 31, 2022

dontcallmedom Jan 31, 2022

tidoust Jan 31, 2022

Update tools and jobs to publish curated data #468

Update tools and jobs to publish curated data #468

Conversation

tidoust commented Jan 30, 2022 • edited Loading

jodinathan commented Jan 30, 2022

tidoust commented Jan 31, 2022

tidoust commented Jan 31, 2022

dontcallmedom left a comment

Choose a reason for hiding this comment

dontcallmedom Jan 31, 2022

Choose a reason for hiding this comment

tidoust Jan 31, 2022

Choose a reason for hiding this comment

dontcallmedom Jan 31, 2022

Choose a reason for hiding this comment

tidoust Jan 31, 2022

Choose a reason for hiding this comment

tidoust commented Jan 30, 2022 •

edited

Loading