Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update tools and jobs to publish curated data #468

Merged
merged 9 commits into from
Jan 31, 2022
Merged

Update tools and jobs to publish curated data #468

merged 9 commits into from
Jan 31, 2022

Conversation

tidoust
Copy link
Member

@tidoust tidoust commented Jan 30, 2022

The goal of this update is to create a curated data view along with the raw data view and the npm packages views (see #277).

Curation means applying patches to the raw data and re-generating the idlparsed, idlnames and idlnamesparsed folders. The latter two will only contain IDL names targeted at browsers, although note that actual spec filtering remains a TODO at this stage (see corresponding TODO comments in prepare-curated.js and prepare-packages.js).

To create the curated data view, this update introduces new tools:

  • a prepare-curated.js tool that copies the raw data to the given folder, applies patches (CSS, elements, IDL) when needed, re-generates the idlparsed folder, re-generates the idlnames and idlnamesparsed folders and adjusts the index.json and idlnames.json files accordingly.
  • a prepare-packages.js tool (replaces the now gone packages/prepare.js) that copies relevant curated data from the curated folder to the packages folder.
  • a commit-curated.js tool that updates the curated branch with the contents of the given curated folder.

Goal is to have the curated branch be the one published as GitHub Pages.

The test logic was partially re-written to run the tests against the curated data, and against both the curated data and the NPM packages data when tests may yield different results.

A new curate.yml job publishes the curated data whenever the crawl data is updated. The job also takes care or preparing package release PRs as needed, replacing the previous prepare-xxx-release jobs.

The release workflow becomes:

  1. Crawled data is updated (update-ed.yml)
  2. Curated data and package data get generated (curate.yml)
  3. Curated data and package data get tested (curate.yml)
  4. The curated branch gets updated with the curated data (curate.yml)
  5. Npm package pre-release PR gets created (curate.yml)
  6. Someone reviews and merges the PR
  7. New versions of npm packages are released (release-package.yml)
  8. A Raw data for @webref/ttt@vx.y.z tag gets added to the relevant commit on the main branch.
  9. A @webref/ttt@vx.y.z tag gets added to the relevant commit on the curated branch.
  10. The @webref/ttt@latest tag gets updated to point to the relevant commit on the curated branch.

Note that, in order for a release to be created, curated data needs to have changed. A change to the static content in the packages folder won't be enough to trigger a release for instance. That should not be a major problem.

@jodinathan
Copy link

this will help us so much :D

The goal of this update is to create a curated data view along with the raw data
view and the npm packages views (see #277).

Curation means applying patches to the raw data and re-generating the
`idlparsed`, `idlnames` and `idlnamesparsed` folders`. The latter two will only
contain IDL names targeted at browsers, although note that actual spec filtering
remains a TODO at this stage (see corresponding TODO comments in
`prepare-curated.js` and `prepare-packages.js`).

To create the curated data view, this update introduces new tools:
- a `prepare-curated.js` tool that copies the raw data to the given folder,
applies patches (CSS, elements, IDL) when needed, re-generates the `idlparsed`
folder, re-generates the `idlnames` and `idlnamesparsed` folders and adjusts the
`index.json` and `idlnames.json` files accordingly.
- a `prepare-packages.js` tool (replaces the now gone `packages/prepare.js`)
that copies relevant curated data from the curated folder to the packages
folder.
- a `commit-curated.js` tool that updates the `curated` branch with the contents
of the given curated folder.

Goal is to have the `curated` branch be the one published as GitHub Pages.

The test logic was partially re-written to run the tests against the curated
data, and against both the curated data and the NPM packages data when tests
may yield different results.

A new `curate.yml` job publishes the curated data whenever the crawl data is
updated. The job also takes care or preparing package release PRs as needed,
replacing the previous prepare-xxx-release jobs.

The release workflow becomes:
1. Crawled data is updated (`update-ed.yml`)
2. Curated data and package data get generated (`curate.yml`)
3. Curated data and package data get tested (`curate.yml`)
4. The `curated` branch gets updated with the curated data (`curate.yml`)
5. Npm package pre-release PR gets created (`curate.yml`)
6. Someone reviews and merges the PR
7. New version of npm packages are released (`release-package.yml`)
8. A `Raw data for @webref/ttt@vx.y.z` tag gets added to the relevant commit on
the `main` branch.
9. A `@webref/ttt@vx.y.z` tag gets added to the relevant commit on the `curated`
branch.
10. The `@webref/ttt@latest` tag gets updated to point to the relevant commit on
the `curated` branch.

Note that, in order for a release to be created, curated data needs to have
changed. A change to the static content in the `packages` folder won't be enough
to trigger a release for instance. That should not be a major problem.
@tidoust tidoust marked this pull request as ready for review January 31, 2022 09:18
@tidoust
Copy link
Member Author

tidoust commented Jan 31, 2022

@dontcallmedom CI tests fail because the UUID spec still exists in webref. We should automate removal of files that are linked to a spec that got removed from browser-specs (in such cases, we can be confident that the removal was not accidental)

@tidoust
Copy link
Member Author

tidoust commented Jan 31, 2022

Replying to myself:

Note that, in order for a release to be created, curated data needs to have changed. A change to the static content in the packages folder won't be enough to trigger a release for instance. That should not be a major problem.

Actually, that is going to be a problem because we want to do major/minor bumps under packages once in a while, and we want to update the package PR accordingly. I think that the easiest is to also include files under the packages folder in the curated folder so that updates to these files in the main branch also trigger updates to the curated branch. I'll update the scripts accordingly.

Copy link
Member

@dontcallmedom dontcallmedom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yet another amazing piece of work, thank you so much!

A few nits for your consideration, but looks great to me in any case

packages/css/index.js Outdated Show resolved Hide resolved
// rm dstDir/*.${fileExt}
const dstFiles = await fs.readdir(dstDir);
for (const file of dstFiles) {
if (file.endsWith(`.${fileExt}`) && file !== 'package.json') {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably not a short term concern, but maybe something worth protecting us from at the browser-spec levels: what if a spec ends up using "package" as a shortname?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tracked in #472.

tools/commit-curated.js Outdated Show resolved Hide resolved
tools/prepare-curated.js Outdated Show resolved Hide resolved

async function cleanCrawlOutcome(spec) {
await Promise.all(Object.keys(spec).map(async property => {
// Only consider properties that link to an extract
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since the list of extracted properties is more or less hardcoded in the rest of the code (e.g. L77), it might be better to use a shared hardcoded list rather than rely on heuristics?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went the opposite way, actually, and dropped the hardcoded list (except to note the folders that must not be integrated), so that the script can handle new extracts without having to be modified.

Note the heuristics are exactly the same as in the expandCrawlResults function in Reffy.

tools/utils.js Outdated Show resolved Hide resolved
Having the static packages files (`index.js`, `README.md`, `package.json`) in
the `curated` branch is both useful to trigger an update of the branch when
these files are updated (and thus make it possible to publish a new NPM package)
as well as to have these files directly available under the `@webref/xxx@vx.y.z`
tag.
@tidoust tidoust marked this pull request as ready for review January 31, 2022 11:08
- Use destructuring assignments for options
- Drop hardcoded list of folders in `prepare-curated.js`
- Add explanation about clean function purpose
@tidoust tidoust merged commit edb3390 into main Jan 31, 2022
@tidoust tidoust deleted the data-curation branch January 31, 2022 13:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants