Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recurring crawl errors! #1244

Closed
tidoust opened this issue Jun 3, 2024 · 3 comments
Closed

Recurring crawl errors! #1244

tidoust opened this issue Jun 3, 2024 · 3 comments

Comments

@tidoust
Copy link
Member

tidoust commented Jun 3, 2024

The crawl is resilient, happily reuses previous extracts and hides the errors (see #1131), but it has been a while since Reffy managed to crawl all specs without any error.

The following git command can be used to track changes to the line that reports the number of errors in ed/index.json:

git log -L 650,653:ed/index.json

Looking at the result, last time there was 0 error was on 18 April 2024. About 20 crawl errors are reported in ed/index.json since then. There are variations but most errors are server errors (internal errors or rejections of requests) and timeouts. Looking at today's last crawl, I see 27 errors, including:

  • HTTP status 429: 4 errors, W3C specs (but only one /TR, the rest being Patent Policy, Process, GIF89a)
  • HTTP status 500: 5 errors, drafts.fxtf.org specs
  • HTTP status 503: 5 errors, 4 /TR specs, 1 w3c.github.io/reporting/
  • HTTP status 504: 8 errors, 7 Houdini specs, 1 for css-viewport-1
  • Network timeout: 4 errors, 2 for drafts.fxtf.org specs, 1 for https://w3c.github.io/aria/, 1 for the SVG draft
  • ReSpec generation timeout: 1 error for Gamepad Extensions (this one is easy to reproduce, need to investigate)

These errors seem representative of other crawl results. I don't get these errors when I run a crawl locally, except for the one on Gamepad Extensions... and for a 429 on https://drafts.fxtf.org/geometry-1/ which does not appear in Webref's data.

I'm creating this issue to explore possible workarounds we could perhaps consider to get back to normal. We also have recurring build errors with similar errors in browser-specs.

@tidoust
Copy link
Member Author

tidoust commented Jun 4, 2024

The crawler currently processes the list 4 specs at a time. Most of the time, it just fetches the core URL with appropriate HTTP cache headers, get a 304, reuse the previous crawl results, and move on to the next spec. This allows us to crawl things faster. From a server perspective, this might be interpreted as the crawler sending many requests at once though.

To be seen as a nicer bot, the crawler could perhaps:

  1. Process the list 2 or 3 specs at a time. Serializing things completely would probably make crawl run too slow.
  2. Sort specs initially to "spread origins", so that the crawler needs to process a few specs before it gets back to sending another request to a given origin.
  3. Add something like a 1-2s delay between requests sent to a given origin, to avoid reaching the 180 requests/minute limit for W3C servers.
  4. Block requests to CSS stylesheets and other known resources we don't need such as fixup.js for /TR specs. But the crawler already caches responses to these resources in practice, it's not obvious that we would gain anything.
  5. Schedule browser-specs builds and Webref crawls further apart. They don't run at the same time but within the same hour for now, and rate limits probably get reset after an hour or so.

@dontcallmedom
Copy link
Member

I think I'd start with 2 & 3 as the likely biggest bang for the buck

tidoust added a commit to w3c/reffy that referenced this issue Jun 5, 2024
The crawler has a hard time crawling all specs nowadays due to more stringent
restrictions on servers that lead to network timeouts and errors. See:
w3c/webref#1244

The goal of this update is to reduce the load of the crawler onto servers. Two
changes:

1. The list of specs to crawl gets sorted to distribute origins. This should
help with diluting requests sent to a specific server at once. The notion of
"origin" used in the code is loose and more meant to identify the server that
serves the resource than the actual origin.

2. Requests sent to a given origin are serialized, and sent 2 seconds minimum
after the last request was sent (and processed). The crawler still processes
the list 4 specs at a time otherwise (provided the specs are to be retrieved
from different origins).

The consequence of 1. is that the specs are no longer processed in order, so
logs will make the crawler look a bit drunk, processing specs seemingly
randomly, as in:

```
  1/610 - https://aomediacodec.github.io/afgs1-spec/ - crawling
  8/610 - https://compat.spec.whatwg.org/ - crawling
 12/610 - https://datatracker.ietf.org/doc/html/draft-davidben-http-client-hint-reliability - crawling
 13/610 - https://datatracker.ietf.org/doc/html/draft-ietf-httpbis-rfc6265bis - crawling
 12/610 - https://datatracker.ietf.org/doc/html/draft-davidben-http-client-hint-reliability - done
 16/610 - https://drafts.css-houdini.org/css-typed-om-2/ - crawling
 13/610 - https://datatracker.ietf.org/doc/html/draft-ietf-httpbis-rfc6265bis - done
 45/610 - https://fidoalliance.org/specs/fido-v2.1-ps-20210615/fido-client-to-authenticator-protocol-v2.1-ps-errata-20220621.html - crawling
https://compat.spec.whatwg.org/ [error] Multiple event handler named orientationchange, cannot associate reliably to an interface in Compatibility Standard
  8/610 - https://compat.spec.whatwg.org/ - done
 66/610 - https://registry.khronos.org/glTF/specs/2.0/glTF-2.0.html - crawling
https://aomediacodec.github.io/afgs1-spec/ [log] extract refs without rules
  1/610 - https://aomediacodec.github.io/afgs1-spec/ - done
```
tidoust added a commit to w3c/reffy that referenced this issue Jun 6, 2024
The crawler has a hard time crawling all specs nowadays due to more stringent
restrictions on servers that lead to network timeouts and errors. See:
w3c/webref#1244

The goal of this update is to reduce the load of the crawler onto servers. Two
changes:

1. The list of specs to crawl gets sorted to distribute origins. This should
help with diluting requests sent to a specific server at once. The notion of
"origin" used in the code is loose and more meant to identify the server that
serves the resource than the actual origin.

2. Requests sent to a given origin are serialized, and sent 2 seconds minimum
after the last request was sent (and processed). The crawler still processes
the list 4 specs at a time otherwise (provided the specs are to be retrieved
from different origins).

The consequence of 1. is that the specs are no longer processed in order, so
logs will make the crawler look a bit drunk, processing specs seemingly
randomly, as in:

```
  1/610 - https://aomediacodec.github.io/afgs1-spec/ - crawling
  8/610 - https://compat.spec.whatwg.org/ - crawling
 12/610 - https://datatracker.ietf.org/doc/html/draft-davidben-http-client-hint-reliability - crawling
 13/610 - https://datatracker.ietf.org/doc/html/draft-ietf-httpbis-rfc6265bis - crawling
 12/610 - https://datatracker.ietf.org/doc/html/draft-davidben-http-client-hint-reliability - done
 16/610 - https://drafts.css-houdini.org/css-typed-om-2/ - crawling
 13/610 - https://datatracker.ietf.org/doc/html/draft-ietf-httpbis-rfc6265bis - done
 45/610 - https://fidoalliance.org/specs/fido-v2.1-ps-20210615/fido-client-to-authenticator-protocol-v2.1-ps-errata-20220621.html - crawling
https://compat.spec.whatwg.org/ [error] Multiple event handler named orientationchange, cannot associate reliably to an interface in Compatibility Standard
  8/610 - https://compat.spec.whatwg.org/ - done
 66/610 - https://registry.khronos.org/glTF/specs/2.0/glTF-2.0.html - crawling
https://aomediacodec.github.io/afgs1-spec/ [log] extract refs without rules
  1/610 - https://aomediacodec.github.io/afgs1-spec/ - done
```
@tidoust
Copy link
Member Author

tidoust commented Jun 8, 2024

The new throttling logic seems to work fine: no crawl error since yesterday. Rules now are:

  • The crawler only sends one request at a time to a given origin
  • The crawler sleeps 2s in between requests sent to the csswg.org server (down to 1s for www.w3.org, and 100ms for other origins)
  • The crawler avoids loading associated resources (CSS stylesheets, images including SVG, some scripts that we know we do not need)
  • csswg.org, fxtf.org and css-houdini.org are considered to be the same origin
  • All xxx.github.io URLs are considered to be the same origin

Crawl takes longer (16-20mn for a full crawl, 5-6mn when most specs can be skipped because they did not change), but it does not have to be fast and that remains reasonable.

For documentation purpose, known usage limits that were put into place on servers:

@tidoust tidoust closed this as completed Jun 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants