Byte limit #75

JanPetterMG · 2016-08-08T17:03:10Z

Feature request: Limit the maximum number of bytes to parse.

A maximum file size may be enforced per crawler. Content which is after the maximum file size may be ignored. Google currently enforces a size limit of 500 kilobytes (KB).

Source: Google

When forming the robots.txt file, you should keep in mind that the robot places a reasonable limit on its size. If the file size exceeds 32 KB, the robot assumes it allows everything

Source: Yandex

Default limit of X bytes, eg. 524.288 bytes (512KB / 0.5MB)
User-defined limit override
Make sure the limit is reasonable, throw an exception if dangerously low, eg. 24.576 bytes (24 KB)
Should be able to disable - no limit

JanPetterMG · 2016-08-08T17:32:46Z

At the moment, it's possible to generate large (fake or valid) robots.txt files, with the aim to trap the robots.txt crawler, slow down the server, and even cause it to hang or crash.

It's also (depending on the setup) possible to trap the crawler in an infinite retry-loop, if the external code utilizing this library, isn't handling repeating fatal errors correctly...

Related to #62

JanPetterMG added the enhancement label Aug 8, 2016

JanPetterMG mentioned this issue Aug 8, 2016

Parsing performance issue #62

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Byte limit #75

Byte limit #75

JanPetterMG commented Aug 8, 2016 •

edited

Loading

JanPetterMG commented Aug 8, 2016

Byte limit #75

Byte limit #75

Comments

JanPetterMG commented Aug 8, 2016 • edited Loading

JanPetterMG commented Aug 8, 2016

JanPetterMG commented Aug 8, 2016 •

edited

Loading