Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Byte limit #75

Open
4 tasks
JanPetterMG opened this issue Aug 8, 2016 · 1 comment
Open
4 tasks

Byte limit #75

JanPetterMG opened this issue Aug 8, 2016 · 1 comment

Comments

@JanPetterMG
Copy link
Collaborator

JanPetterMG commented Aug 8, 2016

Feature request: Limit the maximum number of bytes to parse.

A maximum file size may be enforced per crawler. Content which is after the maximum file size may be ignored. Google currently enforces a size limit of 500 kilobytes (KB).

Source: Google

When forming the robots.txt file, you should keep in mind that the robot places a reasonable limit on its size. If the file size exceeds 32 KB, the robot assumes it allows everything

Source: Yandex

  • Default limit of X bytes, eg. 524.288 bytes (512KB / 0.5MB)
  • User-defined limit override
  • Make sure the limit is reasonable, throw an exception if dangerously low, eg. 24.576 bytes (24 KB)
  • Should be able to disable - no limit
@JanPetterMG
Copy link
Collaborator Author

At the moment, it's possible to generate large (fake or valid) robots.txt files, with the aim to trap the robots.txt crawler, slow down the server, and even cause it to hang or crash.

It's also (depending on the setup) possible to trap the crawler in an infinite retry-loop, if the external code utilizing this library, isn't handling repeating fatal errors correctly...

Related to #62

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant