Scrapy allows redirect following in protocols other than HTTP

@mvsantos

Impact

Scrapy was following redirects regardless of the URL protocol, so redirects were working for data://, file://, ftp://, s3://, and any other scheme defined in the DOWNLOAD_HANDLERS setting.

However, HTTP redirects should only work between URLs that use the http:// or https:// schemes.

A malicious actor, given write access to the start requests (e.g. ability to define start_urls) of a spider and read access to the spider output, could exploit this vulnerability to:

Redirect to any local file using the file:// scheme to read its contents.
Redirect to an ftp:// URL of a malicious FTP server to obtain the FTP username and password configured in the spider or project.
Redirect to any s3:// URL to read its content using the S3 credentials configured in the spider or project.

For file:// and s3://, how the spider implements its parsing of input data into an output item determines what data would be vulnerable. A spider that always outputs the entire contents of a response would be completely vulnerable, while a spider that extracted only fragments from the response could significantly limit vulnerable data.

Patches

Upgrade to Scrapy 2.11.2.

Workarounds

Replace the built-in retry middlewares (RedirectMiddleware and MetaRefreshMiddleware) with custom ones that implement the fix from Scrapy 2.11.2, and verify that they work as intended.

References

This security issue was reported by @mvsantos at scrapy/scrapy#457.

References

Gallaecio published to scrapy/scrapy May 14, 2024

Published to the GitHub Advisory Database May 14, 2024

Reviewed May 14, 2024

Last updated May 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Package

Affected versions

Patched versions

Description

Impact

Patches

Workarounds

References

References

Severity

CVSS overall score

CVSS v3 base metrics

CVSS v3 base metrics

Weaknesses

CVE ID

GHSA ID

Source code