Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indexing: Restrict indexing to a known list of content types #149

Open
m-i-l opened this issue May 8, 2024 · 0 comments
Open

Indexing: Restrict indexing to a known list of content types #149

m-i-l opened this issue May 8, 2024 · 0 comments
Labels
enhancement New feature or request

Comments

@m-i-l
Copy link
Contributor

m-i-l commented May 8, 2024

The search results are now restricted to 2 known content_types, i.e. text/html and text/plain now, as per #147 . It would also make sense to restrict indexing to a list of known content_types too. While it might be fine to only show HTML and plain text in results, there are some additional files which might still be useful to retain in the index for additional functionality, so I'd propose restricting indexing to:

text/html,
text/plain,
application/json,
text/xml,
application/rss+xml,
application/xml,
application/xhtml+xml,
application/atom+xml,
application/feed+json

This could be extended, e.g. if #70 is implemented.

It doesn't look like there is an easy hook in scrapy for doing this (e.g. along the lines of allow_domains in LinkExtractor), but it should be possible to implement via a check at the top of customparser in search_my_site_parser.py which returns None (like the existing logic for article type exclusion functionality).

For reference, the full list of currently indexed content types is:

"text/html",91813,
"application/json",1840,
"text/xml",1687,
"application/rss+xml",1532,
"application/xml",1529,
"application/javascript",648,
"text/plain",595,
"application/xhtml+xml",256,
"application/octet-stream",243,
"application/atom+xml",233,
"application/manifest+json",135,
no content type specified,102,
"text/css",89,
"application/feed+json",32,
"text/javascript",26,
"application/pgp-keys",24,
"application/opensearchdescription+xml",20,
"application/x-tex",17,
"text/gemini",17,
"text/x-bibtex",16,
"text/csv",14,
"application/json+oembed",13,
"text/markdown",13,
"application/pgp-signature",12,
"text/calendar",10,
"application/pgp-encrypted",8,
"application/rdf+xml",7,
"text/x-csrc",7,
"text/x-python",6,
"text/x-c",5,
"text/x-diff",5,
"application/stream+json",4,
"application/x-mspublisher",4,
"image/svg+xml",4,
"text/vcard",4,
"text/x-opml+xml",4,
"application/jf2feed+json",3,
"application/rsd+xml",3,
"application/x-httpd-ea-php55",3,
"application/x-sh",3,
"binary/octet-stream",3,
"text/x-c++src",3,
"application/jf2+json",2,
"application/jrd+json",2,
"application/mathematica",2,
"application/pem-certificate-chain",2,
"application/rtf",2,
"application/vnd.wolfram.mathematica.package",2,
"application/x-javascript",2,
"application/x-tcl",2,
"text/x-java-source",2,
"text/x-opml",2,
"text/x-sh",2,
"x-world/x-vrml",2,
"application/mf2+json",1,
"application/opml+xml",1,
"application/pdf",1,
"application/powder+xml",1,
"application/vnd.apple.keynote",1,
"application/x-msdownload",1,
"application/x-perl",1,
"application/x-sql",1,
"application/x-trash",1,
"application/x-x509-ca-cert",1,
"audio/mpegurl",1,
"audio/x-pn-realaudio",1,
"text/prs.fallenstein.rst",1,
"text/vnd.wap.wml",1,
"text/x-java",1,
"text/x-rsrc",1

@m-i-l m-i-l added the enhancement New feature or request label May 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant