You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The search results are now restricted to 2 known content_types, i.e. text/html and text/plain now, as per #147 . It would also make sense to restrict indexing to a list of known content_types too. While it might be fine to only show HTML and plain text in results, there are some additional files which might still be useful to retain in the index for additional functionality, so I'd propose restricting indexing to:
This could be extended, e.g. if #70 is implemented.
It doesn't look like there is an easy hook in scrapy for doing this (e.g. along the lines of allow_domains in LinkExtractor), but it should be possible to implement via a check at the top of customparser in search_my_site_parser.py which returns None (like the existing logic for article type exclusion functionality).
For reference, the full list of currently indexed content types is:
The search results are now restricted to 2 known content_types, i.e. text/html and text/plain now, as per #147 . It would also make sense to restrict indexing to a list of known content_types too. While it might be fine to only show HTML and plain text in results, there are some additional files which might still be useful to retain in the index for additional functionality, so I'd propose restricting indexing to:
text/html,
text/plain,
application/json,
text/xml,
application/rss+xml,
application/xml,
application/xhtml+xml,
application/atom+xml,
application/feed+json
This could be extended, e.g. if #70 is implemented.
It doesn't look like there is an easy hook in scrapy for doing this (e.g. along the lines of allow_domains in LinkExtractor), but it should be possible to implement via a check at the top of customparser in search_my_site_parser.py which returns None (like the existing logic for article type exclusion functionality).
For reference, the full list of currently indexed content types is:
"text/html",91813,
"application/json",1840,
"text/xml",1687,
"application/rss+xml",1532,
"application/xml",1529,
"application/javascript",648,
"text/plain",595,
"application/xhtml+xml",256,
"application/octet-stream",243,
"application/atom+xml",233,
"application/manifest+json",135,
no content type specified,102,
"text/css",89,
"application/feed+json",32,
"text/javascript",26,
"application/pgp-keys",24,
"application/opensearchdescription+xml",20,
"application/x-tex",17,
"text/gemini",17,
"text/x-bibtex",16,
"text/csv",14,
"application/json+oembed",13,
"text/markdown",13,
"application/pgp-signature",12,
"text/calendar",10,
"application/pgp-encrypted",8,
"application/rdf+xml",7,
"text/x-csrc",7,
"text/x-python",6,
"text/x-c",5,
"text/x-diff",5,
"application/stream+json",4,
"application/x-mspublisher",4,
"image/svg+xml",4,
"text/vcard",4,
"text/x-opml+xml",4,
"application/jf2feed+json",3,
"application/rsd+xml",3,
"application/x-httpd-ea-php55",3,
"application/x-sh",3,
"binary/octet-stream",3,
"text/x-c++src",3,
"application/jf2+json",2,
"application/jrd+json",2,
"application/mathematica",2,
"application/pem-certificate-chain",2,
"application/rtf",2,
"application/vnd.wolfram.mathematica.package",2,
"application/x-javascript",2,
"application/x-tcl",2,
"text/x-java-source",2,
"text/x-opml",2,
"text/x-sh",2,
"x-world/x-vrml",2,
"application/mf2+json",1,
"application/opml+xml",1,
"application/pdf",1,
"application/powder+xml",1,
"application/vnd.apple.keynote",1,
"application/x-msdownload",1,
"application/x-perl",1,
"application/x-sql",1,
"application/x-trash",1,
"application/x-x509-ca-cert",1,
"audio/mpegurl",1,
"audio/x-pn-realaudio",1,
"text/prs.fallenstein.rst",1,
"text/vnd.wap.wml",1,
"text/x-java",1,
"text/x-rsrc",1
The text was updated successfully, but these errors were encountered: