The Search is powered by DocSearch v3 which is a free and automated solution provided by Algolia.
- Non-product pages, e.g.
/contributing/
- Plugin Hub, except for basic-example pages, i.e.
/**/how-to/basic-example/**/
- Every product page in every available version, including
latest
and non-released versions, e.g.dev
. - OAS reference pages
DocSearch is split into two components:
- Algolia Crawler: automated web scraping program that extracts content from the site and updates the index. It runs once a week, but it can also be manually triggered from the Dashboard.
- Algolia Autocomplete: frontend library that provides an immersive search experience through a modal.
The Crawler's configuration, including which URLs to crawl, how to extract data from the pages, and its schedule now live in the Crawler's Dashboard.
From Algolia's docs:
The Crawler is an automated web scraping program. When given a set of start URLs, it visits and extracts content from those pages. It then visits URLs these pages link to, and the process repeats itself for all linked pages. With little configuration the Crawler can populate and maintain Algolia indices for you by periodically extracting content from your web pages.
The Crawler comes with a Dashboard (credentials in 1Password - search for Algolia - Team Docs
) in which we can edit the Crawler's configuration, trigger new Crawls, test URLs and analyze results.
The Crawler has a configuration file that defines a Crawler object which is made of top-level parameters, actions, and index settings.
Some noteworthy params:
- startUrls: the URLs that your Crawler starts on. Set to
https://docs.konghq.com/
. - schedule: How often a complete crawl should be performed, currently once a week, but it can be triggered manually when needed.
- ignoreNoIndex and ignoreCanonicalTo: both set to
true
so we can index older version of product pages. - renderJavaScript: List of URLs that require Javascript to render. It contains the list of OAS Reference pages.
- actions: An action indicates a subset of your targeted URLs that you want to extract records from in a specific way.
When a crawl starts, the Crawler adds startURLs
to its URL database, fetches all the linked pages,
and extracts their content. Pages can also be ignored/skipped.
Pages are processed in five main steps:
- A page is fetched.
- Links and records are extracted from the page.
- The extracted records are indexed to Algolia.
- The extracted links are added to the Crawler’s URL database.
- For each new, non-excluded page added to the database, the process is repeated.
We use different actions
for processing pages in different ways,
based on the pathsToMatch
attribute. It uses micromatch for pattern matching and supports negation, wildcards, etc.
Note: URLs should match only one action, otherwise the Crawler returns an error when it attempts to parse it. We define one action for each product, one for OAS pages, and a few others for specific pages.
Once the Crawler finds the matching action
for a page based on its URL, it extracts records from the page using the recordExtractor. It provides a docsearch helper which automatically extracts content and formats it to be compatible with DocSearch
. See DocSearch docs for a detailed explanation of how to configure this parameter.
We also set custom variables to each record, so that we can use them to filter the search results. For each page, we extract/set its product
and version
.
For example, Gateway OAS pages match the following action:
{
indexName: "konghq",
pathsToMatch: [
"https://docs.konghq.com/gateway/api/admin-ee/**/",
"https://docs.konghq.com/gateway/api/admin-oss/**/",
"https://docs.konghq.com/gateway/api/status/**/",
],
recordExtractor: ({ url, $, contentLength, fileType, helpers }) => {
let segments = url.pathname.split("/");
// extract the version from the URL
let versionId = segments[segments.length - 2] || "latest";
return helpers.docsearch({
recordProps: {
content: [".app-container p"],
lvl0: {
selectors: "header h2",
defaultValue: "Kong",
},
lvl1: [".app-container h1"],
lvl2: [".app-container h2"],
lvl3: [".app-container h3"],
lvl4: [".app-container h4"],
product: {
defaultValue: "Kong Gateway", // Manually set the product
},
version: {
defaultValue: versionId,
},
},
aggregateContent: true,
recordVersion: "v3",
indexHeadings: true,
});
},
},
The docsearch
helper creates a hierarchical structure for every record it extracts. We use headings for levels lvl1
to lvl4
and multiple CSS selectors for the content
. These CSS selectors may vary from action to action given that some pages use a different layout.
The recordExtractor
uses a Cheerio instance under the hood to manipulate the DOM, and custom Javascript code can be executed in the body of the recordExtractor
function, which can be used to debug CSS selectors by using console.log()
.
We set both aggregateContent and indexHeadings to true
.
This setup creates records for headings, and all the nested elements under a heading are also indexed and associated with it hierarchically.
The action for Gateway pages looks like this, note how the CSS selectors are different from the previous one (product pages share the same layout though).
{
indexName: "konghq",
pathsToMatch: [
"https://docs.konghq.com/gateway/**/**/",
// We exclude the API pages, they are handled in a different action
"!https://docs.konghq.com/gateway/api/**/**/",
// Exclude the changelog, handled in a different action
"!https://docs.konghq.com/gateway/changelog/",
],
recordExtractor: ({ url, $, contentLength, fileType, helpers }) => {
// Extract the version from the dropdown
let versionId = $("#version-list .active a").data("versionId") || "";
// Special case for `latest` version
if (
url.toString().startsWith("https://docs.konghq.com/gateway/latest/")
) {
versionId = "latest";
}
return helpers.docsearch({
recordProps: {
content: [".content p, .content li"],
lvl0: {
selectors: ".docsets-dropdown > .dropdown-button > span",
defaultValue: "Kong",
},
lvl1: [".content h1"],
lvl2: [".content h2"],
lvl3: [".content h3"],
lvl4: [".content h4"],
product: {
// Extract the product from the dropdown
selectors: ".docsets-dropdown > .dropdown-button > span",
},
version: {
defaultValue: versionId.toString(),
},
},
aggregateContent: true,
recordVersion: "v3",
indexHeadings: true,
});
},
},
For more information check out Algolia's Extracting Data How-To Guide.
Our plan has a limit of 1_000_000
records.
We refine the search results by using a combination of facets and optional filters. A Jekyll generator sets the corresponding filters to every page. The reason is that records related to the current user's context need to be ranked higher than others.
The optionalFilters
give a higher score to records belonging to specific products. The main advantage of this type of filter is that they can promote/demote certain records without filtering out records from the result set.
Here's a map indicating which products will have a higher score based on the page the search initiated.
For example, if the search initiates from any gateway
doc, we give a higher score to records belonging to: Kong Gateway
, Plugin Hub
, and deck
in that order.
A facetFilter
filters the search results so they have the same version
as the page from which the search initiated.
The list of searchable attributes can be found here.
We use the default attribute list defined by DocSearch, i.e., content
and headings lvl0 - lvl5
, but we modified their priority order so that the content
has the highest priority and then the lvl5
to lvl0
.
We found this priority to work best for us, mainly because of the types of queries we get and how our headings and content is written. We want records with paragraphs containing the query terms to rank higher than records that include them just in the headings.
We defined a list of synonyms to improve the search results, in particular for acronyms, e.g., mTLS
, RBAC
, etc.
We defined a specific set of rules that changes the search results based on different conditions.
Two rules pin specific records (API Reference pages) to the top of the results whenever the query contains <product> API
. The terms <product API>
are mentioned in several places, so we wanted to increase the API reference pages discoverability.
The other rule is for handling empty queries in the search page. By default, InstantSearch
always shows you results, even when the query is empty. This rule pins a curated list of records when an empty query comes from /search/
.
A new crawl can be manually triggered, e.g., whenever there's a new release, by clicking the Restart crawling
button in the Crawler's Dashboard.
The Crawler's editor provides a URL tester, which can be used to debug which action handles a specific URL, CSS selectors used by the extractor, custom variables and which links and records are extracted from the page.
Whenever a new version of a product is released, we need to trigger a crawl manually.
Adding a new product requires updating the Crawler's configuration file using the Editor and adding a new entry to our custom ranking (if needed).
A new action needs to be created for the product. Copying another product's action would be the best place to start, given that they share most of the logic. Make sure that the pathsToMatch
matches the new product pages and that the product
and version
custom variables are set.
Adding a new OAS page requires updating the Crawler's configuration file using the Editor.
The new URL needs to be added to the renderJavaScript
list and if an action already exists for the corresponding product, it needs to be added to the action's pathsToMatch
. If an action doesn't exist, a new one needs to be created.
For example, if https://docs.konghq.com/gateway/api/new-api-spec/latest/
were to be added, we add it to the end of the renderJavaScript
list,
renderJavaScript: [
...,
"https://docs.konghq.com/gateway/api/new-api-spec/**/"
]
and we look for an action that handles all the OAS pages for gateway, if there is one, we add the URL to the corresponding pathsToMatch
{
pathsToMatch: [
// Existing list..
"https://docs.konghq.com/gateway/api/admin-ee/**/",
"https://docs.konghq.com/gateway/api/admin-oss/**/",
"https://docs.konghq.com/gateway/api/status/**/",
"https://docs.konghq.com/gateway/api/new-api-spec/**/", // <- New URL
],
...
}
By default, Algolia doesn't index code snippets. However, there are a few cases in which we would like to index them. For example, code snippets in Troubleshooting sections or used to highlight error messages/codes, etc. Users should be able to copy and paste errors in the search bar and find meaningful results.
To achieve this, we need to add a specific CSS class to the code snippets we want to index (.algolia-index-code-snippet
) so we can tell the Crawler to extract them.