-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mango covering JSON indexes RFC #4410
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice RFC, it explains the concept clearly and +1 to covering indexes this will be a great feature.
62d1639
to
56ae185
Compare
{ | ||
"index": { | ||
"fields": [ "age", "name" ], | ||
"include": [ "occupation", "manager_id" ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The order of the fields in include
is not meaningful in any way? Should we add a note highlighting it in the API, just for completeness.
As an implementation detail, perhaps we'd just want to normalize it by sorting when creating the design doc and the view signature. That would mean that two indexes with the same details and only the include
in a different order would be equivalent and "point to" the same view signature.
- Unlike `fields`, the fields in `include` _do not have to exist_ in the source | ||
document in order that the document be included in the index. This is to | ||
allow the index to cover more queries. | ||
- Including a deeply nested field would follow the same pattern as for other |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wonder if there is any value in applying limits, or would that be over-complicating it? We'd have to decide what happens when the limit is reached: crash or ignore and fall back to just loading the doc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Goodness -- great point 🎉
I think that we could validate the length of the list of fields when the ddoc is updated, rather than failing during indexing. We could also limit the depth by counting the .
characters.
One thing we can only validate at index time are things like the length of included strings. I think here that we might want to place a limit on the total size of the values, say 32kb. Even that's quite a few disk pages, though hopefully they are sequential on disk so the kernel's prefilling the page cache ahead of us.
Given it's easier to start with limits and increase them later, perhaps we should think about this more deeply. In a view index we allow ~anything I believe, but here potentially we could be more conservative.
As an example, postgres limits indexes to 32 columns. Its max field size is 1GB; I think we'd like something a little smaller 😬
Are there other limits here?
My thought is that we do limit, and make it configurable, and perhaps start relatively low for the defaults:
mango_json_index_include_fields_max=16
(why 16? Powers of two always sound nice)mango_json_index_include_depth_max=8
mango_json_index_include_size_bytes_max=32768
(32kb)
We can enforce mango_json_index_include_fields_max
and mango_json_index_include_depth_max
in _index
. (We may have to belt-and-braces this as the user can go behind Mango's back to upload views that are the "right shape").
mango_json_index_include_size_bytes_max
would need to be checked per document at index time. I worry what the behaviour should be here -- I see options of marking the whole index bad; having rows with "missing" values fields, meaning complexity during query; skipping indexing the document entirely. I lean towards skipping the doc as the least likely to cause unpredictable behaviour, but what's the current behaviour for views if indexing a doc fails?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a section on limits for include
in 91f99be. I think that the topic is really important and it's a bit facepalm that I skipped it. Too excited about writing Erlang I guess 😬 What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great. Thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure what the behavior for mango_json_index_include_size_bytes_max
should be either. It's pretty tricky. From the top of my head I don't know what currently happens if we fail to index a document. With mango we don't usually fail, if I had to guess it would be we end up in a crash loop on that document. A user then may index for a day and hit a large field value their index crashes, they'd remove the field, try again (get a new view signature) , index for 2 days and crash, etc.
The alternative is to skip the doc but then there is danger of it looking like data loss - user has some very important document (medical record, say allergies to medicine), index skips it, nobody notices until the patient is prescribed the wrong medication (sorry being a bit dramatic here with a silly user story). Maybe that's something we solve outside of the RFC and just that if the value made it past the max_document_size limit then it will be indexed or crash horribly...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if instead of storing plain values in rows we'd allow also storing a marker which would indicate that a value was too large, and just fallback to reading the doc? We can still have a hard-limit (strict) mode perhaps which deals with failures, but this would allows us not to deal with indexing failures so to speak and punt it for later.
There may be an optimal cut-off value where storing it in the index is more wasteful than reading the doc? Or there may be not, as technically I think we can write arbitrarily large values in the index, it will just spread over a lot of b-tree blocks, but we'd still duplicate the data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if instead of storing plain values in rows we'd allow also storing a marker which would indicate that a value was too large, and just fallback to reading the doc?
I'd like to avoid doing this if we can -- while nice in some ways, this approach makes it more difficult for a customer to understand the performance of their query. I'd like the performance of queries to be predictable. My feeling is that large fields should require a trip to primary data, and that enforcing smaller value sizes keeps it more likely that indexes can live in RAM, where, I think, they should be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discussed this a bit in the developer meeting yesterday.
First, we compared this to how this would manifest in M/R views, and the example we used was what if a doc causes a JS exception to be thrown. What we do there is skip the doc and not include any result rows. We do log this in couch.log but it is not surfaced to the HTTP API user.
We went through the motions of whether to add fields inline with the result and also recording a metric for this and concluded from experience that most people do not really care about these things.
However some folks take this seriously, and we want to accommodate those. Imagine this:
- we increment a metrics each time a mango covering index can’t include a doc because of some limit.
- a user sees this number growing and finds out that this isn’t good.
- next they want to know which document caused this and we can point them to couch.log where this should appear (TODO: decide log level)
- finally, it’d be great if there was a variant to
_explain
, say:POST /db/_explain?include_result=true
which returns the result like normal, but rows that are missing have an error object in them (or maybe we just show the error rows)
So some handwaving to be defined away, but: let’s do this right this time :)
Propose covering indexes within Mango, for JSON indexes, and outline the implementation steps.
Also correct a few typos.
6f5dd5f
to
c6ea642
Compare
[TIP]: # ( Artwork may be attached to the submission and linked as necessary. ) | ||
[TIP]: # ( ASCII artwork can also be included in code blocks, if desired. ) | ||
|
||
This would take place within `mango_view_cursor.erl`. The key functions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is mango_cursor_view.erl
.
|
||
## Phase 1: handle keys only covering indexes | ||
|
||
Within `execute/3` we will need to decide whether the view should be requested |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess that is mango_cursor_view:execute/3
. It might be worth to use fully qualified names to aid readability.
Overview
RFC for covering indexes in Mango. Propose covering indexes within Mango, for JSON indexes, and outline the implementation steps.
I will open a
[DISCUSS]
thread in the mailing list around this too.Related Issues or Pull Requests
#4394 is a pre-req for this work, because without it we always need to read the full doc to return to the coordinator.