Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Index sections #165

Merged
merged 25 commits into from
Nov 3, 2023
Merged

Index sections #165

merged 25 commits into from
Nov 3, 2023

Conversation

Thejas-bhat
Copy link
Member

@Thejas-bhat Thejas-bhat commented Jul 18, 2023

  • Introduces the concept of section which talks about reformating the index file as a set of sections each section corresponding to a certain type of index.
  • Refactors the inverted index data i.e. the term dictionary, postings list etc. as part of a section.

@Thejas-bhat Thejas-bhat mentioned this pull request Jul 18, 2023
@abhinavdangeti abhinavdangeti self-requested a review July 18, 2023 19:14
write.go Outdated Show resolved Hide resolved
@Thejas-bhat Thejas-bhat marked this pull request as ready for review September 15, 2023 09:29
build.go Outdated Show resolved Hide resolved
build.go Outdated Show resolved Hide resolved
section_inverted_index.go Outdated Show resolved Hide resolved
section_inverted_index.go Outdated Show resolved Hide resolved
write.go Outdated Show resolved Hide resolved
Copy link
Member

@abhinavdangeti abhinavdangeti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple concerns around API signatures and offset expectations.

numDocs: numDocs,
storedIndexOffset: storedIndexOffset,
fieldsIndexOffset: sectionsIndexOffset,
sectionsIndexOffset: sectionsIndexOffset,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How come the fieldsIndexOffset and sectionsIndexOffset are the same here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

essentially sections index content contains the same as fields index + sections information. so, for the fieldsIndexOffset which gives info about the fields in a segment we just consult the sectionsIndexOffset to get that info. Also, the field is needed for backward compatibility purposes as well.

@@ -74,14 +73,14 @@ func mergeSegmentBases(segmentBases []*SegmentBase, drops []*roaring.Bitmap, pat
// wrap it for counting (tracking offsets)
cr := NewCountHashWriterWithStatsReporter(br, s)

newDocNums, numDocs, storedIndexOffset, fieldsIndexOffset, docValueOffset, _, _, _, err :=
newDocNums, numDocs, storedIndexOffset, fieldsIndexOffset, docValueOffset, _, _, _, sectionsIndexOffset, err :=
MergeToWriter(segmentBases, drops, chunkMode, cr, closeCh)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's try and keep the signature of InitSegmentBase and MergeToWriter similar in terms of the index offsets.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean interm.convert()'s signature over here?

segment.go Outdated Show resolved Hide resolved
Copy link
Member

@metonymic-smokey metonymic-smokey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The unit test TestMergeWithEmptySegment seems to fail for me with these new changes - must be an issue on my end since it passes in CI

section.go Outdated Show resolved Hide resolved
// as part of the merge API, write the merged data to the writer and also track
// the starting offset of this newly merged section data.
Merge(opaque map[int]resetable, segments []*SegmentBase, drops []*roaring.Bitmap, fieldsInv []string,
newDocNumsIn [][]uint64, w *CountHashWriter, closeCh chan struct{}) error
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

newDocNumsIn should also have uint32 no? like newDocNums[3][4] = x will mean the new Doc Num for a document in segment 3 with docnumber 4 will be x. So x has to be uint32 only.

* initial sections code for dense vector index

---------

Co-authored-by: Abhi Dangeti <abhinav@couchbase.com>
@Thejas-bhat
Copy link
Member Author

the main reason why I kept the docNum's type to be uint64 was to ensure that we don't break anything around the existing code - I was being conservative to make additional changes to the existing function signatures etc. do you all think its better to finalize whether to use uint32 or uint64 for docNum (after testing it better for eg) in a later PR? or do you want me to include this in the current PR?

@abhinavdangeti
Copy link
Member

the main reason why I kept the docNum's type to be uint64 was to ensure that we don't break anything around the existing code - I was being conservative to make additional changes to the existing function signatures etc. do you all think its better to finalize whether to use uint32 or uint64 for docNum (after testing it better for eg) in a later PR? or do you want me to include this in the current PR?

@Thejas-bhat I pushed that last commit up to unbreak the build. The type of the faissIndexSection's Process didn't match with the section interface's Process. We can talk about changing the type later if really needed.

@abhinavdangeti abhinavdangeti merged commit 7945e8e into master Nov 3, 2023
6 checks passed
@abhinavdangeti abhinavdangeti deleted the refactor-section branch January 10, 2024 22:34
moshaad7 pushed a commit that referenced this pull request Sep 12, 2024
* index sections
* faiss index section + inverted text index section

Co-authored-by: Marty Schoch <marty.schoch@gmail.com>
Co-authored-by: Abhi Dangeti <abhinav@couchbase.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants