Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-5149 Lucene numdocs param #5163

Open
wants to merge 2 commits into
base: develop
Choose a base branch
from

Conversation

ate47
Copy link
Contributor

@ate47 ate47 commented Oct 28, 2024

GitHub issue resolved: #5149

Briefly describe the changes proposed in this PR:

I added a virtual property to specify the maximum amount of results the user wants in a LuceneSail search.

For example, with the example in rdf4j.org it gives something like that:

PREFIX search: <http://www.openrdf.org/contrib/lucenesail#>

?subj search:matches [
	      search:query "search terms...";
	      search:numDocs 200 ] .

At the same time, I added the sail parameter maxQueryDocuments to set the maximum amount of documents the user can query. By default, it takes the value of maxDocuments to keep a backward compatibility.

LuceneSail lc = new LuceneSail();
lc.setParameter(LuceneSail.MAX_QUERY_DOCUMENTS_KEY, "200");

PR Author Checklist (see the contributor guidelines for more details):

  • my pull request is self-contained
  • I've added tests for the changes I made
  • I've applied code formatting (you can use mvn process-resources to format from the command line)
  • I've squashed my commits where necessary
  • every commit message starts with the issue number (GH-xxxx) followed by a meaningful description of the change

* query at a time to return from a search query. The default is the value of the {@link #MAX_DOCUMENTS_KEY}
* parameter.
*/
public static final String MAX_QUERY_DOCUMENTS_KEY = "maxQueryDocuments";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure quite sure about this approach.

I understand the need as:

  • If the user doesn't set a size, then we want some default size.
  • But if the user sets a size, then that size can be both smaller and bigger than a default, but always smaller than some system level max.

Is that a correct understanding?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have two parameters:

  1. MAX_DOCUMENTS_KEY
  2. MAX_QUERY_DOCUMENTS_KEY

The first one is the original one, it defines the maximum amount of documents the index should search. But with the new predicate added in this PR, maybe the user would set a limit above this value.

This limit can be both a feature (We increase temporary the amount of results for a query) or a security issue (Because we are removing a limit). So I decided to add the 2nd parameter, this one defines the maximum amount of documents the user can ask using the new predicate.

By default, I took the same value as MAX_DOCUMENTS_KEY because maybe someone will update RDF4J and isn't aware of this change, so he won't set the new parameter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you instead have only MAX_DOCUMENTS_KEY and then you can provide a lower default from the client side if the user doesn't provide a value?

So instead of something like:

max docs = 1000
max query docs = 5000
num docs = -1

You can instead do

max docs = 5000
num docs = 1000 (client side sets num docs to 1000 as default and user can increase it to 5000 if needed, or set a lower value)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would require you to add the logic to your own application.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Benefit would be that the MAX_DOCUMENTS setting would always be the maximum, even when the user specifies numDocs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I would want this PR to fix the issue ec-doris/kohesio-backend#268, which requires overwriting the original max docs parameter at query time, both above and below its original value.

Also, it might break previous configs in other projects used to the previous MAX_DOCUMENTS logic, so I'm not sure about this way.

Copy link
Contributor

@hmottestad hmottestad Nov 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous MAX_DOCUMENTS logic wouldn't change. It would still be the absolute upper limit. But if you want to allow users to go above this limit, then you would have to increase it and then specify your own numDocs on your requests if you still want some queries to return fewer than MAX_DOCUMENTS by default.

Copy link
Contributor

@hmottestad hmottestad Nov 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about introducing DEFAULT_NUM_DOCS that would be used instead of MAX_DOCUMENTS, unless numDocs is provided by the user.

So:

  • DEFAULT_NUM_DOCS = Math.min(MAX_DOCUMENTS, System.getProperty(...))
  • if numDocs is not provided by the user, use DEFAULT_NUM_DOCS
  • else limit numDocs to MAX_DOCUMENTS (e.g. Math.min(numDocs, MAX_DOCUMENTS)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The benefit is that we only have a single MAX deal with, instead of two different MAXs depending on if the user provides a numDocs or not.

@hmottestad
Copy link
Contributor

This looks very good. Thanks for the tests!

I've added some comments. In particular I'm a bit unsure about the introduction of a second max config option.

@ate47 ate47 force-pushed the GH-5149-lucene-numdocs-param branch from ffa582e to 3e6454e Compare November 5, 2024 13:10
@ate47 ate47 requested a review from hmottestad November 5, 2024 14:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants