-
Notifications
You must be signed in to change notification settings - Fork 164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-5149 Lucene numdocs param #5163
base: develop
Are you sure you want to change the base?
Conversation
…s param in LuceneSail query
211c0f6
to
41d4869
Compare
...ail/elasticsearch/src/main/java/org/eclipse/rdf4j/sail/elasticsearch/ElasticsearchIndex.java
Show resolved
Hide resolved
* query at a time to return from a search query. The default is the value of the {@link #MAX_DOCUMENTS_KEY} | ||
* parameter. | ||
*/ | ||
public static final String MAX_QUERY_DOCUMENTS_KEY = "maxQueryDocuments"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure quite sure about this approach.
I understand the need as:
- If the user doesn't set a size, then we want some default size.
- But if the user sets a size, then that size can be both smaller and bigger than a default, but always smaller than some system level max.
Is that a correct understanding?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have two parameters:
MAX_DOCUMENTS_KEY
MAX_QUERY_DOCUMENTS_KEY
The first one is the original one, it defines the maximum amount of documents the index should search. But with the new predicate added in this PR, maybe the user would set a limit above this value.
This limit can be both a feature (We increase temporary the amount of results for a query) or a security issue (Because we are removing a limit). So I decided to add the 2nd parameter, this one defines the maximum amount of documents the user can ask using the new predicate.
By default, I took the same value as MAX_DOCUMENTS_KEY
because maybe someone will update RDF4J and isn't aware of this change, so he won't set the new parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you instead have only MAX_DOCUMENTS_KEY and then you can provide a lower default from the client side if the user doesn't provide a value?
So instead of something like:
max docs = 1000
max query docs = 5000
num docs = -1
You can instead do
max docs = 5000
num docs = 1000 (client side sets num docs to 1000 as default and user can increase it to 5000 if needed, or set a lower value)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would require you to add the logic to your own application.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Benefit would be that the MAX_DOCUMENTS setting would always be the maximum, even when the user specifies numDocs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I would want this PR to fix the issue ec-doris/kohesio-backend#268, which requires overwriting the original max docs parameter at query time, both above and below its original value.
Also, it might break previous configs in other projects used to the previous MAX_DOCUMENTS logic, so I'm not sure about this way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The previous MAX_DOCUMENTS logic wouldn't change. It would still be the absolute upper limit. But if you want to allow users to go above this limit, then you would have to increase it and then specify your own numDocs on your requests if you still want some queries to return fewer than MAX_DOCUMENTS by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about introducing DEFAULT_NUM_DOCS that would be used instead of MAX_DOCUMENTS, unless numDocs
is provided by the user.
So:
- DEFAULT_NUM_DOCS = Math.min(MAX_DOCUMENTS, System.getProperty(...))
- if numDocs is not provided by the user, use DEFAULT_NUM_DOCS
- else limit numDocs to MAX_DOCUMENTS (e.g. Math.min(numDocs, MAX_DOCUMENTS)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The benefit is that we only have a single MAX deal with, instead of two different MAXs depending on if the user provides a numDocs or not.
core/sail/lucene/src/main/java/org/eclipse/rdf4j/sail/lucene/impl/LuceneIndex.java
Outdated
Show resolved
Hide resolved
core/sail/lucene/src/main/java/org/eclipse/rdf4j/sail/lucene/impl/LuceneIndex.java
Show resolved
Hide resolved
...tes/lucene/src/main/java/org/eclipse/testsuite/rdf4j/sail/lucene/AbstractLuceneSailTest.java
Outdated
Show resolved
Hide resolved
...tes/lucene/src/main/java/org/eclipse/testsuite/rdf4j/sail/lucene/AbstractLuceneSailTest.java
Outdated
Show resolved
Hide resolved
...tes/lucene/src/main/java/org/eclipse/testsuite/rdf4j/sail/lucene/AbstractLuceneSailTest.java
Outdated
Show resolved
Hide resolved
This looks very good. Thanks for the tests! I've added some comments. In particular I'm a bit unsure about the introduction of a second |
ffa582e
to
3e6454e
Compare
GitHub issue resolved: #5149
Briefly describe the changes proposed in this PR:
I added a virtual property to specify the maximum amount of results the user wants in a LuceneSail search.
For example, with the example in rdf4j.org it gives something like that:
At the same time, I added the sail parameter
maxQueryDocuments
to set the maximum amount of documents the user can query. By default, it takes the value ofmaxDocuments
to keep a backward compatibility.PR Author Checklist (see the contributor guidelines for more details):
mvn process-resources
to format from the command line)