-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Path Search Feature to Qlever #1335
base: master
Are you sure you want to change the base?
Conversation
Current TODOs:
|
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #1335 +/- ##
==========================================
- Coverage 89.06% 89.01% -0.05%
==========================================
Files 328 330 +2
Lines 29294 29763 +469
Branches 3262 3335 +73
==========================================
+ Hits 26090 26494 +404
- Misses 2054 2108 +54
- Partials 1150 1161 +11 ☔ View full report in Codecov by Sentry. |
|
This reverts commit d700df2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A first round of comments for the parsing/planning stuff.
I still have to dig through the more complicated boost-graph-stuff.
src/parser/GraphPatternOperation.cpp
Outdated
auto simpleTriple = triple.getSimple(); | ||
TripleComponent predicate = simpleTriple.p_; | ||
TripleComponent object = simpleTriple.o_; | ||
AD_CORRECTNESS_CHECK(predicate.isIri()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be cleaned up.
- Store the result of
toStringRepresentation
to get rid of the redundant code. - make a lambda
setVariable
so that you can do something like
if (iriString.ends_with("start") { setVariable(start);); // object is captured by the lambda.
3. You should handle the case, that there are duplicate triples (e.g. something you want to set is not nullopt anymore etc).
4. You shouldn't do `AD_...CHECK` here, as this is not a programming error if it happens, but may happen in the query.
You should throw proper exceptions with proper strings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm unsure how to implement the lambda you suggest. Assuming the signature of the lambda is something like void setVariable(Variable var)
, I don't see how the lambda can choose to set the correct member of the PathQuery. The only two options I see is passing a pointer to the member such as start_
or using a map, such that variable can be set by parameter name (map[param] = object.getVariable()
). I don't think the amount of duplication justifies either of these solutions.
I'm also aware that I could capture the PathQuery (this
), but I'm not sure how that would help here.
I am aware that this is a lot of duplicate code. I implemented a lambda similar to what you suggested to get rid of some the duplication. While I was at it, I also improved the error handling.
src/engine/PathSearch.h
Outdated
* @param endNodes A span of end nodes. | ||
* @param edgePropertyLists A span of edge property lists. | ||
*/ | ||
void buildGraph(std::span<const Id> startNodes, std::span<const Id> endNodes, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a good reason why this doesn't simply return the graph?
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A thorough pass on everything but the tests.
# - query: path_search_shortest_paths | ||
# type: no-text | ||
# sparql: | | ||
# PREFIX pathSearch: <https://qlever.cs.uni-freiburg.de/pathSearch/> | ||
# SELECT * WHERE { | ||
# SERVICE pathSearch: { | ||
# pathSearch: pathSearch:algorithm pathSearch:shortestPaths; | ||
# pathSearch:source <Mary_Ann_Leeper>; | ||
# pathSearch:target <Literature_Subject>; | ||
# pathSearch:pathColumn ?path; | ||
# pathSearch:edgeColumn ?edge; | ||
# pathSearch:start ?start; | ||
# pathSearch:end ?end; | ||
# {SELECT * WHERE { | ||
# ?start <is-a> ?end | ||
# }} | ||
# } | ||
# } | ||
# checks: | ||
# - num_rows: 4 | ||
# - num_cols: 4 | ||
# - selected: ["?path", "?edge", "?start", "?end"] | ||
# - contains_row: [0, 0, "<Mary_Ann_Leeper>", "<Biochemist>"] | ||
# - contains_row: [0, 1, "<Biochemist>", "<Literature_Subject>"] | ||
# - contains_row: [1, 0, "<Mary_Ann_Leeper>", "<Chemist>"] | ||
# - contains_row: [1, 1, "<Chemist>", "<Literature_Subject>"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Delete if not needed anymore.
checks: | ||
- num_rows: 17 | ||
- num_cols: 4 | ||
- selected: ["?path", "?edge", "?start", "?end"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the result is deterministic, then you can check some rows here.
pathQuery.childGraphPattern_ = std::move(pattern._child); | ||
} else { | ||
throw parsedQuery::PathSearchException( | ||
"Unsupported subquery in pathSearch." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Unsupported subquery in pathSearch." | |
"Unsupported element in pathSearch." |
void addParameter(const SparqlTriple& triple); | ||
void fromBasicPattern(const BasicGraphPattern& pattern); | ||
std::variant<Variable, std::vector<Id>> toSearchSide( | ||
std::vector<TripleComponent> side, const Index::Vocab& vocab) const; | ||
PathSearchConfiguration toPathSearchConfiguration( | ||
const Index::Vocab& vocab) const; | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add some documentation....
auto getVariable = [](std::string parameter, const TripleComponent& object) { | ||
if (!object.isVariable()) { | ||
throw PathSearchException("The value " + object.toString() + | ||
" for parameter '" + parameter + | ||
"' has to be a variable"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pass the parameter
as a string_view
and then use absl::StrCat
to concatenate the error message.
while (!currentPath.empty() && edge.start_ != currentPath.end()) { | ||
currentPath.pop_back(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it works that you just do a single pop in the end as soon as you have no more outgoing edges that lead to unvisited nodes. Because then you know, that the next edge will be a
visited.insert(edge.end_.getBits()); | ||
} | ||
|
||
return pathCache[source.getBits()]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think for this function at all we should aim for the most efficient implementation thinkable.
And that is, I think, one that directly stores to the IdTable and has very simple path creations. We can talk about this again if you want.
One thing we have to think about: In the case where you have pairs of [source, target], the path cache has to be invalidated the set of targets changes. So you basically should
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay,
I just thought of an idea:
- You store the complete subtree in a deterministic and sorted way.
- Then the
Edge
type can just be a single rowIndex into that table (or the individual spans for start and end nodes, which is the same). That way you don't really have the problem with the overhead, and the edge Properties etc, as you only have to touch all this when you actually materialize the result. This should all be really efficient (no allocations etc).
if (pathCache.contains(edge.end_.getBits())) { | ||
for (auto subPath : pathCache[edge.end_.getBits()]) { | ||
if (subPath.first() == currentPath.first()) { | ||
addToCache(pathCache, currentPath, currentPath.size()); | ||
} else { | ||
auto fullPath = currentPath.concat(subPath); | ||
addToCache(pathCache, fullPath, currentPath.size()); | ||
} | ||
} | ||
continue; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should definitely factor out the caching etc. to separate lambdas outside the core-DFS.
That way we can more easily reason about the code and refactor it if necessary.
} | ||
for (auto source : sources) { | ||
for (auto path : findPaths(source, targetSet, binSearch)) { | ||
paths.push_back(path); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This implementation is the one for two separate values clauses, doing the full cartesian product of sources and targets.
std::holds_alternative<Variable>(config.sources_) | ||
? AD_FIELD( | ||
PathSearchConfiguration, sources_, | ||
VariantWith<Variable>(Eq(std::get<Variable>(config.sources_)))) | ||
: AD_FIELD( | ||
PathSearchConfiguration, sources_, | ||
VariantWith<std::vector<ValueId>>(UnorderedElementsAreArray( | ||
std::get<std::vector<ValueId>>(config.sources_)))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't this just be AD_FIELD(Config, sources_, Eq(config.sources_))
?
Same for the targetMatcher
.
This PR will add a feature for performing path-search in Qlever. It can be called via Federated Query. Currently, there are two different algorithms for computing paths: one for the computation of all paths and one to compute all shortest paths. The PathSearch is designed in such a way, that adding more algorithms later is straightforward.