Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Path Search Feature to Qlever #1335

Open
wants to merge 53 commits into
base: master
Choose a base branch
from

Conversation

JoBuRo
Copy link
Contributor

@JoBuRo JoBuRo commented Apr 30, 2024

This PR will add a feature for performing path-search in Qlever. It can be called via Federated Query. Currently, there are two different algorithms for computing paths: one for the computation of all paths and one to compute all shortest paths. The PathSearch is designed in such a way, that adding more algorithms later is straightforward.

@JoBuRo
Copy link
Contributor Author

JoBuRo commented Apr 30, 2024

Current TODOs:

  • Implement federated query parsing for path search
  • Improve estimates of the PathSearch operation
  • Check if building the ResultTable can be improved
  • Refactor PathSearch into smaller classes

Copy link

codecov bot commented Apr 30, 2024

Codecov Report

Attention: Patch coverage is 83.36842% with 79 lines in your changes missing coverage. Please review.

Project coverage is 89.01%. Comparing base (14d6e1c) to head (1c209fe).

Files Patch % Lines
src/engine/PathSearch.cpp 82.48% 42 Missing and 6 partials ⚠️
src/parser/GraphPatternOperation.cpp 82.07% 13 Missing and 6 partials ⚠️
src/parser/sparqlParser/SparqlQleverVisitor.cpp 80.00% 5 Missing and 1 partial ⚠️
src/engine/QueryPlanner.cpp 93.87% 2 Missing and 1 partial ⚠️
src/engine/CheckUsePatternTrick.cpp 0.00% 2 Missing ⚠️
src/engine/PathSearch.h 92.85% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1335      +/-   ##
==========================================
- Coverage   89.06%   89.01%   -0.05%     
==========================================
  Files         328      330       +2     
  Lines       29294    29763     +469     
  Branches     3262     3335      +73     
==========================================
+ Hits        26090    26494     +404     
- Misses       2054     2108      +54     
- Partials     1150     1161      +11     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link

sonarcloud bot commented Apr 30, 2024

Quality Gate Passed Quality Gate passed

Issues
17 New issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code

See analysis details on SonarCloud

@JoBuRo JoBuRo marked this pull request as ready for review June 26, 2024 07:33
Copy link
Member

@joka921 joka921 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A first round of comments for the parsing/planning stuff.

I still have to dig through the more complicated boost-graph-stuff.

src/parser/GraphPatternOperation.cpp Show resolved Hide resolved
auto simpleTriple = triple.getSimple();
TripleComponent predicate = simpleTriple.p_;
TripleComponent object = simpleTriple.o_;
AD_CORRECTNESS_CHECK(predicate.isIri());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be cleaned up.

  1. Store the result of toStringRepresentation to get rid of the redundant code.
  2. make a lambda setVariable so that you can do something like
if (iriString.ends_with("start") { setVariable(start);); // object is captured by the lambda.

3. You should handle the case, that there are duplicate triples (e.g. something you want to set is not nullopt anymore etc).

4. You shouldn't do `AD_...CHECK` here, as this is not a programming error if it happens, but may happen in the query.
You should throw proper exceptions with proper strings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unsure how to implement the lambda you suggest. Assuming the signature of the lambda is something like void setVariable(Variable var), I don't see how the lambda can choose to set the correct member of the PathQuery. The only two options I see is passing a pointer to the member such as start_ or using a map, such that variable can be set by parameter name (map[param] = object.getVariable()). I don't think the amount of duplication justifies either of these solutions.
I'm also aware that I could capture the PathQuery (this), but I'm not sure how that would help here.

I am aware that this is a lot of duplicate code. I implemented a lambda similar to what you suggested to get rid of some the duplication. While I was at it, I also improved the error handling.

src/parser/GraphPatternOperation.cpp Outdated Show resolved Hide resolved
src/parser/GraphPatternOperation.cpp Outdated Show resolved Hide resolved
src/parser/GraphPatternOperation.cpp Show resolved Hide resolved
src/parser/sparqlParser/SparqlQleverVisitor.cpp Outdated Show resolved Hide resolved
src/parser/sparqlParser/SparqlQleverVisitor.cpp Outdated Show resolved Hide resolved
src/parser/GraphPatternOperation.h Show resolved Hide resolved
src/engine/QueryPlanner.cpp Outdated Show resolved Hide resolved
* @param endNodes A span of end nodes.
* @param edgePropertyLists A span of edge property lists.
*/
void buildGraph(std::span<const Id> startNodes, std::span<const Id> endNodes,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a good reason why this doesn't simply return the graph?

Copy link

sonarcloud bot commented Jul 11, 2024

Copy link
Member

@joka921 joka921 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A thorough pass on everything but the tests.

Comment on lines +1043 to +1068
# - query: path_search_shortest_paths
# type: no-text
# sparql: |
# PREFIX pathSearch: <https://qlever.cs.uni-freiburg.de/pathSearch/>
# SELECT * WHERE {
# SERVICE pathSearch: {
# pathSearch: pathSearch:algorithm pathSearch:shortestPaths;
# pathSearch:source <Mary_Ann_Leeper>;
# pathSearch:target <Literature_Subject>;
# pathSearch:pathColumn ?path;
# pathSearch:edgeColumn ?edge;
# pathSearch:start ?start;
# pathSearch:end ?end;
# {SELECT * WHERE {
# ?start <is-a> ?end
# }}
# }
# }
# checks:
# - num_rows: 4
# - num_cols: 4
# - selected: ["?path", "?edge", "?start", "?end"]
# - contains_row: [0, 0, "<Mary_Ann_Leeper>", "<Biochemist>"]
# - contains_row: [0, 1, "<Biochemist>", "<Literature_Subject>"]
# - contains_row: [1, 0, "<Mary_Ann_Leeper>", "<Chemist>"]
# - contains_row: [1, 1, "<Chemist>", "<Literature_Subject>"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete if not needed anymore.

checks:
- num_rows: 17
- num_cols: 4
- selected: ["?path", "?edge", "?start", "?end"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the result is deterministic, then you can check some rows here.

pathQuery.childGraphPattern_ = std::move(pattern._child);
} else {
throw parsedQuery::PathSearchException(
"Unsupported subquery in pathSearch."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"Unsupported subquery in pathSearch."
"Unsupported element in pathSearch."

Comment on lines +169 to +175
void addParameter(const SparqlTriple& triple);
void fromBasicPattern(const BasicGraphPattern& pattern);
std::variant<Variable, std::vector<Id>> toSearchSide(
std::vector<TripleComponent> side, const Index::Vocab& vocab) const;
PathSearchConfiguration toPathSearchConfiguration(
const Index::Vocab& vocab) const;
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add some documentation....

Comment on lines +84 to +88
auto getVariable = [](std::string parameter, const TripleComponent& object) {
if (!object.isVariable()) {
throw PathSearchException("The value " + object.toString() +
" for parameter '" + parameter +
"' has to be a variable");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pass the parameter as a string_view and then use absl::StrCat to concatenate the error message.

Comment on lines +304 to +306
while (!currentPath.empty() && edge.start_ != currentPath.end()) {
currentPath.pop_back();
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it works that you just do a single pop in the end as soon as you have no more outgoing edges that lead to unvisited nodes. Because then you know, that the next edge will be a

visited.insert(edge.end_.getBits());
}

return pathCache[source.getBits()];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for this function at all we should aim for the most efficient implementation thinkable.
And that is, I think, one that directly stores to the IdTable and has very simple path creations. We can talk about this again if you want.

One thing we have to think about: In the case where you have pairs of [source, target], the path cache has to be invalidated the set of targets changes. So you basically should

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay,
I just thought of an idea:

  1. You store the complete subtree in a deterministic and sorted way.
  2. Then the Edge type can just be a single rowIndex into that table (or the individual spans for start and end nodes, which is the same). That way you don't really have the problem with the overhead, and the edge Properties etc, as you only have to touch all this when you actually materialize the result. This should all be really efficient (no allocations etc).

Comment on lines +311 to +320
if (pathCache.contains(edge.end_.getBits())) {
for (auto subPath : pathCache[edge.end_.getBits()]) {
if (subPath.first() == currentPath.first()) {
addToCache(pathCache, currentPath, currentPath.size());
} else {
auto fullPath = currentPath.concat(subPath);
addToCache(pathCache, fullPath, currentPath.size());
}
}
continue;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should definitely factor out the caching etc. to separate lambdas outside the core-DFS.
That way we can more easily reason about the code and refactor it if necessary.

}
for (auto source : sources) {
for (auto path : findPaths(source, targetSet, binSearch)) {
paths.push_back(path);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implementation is the one for two separate values clauses, doing the full cartesian product of sources and targets.

Comment on lines +267 to +274
std::holds_alternative<Variable>(config.sources_)
? AD_FIELD(
PathSearchConfiguration, sources_,
VariantWith<Variable>(Eq(std::get<Variable>(config.sources_))))
: AD_FIELD(
PathSearchConfiguration, sources_,
VariantWith<std::vector<ValueId>>(UnorderedElementsAreArray(
std::get<std::vector<ValueId>>(config.sources_))));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't this just be AD_FIELD(Config, sources_, Eq(config.sources_))
?
Same for the targetMatcher.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants