Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: allow spread operators in to-many relationships #3640

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

laurenceisla
Copy link
Member

@laurenceisla laurenceisla commented Jul 5, 2024

Closes #3041

@laurenceisla
Copy link
Member Author

laurenceisla commented Jul 6, 2024

My approach right now is to generate this query for a to-many request:

curl 'localhost:3000/clients?select=name,...projects(name,id)'
SELECT "test"."clients"."name",
       "clients_projects_1"."name",
       "clients_projects_1"."id"
FROM "test"."clients"
LEFT JOIN LATERAL (
  SELECT json_agg("projects_1"."name") AS "name",
         json_agg("projects_1"."id") AS "id"
  FROM "test"."projects" AS "projects_1"
  WHERE "projects_1"."client_id" = "test"."clients"."id"
) AS "clients_projects_1" ON TRUE

Right now this gives the expected result. But aggregates are not working correctly, because they are designed to be selected in the top query with a GROUP BY. A solution would be to not do the json_agg() inside the sub-query and do it in the top one and treat it as another aggregate (with GROUP BY). Like this:

SELECT "test"."clients"."name",
      json_agg("clients_projects_1"."name") AS "name",
      json_agg("clients_projects_1"."id") AS "id"
FROM "test"."clients"
LEFT JOIN LATERAL (
  SELECT "projects_1"."name",
         "projects_1"."id"
  FROM "test"."projects" AS "projects_1"
  WHERE "projects_1"."client_id" = "test"."clients"."id"
) AS "clients_projects_1" ON TRUE
GROUP BY "test"."clients"."name"

Not sure which one is better/easier right now... I'm thinking the latter.

src/PostgREST/Error.hs Outdated Show resolved Hide resolved
@wolfgangwalther
Copy link
Member

Right now this gives the expected result. But aggregates are not working correctly, because they are designed to be selected in the top query with a GROUP BY. A solution would be to not do the json_agg() inside the sub-query and do it in the top one and treat it as another aggregate (with GROUP BY).

Having the json_agg in the outer query would make the query cleaner, imho.

@laurenceisla
Copy link
Member Author

laurenceisla commented Jul 17, 2024

Some caveats I encountered:

Repeated values and order

Do we want to keep repeated values in the results? For example (not the best use case, just to illustrate):

curl 'localhost:3000/project?select=name,...tasks(tasks:name,due_dates:due_date)'
[
  {
    "name": "project 1",
    "tasks": ["task 1", "task 2", "task 3", "task 4"],
    "due_dates": [null, "2024-08-08", "2024-08-08", null]
  }
]

Here we're repeating null and "2024-08-08", so maybe we don't want to do this and just return [null, "2024-08-08"] (perhaps also remove null values?). Doing this will not guarantee the same dimensions for the same aggregated relationship, and definitely not the order of the results (which wasn't guaranteed before either). Doing a DISTINCT inside the json_agg() is a possible solution (the next caveat has an example query).

Nested To-Many Spreads

I have a doubt on what to expect with nested to-many spreads. For example, on a non-nested to-many spread like this one:

curl 'localhost:3000/entities?select=name,...child_entities(children:name)'

We would expect:

[
  {"name": "entity 1", "children": ["child entity 1", "child entity 2"]},
  {"name": "entity 2", "children": ["child entity 3"]},
  "..."
]

But what if we nest another to-many spread embedding with a new column to aggregate:

curl 'localhost:3000/entities?select=name,...child_entities(children:name,...grandchild_entities(grandchildren:name))'

I understand that we're hoisting all the aggregates to the top level, and not grouping by the intermediate columns (entities.name), because they should also be aggregated. I'm assuming that the result should be the same as above but also with the aggregated grandchild_entities.name.

[
  {"name": "entity 1", "children": ["child entity 1", "child entity 2"], "grandchildren": ["grandchild entity 1", "grandchild entity 2", "..."]},
  {"name": "entity 2", "children": ["child entity 3"], "grandchildren": []},
  "..."
]

This cannot be achieved by a simple GROUP BY, because duplicated values will be returned by entities.name (which, perhaps, is not what we want). A solution would also be to use DISTINCT. The query would look like this:

SELECT "api"."entities"."name",
       json_agg(DISTINCT "entities_child_entities_1"."children") AS "children",
       json_agg(DISTINCT "entities_child_entities_1"."grandchildren") AS "grandchildren"
FROM "api"."entities"
LEFT JOIN LATERAL (
  SELECT "child_entities_1"."name" AS "children",
         "child_entities_grandchild_entities_2"."grandchildren" AS "grandchildren"
  FROM "api"."child_entities" AS "child_entities_1"
  LEFT JOIN LATERAL (
    SELECT "grandchild_entities_2"."name" AS "grandchildren"
    FROM "api"."grandchild_entities" AS "grandchild_entities_2"
    WHERE "grandchild_entities_2"."parent_id" = "child_entities_1"."id"
  ) AS "child_entities_grandchild_entities_2" ON TRUE
  WHERE "child_entities_1"."parent_id" = "api"."entities"."id"
) AS "entities_child_entities_1" ON TRUE
GROUP BY "api"."entities"."name";

If there is no sensible interpretation of the query, another option is to prohibit these intermediate columns altogether (aggregates like sum, avg, etc. should still be possible).

@laurenceisla laurenceisla force-pushed the feat-spread-m2m branch 2 times, most recently from 6507878 to bd93514 Compare July 26, 2024 01:50
src/PostgREST/Plan.hs Outdated Show resolved Hide resolved
@laurenceisla
Copy link
Member Author

OK, this is what I got implemented so far. For example, using the tables in our spec test:

Factories <-02M-> processes <-M2M-> supervisors
curl 'localhost:3000/factories?select=name,...processes(processes:name,...supervisors(supervisors:name))'
[
 {
  "name": "Factory C",
  "processes": ["Process C1", "Process C2", "Process XX"],
  "supervisors": ["Peter", "Peter", null]
 },
 {
  "name": "Factory B",
  "process": ["Process B1", "Process B1", "Process B2", "Process B2"],
  "supervisors": ["Peter", "Sarah", "Mary", "John"]
 },
 {
  "name": "Factory A",
  "process": ["Process A1", "Process A2"],
  "supervisors": ["Mary", "John"]
 },
 {
  "name": "Factory D",
  "process": [null],
  "supervisors": [null]
 }
]
[
  {
  	"name":"Factory C",
  	"processes":["Process C1", "Process C2", "Process XX"],
  	"supervisors":[{"name": "Peter"}, {"name": "Peter"}, null]},
  {
  	"name":"Factory B",
  	"processes":["Process B1", "Process B1", "Process B2", "Process B2"],
  	"supervisors":[{"name": "Peter"}, {"name": "Sarah"}, {"name": "Mary"}, {"name": "John"}]},
  {
  	"name":"Factory A",
  	"processes":["Process A1", "Process A2"],
  	"supervisors":[{"name": "Mary"}, {"name": "John"}]},
  {
  	"name":"Factory D",
  	"processes":[null],
	"supervisors":[null]
  }
]

As I mentioned in previous comments, some values will repeat, since we're grouping by the factory "name" without doing a DISTINCT or NOT NULL. The next step would be to implement the .. operator as mentioned here: #3640 (comment), it shouldn't be too complicated.

There's a problem when the embeddings have no values, as seen in the "Factory D" example, which has no processes and no supervisors. This is the same issue as this SO question. One solution is to do a COALESCE(NULLIF(..., '[null]'), '[]'), but this does not take into consideration valid null values (the row exists but the column value is null). The best solution is in one of the answers (filtering by the PK of the relationship), but it doesn't seem like a trivial task.

@laurenceisla laurenceisla force-pushed the feat-spread-m2m branch 7 times, most recently from 9002110 to b3e5483 Compare September 18, 2024 23:37
Copy link
Member Author

@laurenceisla laurenceisla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feature should be ready for review now.

I'm leaving the .. for DISTINCT and NOT NULL for another PR to keep it cleaner. Edit: Nvm. I figured that it should be OK to include that feature here too, although in different commits.

Here are some comments on the changes done:

src/PostgREST/Query/SqlFragment.hs Show resolved Hide resolved
src/PostgREST/Query/SqlFragment.hs Outdated Show resolved Hide resolved
src/PostgREST/ApiRequest/Types.hs Outdated Show resolved Hide resolved
src/PostgREST/Plan/ReadPlan.hs Show resolved Hide resolved
@laurenceisla laurenceisla changed the title feat: WIP allow spread operators in to-many relationships feat: allow spread operators in to-many relationships Sep 18, 2024
@laurenceisla laurenceisla marked this pull request as ready for review September 18, 2024 23:40
@laurenceisla laurenceisla force-pushed the feat-spread-m2m branch 4 times, most recently from 67e6419 to 87a13ef Compare September 25, 2024 22:14
@laurenceisla laurenceisla force-pushed the feat-spread-m2m branch 3 times, most recently from 5a80bf2 to eaac818 Compare October 28, 2024 20:26
@steve-chavez
Copy link
Member

One thing I'm a bit concerned about is that we seem to have lost performance from v12.2.3, see https://github.com/PostgREST/postgrest/pull/3640/checks?check_run_id=32182824626

param head main v12.2.3
throughput 418 419 468

This has happened somewhere outside this PR on main though.

@laurenceisla
Copy link
Member Author

This has happened somewhere outside this PR on main though.

Found the loadtest in which head and main started to differ: https://github.com/PostgREST/postgrest/actions/runs/9846138600/job/27183597190

After some local tests, I think the commit f31848f is where the throughput started to go down (the previous one returned similar values to v12.2.3):

param v12.2.3 735e1ed f31848f 86c3257
throughput 468 462 420 437

@wolfgangwalther
Copy link
Member

After some local tests, I think the commit f31848f is where the throughput started to go down

Uh, that would indicate that only the root endpoint should be affected, right? Or maybe even only the schema cache load - I wonder whether we have any kind of "ramp up" in the loadtest to discard stuff at the beginning when the schema cache still has to load?

I can't imagine this affects any regular requests.

@wolfgangwalther
Copy link
Member

To prove my above point, I ran those tests as well:

  • regular loadtest/targets.http: 558 req/s vs 521 req/s (similar to your observation)
  • removed the openapi request from loadtest/targets.http: 1544 req/s vs 1597 req/s

So the regression really only happens on the root endpoint for the openapi output. Given that we're working towards removing / replacing that anyway, I don't consider this regression problematic right now.

WDYT?

@laurenceisla
Copy link
Member Author

laurenceisla commented Oct 30, 2024

So the regression really only happens on the root endpoint for the openapi output. Given that we're working towards removing / replacing that anyway, I don't consider this regression problematic right now.

Cool, since it only affects the OpenAPI output, then yes, I also don't consider it a problem.

Copy link
Member

@steve-chavez steve-chavez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make sure these headings are unique

docs/references/api/resource_embedding.rst Outdated Show resolved Hide resolved
docs/references/api/resource_embedding.rst Outdated Show resolved Hide resolved
Copy link
Member

@steve-chavez steve-chavez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great now! Awesome work! 💯

@jdgamble555
Copy link

This is awesome guys, I was going to ask about aggregations, but just works!

Sweet!

Copy link
Member

@wolfgangwalther wolfgangwalther left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Terrific work with the test cases, very extensive. My head explodes, though.

Because we just discussed commit message / prefixes in another PR - what's your opinion on docs/feat commits? Should they be split like in this PR or do they belong together, i.e. was the idea to squash this?

I think they should go into the same feat: commit. A feature without docs is not a feature.

Comment on lines +1244 to +1245
It is expected to get ``null`` values in the resulting array.
You can exclude them with :ref:`stripped_nulls`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't looked at the code, but I assume this with a special case handling this for the aggregation.

I don't think it's a good idea. This will lead to inconsistent results, because: Assume you have a regular aggregation, some spread embedding aggregation, a regular array and a json array - all with some null values in them. Some of them will be stripped, but others won't.

json(b)_strip_nulls only strips nulls in objects for a reason, I don't think we should change that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had the discussion about it here: #3640 (comment)

This will lead to inconsistent results [...] Some of them will be stripped, but others won't.

From the convo above, I was also on the fence about it, but I figured that adding "this only works on to-many spreads" to the docs would clarify some things (I forgot to do that btw). Still, I agree, I think the inconsistency you mention is enough to look for an alternative (maybe another parameter in the header?).

Copy link
Member

@wolfgangwalther wolfgangwalther Nov 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still, I agree, I think the inconsistency you mention is enough to look for an alternative (maybe another parameter in the header?).

I think filtering NULLs should be very explicit.

I don't remember seeing any tests with filters in the tests (but I didn't look again now).

Is something like this supposed to work?

get "/factories?select=factory:name,...processes(process:name)&processes.process=not.is.null"

And also any other filter on the embedding?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And also any other filter on the embedding?

Yes, filters on the spread embed resource work, there's a couple of tests with them.

Is something like this supposed to work?

Yes, it will work on a single embed resource and won't include the null values. But deeply nested resources could include nulls inside the array when the value is null or when no embedded row is returned. This is a problem with the current implementation.

Hmm... with the fixed implementation (non-flattened arrays) this may not be a problem anymore, since it should return empty arrays instead of null... but I'm not entirely sure, I need to check the new design of the resulting queries to verify.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... with the fixed implementation (non-flattened arrays) this may not be a problem anymore, since it should return empty arrays instead of null...

Yes, AFAICT this is correct. Since the array_agg is done in the same sub-query selection, it would return null on a failed JOIN. But it returns [null] when the JOIN is successful and the value is null (which is what we want). So just the explicit filter should be needed here, not the header.

Comment on lines +173 to +190
it "should aggregate spread columns from a nested one-to-many relationship" $
get "/factories?select=factory:name,...processes(process:name,...process_supervisor(supervisor_ids:supervisor_id))&id=lte.2" `shouldRespondWith`
[json|[
{"factory":"Factory B","process":["Process B1", "Process B1", "Process B2", "Process B2"],"supervisor_ids":[3, 4, 1, 2]},
{"factory":"Factory A","process":["Process A1", "Process A2"],"supervisor_ids":[1, 2]}
]|]
{ matchStatus = 200
, matchHeaders = [matchContentTypeJson]
}
it "should aggregate spread columns from a nested many-to-many relationship" $ do
get "/factories?select=factory:name,...processes(process:name,...supervisors(supervisors:name))&id=lte.2" `shouldRespondWith`
[json|[
{"factory":"Factory B","process":["Process B1", "Process B1", "Process B2", "Process B2"],"supervisors":["Peter", "Sarah", "Mary", "John"]},
{"factory":"Factory A","process":["Process A1", "Process A2"],"supervisors":["Mary", "John"]}
]|]
{ matchStatus = 200
, matchHeaders = [matchContentTypeJson]
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The alternative would be to spread this into an array of arrays, right?

I would have expected that, I think.

This flattening of nested arrays somehow breaks my mental model about this. Especially when I read on, because in the next step we are aggregating arrays of objects (which makes sense). I find it more consistent if we had arrays of arrays?

Copy link
Member Author

@laurenceisla laurenceisla Nov 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The alternative would be to spread this into an array of arrays, right?

Yes, that was another option. In that case, the response for the first test would be something like:

[
  {"factory":"Factory B","process":["Process B1", "Process B2"],"supervisor_ids":[[3, 4], [1, 2]]},
  {"factory":"Factory A","process":["Process A1", "Process A2"],"supervisor_ids":[[1], [2]]}
]

A caveat here is that it's difficult to hoist aggregates to the top resource. For example if we wanted a COUNT on supervisors, we could expect this:

{"factory":"Factory B","process":["Process B1", "Process B2"],"supervisor_ids":[[3, 4], [1, 2]],"supervisor_count":[2,2]}

If we wanted to aggregate to the top level we would need to do a COUNT(supervisors) and then a SUM(supervisors_result) to get the correct "count": 4 (unless there's another alternative).

Also, in the issue description under "Multiple Levels" an example of an expected result is given with flatten arrays (although with a junction table). Which I find quite useful to have, instead of further nesting the result in arrays (every n nested embed spread will return n nested arrays).

This flattening of nested arrays somehow breaks my mental model about this.

Yes, it actually goes away from the concept of to-one spread in that sense (removing the object to select the columns). To avoid both of the issues above, the mental model I went for is this one:

  • Treat every "to-many" spread as an array_agg aggregate of every selected column in the embedded resources. This aggregate will have a single GROUP BY at the top resource, instead of a GROUP BY at every level of embedding.
  • Since a single GROUP BY is used, other aggregates can be easily hoisted to the top resource and grouped by the columns selected there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, #3041 (comment) "Multiple Levels" an example of an expected result is given with flatten arrays (although with a junction table). Which I find quite useful to have, instead of further nesting the result in arrays (every n nested embed spread will return n nested arrays).

Oh, I fully agree with that one. If there is a junction table to make it m2m, and you don't specifiy this table in your request, then you should get a single array.

But the case is different when you specify the intermediate table.

A caveat here is that it's difficult to hoist aggregates to the top resource. For example if we wanted a COUNT on supervisors
[..]
If we wanted to aggregate to the top level we would need to do a COUNT(supervisors) and then a SUM(supervisors_result) to get the correct "count": 4 (unless there's another alternative).

In this specific example, 4 is the right answer. What is the right answer if the response had "supervisor_ids":[[3, 4], [1, 2, 3]], i.e. it had duplicates?

I assume the query would return 5. We are not counting "supervisors", but "rows of process_supervisors". I think this result can be surprising, but if we make it more explicit like so:

get "/factories?select=factory:name,...processes(process:name,...supervisors(count()).sum())&id=lte.2"

... then it's much less surprising. Because we are summing up counts, it's clear that the information about multiple processes having the same supervisors is lost in the "count" aggregation.

Copy link
Member Author

@laurenceisla laurenceisla Nov 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's less surprising, although a bit confusing on what to expect when there are more than one column, aggregates or nested relationships.

This case is straight forward enough:

...supervisors(count()).sum()

What about something like this:

...supervisors(name,count(),...processes(count())).sum()

What would be the rules to use sum() in the spread relationship? These could be some options:

  • Allow only one single selected aggregate (throw an error otherwise)
  • Allow many aggregates and apply the sum() to each of them (throw error on non-aggregates)
  • Apply to all the columns regardless and let PostgreSQL return the error
  • Ignore non-aggregates and apply only on aggregated columns (does not throw an error, but PostgreSQL could return one if the aggregate type is a mismatch)

The last one could be a good approach, since it still allows spreading other columns... the implementation may not be trivial, though and it could also be extra burden for the UX.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apply to all the columns regardless and let PostgreSQL return the error

This sounds like the most sensible approach to me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we have to-many spreads, they'll be applied to the spread relationship itself, that is, count all of processes by factory (since I don't believe that a response like count: [1,1] would be sensible in this case):

I disagree. The [1, 1] is what I would expect here.

Copy link
Member Author

@laurenceisla laurenceisla Nov 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree. The [1, 1] is what I would expect here.

Yes, I agree. I didn't mean to publish the comment (just deleted it before reading your answer, sorry). I was still experimenting with the expected results and found out that it's expected to get the full count() only when not specifying an extra column, which makes sense (it groups by those columns).

Ignore my previous comment, I'll make a new example in a moment.

test/spec/Feature/Query/SpreadQueriesSpec.hs Show resolved Hide resolved
@laurenceisla
Copy link
Member Author

what's your opinion on docs/feat commits? Should they be split like in this PR or do they belong together, i.e. was the idea to squash this?

I think they should go into the same feat: commit. A feature without docs is not a feature.

Makes sense, yes. I'll squash them to avoid problems when merging.

The selected columns in the embedded resources are aggregated into arrays
}
]

The order of the values inside the resulting array is unspecified.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Imho, this is unsatisfactory, it would basically make the feature unusable for me. This is because we snapshot test all our api responses and if we can't generate predictable output, then we can't use the feature. So ordering is very important.

Would something like this be hard to do?

get "/factories?select=factory:name,...processes(name)&processes.order=name"

(I hope I got the syntax right, this should be the regular "sort the embedded response" syntax, right?)

For the spread, this could then move the ORDER BY into the aggregate function call. This would only allow to specify a single ORDER BY for multiple spread aggregates - which I consider a good thing, because this would ensure the array items still match between arrays.

Copy link
Member Author

@laurenceisla laurenceisla Nov 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would something like this be hard to do?

For the spread, this could then move the ORDER BY into the aggregate function call. This would only allow to specify a single ORDER BY for multiple spread aggregates - which I consider a good thing, because this would ensure the array items still match between arrays.

I don't think it'd be hard to do. Yes, the syntax is OK, internally it would need to treat every order done inside a to_many spread as an array_agg order for every column, instead of a subquery one.

Right now, the order as you mentioned in your example works: it orders the subquery and the aggregated columns will be sorted. But it's not guaranteed to behave the same way for other more complex cases, as mentioned in this convo. So yes, I'll implement the order by in the aggregate here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

Spread operator for many-to-many relationships, aliases, and aggregations
4 participants