Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue found on page 'SAMPLE Clause' #4041

Open
rpbouman opened this issue Nov 11, 2024 · 0 comments
Open

Issue found on page 'SAMPLE Clause' #4041

rpbouman opened this issue Nov 11, 2024 · 0 comments

Comments

@rpbouman
Copy link
Contributor

rpbouman commented Nov 11, 2024

This is in relation to the issueI filed earlier: duckdb/duckdb#14773

I have read both the SQL Query syntax page on the SAMPLE clause (https://duckdb.org/docs/sql/query_syntax/sample.html) as well as the general feature overview page on samples (https://duckdb.org/docs/sql/samples.html).

Together these pages describe the syntax of the SAMPLE and TABLESAMPLE clauses in detail.
However, it does not precisely explain where the SAMPLE or TABLESAMPLE clause fits syntactically inside the query as a whole.

At first, I thought that this piece of text in the opening paragraph from https://duckdb.org/docs/sql/query_syntax/sample explained that:

The sample clause is applied right after anything in the FROM clause (i.e., after any joins, but before the WHERE clause or any aggregates).

But from what I understand now, the term "applied" does not refer to the syntactical placement but to the effect of its operation.
I think the documentation would be improved if it would more explicitly explain whether it is talking about the order of operation vs the syntactical position.

There is currently some text in https://duckdb.org/docs/sql/samples.html#table-samples, that reads:

The TABLESAMPLE and USING SAMPLE clauses are identical in terms of syntax and effect, with one important difference: tablesamples sample directly from the table for which they are specified, whereas the sample clause samples after the entire from clause has been resolved.

Here, just like in the earlier explanation of the effect of the SAMPLE clause, this text seems to be more about the order of operations than about the syntactical position. In retrospect I understand now that the term "identical in terms of syntax" is about the internal syntax of the clause itself, and not about the syntactical position in the statement as a whole. I think the documentation would be improved again by making that distinction explicit.

Proposed changes:

  1. From what I gather now, it looks like the syntactical placement of the SAMPLE-clause is after the GROUP BY-clause, but before the ORDER BY clause. I think this should be stated explicitly right before or right after the syntax graph of the clause itself.

  2. For TABLESAMPLE, the placement is after the table expression to which it applies. If a table alias is present, the TABLESAMPLE-clause should follow after the alias. I think it would be good to state this explicitly when describing the TABLESAMPLE clause.

  3. Right now the documentation seems to use the term "SAMPLE-clause" throughout, and its not always clear whether that refers to USING SAMPLE, TABLESAMPLE, or both. I think it would be an improvement to explicitly talk about either the USING SAMPLE-clause, or the TABLESAMPLE-clause. In those cases where you just in general want to talk about the effect of such clauses, the uncapitalized term "sampling" or "sampling operation" could be used.

  4. The syntax diagram in the FROM-clause documentation (https://duckdb.org/docs/sql/query_syntax/from) explicitly describes the position of the TABLESAMPLE clause, but it shows it only for the first table expression in the FROM-clause, not for any Joined table expressions.

  5. The effect of REPEATABLE sub-clause and the significance of the seed option appear not to be documented. There are a few examples of how it is used in the general feature overview page, but there is no explanation of how it works, what types of values are appropriate for the seed parameter, and how the choice of the seed parameter affects the result.

Page URL: https://duckdb.org/docs/sql/query_syntax/sample.html and https://duckdb.org/docs/sql/samples.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant