Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Understanding SEQUENCE and nextval #3960

Open
houghtonap opened this issue Oct 24, 2024 · 0 comments
Open

Understanding SEQUENCE and nextval #3960

houghtonap opened this issue Oct 24, 2024 · 0 comments

Comments

@houghtonap
Copy link

I have a text dataset that contains 1 record per line and each line has positional information. Each record has a type which is the first 2 characters. The records in the dataset are ordered and grouped. The first record is always XA and the last record is always XE. Then there is one or more groups of records starting with XE and containing records XW, XO, XS, XT, XU.

I like DuckDB's SEQUENCE, it abstracts creating sequences in a simple manner. I thought this problem would be a good use for SEQUENCE to number all the records in each XE record group, but did not get the results I expected. Here the SQL I initially wrote:

SET VARIABLE dataset_document = 'test.txt' ;

DROP TABLE IF EXISTS dataset ;

CREATE TABLE dataset
  (
    filename       VARCHAR   NOT NULL,
    record_line    UINTEGER  NOT NULL,
    record_offset  UINTEGER  NULL,
    record_group   UINTEGER  NULL,
    record_type    VARCHAR   NOT NULL,
    record_data    VARCHAR   NULL
  )
;

CREATE OR REPLACE SEQUENCE dataset_line START WITH 1 INCREMENT BY 1 MINVALUE 1 ;
CREATE OR REPLACE SEQUENCE record_group START WITH 1 INCREMENT BY 1 MINVALUE 1 ;

INSERT INTO dataset BY NAME
  SELECT
    nextval( 'dataset_line' ) AS record_line,
    upper( left( csv.record_data, 2 ) ) AS record_type,
    CASE
      WHEN record_type IN ( 'XE' ) THEN
        nextval( 'record_group' )
      WHEN record_type IN ( 'XW', 'XO', 'XS', 'XT', 'XU' ) THEN
        currval( 'record_group' )
      ELSE
        NULL
    END AS record_group,
    csv.*
  FROM
    read_csv
    (
      getvariable( 'dataset_document' ),
      delim = '\t',
      filename = true,
      header = false,
      new_line = '\r\n',
      names = [ 'record_data' ]
    ) AS csv
;

SELECT * FROM dataset ;

What the last SELECT reported was:

┌──────────┬─────────────┬───────────────┬──────────────┬─────────────┬────────────────────────────────────────────────┐
│ filename │ record_line │ record_offset │ record_group │ record_type │                  record_data                   │
│ varchar  │   uint32    │    uint32     │    uint32    │   varchar   │                    varchar                     │
├──────────┼─────────────┼───────────────┼──────────────┼─────────────┼────────────────────────────────────────────────┤
│ test.txt │           1 │               │              │ XA          │ XA182588437555052967957830707312983580301420.  │
│ test.txt │           2 │               │            1 │ XE          │ XE697135992770119806071561226874178792171695.  │
│ test.txt │           3 │               │            2 │ XW          │ XW697135992770119806071561226874178792171695.  │
│ test.txt │           4 │               │            2 │ XT          │ XT706835212857442410629300363593579965757305.  │
│ test.txt │           5 │               │            2 │ XE          │ XE697135992770119806071561226874178792171695.  │
│ test.txt │           6 │               │            2 │ XW          │ XW697135992770119806071561226874178792171695.  │
│ test.txt │           7 │               │            2 │ XO          │ XO697135992770119806071561226874178792171695.  │
│ test.txt │           8 │               │            2 │ XT          │ XT706835212857442410629300363593579965757305.  │
│ test.txt │           9 │               │            2 │ XU          │ XU706835212857442410629300363593579965757305.  │
│ test.txt │          10 │               │              │ XF          │ XF765537662619707161566341078716460324648257.  │
├──────────┴─────────────┴───────────────┴──────────────┴─────────────┴────────────────────────────────────────────────┤
│ 10 rows                                                                                                    6 columns │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

However, what I expected was:

┌──────────┬─────────────┬───────────────┬──────────────┬─────────────┬────────────────────────────────────────────────┐
│ filename │ record_line │ record_offset │ record_group │ record_type │                  record_data                   │
│ varchar  │   uint32    │    uint32     │    uint32    │   varchar   │                    varchar                     │
├──────────┼─────────────┼───────────────┼──────────────┼─────────────┼────────────────────────────────────────────────┤
│ test.txt │           1 │               │              │ XA          │ XA182588437555052967957830707312983580301420.  │
│ test.txt │           2 │               │            1 │ XE          │ XE697135992770119806071561226874178792171695.  │
│ test.txt │           3 │               │            1 │ XW          │ XW697135992770119806071561226874178792171695.  │
│ test.txt │           4 │               │            1 │ XT          │ XT706835212857442410629300363593579965757305.  │
│ test.txt │           5 │               │            2 │ XE          │ XE697135992770119806071561226874178792171695.  │
│ test.txt │           6 │               │            2 │ XW          │ XW697135992770119806071561226874178792171695.  │
│ test.txt │           7 │               │            2 │ XO          │ XO697135992770119806071561226874178792171695.  │
│ test.txt │           8 │               │            2 │ XT          │ XT706835212857442410629300363593579965757305.  │
│ test.txt │           9 │               │            2 │ XU          │ XU706835212857442410629300363593579965757305.  │
│ test.txt │          10 │               │              │ XF          │ XF765537662619707161566341078716460324648257.  │
├──────────┴─────────────┴───────────────┴──────────────┴─────────────┴────────────────────────────────────────────────┤
│ 10 rows                                                                                                    6 columns │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

At this point I decided to step back and reread the documentation and analyze what was happening. The documentation for the nextval function indicates you call the function to get the next value. Unfortunately, the function nextval is a misnomer. It does not return the next value from the SEQUENCE, but is essentially equivalent to the C language post increment operator, where the current value is returned and then the function increments the SEQUENCE. I feel the documentation should be clearer about what the function does.

It seem to me that the function nextval is really postincr and there is an analogous missing preincr function which is analogous to the C language pre increment operator. My mismatched expectation for the problem I was trying to solve was that the documentation and the name of the function nextval meant that it was analogous to the C language pre increment operator which would have produced the correct results.

A suggested DuckDB enhancement would be to create the preincr function and make postincr an alias to the existing nextval function. These two functions, preincr and postincr could be used for different use case scenarios. An alternate enhancement would be to abstract the knowledge of pre/post increment into the SEQUENCE statement: CREATE SEQUENCE serial START WITH 1 PRE INCREMENT BY 2 ; or CREATE SEQUENCE serial START WITH 1 POST INCREMENT BY 2 ; with the default being POST when neither is specified to remain backward compatible.

I decided that a simple correction to my initial SQL could produce the results I was looking for. So I changed the CASE statement to subtract 1 from the currval function.

    CASE
      WHEN record_type IN ( 'XE' ) THEN
        nextval( 'record_group' )
      WHEN record_type IN ( 'XW', 'XO', 'XS', 'XT', 'XU' ) THEN
        currval( 'record_group' ) - 1
      ELSE
        NULL
    END AS record_group,

My expectation on how the CASE statement would work on the dataset with this change was:

  1. record_line 1 is evaluated as NULL
  2. record_line 2 is evaluated with nextval returning the value 1 and incrementing the SEQUENCE value to 2.
  3. record_line 3 is evaluated with currval returning the value 2 minus 1 resulting in the value 1.
  4. record_line 4 is evaluated with currval returning the value 2 minus 1 resulting in the value 1.
  5. record_line 5 is evaluated with nextval returning the value 2 and incrementing the SEQUENCE value to 3.
  6. record_line 6 is evaluated with currval returning the value 3 minus 1 resulting in the value 2.
  7. record_line 7 is evaluated with currval returning the value 3 minus 1 resulting in the value 2.
  8. record_line 8 is evaluated with currval returning the value 3 minus 1 resulting in the value 2.
  9. record_line 9 is evaluated with currval returning the value 3 minus 1 resulting in the value 2.
  10. record_line 10 is evaluated asNULL.

Unfortunately, the result I got was:

┌──────────┬─────────────┬───────────────┬──────────────┬─────────────┬────────────────────────────────────────────────┐
│ filename │ record_line │ record_offset │ record_group │ record_type │                  record_data                   │
│ varchar  │   uint32    │    uint32     │    uint32    │   varchar   │                    varchar                     │
├──────────┼─────────────┼───────────────┼──────────────┼─────────────┼────────────────────────────────────────────────┤
│ test.txt │           1 │               │              │ XA          │ XA182588437555052967957830707312983580301420.  │
│ test.txt │           2 │               │            1 │ XE          │ XE697135992770119806071561226874178792171695.  │
│ test.txt │           3 │               │            1 │ XW          │ XW697135992770119806071561226874178792171695.  │
│ test.txt │           4 │               │            1 │ XT          │ XT706835212857442410629300363593579965757305.  │
│ test.txt │           5 │               │            2 │ XE          │ XE697135992770119806071561226874178792171695.  │
│ test.txt │           6 │               │            1 │ XW          │ XW697135992770119806071561226874178792171695.  │
│ test.txt │           7 │               │            1 │ XO          │ XO697135992770119806071561226874178792171695.  │
│ test.txt │           8 │               │            1 │ XT          │ XT706835212857442410629300363593579965757305.  │
│ test.txt │           9 │               │            1 │ XU          │ XU706835212857442410629300363593579965757305.  │
│ test.txt │          10 │               │              │ XF          │ XF765537662619707161566341078716460324648257.  │
├──────────┴─────────────┴───────────────┴──────────────┴─────────────┴────────────────────────────────────────────────┤
│ 10 rows                                                                                                    6 columns │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

It appears that record_line 2-4 has record_group as expected, but record_line 5-9 is not as expected, except record_line 5 which correctly specifies record_group 2. It appears that on record_line 5, that the nextval function correctly returned the current value of the SEQUENCE, but failed to increment the SEQUENCE value, hence the currval function for record_line 6-9 returns 2 - 1 instead of 3 - 1.

OK at this point I decided to change the CASE statement to only record the change when seeing an XE record and I added a third XE group to the test.txt:

    CASE
      WHEN record_type IN ( 'XE' ) THEN
        nextval( 'record_group' )
      -- WHEN record_type IN ( 'XW', 'XO', 'XS', 'XT', 'XU' ) THEN
        -- currval( 'record_group' ) - 1
      ELSE
        NULL
    END AS record_group,

This produced the following results which seems to indicate that the function nextval is working as expected and there is some oddity with the currval function.

┌──────────┬─────────────┬───────────────┬──────────────┬─────────────┬────────────────────────────────────────────────┐
│ filename │ record_line │ record_offset │ record_group │ record_type │                  record_data                   │
│ varchar  │   uint32    │    uint32     │    uint32    │   varchar   │                    varchar                     │
├──────────┼─────────────┼───────────────┼──────────────┼─────────────┼────────────────────────────────────────────────┤
│ test.txt │           1 │               │              │ XA          │ XA182588437555052967957830707312983580301420.  │
│ test.txt │           2 │               │            1 │ XE          │ XE697135992770119806071561226874178792171695.  │
│ test.txt │           3 │               │              │ XW          │ XW697135992770119806071561226874178792171695.  │
│ test.txt │           4 │               │              │ XT          │ XT706835212857442410629300363593579965757305.  │
│ test.txt │           5 │               │            2 │ XE          │ XE697135992770119806071561226874178792171695.  │
│ test.txt │           6 │               │              │ XW          │ XW697135992770119806071561226874178792171695.  │
│ test.txt │           7 │               │              │ XO          │ XO697135992770119806071561226874178792171695.  │
│ test.txt │           8 │               │              │ XT          │ XT706835212857442410629300363593579965757305.  │
│ test.txt │           9 │               │              │ XU          │ XU706835212857442410629300363593579965757305.  │
│ test.txt │          10 │               │            3 │ XE          │ XE697135992770119806071561226874178792171695.  │
│ test.txt │          11 │               │              │ XW          │ XW697135992770119806071561226874178792171695.  │
│ test.txt │          12 │               │              │ XT          │ XT706835212857442410629300363593579965757305.  │
│ test.txt │          13 │               │              │ XF          │ XF765537662619707161566341078716460324648257.  │
├──────────┴─────────────┴───────────────┴──────────────┴─────────────┴────────────────────────────────────────────────┤
│ 13 rows                                                                                                    6 columns │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

My understanding of SEQUENCE, nextval and currval seems way off and I would welcome comments explaining where I'm going off the trail and if possible an alternate SQL way to assign these groups. My current thinking is that this might be solved by using a gaps/islands approach.

BTW, also another suggested enhancement, it would be nice to have an import/export for positional data allowing the options filename, new_line, compression, etc. like the CSV import/export. Could use a map or list to specify the column names, positions and datatypes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant