You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have a text dataset that contains 1 record per line and each line has positional information. Each record has a type which is the first 2 characters. The records in the dataset are ordered and grouped. The first record is always XA and the last record is always XE. Then there is one or more groups of records starting with XE and containing records XW, XO, XS, XT, XU.
I like DuckDB's SEQUENCE, it abstracts creating sequences in a simple manner. I thought this problem would be a good use for SEQUENCE to number all the records in each XE record group, but did not get the results I expected. Here the SQL I initially wrote:
SET VARIABLE dataset_document ='test.txt' ;
DROPTABLE IF EXISTS dataset ;
CREATETABLEdataset
(
filename VARCHARNOT NULL,
record_line UINTEGER NOT NULL,
record_offset UINTEGER NULL,
record_group UINTEGER NULL,
record_type VARCHARNOT NULL,
record_data VARCHARNULL
)
;
CREATE OR REPLACESEQUENCEdataset_line START WITH 1 INCREMENT BY 1 MINVALUE 1 ;
CREATE OR REPLACESEQUENCErecord_group START WITH 1 INCREMENT BY 1 MINVALUE 1 ;
INSERT INTO dataset BY NAME
SELECT
nextval( 'dataset_line' ) AS record_line,
upper( left( csv.record_data, 2 ) ) AS record_type,
CASE
WHEN record_type IN ( 'XE' ) THEN
nextval( 'record_group' )
WHEN record_type IN ( 'XW', 'XO', 'XS', 'XT', 'XU' ) THEN
currval( 'record_group' )
ELSE
NULL
END AS record_group,
csv.*FROM
read_csv
(
getvariable( 'dataset_document' ),
delim ='\t',
filename = true,
header = false,
new_line ='\r\n',
names = [ 'record_data' ]
) AS csv
;
SELECT*FROM dataset ;
At this point I decided to step back and reread the documentation and analyze what was happening. The documentation for the nextval function indicates you call the function to get the next value. Unfortunately, the function nextval is a misnomer. It does not return the next value from the SEQUENCE, but is essentially equivalent to the C language post increment operator, where the current value is returned and then the function increments the SEQUENCE. I feel the documentation should be clearer about what the function does.
It seem to me that the function nextval is really postincr and there is an analogous missing preincr function which is analogous to the C language pre increment operator. My mismatched expectation for the problem I was trying to solve was that the documentation and the name of the function nextval meant that it was analogous to the C language pre increment operator which would have produced the correct results.
A suggested DuckDB enhancement would be to create the preincr function and make postincr an alias to the existing nextval function. These two functions, preincr and postincr could be used for different use case scenarios. An alternate enhancement would be to abstract the knowledge of pre/post increment into the SEQUENCE statement: CREATE SEQUENCE serial START WITH 1 PRE INCREMENT BY 2 ; or CREATE SEQUENCE serial START WITH 1 POST INCREMENT BY 2 ; with the default being POST when neither is specified to remain backward compatible.
I decided that a simple correction to my initial SQL could produce the results I was looking for. So I changed the CASE statement to subtract 1 from the currval function.
CASE
WHEN record_type IN ( 'XE' ) THEN
nextval( 'record_group' )
WHEN record_type IN ( 'XW', 'XO', 'XS', 'XT', 'XU' ) THEN
currval( 'record_group' ) -1
ELSE
NULL
END AS record_group,
My expectation on how the CASE statement would work on the dataset with this change was:
record_line 1 is evaluated as NULL
record_line 2 is evaluated with nextval returning the value 1 and incrementing the SEQUENCE value to 2.
record_line 3 is evaluated with currval returning the value 2 minus 1 resulting in the value 1.
record_line 4 is evaluated with currval returning the value 2 minus 1 resulting in the value 1.
record_line 5 is evaluated with nextval returning the value 2 and incrementing the SEQUENCE value to 3.
record_line 6 is evaluated with currval returning the value 3 minus 1 resulting in the value 2.
record_line 7 is evaluated with currval returning the value 3 minus 1 resulting in the value 2.
record_line 8 is evaluated with currval returning the value 3 minus 1 resulting in the value 2.
record_line 9 is evaluated with currval returning the value 3 minus 1 resulting in the value 2.
It appears that record_line 2-4 has record_group as expected, but record_line 5-9 is not as expected, except record_line 5 which correctly specifies record_group 2. It appears that on record_line 5, that the nextval function correctly returned the current value of the SEQUENCE, but failed to increment the SEQUENCE value, hence the currval function for record_line 6-9 returns 2 - 1 instead of 3 - 1.
OK at this point I decided to change the CASE statement to only record the change when seeing an XE record and I added a third XE group to the test.txt:
CASE
WHEN record_type IN ( 'XE' ) THEN
nextval( 'record_group' )
-- WHEN record_type IN ( 'XW', 'XO', 'XS', 'XT', 'XU' ) THEN-- currval( 'record_group' ) - 1
ELSE
NULL
END AS record_group,
This produced the following results which seems to indicate that the function nextval is working as expected and there is some oddity with the currval function.
My understanding of SEQUENCE, nextval and currval seems way off and I would welcome comments explaining where I'm going off the trail and if possible an alternate SQL way to assign these groups. My current thinking is that this might be solved by using a gaps/islands approach.
BTW, also another suggested enhancement, it would be nice to have an import/export for positional data allowing the options filename, new_line, compression, etc. like the CSV import/export. Could use a map or list to specify the column names, positions and datatypes.
The text was updated successfully, but these errors were encountered:
I have a text dataset that contains 1 record per line and each line has positional information. Each record has a type which is the first 2 characters. The records in the dataset are ordered and grouped. The first record is always XA and the last record is always XE. Then there is one or more groups of records starting with XE and containing records XW, XO, XS, XT, XU.
I like DuckDB's
SEQUENCE
, it abstracts creating sequences in a simple manner. I thought this problem would be a good use forSEQUENCE
to number all the records in each XE record group, but did not get the results I expected. Here the SQL I initially wrote:What the last
SELECT
reported was:However, what I expected was:
At this point I decided to step back and reread the documentation and analyze what was happening. The documentation for the nextval function indicates you call the function to get the next value. Unfortunately, the function
nextval
is a misnomer. It does not return the next value from theSEQUENCE
, but is essentially equivalent to the C language post increment operator, where the current value is returned and then the function increments theSEQUENCE
. I feel the documentation should be clearer about what the function does.It seem to me that the function
nextval
is reallypostincr
and there is an analogous missingpreincr
function which is analogous to the C language pre increment operator. My mismatched expectation for the problem I was trying to solve was that the documentation and the name of the functionnextval
meant that it was analogous to the C language pre increment operator which would have produced the correct results.A suggested DuckDB enhancement would be to create the
preincr
function and makepostincr
an alias to the existingnextval
function. These two functions,preincr
andpostincr
could be used for different use case scenarios. An alternate enhancement would be to abstract the knowledge of pre/post increment into theSEQUENCE
statement:CREATE SEQUENCE serial START WITH 1 PRE INCREMENT BY 2 ;
orCREATE SEQUENCE serial START WITH 1 POST INCREMENT BY 2 ;
with the default beingPOST
when neither is specified to remain backward compatible.I decided that a simple correction to my initial SQL could produce the results I was looking for. So I changed the
CASE
statement to subtract 1 from thecurrval
function.My expectation on how the
CASE
statement would work on the dataset with this change was:record_line
1 is evaluated asNULL
record_line
2 is evaluated withnextval
returning the value 1 and incrementing theSEQUENCE
value to 2.record_line
3 is evaluated withcurrval
returning the value 2 minus 1 resulting in the value 1.record_line
4 is evaluated withcurrval
returning the value 2 minus 1 resulting in the value 1.record_line
5 is evaluated withnextval
returning the value 2 and incrementing theSEQUENCE
value to 3.record_line
6 is evaluated withcurrval
returning the value 3 minus 1 resulting in the value 2.record_line
7 is evaluated withcurrval
returning the value 3 minus 1 resulting in the value 2.record_line
8 is evaluated withcurrval
returning the value 3 minus 1 resulting in the value 2.record_line
9 is evaluated withcurrval
returning the value 3 minus 1 resulting in the value 2.record_line
10 is evaluated asNULL
.Unfortunately, the result I got was:
It appears that
record_line
2-4 hasrecord_group
as expected, butrecord_line
5-9 is not as expected, exceptrecord_line
5 which correctly specifiesrecord_group
2. It appears that onrecord_line
5, that thenextval
function correctly returned the current value of theSEQUENCE
, but failed to increment theSEQUENCE
value, hence thecurrval
function forrecord_line
6-9 returns 2 - 1 instead of 3 - 1.OK at this point I decided to change the
CASE
statement to only record the change when seeing an XE record and I added a third XE group to thetest.txt
:This produced the following results which seems to indicate that the function
nextval
is working as expected and there is some oddity with thecurrval
function.My understanding of
SEQUENCE
,nextval
andcurrval
seems way off and I would welcome comments explaining where I'm going off the trail and if possible an alternate SQL way to assign these groups. My current thinking is that this might be solved by using a gaps/islands approach.BTW, also another suggested enhancement, it would be nice to have an import/export for positional data allowing the options
filename
,new_line
,compression
, etc. like the CSV import/export. Could use amap
orlist
to specify the column names, positions and datatypes.The text was updated successfully, but these errors were encountered: