Uncoalesced table handling #2262

niloc132 · 2024-10-22T14:48:37Z

Presently, the web UI seems to share too much between PartitionedTables and uncoalesced Tables. Recapping briefly, a PartitionedTable is a kind of widget, while a Table that is uncoalesced has no size and returns true for isUncoalesced. The distinction is meaningful, in that a PartitionedTable is a Table that has since been partitioned, while an uncoalesced Table represents many tables on disk that have not been read yet and should not be read until the user either selects a partition or explicitly requests that the table be coalesced.

From #2066

I'm not certain that PartitionedTable and PartitionAwareSourceTable should be displayed in the same way.

From #2049

If I have a PartitionAwareSourceTable, I get the partitioned table viewer in the UI, which is very helpful, but a bit confusing because it tells me I have a "Partitioned Table"...

Note that some uncoalesced tables have zero partitions - in this case, they can and should be immediately coalesced, as there is only one table on disk (see #1763 and I think #1904).

Testing to try:

Multiple parquet partitions

This will produce an uncoalesced table with two partitions and one partition column - data should not be visible until a partition is selected

from deephaven.parquet import write, read
from deephaven import empty_table
t1 = empty_table(10).update_view(['I=i', 'Partition=0'])
t2 = empty_table(10).update_view(['I=i', 'Partition=1'])

write(t2, '/tmp/tableA/P=1/p1.parquet')
write(t1, '/tmp/tableA/P=0/p0.parquet')
A = read('/tmp/tableA')

No partition columns (and so, only one partition)

This will produce an uncoalesced table with one partition and no partition columns - data should be visible right away, and no partition selector

from deephaven.parquet import write, read
from deephaven import empty_table
t = empty_table(10).update_view(['I=i', 'Partition=0'])

write(t, '/tmp/tableB.parquet')
B = read('/tmp/tableB')

Empty table, multiple partitions

I'm not entirely sure what we expect here, but we should not coalesce, as there could be many empty partitions, and reading those partitions is expensive, even though they are empty (consider if each was on S3 and we needed a roundtrip to there). We probably should inform the user that there are no partitions, but not subscribe to the data itself.

This should probably show a partition selector, even though there are (presently) no partitions to select?

from deephaven.parquet import write, read
from deephaven import empty_table
t1 = empty_table(0).update_view(['I=i', 'Partition=0'])
t2 = empty_table(0).update_view(['I=i', 'Partition=1'])

write(t2, '/tmp/tableC/P=1/p1.parquet')
write(t1, '/tmp/tableC/P=0/p0.parquet')
C = read('/tmp/tableC')

Empty table, no partition columns

In this case, as with the other one partition case, we can simply subscribe. The important distinction here is that while the table is still empty (as in the previous case), there are no partitioned columns.

from deephaven.parquet import write, read
from deephaven import empty_table
t = empty_table(0).update_view(['I=i', 'Partition=0'])

write(t, '/tmp/tableD.parquet')
D = read('/tmp/tableD')

Partitioned columns exist, only one partition

While we still don't want to coalesce, we can display the initial partition, so data will be visible. I believe this is how we already handle cases when many partitions are present, so this is only present to validate that it still works

from deephaven.parquet import write, read
from deephaven import empty_table
t1 = empty_table(10).update_view(['I=i', 'Partition=0'])

write(t2, '/tmp/tableE/P=1/p1.parquet')
write(t1, '/tmp/tableE/P=0/p0.parquet')
E = read('/tmp/tableE')

Partitioned columns exist, no partitions

I'm not sure if this case is really possible at this time with parquet.

from deephaven.parquet import write, read
from deephaven import empty_table
import os 

os.system('mkdir -p /tmp/tableF')
F = read('/tmp/tableF')

This results in an error.

PartitionedTable widget

This could be for zero to many partitions, though a PartitionedTable always has at least one partitioned column (aka "key columns").

from deephaven import empty_table, merge
t0 = empty_table(0).update_view(['I=i', 'Partition=0'])
t1 = empty_table(10).update_view(['I=i', 'Partition=0'])
t2 = empty_table(10).update_view(['I=i', 'Partition=1'])
t3 = merge([t1,t2])

# no partitions, no data - but perhaps should still show the partition selector?
A = t0.partition_by('Partition')
# one partition
B = t1.partition_by('Partition')
# multiple partitions
C = t3.partition_by('Partition')

The text was updated successfully, but these errors were encountered:

dsmmcken · 2024-10-22T15:09:50Z

I'm not certain that PartitionedTable and PartitionAwareSourceTable should be displayed in the same way.

Note #2066 and #2049 were addressed after discussion with @rbasralian and @rcaudy in #2079, there are differences in how they display.

niloc132 · 2024-10-22T15:13:49Z

Unfortunately, at least for the "Merge" vs "Uncoalesce" button text difference, it isn't working, at least in my testing, and the uncoalesced tables are briefly loading the entire coalesced contents, which is incredibly expensive in some cases.

mofojed · 2024-10-22T15:17:31Z

the uncoalesced tables are briefly loading the entire coalesced contents, which is incredibly expensive in some cases.

This part we definitely need to fix. There may be some more discussion about some of the other behaviours.

niloc132 added the bug Something isn't working label Oct 22, 2024

mofojed added this to the October 2024 milestone Oct 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uncoalesced table handling #2262

Uncoalesced table handling #2262

niloc132 commented Oct 22, 2024

dsmmcken commented Oct 22, 2024

niloc132 commented Oct 22, 2024 •

edited

Loading

mofojed commented Oct 22, 2024

Uncoalesced table handling #2262

Uncoalesced table handling #2262

Comments

niloc132 commented Oct 22, 2024

Testing to try:

Multiple parquet partitions

No partition columns (and so, only one partition)

Empty table, multiple partitions

Empty table, no partition columns

Partitioned columns exist, only one partition

Partitioned columns exist, no partitions

PartitionedTable widget

dsmmcken commented Oct 22, 2024

niloc132 commented Oct 22, 2024 • edited Loading

mofojed commented Oct 22, 2024

niloc132 commented Oct 22, 2024 •

edited

Loading