2023.6.1 - Breaking release with ~6x speed up
This release 2023.6.1
addresses #61 with a significant number of changes and features ~6x speedup.
This release note summarizes the changes from 2023.6.dev1 through 2023.6.dev14
For a quick summary view the tutorial
Changes
A new Config
-class available in config.py
, which sets:
- Config.page_size = 1_000_000
- Config.workdir =
tempfile.gettempdir()
- Config.DISABLE_TQDM: as a boolean switch to silence tqdm.
The workdir uses ///pages for managing pages, so that different python processes don't interact with the same storage location. Previously two python processes would enter conflict on the same tablite.h5
-file.
If the python process is sigkill'ed, the temp folder will loiter. If the python process exits (sigint) python will clean up the temp folders.
Table now accepts the keyword columns
as a dict in init:
t = Table(columns={'b':[4,5,6], 'c':[7,8,9]})`
Table now accepts header/data combinations in init:
t = Table(header=['b','c'], data=[[4,5,6],[7,8,9]])`
With these features it is no longer necessary to write:
t = Table
t['b'] = [4,5,6]
t['c'] = [7,8,9]
Preferred approach for subclassing tables is:
class MyTable(tablite.Table):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.x = kwargs.get("x", 42) # <== special variable required on MyTable.
def copy(self):
# tablite.Table implements
# new = cls(); # self.x is now 42 !!!
# for name,column in self.items():
# new[name] = column
# MyTable therefore implements:
cp = super(MyTable, self).copy()
cp.x = self.x # updating to real x.
return cp
Replace is refactored from using a single value to a mapping,
example:
>>> t = Table(columns={'A': [1,2,3,4]})
>>> t['A'].replace({2:20,4:40})
>>> t[:]
np.ndarray([1,20,3,40])
-
performance benchmarks have been added.
-
Table.head
has been moved totablite.tools
where it belongs. -
histogram now returns two lists instead of a dict as python treats
True
and1
as identical. -
Table.save
now defaults to zipfile.ZIP_DEFLATED, compression_level=1 as this is only 10% slower, yet saves 80% disk space. -
Column.iter_by_page
: An iterator for traversing data by page. -
module
reindex
implements a constant memory method for re-indexing tasks which now is used by Lookup, Join, Sort and Filter. -
New function
unique_index
which allowsdrop_duplicates
andsortation
to run with constant memory footprint. -
get_headers
now has text_qualifier to match importing -
Table.types()
is practically instantaneous as type information is kept on Pages. dev9 featureTable.dtypes()
is deprecated as it almost duplicates the functionality of .types -
multiprocessing has been disabled for algorithms that guarantee a constant memory footprint.
Deprecated
The following assignment method is DEPRECATED:
t = Table()
t[('b','c')] = [ [4,5,6], [7,8,9] ]
Which then produced the table with two columns:
t['b'] == [4,5,6]
t['c'] == [7,8,9]
copy_to_clipboard
andcopy_from_clipboard
aspyperclip
doesn't seem to be maintained.- class method
from_dict
asTable(columns=dict)
now is supported. - Insert and append have been removed to discourage operations that are IO-intensive.
Column.insert
is being removed as it encourages the user to use slow operations. It is better to perform the data manipulation in memory and drop the result into the column usingcol.extend(....)
orcol[a:b] = [result]
.reload_saved_tables
. Uset.save(path)
andTable.load(path)
instead.reset_storage
is deprecated as all tables not explicitly saved are considered temporary/volatile.from_dict
is deprecated as Table(columns={dict}, ...) is accepted.to_numpy
. Default table['name'] returns a numpy array. User should calltable['name'].tolist()
to get python lists (up to 6x slower) than just retrieving the numpy array.
Everything else remains.