Releases: root-11/tablite
Fixes for xls get_headers inconsistencies
Fix issues with get_headers
falling back to text reading when reading 0 lines of excel, fix issue where reading excel file would ignore file count, excel file reader now has parity for linecount selection.
Fixed get_headers linecount issue
Fixed a logic bug in get_headers
that caused one extra line to be returned than requested.
Fix refcount for deep table copies
Updated the way reference counting works. Tablite now tracks references to used pages and cleans them up based on number of references to those pages in the current process. This change allows to handle deep table clones when sending tables via processes (pickling/unpickling), whereas previous implementation would corrupt all tables using same pages due to reference counting asserting that all tables are shallow copies to the same object.
Update mplite dependency
Updated mplite
dependency, changed to soft version requirement to prevent pipeline freezes due to small bugfixes in mplite
.
2023.6.1 - Breaking release with ~6x speed up
This release 2023.6.1
addresses #61 with a significant number of changes and features ~6x speedup.
This release note summarizes the changes from 2023.6.dev1 through 2023.6.dev14
For a quick summary view the tutorial
Changes
A new Config
-class available in config.py
, which sets:
- Config.page_size = 1_000_000
- Config.workdir =
tempfile.gettempdir()
- Config.DISABLE_TQDM: as a boolean switch to silence tqdm.
The workdir uses ///pages for managing pages, so that different python processes don't interact with the same storage location. Previously two python processes would enter conflict on the same tablite.h5
-file.
If the python process is sigkill'ed, the temp folder will loiter. If the python process exits (sigint) python will clean up the temp folders.
Table now accepts the keyword columns
as a dict in init:
t = Table(columns={'b':[4,5,6], 'c':[7,8,9]})`
Table now accepts header/data combinations in init:
t = Table(header=['b','c'], data=[[4,5,6],[7,8,9]])`
With these features it is no longer necessary to write:
t = Table
t['b'] = [4,5,6]
t['c'] = [7,8,9]
Preferred approach for subclassing tables is:
class MyTable(tablite.Table):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.x = kwargs.get("x", 42) # <== special variable required on MyTable.
def copy(self):
# tablite.Table implements
# new = cls(); # self.x is now 42 !!!
# for name,column in self.items():
# new[name] = column
# MyTable therefore implements:
cp = super(MyTable, self).copy()
cp.x = self.x # updating to real x.
return cp
Replace is refactored from using a single value to a mapping,
example:
>>> t = Table(columns={'A': [1,2,3,4]})
>>> t['A'].replace({2:20,4:40})
>>> t[:]
np.ndarray([1,20,3,40])
-
performance benchmarks have been added.
-
Table.head
has been moved totablite.tools
where it belongs. -
histogram now returns two lists instead of a dict as python treats
True
and1
as identical. -
Table.save
now defaults to zipfile.ZIP_DEFLATED, compression_level=1 as this is only 10% slower, yet saves 80% disk space. -
Column.iter_by_page
: An iterator for traversing data by page. -
module
reindex
implements a constant memory method for re-indexing tasks which now is used by Lookup, Join, Sort and Filter. -
New function
unique_index
which allowsdrop_duplicates
andsortation
to run with constant memory footprint. -
get_headers
now has text_qualifier to match importing -
Table.types()
is practically instantaneous as type information is kept on Pages. dev9 featureTable.dtypes()
is deprecated as it almost duplicates the functionality of .types -
multiprocessing has been disabled for algorithms that guarantee a constant memory footprint.
Deprecated
The following assignment method is DEPRECATED:
t = Table()
t[('b','c')] = [ [4,5,6], [7,8,9] ]
Which then produced the table with two columns:
t['b'] == [4,5,6]
t['c'] == [7,8,9]
copy_to_clipboard
andcopy_from_clipboard
aspyperclip
doesn't seem to be maintained.- class method
from_dict
asTable(columns=dict)
now is supported. - Insert and append have been removed to discourage operations that are IO-intensive.
Column.insert
is being removed as it encourages the user to use slow operations. It is better to perform the data manipulation in memory and drop the result into the column usingcol.extend(....)
orcol[a:b] = [result]
.reload_saved_tables
. Uset.save(path)
andTable.load(path)
instead.reset_storage
is deprecated as all tables not explicitly saved are considered temporary/volatile.from_dict
is deprecated as Table(columns={dict}, ...) is accepted.to_numpy
. Default table['name'] returns a numpy array. User should calltable['name'].tolist()
to get python lists (up to 6x slower) than just retrieving the numpy array.
Everything else remains.
Bugfix for join & filter
The is a bugfix release. No api changes.
Bugfix for sort
No api changes.
This release solves bug #71
Table.dtypes removed.
omission in previous RC.
instant Table.types()
Table.types()
is practically instantaneous as type information is kept onPage
sTable.dtypes()
is deprecated as it almost duplicates the functionality of.types
get_headers improvements
get_headers
now has text_qualifier to match importing