Skip to content

Releases: root-11/tablite

Fixes for xls get_headers inconsistencies

18 Aug 05:15
c020586
Compare
Choose a tag to compare

Fix issues with get_headers falling back to text reading when reading 0 lines of excel, fix issue where reading excel file would ignore file count, excel file reader now has parity for linecount selection.

Fixed get_headers linecount issue

16 Aug 08:52
2d5b7f2
Compare
Choose a tag to compare

Fixed a logic bug in get_headers that caused one extra line to be returned than requested.

Fix refcount for deep table copies

14 Aug 11:05
53bb033
Compare
Choose a tag to compare

Updated the way reference counting works. Tablite now tracks references to used pages and cleans them up based on number of references to those pages in the current process. This change allows to handle deep table clones when sending tables via processes (pickling/unpickling), whereas previous implementation would corrupt all tables using same pages due to reference counting asserting that all tables are shallow copies to the same object.

Update mplite dependency

10 Aug 12:23
a1553a9
Compare
Choose a tag to compare

Updated mplite dependency, changed to soft version requirement to prevent pipeline freezes due to small bugfixes in mplite.

2023.6.1 - Breaking release with ~6x speed up

01 Aug 08:53
Compare
Choose a tag to compare

This release 2023.6.1 addresses #61 with a significant number of changes and features ~6x speedup.
This release note summarizes the changes from 2023.6.dev1 through 2023.6.dev14

For a quick summary view the tutorial

Changes

A new Config-class available in config.py, which sets:

  • Config.page_size = 1_000_000
  • Config.workdir = tempfile.gettempdir()
  • Config.DISABLE_TQDM: as a boolean switch to silence tqdm.

The workdir uses ///pages for managing pages, so that different python processes don't interact with the same storage location. Previously two python processes would enter conflict on the same tablite.h5-file.
If the python process is sigkill'ed, the temp folder will loiter. If the python process exits (sigint) python will clean up the temp folders.

Table now accepts the keyword columns as a dict in init:

     t = Table(columns={'b':[4,5,6], 'c':[7,8,9]})`

Table now accepts header/data combinations in init:

    t = Table(header=['b','c'], data=[[4,5,6],[7,8,9]])`

With these features it is no longer necessary to write:

    t = Table
    t['b'] = [4,5,6]
    t['c'] = [7,8,9]

Preferred approach for subclassing tables is:

class MyTable(tablite.Table):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.x = kwargs.get("x", 42)  # <== special variable required on MyTable.

    def copy(self):
        # tablite.Table implements
        # new = cls();   # self.x is now 42 !!!
        # for name,column in self.items():
        #     new[name] = column

        # MyTable therefore implements:
        cp = super(MyTable, self).copy()
        cp.x = self.x  # updating to real x.
        return cp

Replace is refactored from using a single value to a mapping,

    example:
    >>> t = Table(columns={'A': [1,2,3,4]})
    >>> t['A'].replace({2:20,4:40})
    >>> t[:]
    np.ndarray([1,20,3,40])
  • performance benchmarks have been added.

  • Table.head has been moved to tablite.tools where it belongs.

  • histogram now returns two lists instead of a dict as python treats True and 1 as identical.

  • Table.save now defaults to zipfile.ZIP_DEFLATED, compression_level=1 as this is only 10% slower, yet saves 80% disk space.

  • Column.iter_by_page : An iterator for traversing data by page.

  • module reindex implements a constant memory method for re-indexing tasks which now is used by Lookup, Join, Sort and Filter.

  • New function unique_index which allows drop_duplicates and sortation to run with constant memory footprint.

  • get_headers now has text_qualifier to match importing

  • Table.types() is practically instantaneous as type information is kept on Pages. dev9 feature Table.dtypes() is deprecated as it almost duplicates the functionality of .types

  • multiprocessing has been disabled for algorithms that guarantee a constant memory footprint.

Deprecated

The following assignment method is DEPRECATED:

    t = Table()
    t[('b','c')] = [ [4,5,6], [7,8,9] ]
    Which then produced the table with two columns:
    t['b'] == [4,5,6]
    t['c'] == [7,8,9]
  • copy_to_clipboard and copy_from_clipboard as pyperclip doesn't seem to be maintained.
  • class method from_dict as Table(columns=dict) now is supported.
  • Insert and append have been removed to discourage operations that are IO-intensive.
  • Column.insert is being removed as it encourages the user to use slow operations. It is better to perform the data manipulation in memory and drop the result into the column using col.extend(....) or col[a:b] = [result].
  • reload_saved_tables. Use t.save(path) and Table.load(path) instead.
  • reset_storage is deprecated as all tables not explicitly saved are considered temporary/volatile.
  • from_dict is deprecated as Table(columns={dict}, ...) is accepted.
  • to_numpy. Default table['name'] returns a numpy array. User should call table['name'].tolist() to get python lists (up to 6x slower) than just retrieving the numpy array.

Everything else remains.

Bugfix for join & filter

13 Jul 18:05
Compare
Choose a tag to compare
Pre-release

The is a bugfix release. No api changes.

Bugfix for sort

11 Jul 10:37
Compare
Choose a tag to compare
Bugfix for sort Pre-release
Pre-release

No api changes.
This release solves bug #71

Table.dtypes removed.

03 Jul 11:41
Compare
Choose a tag to compare
Table.dtypes removed. Pre-release
Pre-release

omission in previous RC.

instant Table.types()

03 Jul 11:15
Compare
Choose a tag to compare
instant Table.types() Pre-release
Pre-release
  • Table.types() is practically instantaneous as type information is kept on Pages
  • Table.dtypes() is deprecated as it almost duplicates the functionality of .types

get_headers improvements

27 Jun 12:03
f7a4579
Compare
Choose a tag to compare
Pre-release

get_headers now has text_qualifier to match importing