Skip to content

2023.6.1 - Breaking release with ~6x speed up

Compare
Choose a tag to compare
@root-11 root-11 released this 01 Aug 08:53
· 1883 commits to master since this release

This release 2023.6.1 addresses #61 with a significant number of changes and features ~6x speedup.
This release note summarizes the changes from 2023.6.dev1 through 2023.6.dev14

For a quick summary view the tutorial

Changes

A new Config-class available in config.py, which sets:

  • Config.page_size = 1_000_000
  • Config.workdir = tempfile.gettempdir()
  • Config.DISABLE_TQDM: as a boolean switch to silence tqdm.

The workdir uses ///pages for managing pages, so that different python processes don't interact with the same storage location. Previously two python processes would enter conflict on the same tablite.h5-file.
If the python process is sigkill'ed, the temp folder will loiter. If the python process exits (sigint) python will clean up the temp folders.

Table now accepts the keyword columns as a dict in init:

     t = Table(columns={'b':[4,5,6], 'c':[7,8,9]})`

Table now accepts header/data combinations in init:

    t = Table(header=['b','c'], data=[[4,5,6],[7,8,9]])`

With these features it is no longer necessary to write:

    t = Table
    t['b'] = [4,5,6]
    t['c'] = [7,8,9]

Preferred approach for subclassing tables is:

class MyTable(tablite.Table):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.x = kwargs.get("x", 42)  # <== special variable required on MyTable.

    def copy(self):
        # tablite.Table implements
        # new = cls();   # self.x is now 42 !!!
        # for name,column in self.items():
        #     new[name] = column

        # MyTable therefore implements:
        cp = super(MyTable, self).copy()
        cp.x = self.x  # updating to real x.
        return cp

Replace is refactored from using a single value to a mapping,

    example:
    >>> t = Table(columns={'A': [1,2,3,4]})
    >>> t['A'].replace({2:20,4:40})
    >>> t[:]
    np.ndarray([1,20,3,40])
  • performance benchmarks have been added.

  • Table.head has been moved to tablite.tools where it belongs.

  • histogram now returns two lists instead of a dict as python treats True and 1 as identical.

  • Table.save now defaults to zipfile.ZIP_DEFLATED, compression_level=1 as this is only 10% slower, yet saves 80% disk space.

  • Column.iter_by_page : An iterator for traversing data by page.

  • module reindex implements a constant memory method for re-indexing tasks which now is used by Lookup, Join, Sort and Filter.

  • New function unique_index which allows drop_duplicates and sortation to run with constant memory footprint.

  • get_headers now has text_qualifier to match importing

  • Table.types() is practically instantaneous as type information is kept on Pages. dev9 feature Table.dtypes() is deprecated as it almost duplicates the functionality of .types

  • multiprocessing has been disabled for algorithms that guarantee a constant memory footprint.

Deprecated

The following assignment method is DEPRECATED:

    t = Table()
    t[('b','c')] = [ [4,5,6], [7,8,9] ]
    Which then produced the table with two columns:
    t['b'] == [4,5,6]
    t['c'] == [7,8,9]
  • copy_to_clipboard and copy_from_clipboard as pyperclip doesn't seem to be maintained.
  • class method from_dict as Table(columns=dict) now is supported.
  • Insert and append have been removed to discourage operations that are IO-intensive.
  • Column.insert is being removed as it encourages the user to use slow operations. It is better to perform the data manipulation in memory and drop the result into the column using col.extend(....) or col[a:b] = [result].
  • reload_saved_tables. Use t.save(path) and Table.load(path) instead.
  • reset_storage is deprecated as all tables not explicitly saved are considered temporary/volatile.
  • from_dict is deprecated as Table(columns={dict}, ...) is accepted.
  • to_numpy. Default table['name'] returns a numpy array. User should call table['name'].tolist() to get python lists (up to 6x slower) than just retrieving the numpy array.

Everything else remains.