Skip to content

Commit

Permalink
feat: implement shape_touched optimisation (#381)
Browse files Browse the repository at this point in the history
* scaffolding

* separation of data and shape touched

* checkpoint

* wip

* wip: latest

* refactor: drop reference to reports

* wip

* wip

* fix: don't try to read root column

* fix: support fallback case

* fix: support typetracing sample

* test: fix tests

* fix: don't touch data buffers for sizes

Originally, if we saw a non-metadata buffer, we'd look to see if the current node has any unknown-length attributes that need computing. But, only NumpyArray has `data`, and it has no unknown-length attributes to read.

* chore: appease pre-commit

* chore: remove debug statement

* feat!: remove `necessary_buffers`

* chore: add type hints

* fix: properly "deep" copy forms

* fix: typo

Co-authored-by: Doug Davis <ddavis@ddavis.io>

* test: drop local changes

* fix: correct LSP

* docs: add docstring

* refactor: separate mocking from projection more cleanly

* feat: return reports for later consumption

* feat: expose `dak.report_necessary_buffers`

* fix: ensure we only check input layers

* feat: make default buffer key nicer

* fix: restore ability to detect serialised blockwise layers

* fix: remove `_meta` when serialising IO function

* test: restore original test file

* docs: use new name for necessary_columns

* fix: restore wildcard projection for column-at-a-time readers

* refactor: use DFS to find deepest field

* fix: remove old code

* refactor: remove two-phase abstraction

* docs: add brief comment

* test: add note about broken test

* feat: add `necessary_columns` interface

* feat: add `necessary_columns` interface

* docs: improve notes

* refactor: add implementation for mixin

* docs: add note about `report_necessary_buffers`

* fix: support `dak.necessary_columns`

* refactor: export utils

* chore: require newer uproot

* chore: add local test file

* chore: bump awkward depndency

* Update docs/api/inspect.rst

---------

Co-authored-by: Doug Davis <ddavis@anaconda.com>
Co-authored-by: Doug Davis <ddavis@ddavis.io>
  • Loading branch information
3 people authored Oct 6, 2023
1 parent 5981337 commit 96b19df
Show file tree
Hide file tree
Showing 29 changed files with 996 additions and 693 deletions.
1 change: 1 addition & 0 deletions .github/workflows/awkward-main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ jobs:
uses: actions/checkout@v4
with:
fetch-depth: 0
lfs: true
- name: Setup Python
uses: actions/setup-python@v4
with:
Expand Down
1 change: 1 addition & 0 deletions .github/workflows/conda-tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ jobs:
uses: actions/checkout@v4
with:
fetch-depth: 0
lfs: true
- name: Setup Conda Environment
uses: conda-incubator/setup-miniconda@v2
with:
Expand Down
1 change: 1 addition & 0 deletions .github/workflows/coverage.yml
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ jobs:
uses: actions/checkout@v4
with:
fetch-depth: 0
lfs: true
- name: Setup Python
uses: actions/setup-python@v4
with:
Expand Down
1 change: 1 addition & 0 deletions .github/workflows/pypi-release.yml
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ jobs:
uses: actions/checkout@v4
with:
fetch-depth: 0
lfs: true

- name: Setup Python
uses: actions/setup-python@v4
Expand Down
1 change: 1 addition & 0 deletions .github/workflows/pypi-tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ jobs:
uses: actions/checkout@v4
with:
fetch-depth: 0
lfs: true
- name: setup Python ${{matrix.python-version}}
uses: actions/setup-python@v4
with:
Expand Down
1 change: 1 addition & 0 deletions .github/workflows/uproot-main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ jobs:
uses: actions/checkout@v4
with:
fetch-depth: 0
lfs: true
- name: Setup Python
uses: actions/setup-python@v4
with:
Expand Down
3 changes: 2 additions & 1 deletion docs/api/inspect.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,8 @@ Inspection

partition_compatibility
PartitionCompatibility
necessary_columns
report_necessary_buffers
report_necessary_columns
sample

.. raw:: html
Expand Down
15 changes: 12 additions & 3 deletions docs/more/optimization.rst
Original file line number Diff line number Diff line change
Expand Up @@ -127,14 +127,14 @@ will only grab ``foo`` and ``bar.x``.
:py:func:`ak.from_parquet` function at compute time.

You can see which columns are determined to be necessary by calling
:func:`dask_awkward.necessary_columns` on the collection of interest
:func:`dask_awkward.report_necessary_columns` on the collection of interest
(it returns a mapping that pairs an input layer with the list of
necessary columns):

.. code:: pycon
>>> import dask_awkward as dak
>>> dak.necessary_columns(result)
>>> dak.report_necessary_columns(result)
{"some-layer-name": ["foo", "bar.x"]}
The optimization is performed by relying on upstream Awkward-Array
Expand All @@ -156,7 +156,7 @@ parameter:
One can also use the ``columns=`` argument (with
:func:`~dask_awkward.from_parquet`, for example) to manually define
which columns should be read from disk. The
:func:`~dask_awkward.necessary_columns` function can be used to
:func:`~dask_awkward.report_necessary_columns` function can be used to
determine how one should use the ``columns=`` argument. Using our
above example, we write

Expand All @@ -179,3 +179,12 @@ workflow).

<script data-goatcounter="https://dask-awkward.goatcounter.com/count"
async src="//gc.zgo.at/count.js"></script>


.. note::

Under the hood, the columns optimization is implemented as a *buffers* optimization; dask-awkward determines the
buffers necessary to read from a columnar source, before translating these to column names. Some IO sources might
not support :func:`~dask_awkward.report_necessary_columns`, e.g. if the source directly reads buffers from a container.

For these IO sources, :func:`~dask_awkward.report_necessary_buffers` can be used instead.
4 changes: 2 additions & 2 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ classifiers = [
"Topic :: Software Development",
]
dependencies = [
"awkward >=2.4.4",
"awkward >=2.4.5",
"dask >=2023.04.0",
"typing_extensions >=4.8.0",
]
Expand Down Expand Up @@ -70,7 +70,7 @@ test = [
"pytest >=6.0",
"pytest-cov >=3.0.0",
"requests >=2.27.1",
"uproot",
"uproot >=5.1.0rc1",
]

[project.entry-points."dask.sizeof"]
Expand Down
10 changes: 9 additions & 1 deletion src/dask_awkward/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
import dask_awkward.lib.reducers as reducers
import dask_awkward.lib.str as str
import dask_awkward.lib.structure as structure
import dask_awkward.lib.utils as utils
from dask_awkward.lib.core import Array, PartitionCompatibility, Record, Scalar
from dask_awkward.lib.core import _type as type
from dask_awkward.lib.core import (
Expand All @@ -16,7 +17,14 @@
partition_compatibility,
)
from dask_awkward.lib.describe import fields
from dask_awkward.lib.inspect import necessary_columns, sample
from dask_awkward.lib.inspect import (
report_necessary_buffers,
report_necessary_columns,
sample,
)

necessary_columns = report_necessary_columns # Export for backwards compatibility.

from dask_awkward.lib.io.io import (
ImplementsFormTransformation,
from_awkward,
Expand Down
8 changes: 8 additions & 0 deletions src/dask_awkward/layers/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,11 +3,19 @@
AwkwardInputLayer,
AwkwardMaterializedLayer,
AwkwardTreeReductionLayer,
ImplementsIOFunction,
ImplementsProjection,
IOFunctionWithMocking,
io_func_implements_projection,
)

__all__ = (
"AwkwardInputLayer",
"AwkwardBlockwiseLayer",
"AwkwardMaterializedLayer",
"AwkwardTreeReductionLayer",
"ImplementsProjection",
"ImplementsIOFunction",
"IOFunctionWithMocking",
"io_func_implements_projection",
)
Loading

0 comments on commit 96b19df

Please sign in to comment.