Add a copy(::LazyBranch) method to prevent branch materialization on … #286

grasph · 2023-10-13T09:18:41Z

For a LazyBranch br, Base.copy(br) was returning a vector instead of a LazyBranch and leading to uncessary read. This PR fix the issue.

…copy In absence of such method, the copy(::AbstractVector) method that resulted in the materialization into a Vector was called.

codecov · 2023-10-13T09:25:49Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 86.33%. Comparing base (799b290) to head (de28e08).
Report is 24 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #286      +/-   ##
==========================================
- Coverage   87.82%   86.33%   -1.49%     
==========================================
  Files          18       18              
  Lines        2366     2445      +79     
==========================================
+ Hits         2078     2111      +33     
- Misses        288      334      +46

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Moelf · 2023-10-13T09:30:34Z

I'm not so sure about this one. For example, copy(reinterpret(...)) also changes type and return a Vector(). I think it could be argued that current behavior is the correct one already

Moelf · 2023-10-13T09:39:35Z

Arrow.jl

julia> d = Arrow.Table("/tmp/a.feather");

julia> d.x #this is memory-mapped
100-element Arrow.Primitive{Int64, Vector{Int64}}:
...

julia> copy(d.x)
100-element Vector{Int64}:
...

grasph · 2023-10-13T09:51:10Z

According the Base.copy documentation it must "create a shallow copy of x: the outer structure is copied, but not all internal values."

Moelf · 2023-10-13T09:54:18Z

but then you also have this from Julia itself being weird/inconsistent:

julia> a = rand(2,2)'
2×2 adjoint(::Matrix{Float64}) with eltype Float64:
 0.651229  0.952613
 0.723043  0.00959459

julia> copy(a)
2×2 Matrix{Float64}:
 0.651229  0.952613
 0.723043  0.00959459

julia> a = rand(2)'
1×2 adjoint(::Vector{Float64}) with eltype Float64:
 0.29677  0.177136

julia> copy(a)
1×2 adjoint(::Vector{Float64}) with eltype Float64:
 0.29677  0.177136

edit: it seems the adjoint of vector is a corner case because it's behavior is different from one-row matrix, but the intention here is to change type and return Array.

Let's hold off for a bit and I will ask other Julia folks on Slack about design/expectation. Potentially request a Documentation update

Moelf · 2023-10-13T10:06:18Z

another thing is I just realized we no longer need mutable struct LazyBranch, it can be immutable.

But after making it immutable it breaks CI and seems like allocation in certain operation is increased somehow.

If we can make it immutable then clearly copy should just be a no-op. Except in this case we need clarification from others regarding copy(<:AbstractVector).

grasph · 2023-10-13T10:45:54Z

My idea was that the LazyBranch was conceptually immutable and this would avoid unnecessary materialization with DataFrame if copycols=false is not passed.

Looking at the DataFrames documentation, the behavior of Arrow.Table is to allow conversion of immutable to mutable columns: "[to] get mutable columns from an immutable input table (like Arrow.Table), pass copycols=true explicitly." Changing the immutable nature of the object is arguable, although here it has some practical use here. The issue is that copycols=true is the default, which can be very expansive in HEP use case.

The case of copy(A::Transpose) and copy(A::Adjoint) is special. According to the documentation it "eagerly evaluate[s] the lazy matrix transpose/adjoint.". For sure return an object of the same type as the original is not wrong, it is changing the type which is an exception.

Moelf · 2023-10-13T11:04:00Z

The issue is that copycols=true is the default, which can be very expansive in HEP use case

But it is what it is, user of DataFrame needs to know what's default behavior. The problem of overriding copy is what user should do if they indeed want copycols=true?

I agree with your argument but I don't know if this is overly intrusive and confusing (i.e. what if user wants a mutable copy of a column?)

grasph · 2023-10-13T11:04:02Z

Concerning the method copy(<:AbstractVector), is is not documented. So I expect it can be considered as fallback implementation and should not be considered as providing the current behavior for all subtypes.

There is a Base.copymutable() function to turn an immutable into a mutable e.g, it turns a tuple to a vector.

Moelf · 2023-10-13T16:33:45Z

I got this comment from @vtjnash so maybe we should just return the original object, what do you think?

But maybe the consideration is that, iterating the copy() shouldn't change buffer of the original?

grasph · 2023-10-15T17:20:10Z

I think both options w and w/o copy of the buffer are valid. Independent buffer should provide better performance in case of concurrent access on two copies as it will prevent buffer invalidation, while no copy option minimizes memory allocations.

About this PR in general, since it is not clear what the best option is, I propose to:

postpone the decision to specialize the copy function until we have a better view of the use cases;
document the possibility for the user to disable the materialization by defining himself the copy method as a no-op.

That way, we will keep the options opened. What do you think of this plan ?

Philippe.

Moelf · 2023-10-15T17:44:16Z

that sounds good to me!

I personally think no-op is okay, because I *think vast majority use cases are similar to when you want to do something in a different container type, e.g. DataFrame, the user probably stops interacting with the original LazyBranch variable.

To me preventing full materialization seems useful enough

Moelf · 2023-10-16T16:38:44Z

then that is a violation of the copy API, which states that operations on b do not affect a

ah ok we can't do no-op, so I think we know exactly what we can do in this case.

tamasgal · 2023-10-16T16:49:54Z

This is really difficult since copying something lazy needs to decouple the structure from its laziness in order to fulfil what copy signals: having something which I can operate on without modifying the original thing.

We need to either break the laziness and materialise everything inside the structure (but then it's not a lazy thing anymore) or we create a completely different object with it's own references and independent caches, which sounds like a source of happy bugs 🙈

I don't know... I think changing the type is also a bit of a surprise, even though it makes sense in certain applications.

I agree that we should figure out what a user expects to be served when doing a copy of lazy branches.

src/iteration.jl

grasph added 2 commits October 13, 2023 11:02

Add a copy(::LazyBranch) method to prevent branch materialization on …

a5234e0

…copy In absence of such method, the copy(::AbstractVector) method that resulted in the materialization into a Vector was called.

missmerged fixed

23e577c

Moelf reviewed Jan 16, 2024

View reviewed changes

src/iteration.jl Outdated Show resolved Hide resolved

Update src/iteration.jl

54c42b8

Moelf reviewed Jan 16, 2024

View reviewed changes

src/iteration.jl Outdated Show resolved Hide resolved

Update src/iteration.jl

de28e08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a copy(::LazyBranch) method to prevent branch materialization on … #286

Add a copy(::LazyBranch) method to prevent branch materialization on … #286

grasph commented Oct 13, 2023

codecov bot commented Oct 13, 2023 •

edited

Loading

Moelf commented Oct 13, 2023 •

edited

Loading

Moelf commented Oct 13, 2023

grasph commented Oct 13, 2023

Moelf commented Oct 13, 2023 •

edited

Loading

Moelf commented Oct 13, 2023

grasph commented Oct 13, 2023

Moelf commented Oct 13, 2023

grasph commented Oct 13, 2023

Moelf commented Oct 13, 2023 •

edited

Loading

grasph commented Oct 15, 2023

Moelf commented Oct 15, 2023 •

edited

Loading

Moelf commented Oct 16, 2023

tamasgal commented Oct 16, 2023

Add a copy(::LazyBranch) method to prevent branch materialization on … #286

Are you sure you want to change the base?

Add a copy(::LazyBranch) method to prevent branch materialization on … #286

Conversation

grasph commented Oct 13, 2023

codecov bot commented Oct 13, 2023 • edited Loading

Codecov Report

Moelf commented Oct 13, 2023 • edited Loading

Moelf commented Oct 13, 2023

Arrow.jl

grasph commented Oct 13, 2023

Moelf commented Oct 13, 2023 • edited Loading

Moelf commented Oct 13, 2023

grasph commented Oct 13, 2023

Moelf commented Oct 13, 2023

grasph commented Oct 13, 2023

Moelf commented Oct 13, 2023 • edited Loading

grasph commented Oct 15, 2023

Moelf commented Oct 15, 2023 • edited Loading

Moelf commented Oct 16, 2023

tamasgal commented Oct 16, 2023

codecov bot commented Oct 13, 2023 •

edited

Loading

Moelf commented Oct 13, 2023 •

edited

Loading

Moelf commented Oct 13, 2023 •

edited

Loading

Moelf commented Oct 13, 2023 •

edited

Loading

Moelf commented Oct 15, 2023 •

edited

Loading