-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add pairwise #54
base: master
Are you sure you want to change the base?
Add pairwise #54
Conversation
It works well overall but there are a few issues: - When vectors have different types, `promote_typejoin` is used (via `broadcast`) to choose element type. `promote_type` would be more appropriate but there is no mechanism to do this in Base. - When skipping missing values, inference isn't able to realize that the result cannot be `missing`. This problem is fixed if I use a positional argument rather than a keyword argument for `skipmissing` but it's far from ideal for users. - Since `eachrow(df::DataFrame)` returns a `DataFrameRows` objects which is also a table, `pairwise(cor, eachrow(df))` still computes correlation between columns rather than between rows. And anyway `DataFrameRow` is not accepted by `cor` since it's not an `AbstractVector`. One needs something like `(Vector(r) for r in eachrow(df))` to work around this limitation. - Since Tables.jl objects can be of any type, we must have single method that performs dispatch internally by calling `Tables.istable`. Even `AbstractVector` inputs can be either vectors of vectors or row-oriented or column-oriented tables. This means the method that returns a named array and the method that returns a plain matrix have to live in the same package (which has to depend on NamedArrays).
pairwise(f, x[, y], symmetric::Bool=false, skipmissing::Symbol=:none) | ||
|
||
Return a matrix holding the result of applying `f` to all possible pairs | ||
of vectors in iterators `x` and `y`. Rows correspond to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why vectors? Cannot we say that x
and y
have to be iterables of iterables and we apply f
to a their cross-product?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK - now I see why below. Maybe then require x
and y
to be AbstractVector{<:AbstractVector}
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I guess we could allow anything, f
can take care of throwing an error if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also maybe write exactly what you have in code - if both x
and y
are Table.jl compliant then they are treated as table. Otherwise BOTH are treated as iterators?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, to skip missing values we rely on view
so I guess for that case we need AbstractVector
. In the future we could use skipmissings
but I'd say it's not high priority.
`skipmissing` is different from `:none`). | ||
|
||
# Keyword arguments | ||
- `symmetric::Bool=false`: If `true`, `f` is only called to compute |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
formally commutative
, as symmetric
is a property of a binary relation typically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Symmetric" referred to the table. Not sure what's the best term.
return _pairwise(Val(skipmissing), f, x′, y′, symmetric) | ||
else | ||
throw(ArgumentError("x and y must be either iterators of AbstractArrays, " * | ||
"or Tables.jl objects")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reverse the order here I think.
keys(yi) == inds || | ||
throw(ArgumentError("All input vectors must have the same indices")) | ||
end | ||
x′ = collect(x) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe better use comprehension? It will narrow down the eltype of x'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. I'm not sure what's best. This will only make a difference for vector inputs, right? For these, if you write [x1, x2]
you'll get a narrow type already (actually, narrower than what a comprehension would do alone thanks to promote_type
as opposed to typejoin
). So that would be useful mainly if you pass a vector that you allocated with an abstract type, in which case you may have reasons to do that (avoid unnecessary specialization...).
Anyway it shouldn't affect performance a lot since most of the time should be spent in f(xi, yi)
. Do you have a use case in mind?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no - I just know that collect
does not narrow type.
I like it very much. The only question is if it would be hard/needed to make it more general and allow not only pairwise but arbitrary dimensional cross-product (not sure if it is needed though). |
OK, thanks. It shouldn't be hard to support more types when we don't skip missing values, but it's more tricky when skipping them. Not sure it's worth adding that complexity right now give the intended use cases. EDIT: sorry I hadn't understood your point (I thought it was about allowing any iterators and not just vectors). Yes I imagine it would be possible to accept more than two arguments, but then it wouldn't be pairwise anymore, would it? I guess it would be |
Maybe just anticipate the use for more than 2 arguments and rename it, without implementing it yet. |
But I find it much more user-friendly to tell people "to compute pairwise correlation, use |
I like calling it |
I'm fine adding |
@dkarrasch I don't know whether you've seen this. Ideally this would cohabit nicely with Distances.jl, but the new support for passing vectors of vectors there (JuliaStats/Distances.jl#188 and JuliaStats/Distances.jl#194) will create ambiguities. Not sure what to do about it. |
No, I haven't seen this. The issue is that we relax types in those PRs, right? Is there any way you could depend on Distances here, and then define your own metric types that have fields corresponding to the keyword arguments you use here, and finally extend |
The problem is that Tables.jl objects don't have a particular type. Any object can be a table, including notably vectors or iterators of names tuples. As long as Distances only defines methods for matrices or vectors, the conflict with this PR is quite limited: only Tables.jl objects that are also vectors are problematic (including The only solution AFAICT is to have the most general function check whether an input is a table in the Tables.jl sense, and adapt its behavior depending on that. But @KristofferC didn't like Distances depend on Tables.jl (JuliaStats/Distances.jl#123), and there's the additional problem that we also want to return a We could maybe work around this problem by having Distances check that the eltype is At any rate these issues are tricky. But I think the stats ecosystem would really benefit from having a general |
Maybe for the time being we might require |
@dkarrasch Regarding your comment at JuliaStats/Distances.jl#188 (review) (better keep the discussion in a single place):
Making |
After discussing this on Slack with @bkamins and @quinnj I think the way forward is to require passing iterators over rows or columns of Tables.jl objects explicitly, via |
It works well overall but there are a few issues:
promote_typejoin
is used (viabroadcast
) to choose element type.promote_type
would be more appropriate but there is no mechanism to do this in Base.missing
. This problem is fixed if I use a positional argument rather than a keyword argument forskipmissing
but it's far from ideal for users.eachrow(df::DataFrame)
returns aDataFrameRows
objects which is also a table,pairwise(cor, eachrow(df))
still computes correlation between columns rather than between rows. And anywayDataFrameRow
is not accepted bycor
since it's not an
AbstractVector
. One needs something like(Vector(r) for r in eachrow(df))
to work around this limitation.Tables.istable
. EvenAbstractVector
inputs can be either vectors of vectors or row-oriented or column-oriented tables. This means the method that returns a named array and the method that returns a plain matrix have to live in the same package (which has to depend on NamedArrays).The package should probably be renamed if we want to include this as it's clearly not a frequency table.