Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sortedcontainerupdates #787

Merged

Conversation

StephenVavasis
Copy link
Contributor

@StephenVavasis StephenVavasis commented Jan 20, 2022

Major update to sorted containers.

Address issue #239: Dict constructors on typed data but without explicit type declarations should no longer return {Any,Any} items because collect is now used when explicit types are omitted

Address issue #479: Deprecate insert!, pop!(s), introduce popfirst!, poplast!, ss_push!, sd_push!, smd_push!

Address issue #528: push! documentation improved

Address issue #667: Docs extensively cleaned up and much greater use of documenter.jl

Address issue #669: Expanded use of tokens including a:b notation for tokens

Address issue #671: Add reverse iteration

Address issue #742: New union, merge, symdiff, setdiff, intersect algorithms for sorted containers to improve performance

Address issue #726: Clarify docs on ss_push!, sd_push!, and sdm_push! (formerly insert!)

Address issue #767: Implement firstindex, lastindex, reverse for sorted container iteration

Address issue #781: Use collect on an intermediate step
to infer types in untyped construction of sorted containers

Also, this commit cleans up the constructors and adds two new iteration-helper functions inclusive_key and exclusive_key that were useful in my own work.

    Address issue 239: Dict constructors on typed data but without explicit type declarations
    should no longer return {Any,Any} items because collect is now used when explicit types
    are omitted

    Address issue 479: Deprecate insert!, pop!(s), introduce popfirst!, poplast!,
       ss_push!, sd_push!, smd_push!

    Address issue 528: push! documentation improved

    Address issue 667: Docs extensively cleaned up and much greater use of
      documenter.jl

    Address issue 669: Expanded use of tokens including a:b notation for tokens

    Address issue 671: Add reverse iteration

    Address issue 742: New union, merge, symdiff, setdiff, intersect algorithms for
    sorted containers to improve performance

    Address issue 726: Clarify docs on insert! (now deprecated)

    Address issue 767: Implement firstindex, lastindex, reverse for sorted container iteration

    Address issue 781: Use collect on an intermediate step
    to infer types in untyped construction of sorted containers

    Also, this commit cleans up the constructors and
    adds two new iteration-helper functions inclusive_key
    and exclusive_key that were useful in my own work.
@StephenVavasis
Copy link
Contributor Author

The following test from the new testset, which is meant to exercise a new show method, is failing on some platforms:

    my_assert(remove_spaces(repr(MIME{Symbol("text/plain")}(), factors5)) ==
              "SortedMultiDict{Char,Int64,ForwardOrdering}" *
              "(ForwardOrdering(),['a'=>1,'b'=>2,'c'=>3])")

Here, remove_spaces is a locally written function-- see below. So apparently on some platforms, show works differently? Anyone know the differences? Or is there a better way to write a unit test for show?

function remove_spaces(s::String)
    b = Vector{UInt8}()
    for c in s
        if !isspace(c)
            push!(b,UInt8(c))
        end
    end
    String(b)
end

@StephenVavasis
Copy link
Contributor Author

The reason the test is failing is: In Julia 1.0, printing the vector [1=>'a',2=>'b'] prepends the output with Pair{Int64,Char}. Prepending this eltype-declaration does not happen in Julia 1.7. Apparently I need to think of a more robust unit test for a show method.

@StephenVavasis
Copy link
Contributor Author

StephenVavasis commented Jan 21, 2022

I replaced the failing unit test for show with another that does not require precisely a particular string, and now all tests are successful.

Copy link
Member

@oxinabox oxinabox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry I have not finished reviewing this.
This broadly speaking looks good and if all the existing tests still pass then it should be good

src/DataStructures.jl Outdated Show resolved Hide resolved

const SDMContainer = Union{SortedDict, SortedMultiDict}
const SortedContainer = Union{SDMContainer, SortedSet}
const Token = Tuple{SortedContainer, IntSemiToken}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a surprising definition for me

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you asking why Token is implemented as a tuple rather than a struct? This is a legacy issue. In pre-1.0 Julia, any struct that contained an immutable object had itself to be heap allocated. However, in pre-1.0 Julia, operations on tuples containing an immutable object didn't always allocate. In particular, deref((mysorteddict, semitok)) did not heap-allocate its 2-tuple argument, whereas a hypothetical deref(Token(mysorteddict, semitok)), where Token is the constructor of a hypothetical struct, would have allocated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More my surprise was that a token included both the container and the indexing object.
I would have expected the token to just be the indexing object (but that is defined as a semitoken)
is the idea that the token is enough to fully retrieve the stored object?
But semitoken is only half of what is expected?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think full tokens are useful, e.g., deref(token_firstindex(s)) as opposed to deref((s,firstindex(s))) where the user needs to be consistent with the argument s. So if it is OK with you, I would like to leave these.

token_firstindex(m::SortedContainer)


Return the token of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again this definition of token feels weird to me

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to leave these. See other comment.

Comment on lines 246 to 254
function deref(ii::SortedDictToken)
has_data(ii)
deref_nocheck(ii)
end

@inline function deref_nocheck(ii::SortedMultiDictToken)
@inbounds kdrec = ii[1].bt.data[ii[2].address]
return kdrec.k => kdrec.d
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should be implementing these deref and deref_nocheck pairs, using Base.@propagate_inbounds and Base.@boundscheck.

Then when we want deref_nocheck(k) we could just write @inbounds deref(k)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar for the other foo/foo_nocheck pairs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to implement this, but it is not working properly, so perhaps I am misunderstanding the rules. Here is the new definition of deref:

Base.@propagate_inbounds function deref(ii::SortedDictToken)
    Base.@boundscheck has_data(ii)
    @inbounds kdrec = ii[1].bt.data[ii[2].address]
    return Pair(kdrec.k, kdrec.d)
end

My intention is that if a deref operation is prefixed by @inbounds, then has_data will not be invoked. But this is not happening:

julia> s = SortedDict{Int,String}()
SortedDict{Int64, String, Base.Order.ForwardOrdering}()

julia> deref(token_firstindex(s))
ERROR: BoundsError
Stacktrace:
 [1] has_data
   @ c:\Users\vavasis\OneDrive - University of Waterloo\Julia_pkg_devdir\DataStructures.jl\src\sorted_container_iteration.jl:442 [inlined]
 [2] deref(ii::Tuple{SortedDict{Int64, String, Base.Order.ForwardOrdering}, DataStructures.Tokens.IntSemiToken})
   @ DataStructures c:\Users\vavasis\OneDrive - University of Waterloo\Julia_pkg_devdir\DataStructures.jl\src\sorted_container_iteration.jl:242
 [3] top-level scope
   @ REPL[24]:1

julia> @inbounds deref(token_firstindex(s))
ERROR: BoundsError
Stacktrace:
 [1] has_data
   @ c:\Users\vavasis\OneDrive - University of Waterloo\Julia_pkg_devdir\DataStructures.jl\src\sorted_container_iteration.jl:442 [inlined]
 [2] deref(ii::Tuple{SortedDict{Int64, String, Base.Order.ForwardOrdering}, DataStructures.Tokens.IntSemiToken})
   @ DataStructures c:\Users\vavasis\OneDrive - University of Waterloo\Julia_pkg_devdir\DataStructures.jl\src\sorted_container_iteration.jl:242
 [3] top-level scope
   @ REPL[25]:1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, nvm, I figured this out too. Apparently @inbounds is not necessarily active if it is called directly from the REPL as opposed to its being called from a compiled function.

src/sorted_container_iteration.jl Show resolved Hide resolved
Copy link
Member

@oxinabox oxinabox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some further progress, sorry this is taking so long

Documenter = "e30172f5-a6a5-5a46-863b-614d45cd2de4"

[compat]
Documenter = "0.23"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably we should keep this line, so when/if Documenter makes a new breaking release we don't get it automatically.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I put it back. I'm not sure how it got deleted.

is similar to SortedDict except that each key can be associated with
multiple values. The key=>value pairs in a SortedMultiDict are stored
according to the sorted order for keys, and key=>value pairs with the
same key are stored in order of insertion.

The containers internally use a 2-3 tree, which is a kind of balanced
tree and is described in many elementary data structure textbooks.
tree and is described in data structure textbooks.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should mention that it is backed by a single array, rather than linked-list style?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

is not a bits-type.
If code profiling indicates that statements using tokens are allocating memory,
then it may be advisable to rewrite the application code using semitokens
more than tokens.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
more than tokens.
rather than tokens.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

searchsortedfirst(sc,k)
```@docs
Base.first(sc::SortedContainer)
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason there are in seperate @docs blocks?
rather than

```@docs
deref_key(token::Token)
...
Base.first(sc::SortedContainer)
```

I think that works doesn't it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't know about the other syntax for documenter, but now that you told me, I have fixed it.

Comment on lines 262 to 271
ss_push!(ss::SortedSet, k)
```

```@docs
pop!(ss::SortedSet, k)
sd_push!(sd::SortedDict, p::Pair)
```

```@docs
pop!(ss::SortedSet)
smd_push!(smd::SortedMultiDict, p::Pair)
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be nice to put the Base functions before our special new ones?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

import Base.+
import Base.-
import Base.isequal
export sd_push!, ss_push!, smd_push!, poplast!
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than all 3 sd_push!, ss_push!, smd_push! is there a single name we can overload for these that refers to what it returns, rather than hungarian notation for it's first argument?
Like push_and_get_token! or something? (is that what they are doing?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. The new function is called push_return_token!. Since it is a single function name, I unified the APIs by making the return variables Bool,IntSemiToken. The Bool return variable is redundant for SortedMultiDict because push always creates a new record for SortedMultiDict (i.e., the first return variable is always true for a SortedMultiDict).


const SDMContainer = Union{SortedDict, SortedMultiDict}
const SortedContainer = Union{SDMContainer, SortedSet}
const Token = Tuple{SortedContainer, IntSemiToken}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More my surprise was that a token included both the container and the indexing object.
I would have expected the token to just be the indexing object (but that is defined as a semitoken)
is the idea that the token is enough to fully retrieve the stored object?
But semitoken is only half of what is expected?

the first entry of the sorted container `m`, or the past-end token
if the container is empty. Time: O(log *n*)
"""
token_firstindex(m::SortedContainer) = (m, firstindex(m))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need this to be exported?
It seems like if someone needs it is is a trivial and obvious definition.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See other comment about token_firstindex.

if the container is empty. Time: O(log *n*)
"""
Base.lastindex(m::SortedContainer) = endof(m)
endof(m::SortedContainer) = IntSemiToken(endloc(m.bt))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as per startof do we need this? or can we just use lastindex?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have deprecated startof and endof in favor of firstindex and lastindex.

last entry of the sorted container `m`, or the before-start semitoken
if the container is empty. Time: O(log *n*)
"""
token_lastindex(m::SortedContainer) = (m, lastindex(m))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as per token_firstindex do we need to export this?
It is a trivial and obvious definition

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See other comment about token_firstindex


These five invocations of `in` use the index structure of the sorted
container and test equality based on the order object of the keys
rather than `isequal`. Therefore, these five are all faster than
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we comment here that if isequal is defined consistently with isless etc this should have same behavour?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue here is not whether isless and isequal are consistent. Rather, the issue is whether the user has defined an ordering (such as case-insensitive for strings) for the container different from isequal and isless.


(:)(t1::Token, t2::Token) = _colon(t1,t2)

function _colon(t1::Token, t2::Token)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not define this as : ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The definition function Base.(:)(a,b) gives a syntax error (noted above). However, function (:)(a,b) works, so I have fixed this (but I still have the issue with import noted above).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I figured this out.

Copy link
Member

@oxinabox oxinabox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work.
I assume that the logic is at least as correct as before as all previous tests still pass.

Some mostly stylistic comments to resolve

Comment on lines 1139 to 1141
+(t1::Token, numstep::Integer) =
numstep >= 0 ? stepforward(t1, numstep) : stepback(t1, -numstep)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this easily fits on 1 line, no?

Suggested change
+(t1::Token, numstep::Integer) =
numstep >= 0 ? stepforward(t1, numstep) : stepback(t1, -numstep)
+(t1::Token, numstep::Integer) = numstep >= 0 ? stepforward(t1, numstep) : stepback(t1, -numstep)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 1142 to 1143
-(t1::Token, numstep::Integer) =
numstep >= 0 ? stepback(t1, numstep) : stepforward(t1, -numstep)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this easily fits on 1 line, no?

Suggested change
-(t1::Token, numstep::Integer) =
numstep >= 0 ? stepback(t1, numstep) : stepforward(t1, -numstep)
-(t1::Token, numstep::Integer) = numstep >= 0 ? stepback(t1, numstep) : stepforward(t1, -numstep)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

function stepforward(t1::Token, numstep::Integer)
m = t1[1]
j = t1[2].address
!(j in m.bt.useddatacells) && throw(BoundsError())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you think of

Suggested change
!(j in m.bt.useddatacells) && throw(BoundsError())
(j in m.bt.useddatacells) || throw(BoundsError())

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can make this change, but I have used the idiom !(correct) && throw(...) (rather than correct || throw(...)) throughout my code. Do you recommend that I change them all.


"""
in(p, sc)
@deprecate insert!(m::SortedDict, k, d) sd_push!(m::SortedDict, k=>d)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we move this into src/deprecations.jl so we remember to delete it in next release?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 1 to 2
using Base: Ordering, Forward, Reverse, ForwardOrdering,
ReverseOrdering
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this fits on one line

Suggested change
using Base: Ordering, Forward, Reverse, ForwardOrdering,
ReverseOrdering
using Base: Ordering, Forward, Reverse, ForwardOrdering, ReverseOrdering

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 4 to +5
import Base.lt
import Base.ForwardOrdering
import Base.ReverseOrdering
import DataStructures.IntSemiToken
import DataStructures.eq
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should make the imports into usings i think

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

import DataStructures.eq
using Base: Ordering, Forward, Reverse, ForwardOrdering,
ReverseOrdering
using DataStructures: IntSemiToken
import Base.lt
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this called Base.lessthan now days?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked ordering.jl in Base, and I still see lt. Where did you see lessthan?


struct CaseInsensitive <: Ordering
end

lt(::CaseInsensitive, a, b) = isless(lowercase(a), lowercase(b))
eq(::CaseInsensitive, a, b) = isequal(lowercase(a), lowercase(b))


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# for testing custom ordering

Comment on lines 41 to 50
function remove_spaces(s::String)
b = Vector{UInt8}()
for c in s
if !isspace(c)
push!(b,UInt8(c))
end
end
String(b)
end

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
function remove_spaces(s::String)
b = Vector{UInt8}()
for c in s
if !isspace(c)
push!(b,UInt8(c))
end
end
String(b)
end
remove_spaces(s::String) = replace(s, re"\s+"=>"")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found out that it is actually r"\s+" rather than re"\s+", but aside from that, this is done.

@@ -1765,3 +2093,4 @@ end
end
end
end

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we label the new tests with a comment including the URL of the github issues that they prove is solved?

  Renamed push_return_token! to push_return_semitoken!, which accurately
  describes the operation.

  Documented the fact that values(s) is not currently writeable.
@StephenVavasis
Copy link
Contributor Author

@oxinabox I believe I have addressed all of your comments in this PR. In the last commit that I just made, I have added comments to the test sets to indicate where the various issues are tested. Let me know if you need anything else before you can merge this PR.

@@ -39,12 +42,14 @@ module DataStructures
export SortedDict, SortedMultiDict, SortedSet
export SDToken, SDSemiToken, SMDToken, SMDSemiToken
export SetToken, SetSemiToken
export startof
export startof, endof ## both are deprecated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to explictly export deprecated things @deprecate does that for us

Suggested change
export startof, endof ## both are deprecated

Copy link
Member

@oxinabox oxinabox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.
You have merge rights.
Merge when you are happy

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants