-
Notifications
You must be signed in to change notification settings - Fork 133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DISCUSS][RFC] DLPack Versioning and ABI Update #104
Comments
what should downstream frameworks expect in the way of migration functions? should folks proposing an ABI-breaking change here also be on the hook to provide an implementation that produces a new DLTensor struct given an old one? it could be helpful to provide such implementation to clarify how to interpret such older structs. |
Thanks @areusch this is indeed something that worth covering. The compatibility implication section covers some of that
It is a bit ambiguous to call it ABI-breaking(due to S0) but would be good to list out all the implications |
one concern with placing the version information at the end of the struct is that, in a V1 DLTensor, that field is invalid. therefore, the memory in that location could be set to any value. that begs the question whether we should require some magic, and how that magic should perhaps be computed. i think adding the migration functions would help folks to consider these corner cases, which are not always obvious when writing these specs. |
I believe what you said is covered by S1 part, which I agree that some of the migration function(likely filling in the value) would be useful |
i'm not sure i follow--how can a C impl trust this memory, since it's outside the boundaries of a |
The protocol is covered in the PR to implement this, as documented here (summarized below). As I understand the proposal, if the producer has no PR protocol documentation
|
@mattip thanks for the pointer. those protocol docs make sense when Python is in the loop. I'm thinking of the following scenario--it could be something that doesn't apply to dlpack, so please correct me if I'm wrong:
Now suppose the consumer is trying to work with the DLTensor created by the producer. Presumably this involves a flow in which a When the consumer (or the consumer's wrapping Python DLTensor class) tries to access the received A way around this could be to require that all |
Independent of the discussion here, my suggestion would be to let this RFC gather feedback for 1-2 months before making changes of the code. An ABI break for such an important data exchange protocol is serious. Having enough feedback will ensure that there are no regrets. |
@wjakob thanks for the suggestion! That is indeed the goal of this RFC(to ensure a long enough period of time before we commit as we value stability), I made it clear in the post |
Can you point to this code? At some point the producer must decide to allocate and fill in the C struct. That is the point where there needs to be a protocol for specifying a dlpack version. If the architecture of the producer disallows any such protocol, how can dlpack ever change? Edit: I think the term used to describe such situations is "dependency hell". |
Some update after taking a closer look. Actually right now the proposed ABI did break S0 as well. The main reason was due to the fact that DLManagedTensor's deleter field seats right after a DLTensor. So any append of new fields would effectively change the location of deleter and break S0. To further clarify the situation, let us think about the following possible changes to DLTensor fields:
Due to our favoring of stability ideally C1 should be extremely rare. And C0 ideally should maintain backward compatibility(the new struct can be used safely in old settings -- ignoring the fields). There are a few choices if we consider the need that "C0 should not break" ABI compatibility A0: Place version in DLManagedTensorstruct DLManagedTensor {
Version version;
void * manager_ctx;
void (*deleter)(struct DLManagedTensor * self);
DLTensor dl_tensor;
}; Note that this is a ABI breaking change.
However, this allows future append to DLTensor fields without problem, so future C0 changes can be done in a backward compatible way. A1: Place version and new flags in DLManagedTensorstruct DLManagedTensor {
DLTensor dl_tensor;
void * manager_ctx;
void (*deleter)(struct DLManagedTensor * self);
Version version;
new_flags here
}; This is a ABI backward compatible change. It does bring restrictions
A2: A variant of A1 while allow upgrading DLTensor fieldsstruct DLManagedTensor {
DLTensor dl_tensor;
void * manager_ctx;
void (*deleter)(struct DLManagedTensor * self);
more flags
};
struct DLManagedTensorVersioned {
Version version;
DLManagedTensor managed_tensor;
}
Summary DiscussionsDue to our favoring of stability. Ideally we would strongly favor S0 property (so frameworks do not need to bring extra support to keep up with the protocol), and the fact that DLPack rarely changes is an important property of the exchange protocol, of course, we do want to think about mechanisms to allow us to deal with breaking in case there is really need for it. |
I would tend towards creating the new name (i.e. A2 or variations of it). Otherwise, appending the fields requires out-of-bound exchange of the information whether the field exists. We can do that in Python but it seems stranger in C? If a new name is created, new C code can indicate that by using |
How would one do that? I think the idea of A2 is to still export the |
I don't know how current API looks like. I imagine exposing something like And the only way to transition that, is probably to create a new pair of functions that use So, I feel that advantage of "ABI compatibility" with current |
As far as C/C++ is concerned, I too think this is the only way to transition. Both A1 and A2 sound like good options then. I also see now how A2 can provide type safety. @tqchen The A2 option says "In new version, allocate DLManagedTensorVersioned but still return the pointer to the managed_tensor." I don't think we should do that since it takes the type safety advantage away (making A1 a simpler solution). Otherwise, I am happy to go with A2 too. What do you think @mattip? Let's wait for some others to chime in and then I can update the PR. |
I like A2 as well. |
One of the mentioned disadvantages of A1 is "No changes to the current fields of DLTensor itself can be made". How would this look for A2? (In general, it seems to me that adding any fields to DLTensor will be an ABI break.) |
We can add new fields onto DLTensor since the version is the first field in DLManagedTensorVersioned (as opposed to A1 where the version field comes after DLTensor). |
This would work for software that knows how to interpret a |
The idea is to introduce new functions ( |
The killer feature of DLPack is its simplicity: it's just a header file with a few data structures that are trivially ported to various languages. Frameworks that can exchange DLPack tensors don't need to link to a shared or static library to do so. They don't need even deal with C or C++. Introducing required versioning-related functions as part of this repository sounds like it has the potential to significantly complicate this success story. My 2cts... (Edit: I am against versioning per se. I am just wondering what this means for ABI interoperability between software that may and may not implement this versioned extension.) |
Without versioning, DLPack can't evolve. There are already requests for read-only data support, non-native endianness, etc. Adding those without a version to separate the new structure will lead to spurious breaks everywhere (since not everyone will update at the same time and newer functions won't work with older ones). I am not sure if there is a simpler way; after a lot of discussions, these options seem to be the simplest of all. In long term, we can remove the old structs and also update the spec to not care about the old ABI.
The proposal is to add version-info in the struct to avoid this exactly. All the complications happen in the spec and not the DLPack header file... |
Hi everyone, The issue has been up for almost a month. So far, I think @mattip, @seberg, and @leofang are in favor of versioning and @wjakob is against it. Let's wait until 16'th May for more feedback and, if there is none, it would be good to merge #105 and wait until 23'rd May before closing this to allow some time to revert the changes if needed. Please feel free to add your thoughts on how DLPack should proceed with versioning and ABI break. Thanks! |
this is a tricky situation which I suspect comes down to how everyone actually uses DLPack in practice. it'd be helpful to know more about everyone's usage of DLPack so we could evaluate whether we're all debating the same use case or if there are others. my initial impressions of the proposals: A0.
might I suggest A3, a variant of A2:
here is what i think example usage might look like:
the advantage to this is that the managed fields are moved next to the version, so that they could be used to terminate the lifecycle of a tensor whose version is incompatible with the client library. this scheme is of course vulnerable to various memory tricks as well--you could consider patching Version to include some xor'd magic too. it is more complex, though. thoughts? |
That is the hard part here, and why @tirthasheshpatel submitted #106. Other users of DLPack can add additional use cases. Without concrete use cases, it is difficult to relate to "what if" scenarios. To me the A3 proposal seems too "magical". The use of @wjakob when you say
do you agree that the structure is incomplete? If not, what do you respond to NumPy who, in accordance with the Array API consortium recommendations, added support for DLPack but requested support for readonly and byteorder flags? This need for versioning was pointed out in #34 which was opened in March 2019, and has been extensively discussed in other issues over the passing years. |
Just to make that clear (it seems there was some misunderstanding). I am definitely not against versioning. However, it would be useful if versioning-related information is appended to data structures so that they are still interpretable by older versions at an ABI level. Second -- there is a lot of value to having DLPack data structures described as POD/Plain Old Data in a way that is accessible from many different languages without, e.g., having to include this repository or linking to a library. There was some of including C-level functions, for example by @tirthasheshpatel
Maybe this is fine. However, this is a deviation from the current purely POD-style interface where the only kind of function is a function pointer. Finally, one suggestion that I would also have is that the PR#101 adds some kinds of flags field. Adding Why not add a |
@mattip agreed A3 is kind of magical and that's less desirable than an explicit indication. It seemed like there was an attempt being made to avoid breaking backwards-compatibility by considering the I think there are a few possibilities:
I should also state that my interest here is purely from POV of having TVM interoperate with other libraries. These possibilities capture the range of situations I could imagine a standard such as this needing to confront. |
Yes, this is exactly what #105 will do (I will update the PR to follow A2).
I meant the C/C++ libraries that want to add support for the new ABI will have to create those new functions. DLPack itself won't have any functions and will still remain a POD. So, no linking will be necessary.
The plan is to batch all the requests like readonly, endianness, alignment, etc in one single ABI break. Althouh, it would be useful to also add some padding for future use. |
For C++ libraries: new functions like
I think this one's a tricky case. Yes, if a C++ tensor is imported in Python, there would be no |
The problem I have is, that I still don't know how the C++ usage really looks like. In Python, I think this all works out great, because the usage is only an exchange Protocol. That means we have basically 1-2 functions to worry about and we can break API on those (even using a For all I know, the C++ usage may be looking more like:
Now, if your API looks like that, you cannot replace every single function that takes a
(Those functions could be static inline functions included in the header, just for "emergency" use to support an outdated library though!). Now, if we are dealing with C++ – as far as I understand – C++ should be doing name-mangling, so you can write For a C only API, name mangling doesn't exist, so the "magic" may well be a useful way for a library to support both. From a Python API perspective: I don't think we as "Python exchange protocol" folks should care whether or not you insist on doing "magic". It simply doesn't affect us because we have better ways to deal with it. We can just trivially break the ABI by just adding a single But, at least me as a Python exchange protocol person do feel that trying to make dlpack a generic exchange protocol in Python requires an ABI break (or preparing for ABI break). There are already clear limitations, and a IMO a modern exchange protocol can be expected to have ABI versioning. |
Hope you don't mind me jumping in with some ideas even though I just today learned of this wonderful project. I'll talk about the C ABI only. (I think the Python parts can work on top of what I propose below, though I could be missing something.) It seems to me that readers (of the structs, not necessarily of the tensor data; same henceforth) usually need to evolve ahead of writers in order not to disrupt end users, given that the flags being considered for addition (read-only, byte-swapped) are things that cannot be safely ignored by a reader (it is better for a reader to give up rather than (try to) mutate a read-only buffer or use the wrong byte order). So I think what is important is for new readers to be able to safely read structs written by old writers, not the other way around. For this to be the case, there needs to be a way to check the version without danger of undefined behavior (segfault or worse). I'm assuming here that binary compatibility (affects end users) is more important than source compatibility (affects mostly library devs). If it is super-easy for the same reader to read either ABI v1 (current) or v2 (first versioned version), there could be a migration period during which libraries start supporting reading either v1 and v2 but continue to write v1 (except in testing, or in cases where the new features are essential). Once enough of the ecosystem knows how to read v2, libraries can start writing v2 with no disruption. One way to ensure that there is no danger of undefined behavior with v1 and v2 structs coexisting is to use the high bit (bit 31) of Also, looking at the suggestions (as in #105) to add Taking these points into account, here is what I came up with: typedef struct {
int32_t device_type; // High bit set if versioned tensor
int32_t device_id;
} DLDevice;
typedef struct DLTensor {
void *data;
DLDevice device; // See above
int32_t ndim;
DLDataType dtype;
int64_t *shape;
int64_t *strides;
uint64_t byte_offset; // Unchanged up to here
void *manager_ctx; // Write zero if not managed
void (*deleter)(struct DLTensor *self); // Ditto
uint16_t version; // Readers must give up if higher than known
uint16_t flags; // Bits TBD; unused bits must be written zero
uint32_t reserved_in_v2; // Must be written zero
} DLTensor;
typedef DLTensor DLManagedTensor; // For compat
#define DLPACK_TENSOR_GET_VERSION(t) \
((t).device.device_type & 0x80000000 ? (t).version : 1)
#define DLPACK_TENSOR_GET_DEVICE_TYPE(t) \
((t).device.device_type & 0x7FFFFFFF) // Or ... | 0x80000000 with new enum constants
enum { // These are TBD, I'm just showing as examples
kDLFLAGS_READONLY_MASK = 1 << 0,
kDLFLAGS_BYTESWAPPED_MASK = 1 << 1,
};
#define DLPACK_TENSOR_IS_READONLY(t) \
(DLPACK_TENSOR_GET_VERSION(t) >= 2 && (t).flags & kDLFLAGS_READONLY_MASK)
// And so on Note that The enum constants for device type could be updated to include the set high bit, making writing v2+ the default. Function macros could be made When future additions are made, the version number should be incremented if the added feature is something that readers cannot safely ignore, or if it requires enlarging the struct. Readers can indefinitely support structs of previous versions, as far as I can see, unless they require specific guarantees by the writer indicated by new features. Writers can choose to keep writing an older version (but maybe recommend >= 2) if they do not need any newer features and wish to support the broadest range of readers. The There could be exceptions to my view that readers generally need to support features before writers can adopt them. For example, a hypothetical "immutable" flag (not to be confused with read-only; immutable is an additional promise by the writer that data won't change) could be added without incrementing the version number, because readers that don't care can safely ignore it. (I'm not sure whether an "immutable" flag would be practically beneficial. Just an example to illustrate version evolution.) |
Pinging here, since it has been idle for a long time and came up in NumPy yet again. To me (and others I have talked to), this is still a major blocker for any serious adoption of DLPack in Python. It would be nice to get some movement here. Otherwise we are sacrificing Python development to C++ compatibility which is a tradeoff that is not at all appealing to me from the Python side. |
Thanks @seberg for the ping and I agree with the point being said. Sorry for dropping the thread. After reading all the suggestions in the post, here is one suggestions resolution(that mostly evolves from A2) struct DLManagedTensor {
DLTensor dl_tensor;
void * manager_ctx;
void (*deleter)(struct DLManagedTensor * self);
};
struct DLManagedTensorVersioned {
Version version;
DLTensor dl_tensor;
void * manager_ctx;
void (*deleter)(struct DLManagedTensor * self);
// more flags
}
The immediate need for more flags is the read_only one. Which I agree is useful. We can discuss whether make it a bitmask or a uint_8 flag. Likely a uint32 boolean mask might be helpful to allow future flags being added. In order to manage the breaking change, we would need to update the python APIs in two steps:
The C++ side's code can automatically detects type errors as the type of Would love to see how folks think. @tirthasheshpatel @seberg @wjakob @mattip @areusch @rgommers |
Maybe we could try to set up a half-hour zoom call around this since it seems we are not communicating the change properly. As I understand things, it is incorrect to check ABI version in As proposed in #101, the place to call |
Sorry I was not getting it clearly, As per the new data-api, def from_dlpack(other: Object):
try:
ver = other.__dlpack__info()
except:
# older version that do not have info.
# checks here
return internal(other.__dlpack__()) |
so i guess are people passing around non-wrapped @tqchen curious if you're taking a position on the exact ABI break to make here (since @marktsuchida suggested one above), or whether you're first wanting to tackle the semantics of how the program behaves in the presence of a version mismatch? |
In the case of C++:
Indeed the proposal won't handle the breaks where a C++ api takes a void* and do cast. This is a very rare case, and the benefit of having a stable clean ABI for future versions outweights the one-step transition patches to support such cases, which I am not aware of any in production. |
Cool, I guess in C++ there is no serious ABI issue (unless dirty casts things are done). In C, there is in principle and libraries may need to introduce a new function name when adopting (if they wish to be careful and distinguish), but it seems not a major concern? If we can agree on such an ABI break by introducing a new name, then I think we can proceed with making a concrete proposal that is Python friendly (or a concrete proposal with the Python API in mind). If that is the step where we are, I would suggest a meeting with those interested to discuss how a very concrete Python API/friendly proposal will look like. |
On the C++ side, as long as we make it clear we discourage C-style cast and +1 for a meeting on this. |
Happy to host a half-hour zoom call on this. Please fill out your availability here https://www.when2meet.com/?16752946-wS8RV for the week of Sep 19. Note that the time is in EDT, so if you are in different time-zone, maybe some converstion is needed. |
Sorry, @tqchen, is it possible to open up time slots in the week of Sep 26? NVIDIA folks (=me 😅) would be busy in the Sep 19 week due to GTC. |
OK, updated the link to change ro week of Sep 26 |
I don't have enough of a stake in this to warrant joining the call, but just a quick note that I wrote my above comment under the assumption that (1) existing consumers blindly assume that received capsules contain (as type-erased I think I may have been incorrect about (1) -- if the capsule names ( The scheme I proposed is complicated and ugly, and would remain so semi-permanently. Its only advantage is that it can offer memory safety (in the (2) sense) without the type information having to travel separately. I wouldn't recommended it if there are other solutions like changing the capsule names. |
thanks @marktsuchida for careful thoughts, you are more than welcome to join the call btw |
Let us lock down to Friday, 230 EDT. Sep 30 given the responses. Please mark on your calendar and I will send a zoom link here |
The open discussion meeting of DLPack ABI update is scheduled to be on Friday 230EDT Sep 30. Please use the zoom link to join the meeting |
Summary Strawman from the meeting
Thanks @leofang @seberg @kkraus14 for joining the meeting. We will keep this thread open for one week feedback then post a notice for one month before we do the update struct DLManagedTensor {
DLTensor dl_tensor;
void * manager_ctx;
void (*deleter)(struct DLManagedTensor * self);
};
/*!
* \brief The DLPack and DLPack ABI versions of the tensor.
*/
typedef struct {
/*! \brief DLPack version. */
uint32_t dlpack;
/*! \brief DLPack ABI version. */
uint32_t abi;
} DLPackVersion;
struct DLManagedTensorVersioned {
DLPackVersion version;
void * manager_ctx;
void (*deleter)(struct DLManagedTensorVersioned * self);
uint64_t flags;
DLTensor dl_tensor;
}
typedef DLPACK_BIT_MASK_READ_ONLY 1
|
@tqchen do you have any timelime to release the managed version ? we are maintaining rust bindings here https://github.com/kornia/dlpack-rs that we use for image/video loading in kornia for which we want to release soon new versions. |
This is a thread to bring awareness and discuss the recent set of proposed change of adding versioning information to DLPack.
Summary
Up until now the DLPack C struct itself does not carry ABI version information, and we always maintain backward compatibility. Up until now there is no ABI breaking changes.
The overall stability is one reason why frameworks adopt DLPack in the first place and we would like to continue to do moving forward. In the meantime, it is indeed helpful to clarify the DLPack ABI version in the data structure itself. contains a draft proposal for the change. We can still
In short, we are looking into the possibility of attaching version information to the DLTensor struct, to minimize the ABI change, the version information is appended in the end of the current DLTensor
Summary of key changes
__dlpack_info__
function that returns the current supported API/ABI version.Compatibility implications
One thing that is worth clarifying is that whether this constitute a ABI breaking change. It depends on how people use it. Let us clarify the following scenarios.
__dlpack_info__
then do the conversion according__dlpack_info__
to indicate that it is already at a new versionNormally S0 means that the data structure ABI is still backward compatible (new data structure being used in old scenarios). S1 somewhat seats in the future-compatible regime(old data structure being used in new scenarios).
Related threads
What to Expect
This is a notice and discussion thread to let the community that this is happening. It would be great to get folks to also tag related framework communities, so the ecosystem is fully aware of(and ideally buy-in) the proposed change before we formally proceed.
Given this change does have compatibility implications (although not breaking depending on how we see it), we will have a longer period of NOTICE(expect 1-2 months) before we proceed. Please tag as many folks you think are related as possible.
In the future, we also suggest to open thread of this kind to discuss the compatibility implication of the changes, if any.
The text was updated successfully, but these errors were encountered: