Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[storage] Transaction support #212

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

Conversation

benlwalker
Copy link

A transaction is created by calling CreateTransaction on the
TransactionService, which allocates a txn_id. This txn_id can
(optionally) be included in many storage RPCs. If a storage RPC contains
a valid txn_id, it's execution will be delayed until the transaction is
later executed using ExecuteTransaction.

When a transaction is executed, all of the steps within that transaction
either succeed or fail as a unit. When a transaction completes, the state
of the system will be either as if the transaction never occurred (for a
failure) or as if all steps succeeded.

Transactions must have limited lifetimes after they are created to deal
with clients creating transactions and then going away without executing
or deleting them. The agent implementing this service should periodically
perform clean up as it deems necessary.

Ben Walker added 2 commits November 17, 2022 14:27
A transaction is created by calling CreateTransactionRequest on the
TransactionService, which allocates a txn_id. This txn_id can
(optionally) be included in many storage RPCs. If a storage RPC contains
a valid txn_id, it's executable will be delayed until the transaction is
later executed using ExecuteTransaction.

When a transaction is executed, all of the steps within that transaction
either succeed or fail as a unit. When a transaction completes, the state
of the system will be either as if the transaction never occurred (for a
failure) or as if all steps succeeded.

Transactions must have limited lifetimes after they are created to deal
with clients creating transactions and then going away without executing
or deleting them. The agent implementing this RPC should periodically
perform clean up as it deems necessary.

Signed-off-by: Ben Walker <benjamin.walker@intel.com>
Signed-off-by: Ben Walker <benjamin.walker@intel.com>
@benlwalker benlwalker requested a review from a team as a code owner November 17, 2022 21:34
@benlwalker benlwalker changed the title Transaction support [storage] Transaction support Nov 17, 2022
Copy link
Member

@glimchb glimchb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea, not approving officially yet, so others can comment and review before we merge

@glimchb glimchb added Storage APIs or code related to storage area Merge Candidate in the open merge window, next candidate for merge labels Nov 17, 2022
@jainvipin
Copy link
Contributor

Usually REST APIs doesn't have a concept of inter-related resources that need to be atomically committed. However we should adopt this if if it makes sense for us. Overall I like the idea... but we can keep it optional (naturally if a transaction is not started then rest is optional and life is usual). A few things that deem a discussion

  1. Do we want to support multiple ||el transactions? Both yes and no have their own pros/cons, which we can discuss.
  2. The response needs to be verbose, even though requests to add/delete went together, the response for various operations need to be reported individually so user knows what went wrong on a specific object
  3. If we follow the Spec/Status semantics, we need one object, we can call is Transaction but have Spec identifying the transaction id, and Status listing the status of the transaction, may be it is not an imperative operation and may require resources to asynchronously be available.

@pohly
Copy link

pohly commented Nov 18, 2022

I wouldn't make this support optional. As I mentioned elsewhere, idempotency is crucial for CSI drivers that want to use this API. To achieve this, they can repeat all API calls that haven't completed successfully before, but I have the impression that this is going to be tricky for some these fine-grained API calls (please correct me if I'm wrong). It can also leave the system in a partially configured state.

Both a single, big API call and many smaller calls with transaction support avoid that. I'm fine with both approaches.

@intelfisz
Copy link

Agree with @pohly. Especially when dealing with xPUs making this optional would allow the HW to be left in some partially configured/inconsistent state. I see no reason for leaving such a backdoor open - @jainvipin? If you are concerned that creating transactions may impose some performance tax I would not worry about that since these are configuration and management calls - once the offload machinery is set up, data plane traffic would not be affected.

@glimchb
Copy link
Member

glimchb commented Nov 18, 2022

  1. Can we also explore explore Batch methods https://google.aip.dev/233 #211 ? Batch methods according to https://google.aip.dev/233 ?
    instead of transaction ?
  2. I consulted with EMC CSI drivers writers and they said that idempotency and atomicity is already part of CSI drivers today, so we don't need that support just for DPUs. Other use cases don;t provide this option, so this will create a better, but unnecessary functionality and code divergency

@intelfisz
Copy link

  1. I consulted with EMC CSI drivers writers and they said that idempotency and atomicity is already part of CSI drivers today, so we don't need that support just for DPUs. Other use cases don;t provide this option, so this will create a better, but unnecessary functionality and code divergency

I find this confusing. We are not talking here about existing CSI drivers for setting up remote storage backends in particular but rather about frontend configuration of "local targets" for which new CSI drivers need to be written and which are supposed to utilize OPI API. It is up to the implementation to guarantee idempotency requirement - the transaction idea is to facilitate this.

@pohly
Copy link

pohly commented Nov 18, 2022

they said that idempotency and atomicity is already part of CSI drivers today, so we don't need that support just for DPUs

It's part of a CSI driver if its implementation of calls like NodeStageVolume are written properly. That depends on support in the underlying API. For example, suppose the underlying API has a CreateNewInterface() CreateNewInterfaceResponse call that is needed by NodeStageVolume to set up something in the DPU:

func NodeStageVolume(...) {
    resp := api.CreateNewInterface()
    // do something with the new interface
   ...
}

This implementation of NodeStageVolume is not idempotent. If it gets interrupted after CreateNewInterface has changed the DPU configuration and gets called again, it will create another "interface" (whatever that is), leaving the one created before behind as an orphan resource that will never get cleaned up.

Every single API call has to be defined so that the caller can figure out whether it has completed or be called again (idempotency). The introduction of a transaction ID is a simple way of achieving this because calls have no side effect until the transaction completes.

I just see one problem: what if ExecuteTransaction itself gets interrupted so that it completes on the DPU, but the response doesn't get back to the CSI driver?

@glimchb
Copy link
Member

glimchb commented Nov 18, 2022

I find this confusing. We are not talking here about existing CSI drivers for setting up remote storage backends in particular but rather about frontend configuration of "local targets" for which new CSI drivers need to be written and which are supposed to utilize OPI API. It is up to the implementation to guarantee idempotency requirement - the transaction idea is to facilitate this.

nobody is writing new drivers from scratch, we adding OPI to many already existing options in CSI node drivers
see https://github.com/opiproject/opi-api/blob/main/doc/images/CSI-With-xPU-option1.png
in this option1, controller driver is unchanged, node driver uses OPI gRPC in addition to what it has today

@pohly
Copy link

pohly commented Nov 18, 2022

node driver uses OPI gRPC in addition to what it has today

And it is that part where OPI gRPC calls have to support implementing an idempotent NodeStageVolume.

@intelfisz
Copy link

I find this confusing. We are not talking here about existing CSI drivers for setting up remote storage backends in particular but rather about frontend configuration of "local targets" for which new CSI drivers need to be written and which are supposed to utilize OPI API. It is up to the implementation to guarantee idempotency requirement - the transaction idea is to facilitate this.

nobody is writing new drivers from scratch, we adding OPI to many already existing options in CSI node drivers see https://github.com/opiproject/opi-api/blob/main/doc/images/CSI-With-xPU-option1.png in this option1, controller driver is unchanged, node driver uses OPI gRPC in addition to what it has today

No matter if you extend or write from scratch - it is up to you to guarantee idempotency. Transaction API can greatly facilitate this.

@glimchb
Copy link
Member

glimchb commented Nov 18, 2022

it will create another "interface" (whatever that is), leaving the one created before behind as an orphan resource that will never get cleaned up.

it will not create a new interface. it will use GET to check if interface already exists and not create a new one...

@glimchb
Copy link
Member

glimchb commented Nov 18, 2022

Transaction API can greatly facilitate this.

I agree with this, they can help. But do CSI driver providers need this help or not is another story

@pohly
Copy link

pohly commented Nov 18, 2022

it will not create a new interface. it will use GET to check if interface already exists and not create a new one...

It could do that, but the version I showed didn't to demonstrate the problem 😁

Has someone tried writing down what a correct implementation of NodeStageVolume has to look like with the current OPI gRCP API? That would be useful to verify whether the API has the necessary support (like a GET call for each create).

@glimchb
Copy link
Member

glimchb commented Nov 18, 2022

Has someone tried writing down what a correct implementation of NodeStageVolume has to look like with the current OPI gRCP API? That would be useful to verify whether the API has the necessary support (like a GET call for each create).

we are adding OPI support to existing CSI drivers now, this is where all small fixes to APIs come from my side...

@glimchb
Copy link
Member

glimchb commented Nov 18, 2022

one feedback we already got, that even with GETs or TRANSACTIONs there are too many code, too many API calls...
we were suggested to create a go library as a good practice already exists today in CSI world...

this is why we creating https://github.com/opiproject/goopicsi

@jainvipin
Copy link
Contributor

jainvipin commented Nov 18, 2022

I wouldn't make this support optional. As I mentioned elsewhere, idempotency is crucial for CSI drivers that want to use this API. To achieve this, they can repeat all API calls that haven't completed successfully before, but I have the impression that this is going to be tricky for some these fine-grained API calls (please correct me if I'm wrong). It can also leave the system in a partially configured state.

Both a single, big API call and many smaller calls with transaction support avoid that. I'm fine with both approaches.

Idempotency and transaction semantics are totally independent.
Transaction allows atomicity of commit on multiple objects, where as idempotency is natively built in REST semantics where resources can be committed anytime over and over again without a consequence, etc. I'd favor idempotency, but mandating transactional processing for multi-object atomic commit is a bit extreme.

@glimchb
Copy link
Member

glimchb commented Nov 18, 2022

Idempotency, for example using https://google.aip.dev/133#user-specified-ids

@jainvipin
Copy link
Contributor

Agree with @pohly. Especially when dealing with xPUs making this optional would allow the HW to be left in some partially configured/inconsistent state. I see no reason for leaving such a backdoor open - @jainvipin? If you are concerned that creating transactions may impose some performance tax I would not worry about that since these are configuration and management calls - once the offload machinery is set up, data plane traffic would not be affected.

Well, things are working just fine without transactions for us in our deployments. So I can ascertain that this is not mandatory. DPU is not the highest level of manageability entity in the infrastructure, therefore partial failures can be handled quite efficiently up in the caller as well. Having said, enable/disable #178 captures turning on the volume only when everything is successful, giving user a full control.

xPUs can very likely implement 2 phase commit (required by transactional semantics) - and so we could introduce the concept, but mandating it would be to impose things that are not needed in NVMe, at least not for now. In fact, I'd say that let's go without transactional semantics for now, and see if anyone runs into a problem that we need to solve here - at least here at AMD, we don't see a problem without transactions (for NVMe).

In fact transactional semantics (esp multiple transactions simultaneous) is going to break idempotency, because the update on an object later in different transaction could be void with prior commit on the same object in later transaction. So we are creating a sense of idempotency at transaction level and not object level.

@benlwalker
Copy link
Author

Adding transactions facilitates making the APIs idempotent. A single CSI call likely results in many OPI calls, each of which modifies state and can fail. Either you check a bunch of state in the CSI plugin implementation, then submit these one by one and on a failure attempt to roll it all back, or we have a transaction API that does the whole set as a unit and lets the CSI plugin check many fewer things and not deal with rollback.

The transaction API also deals with parallelism in a nicer way. You can build up transactions in parallel, but I'd highly recommend that the agent implementations take a global lock around the ExecuteTransaction call to serialize everything within that xPU. Overlapped provisioning is extremely problematic.

this is why we creating https://github.com/opiproject/goopicsi

While having a golang wrapper around OPI is convenient, I'd highly discourage that as a place where we implement any key features of OPI, such as transactionality or idempotence. This is because not all orchestration systems are written in go - some are Python, or Java, or C#, or JavaScript. Critical functionality in OPI can't be in a golang specific module.

@benlwalker
Copy link
Author

Well, things are working just fine without transactions for us in our deployments. So I can ascertain that this is not mandatory. DPU is not the highest level of manageability entity in the infrastructure, therefore partial failures can be handled quite efficiently up in the caller as well. Having said, enable/disable #178 captures turning on the volume only when everything is successful, giving user a full control.

You've just pushed the idempotence logic up into the orchestration plugin. There's nothing inherently wrong with that approach, other than now every orchestration system has to implement that. We're only arguing that it's more effective for the agent running on the xPU to implement it, given the knowledge of the xPU internals that the agent has. It's not at all about what works and what doesn't, it's about what's best.

@jainvipin
Copy link
Contributor

Well, things are working just fine without transactions for us in our deployments. So I can ascertain that this is not mandatory. DPU is not the highest level of manageability entity in the infrastructure, therefore partial failures can be handled quite efficiently up in the caller as well. Having said, enable/disable #178 captures turning on the volume only when everything is successful, giving user a full control.

You've just pushed the idempotence logic up into the orchestration plugin. There's nothing inherently wrong with that approach, other than now every orchestration system has to implement that. We're only arguing that it's more effective for the agent running on the xPU to implement it, given the knowledge of the xPU internals that the agent has. It's not at all about what works and what doesn't, it's about what's best.

Again, idempotency has nothing to do with transaction semantics.
I am not sure what is the goal here. We can have idempotency (which works best with declarative APIs vs. imperative APIs) with or without transaction semantics.

And as for handling failure, the caller of the API will need to handle the failure regardless of whether a transaction fails or an object operation fails. But I do understand that transactional failure handling may be simpler if orchestrator wants to do away with withdrawing an object upon partial failure.

Also note that I am not saying there is anything wrong with the approach, in fact I like the idea of a user wanting to do atomic commit on multiple objects. What I am objecting to is that this is the only way to do it, because I'd refuse to believe that based on my deployment experience. Of course, we can make it forceful- despite that.

In fact, I'd challenge that we run current APIs in production and see if we even need transactional semantics in storage APIs?

@pohly
Copy link

pohly commented Nov 18, 2022

this is why we creating https://github.com/opiproject/goopicsi

Is that already supposed to be idempotent?

From the OPI proto file it is not clear to me how CreateNVMeSubsystem will react when called twice, and in https://github.com/opiproject/goopicsi/blob/c545e44b269d3fb77b8331231df8448576e1959c/goopicsi.go#LL64C17-L64C37 there is no special handling of an "already exists" error.

@glimchb
Copy link
Member

glimchb commented Nov 18, 2022

this is why we creating https://github.com/opiproject/goopicsi

Is that already supposed to be idempotent?

in development, PRs in progress, you are welcome to contribute as well, this is open

From the OPI proto file it is not clear to me how CreateNVMeSubsystem will react when called twice,

one simple option is just to return same subsystems[in.SubsystemId.Value]
https://github.com/opiproject/opi-spdk-bridge/blob/main/server/frontend.go#L50
without creating new object.

PRs are welcome

@jainvipin
Copy link
Contributor

this is why we creating https://github.com/opiproject/goopicsi

Is that already supposed to be idempotent?

From the OPI proto file it is not clear to me how CreateNVMeSubsystem will react when called twice, and in https://github.com/opiproject/goopicsi/blob/c545e44b269d3fb77b8331231df8448576e1959c/goopicsi.go#LL64C17-L64C37 there is no special handling of an "already exists" error.

Typically, per REST semantics you'd do POST on an object and then PUT (to update, as many times). The point I was trying to make is that I can have transaction semantics and yet fail upon double create and so I may not get idempotency with transaction semantics. If backend doesn't implement idempotency we can seek that as the goal, not mix it with transactions.

@pohly
Copy link

pohly commented Nov 18, 2022

one simple option is just to return same subsystems[in.SubsystemId.Value]
https://github.com/opiproject/opi-spdk-bridge/blob/main/server/frontend.go#L50
without creating new object.

But doesn't that then have to be in the API spec? Clients cannot rely on this unless such behavior is required from an implementation and thus guaranteed.

PRs are welcome

I'm afraid I don't know enough about these individual APIs to suggest enhancements. I probably wouldn't even get the terminology right 😅

@glimchb
Copy link
Member

glimchb commented Nov 18, 2022

But doesn't that then have to be in the API spec? Clients cannot rely on this unless such behavior is required from an implementation and thus guaranteed.

https://google.aip.dev/133#user-specified-ids

If a user tries to create a resource with an ID that would result in a duplicate resource name, the service must error with ALREADY_EXISTS.

and

Important: Declarative-friendly resources ([AIP-128](https://google.aip.dev/128)) must support user-specified IDs.

@glimchb glimchb removed the Merge Candidate in the open merge window, next candidate for merge label Dec 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Storage APIs or code related to storage area
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants