Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace BinaryFormatter with MessagePack serialization #1166

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

arsdragonfly
Copy link

Fixes #1131 . As explained in my comment #1131 the whole thing is very much pointless due to the nature of Spark being RCE so we must be able to pass arbitrary data/computation around (except for one place where we don't really need the flexibility of deserializing arbitrary types and I've replaced that with a strongly typed deserializer). Nevertheless, using MessagePack's typeless serialization will hopefully saves us the trouble of the ecosystem downstream.

Copy link

@dbeavon dbeavon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Starting review ... 04/16/2024

Copy link

@dbeavon dbeavon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we merge this?

Would be great to have a new version of Microsoft.Spark with these MessagePack changes, and with an upgrade to .net 8.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @arsdragonfly

I was looking at the dependencies, and the MessagePack library only seems to be compatible with .net standard 2.0.
https://www.nuget.org/packages/MessagePack/2.5.140#dependencies-body-tab

Should we start removing the targeting for .net standard 2.1?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops. meant to put this comment on the Microsoft.Spark.csproj.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure why the netstandard2.1 tf was there before, looks redundant to me

@dbeavon
Copy link

dbeavon commented Apr 17, 2024

@elvaliuliuliu @AFFogarty

Can someone please help with this merge? Or assign to reviewers who have write access?

This introduces dependencies on :

  • "MessagePack"
  • "MessagePack.Annotations"
  • "Microsoft.NET.StringTools" (MSBuild utilities)

@arsdragonfly Have you already found someone to help you merge this, by any chance?

@dbeavon
Copy link

dbeavon commented Apr 18, 2024

@luisquintanilla @GoEddie
Do you have rights to approve of PR's? Can you help with this one?
It was a primary concern that needed to be addressed, according to AFFogarty

@GoEddie
Copy link
Contributor

GoEddie commented Apr 18, 2024

@luisquintanilla @GoEddie Do you have rights to approve of PR's? Can you help with this one? It was a primary concern that needed to be addressed, according to AFFogarty

Sorry I don’t

@GeorgeS2019
Copy link

GeorgeS2019 commented Apr 19, 2024

@arsdragonfly => Great job!

@luisquintanilla

You are one of the pioneering advocates. Could you help identify who has the right to trigger the merge?

I will write you en email.

@GeorgeS2019
Copy link

@dbeavon

Can this merge also work for .NET 8.0?
#1170

@GeorgeS2019
Copy link

@arsdragonfly
@dbeavon

With this merge, what version of Spark will this support?

Spark 3.3 or beyond to spark 3.5.1?
What version has been tested to work with this merge?
#1163

@GeorgeS2019 GeorgeS2019 mentioned this pull request Apr 20, 2024
@GeorgeS2019
Copy link

GeorgeS2019 commented Apr 20, 2024

@dbeavon

Could you put all the necessary merges and upgrade e.g. to .NET8 to your fork?

I could direct people to your fork.

We state clearly which latest version of Spark this will support.

We work on this fork until the dotnet team responds.

Reference: #1174

@dbeavon
Copy link

dbeavon commented Apr 20, 2024

@GeorgeS2019

Could you put all the necessary merges and upgrade e.g. to .NET8 to your fork?

Yes, but I think we need to know what cloud platforms we want to support. I think there are some minor tweaks needed for HDInsight.

I'm guessing we should limit ourselves to HDInsight for now (and local OSS on windows).

Then we need to decide if we use a new version number or stick with 2.1.x.
Then we need to decide if we publish a nuget for our Microsoft.Spark changes (to make it easier when new developers start working on a .Net driver)

The amount of work to be done is non-trivial, maybe about 40 hours to get up and running with .net 8 on HDI. Most of the work is non-technical and NOT an area where I add value (ie. management and decision making and communicating). I don't want to host this in my own fork forever. Have you tried to contact Microsoft? Do you have a .net foundation membership?

Frankly, I'd prefer if you owned a fork, and I'll send you PR's to merge. Your help would be invaluable, whether technical or non-technical. I can send you PR's right away for .net 8 and messagepack and HDI compatibility.

@GeorgeS2019
Copy link

GeorgeS2019 commented Apr 20, 2024

Start small..we recruit more to join. Local Windows 11 OS is key.

Ideally we show the spark dotnet book examples work in e.g. Net

So new and existing users can read the book and yet doing exercises code based on
Net 8.0. And a convenient microsft spark 2.1.* version, as U have suggested

@GeorgeS2019
Copy link

Updated

@dbeavon
Copy link

dbeavon commented Apr 20, 2024

@arsdragonfly @dbeavon

With this merge, what version of Spark will this support?

The versions of Spark are probably the same as they were in Microsoft.Spark 2.1.1.

I am only testing Spark 3.1.2. (Version supported for HDI 5) That has been sitting around for a long time.

I am pretty sure we could get to other versions of spark if needed. It seems like the least of our concerns at this point.

What is the concern insofar as Spark versioning? Please remember that .net and spark are pretty independent of each other. You can support a wide range of versions without too much difficulty.

@GeorgeS2019
Copy link

Just start very small..others will join with renewed interest

@arsdragonfly
Copy link
Author

still discussing internally on how to structure project governance, stay tuned

@GrabYourPitchforks
Copy link
Member

GrabYourPitchforks commented May 14, 2024

Note to @AFFogarty and other MSFT reviewers: MessagePack's "typeless" functionality is banned for MSFT internal use under the same regulation that bans BinaryFormatter for MSFT internal use. If you accept this PR, you will still have to jump through the same "ask upper management for an exception to use this" hoops that you would've had to go through with BF.

As OP notes, you could make an argument that since no logical trust boundary is being crossed, it's all a moot point anyway and the exception should be granted. I imagine your security team would sign off on that statement. But the main point I'm trying to make is that this PR does not simplify any internal signoff + release process. This PR should be viewed wholly through the lens of broader ecosystem impact, nothing more. I am only posting this comment because I am the compliance owner for serialization related technologies and I wanted to clarify how this PR would / would not impact any signoff.

@arsdragonfly
Copy link
Author

Note to @AFFogarty and other MSFT reviewers: MessagePack's "typeless" functionality is banned for MSFT internal use under the same regulation that bans BinaryFormatter for MSFT internal use. If you accept this PR, you will still have to jump through the same "ask upper management for an exception to use this" hoops that you would've had to go through with BF.

As OP notes, you could make an argument that since no logical trust boundary is being crossed, it's all a moot point anyway and the exception should be granted. I imagine your security team would sign off on that statement. But the main point I'm trying to make is that this PR does not simplify any internal signoff + release process. This PR should be viewed wholly through the lens of broader ecosystem impact, nothing more. I am only posting this comment because I am the compliance owner for serialization related technologies and I wanted to clarify how this PR would / would not impact any signoff.

Thanks for the reminder. This is the unfortunate workaround that we'd probably have to settle around since BF is on its way out and the strictly statically typed alternative that Orleans folks chose to use since Orleans 7 will IMO make .NET for Spark completely useless for anything but the most trivial toy programs. @luisquintanilla could you help drive the security review/signoff process for this particular PR and for our future community-driven development model in general?

Copy link

@grazy27 grazy27 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your PR!

@@ -20,7 +20,6 @@ namespace Microsoft.Spark
/// also attempts to distribute broadcast variables using efficient broadcast algorithms to
/// reduce communication cost.
/// </summary>
[Serializable]
public sealed class Broadcast<T> : IJvmObjectReferenceProvider
{
[NonSerialized]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's replace [NonSerialized] wtih [IgnoreMember] attribute from MessagePack to ensure non-breaking changes

@@ -20,7 +20,6 @@ namespace Microsoft.Spark
/// also attempts to distribute broadcast variables using efficient broadcast algorithms to
/// reduce communication cost.
/// </summary>
[Serializable]
public sealed class Broadcast<T> : IJvmObjectReferenceProvider
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to register broadcast var in the registry on it being serialized, we can implement IMessagePackSerializationCallbackReceiver, and rename existing OnSerialized to OnBeforeSerialize(). Otherwise it's impossible to access var from worker context

Comment on lines +21 to +24
// TODO: find out why MessagePack is requiring a parameterless constructor.
public TestClass()
{
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are several more classes that require similar empty ctors, such as ExternalClass or Broadcast

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that the standalone nuget package for BF is now available, I'm leaning towards scraping the whole PR and moving to that instead, because this is IMO a major limitation that could break any code that uses user/3rd party classes that reference a class without an empty ctor. What do you think?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since microsoft doesn't add [Serializable] attributes to new types since .NET 6, sticking to NuGet will still not work long-term - As soon as there's a new commonly used type - integration will fall apart.

Imho the best option long-term would be either to investigate MsgPack further (Custom resolver for classes without default ctor and workaround for the bug with empty stream), or, if there's no way to work-around ctor behavior - make serializer configurable and include both options

I created a PR with version bump up for .NET and dependencies, for .NET8 it's still possible to use BinaryFormatter from library, for .NET9 we'll have to migrate to NuGet

Comment on lines 73 to 76
public object Deserialize(Stream stream, int length)
{
return _formater.Deserialize(stream);
return MessagePackSerializer.Typeless.Deserialize(stream);
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a bug in MessagePack I found while tried to reuse these changes, when called from Collector::Collect: Deserializer erases entire stream after the first read.

Workaround for it is to either read stream in small chunks and pass to MessagePack, or load entire stream in MemoryStream and then use it.

MessagePack-CSharp/MessagePack-CSharp#1881

@arsdragonfly
Copy link
Author

arsdragonfly commented Sep 11, 2024

do not merge this PR yet (see above mentioned problems)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

JVM IPC Deserialization uses BinaryFormatter, which is now Deprecated for OWASP CWE
7 participants