From 03b79393e71910a33a39864e563fcbeb2de56658 Mon Sep 17 00:00:00 2001 From: Niharika Dutta Date: Sun, 19 Apr 2020 22:31:05 -0700 Subject: [PATCH 01/20] Adding section for UDF serialization --- docs/broadcast-guide.md | 92 +++++++++++++++++++++ docs/udf-guide.md | 172 ++++++++++++++++++++++++++++++++++++++++ 2 files changed, 264 insertions(+) create mode 100644 docs/broadcast-guide.md create mode 100644 docs/udf-guide.md diff --git a/docs/broadcast-guide.md b/docs/broadcast-guide.md new file mode 100644 index 000000000..4286c569e --- /dev/null +++ b/docs/broadcast-guide.md @@ -0,0 +1,92 @@ +# Guide to using Broadcast Variables + +This is a guide to show how to use broadcast variables in .NET for Apache Spark. + +## What are Broadcast Variables + +[Broadcast variables in Apache Spark](https://spark.apache.org/docs/2.2.0/rdd-programming-guide.html#broadcast-variables) are a mechanism for sharing variables across executors that are meant to be read-only. They allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. They can be used, for example, to give every node a copy of a large input dataset in an efficient manner. + +### How to use broadcast variables in .NET for Apache Spark + +Broadcast variables are created from a variable `v` by calling `SparkContext.Broadcast(v)`. The broadcast variable is a wrapper around `v`, and its value can be accessed by calling the `Value()` method on it. + +Example: + +```csharp +string v = "Variable to be broadcasted"; +Broadcast bv = SparkContext.Broadcast(v); + +// Using the broadcast variable in a UDF: +Func udf = Udf( + str => $"{str}: {bv.Value()}"); +``` + +The type of broadcast variable is captured by using Generics in C#, as can be seen in the above example. + +### Deleting broadcast variables + +The broadcast variable can be deleted from all executors by calling the `Destroy()` function on it. + +```csharp +// Destroying the broadcast variable bv: +bv.Destroy(); +``` + +> Note: `Destroy` deletes all data and metadata related to the broadcast variable. Use this with caution- once a broadcast variable has been destroyed, it cannot be used again. + +#### Caveat of using Destroy + +One important thing to keep in mind while using broadcast variables in UDFs is to limit the scope of the variable to only the UDF that is referencing it. The [guide to using UDFs](udf-guide.md) describes this phenomenon in detail. This is especially crucial when calling `Destroy` on the broadcast variable. If the broadcast variable that has been destroyed is visible to or accessible from other UDFs, it gets picked up for serialization by all those UDFs, even if it is not being referenced by them. This will throw an error as .NET for Apache Spark is not able to serialize the destroyed broadcast variable. + +Example to demonstrate: + +```csharp +string v = "Variable to be broadcasted"; +Broadcast bv = SparkContext.Broadcast(v); + +// Using the broadcast variable in a UDF: +Func udf1 = Udf( + str => $"{str}: {bv.Value()}"); + +// Destroying bv +bv.Destroy(); + +// Calling udf1 after destroying bv throws the following expected exception: +// org.apache.spark.SparkException: Attempted to use Broadcast(0) after it was destroyed +df.Select(udf1(df["_1"])).Show(); + +// Different UDF udf2 that is not referencing bv +Func udf2 = Udf( + str => $"{str}: not referencing broadcast variable"); + +// Calling udf2 throws the following (unexpected) exception: +// [Error] [JvmBridge] org.apache.spark.SparkException: Task not serializable +df.Select(udf2(df["_1"])).Show(); +``` + +The recommended way of implementing above desired behavior: + +```csharp +string v = "Variable to be broadcasted"; +// Restricting the visibility of bv to only the UDF referencing it +{ + Broadcast bv = SparkContext.Broadcast(v); + + // Using the broadcast variable in a UDF: + Func udf1 = Udf( + str => $"{str}: {bv.Value()}"); + + // Destroying bv + bv.Destroy(); +} + +// Different UDF udf2 that is not referencing bv +Func udf2 = Udf( + str => $"{str}: not referencing broadcast variable"); + +// Calling udf2 works fine as expected +df.Select(udf2(df["_1"])).Show(); +``` + This ensures that destroying `bv` doesn't affect calling `udf2` because of unexpected serialization behavior. + + Broadcast variables are very useful for transmitting read-only data to all executors, as the data is sent only once and this gives huge performance benefits when compared with using local variables that get shipped to the executors with each task. Please refer to the [official documentation](https://spark.apache.org/docs/2.2.0/rdd-programming-guide.html#broadcast-variables) to get a deeper understanding of broadcast variables and why they are used. \ No newline at end of file diff --git a/docs/udf-guide.md b/docs/udf-guide.md new file mode 100644 index 000000000..bb308815d --- /dev/null +++ b/docs/udf-guide.md @@ -0,0 +1,172 @@ +# Guide to User-Defined Functions (UDFs) + +This is a guide to show how to use UDFs in .NET for Apache Spark. + +## What are UDFs + +[User-Defined Functions (UDFs)](https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/expressions/UserDefinedFunction.html) are a feature of Spark that allow developers to use custom functions to extend the system's built-in functionality. They transform values from a single row within a table to produce a single corresponding output value per row based on the logic defined in the UDF. + +Let's take the following as an example for a UDF definition: + +```csharp +string s1 = "hello"; +Func udf = Udf( + str => $"{s1} {str}"); + +``` +The above defined UDF takes a `string` as an input (in the form of a [Column](https://github.com/dotnet/spark/blob/master/src/csharp/Microsoft.Spark/Sql/Column.cs#L14) of a [Dataframe](https://github.com/dotnet/spark/blob/master/src/csharp/Microsoft.Spark/Sql/DataFrame.cs#L24)), and returns a `string` with `hello` appended in front of the input. + +For a sample Dataframe, let's take the following Dataframe `df`: + +```text ++-------+ +| name| ++-------+ +|Michael| +| Andy| +| Justin| ++-------+ +``` + +Now let's apply the above defined `udf` to the dataframe `df`: + +```csharp +DataFrame udfResult = df.Select(udf(df["name"])); +``` + +This would return the below as the Dataframe `udfResult`: + +```text ++-------------+ +| name| ++-------------+ +|hello Michael| +| hello Andy| +| hello Justin| ++-------------+ +``` +To get a better understanding of how to implement UDFs, please take a look at the [UDF helper functions](https://github.com/dotnet/spark/blob/master/src/csharp/Microsoft.Spark/Sql/Functions.cs#L3616) and some [test examples](https://github.com/dotnet/spark/blob/master/src/csharp/Microsoft.Spark.E2ETest/UdfTests/UdfSimpleTypesTests.cs#L49). + +## UDF serialization + +Since UDFs are functions that need to be executed on the workers, they have to be serialized and sent to the workers as part of the payload from the driver. This involves serializing the [delegate](https://docs.microsoft.com/en-us/dotnet/csharp/programming-guide/delegates/) which is a reference to the method, along with its [target](https://docs.microsoft.com/en-us/dotnet/api/system.delegate.target?view=netframework-4.8) which is the class instance on which the current delegate invokes the instance method. Please take a look at this [code](https://github.com/dotnet/spark/blob/master/src/csharp/Microsoft.Spark/Utils/CommandSerDe.cs#L149) to get a better understanding of how UDF serialization is being done. + +## Good to know while implementing UDFs + +One behavior to be aware of while implementing UDFs in .NET for Apache Spark is how the target of the UDF gets serialized. .NET for Apache Spark uses .NET Core, which does not support serializing delegates, so it is instead done by using reflection to serialize the target where the delegate is defined. When multiple delegates are defined in a common scope, they have a shared closure that becomes the target of reflection for serialization. Let's take an example to illustrate what that means. + +The following code snippet defines two string variables that are being referenced in two function delegates, that just return the respective strings as result: + +```csharp +using System; + +public class C { + public void M() { + string s1 = "s1"; + string s2 = "s2"; + Func a = str => s1; + Func b = str => s2; + } +} +``` + +The above C# code generates the following C# disassembly (credit source: [sharplab.io](sharplab.io)) code from the compiler: + +```csharp +public class C +{ + [CompilerGenerated] + private sealed class <>c__DisplayClass0_0 + { + public string s1; + + public string s2; + + internal string b__0(string str) + { + return s1; + } + + internal string b__1(string str) + { + return s2; + } + } + + public void M() + { + <>c__DisplayClass0_0 <>c__DisplayClass0_ = new <>c__DisplayClass0_0(); + <>c__DisplayClass0_.s1 = "s1"; + <>c__DisplayClass0_.s2 = "s2"; + Func func = new Func(<>c__DisplayClass0_.b__0); + Func func2 = new Func(<>c__DisplayClass0_.b__1); + } +} +``` +As can be seen in the above IL code, both `func` and `func2` share the same closure `<>c__DisplayClass0_0`, which is the target that is serialized when serializing the delegates `func` and `func2`. Hence, even though `Func a` is only referencing `s1`, `s2` also gets serialized when sending over the bytes to the workers. + +This can lead to some unexpected behaviors at runtime (like in the case of using [broadcast variables](broadcast-guide.md)), which is why we recommend restricting the visibility of the variables used in a function to that function's scope. +Taking the above example to better explain what that means: + +Recommended user code to implement desired behavior of previous code snippet: + +```csharp +using System; + +public class C { + public void M() { + { + string s1 = "s1"; + Func a = str => s1; + } + { + string s2 = "s2"; + Func b = str => s2; + } + } +} +``` + +The above C# code generates the following C# disassembly (credit source: [sharplab.io](sharplab.io)) code from the compiler: + +```csharp +public class C +{ + [CompilerGenerated] + private sealed class <>c__DisplayClass0_0 + { + public string s1; + + internal string b__0(string str) + { + return s1; + } + } + + [CompilerGenerated] + private sealed class <>c__DisplayClass0_1 + { + public string s2; + + internal string b__1(string str) + { + return s2; + } + } + + public void M() + { + <>c__DisplayClass0_0 <>c__DisplayClass0_ = new <>c__DisplayClass0_0(); + <>c__DisplayClass0_.s1 = "s1"; + Func func = new Func(<>c__DisplayClass0_.b__0); + <>c__DisplayClass0_1 <>c__DisplayClass0_2 = new <>c__DisplayClass0_1(); + <>c__DisplayClass0_2.s2 = "s2"; + Func func2 = new Func(<>c__DisplayClass0_2.b__1); + } +} +``` + +Here we see that `func` and `func2` no longer share a closure and have their own separate closures `<>c__DisplayClass0_0` and `<>c__DisplayClass0_1` respectively. When used as the target for serialization, nothing other than the referenced variables will get serialized for the delegate. + +This above behavior is important to keep in mind while implementing multiple UDFs in a common scope. +To learn more about UDFs in general, please review the following articles that explain UDFs and how to use them: [UDFs in databricks(scala)](https://docs.databricks.com/spark/latest/spark-sql/udf-scala.html), [Spark UDFs and some gotchas](https://medium.com/@achilleus/spark-udfs-we-can-use-them-but-should-we-use-them-2c5a561fde6d). \ No newline at end of file From 4ef693dbf7616b738a6ae70d1e9dc8c12dd8e5d3 Mon Sep 17 00:00:00 2001 From: Niharika Dutta Date: Sun, 19 Apr 2020 22:32:56 -0700 Subject: [PATCH 02/20] removing guides from master --- docs/broadcast-guide.md | 92 --------------------- docs/udf-guide.md | 172 ---------------------------------------- 2 files changed, 264 deletions(-) delete mode 100644 docs/broadcast-guide.md delete mode 100644 docs/udf-guide.md diff --git a/docs/broadcast-guide.md b/docs/broadcast-guide.md deleted file mode 100644 index 4286c569e..000000000 --- a/docs/broadcast-guide.md +++ /dev/null @@ -1,92 +0,0 @@ -# Guide to using Broadcast Variables - -This is a guide to show how to use broadcast variables in .NET for Apache Spark. - -## What are Broadcast Variables - -[Broadcast variables in Apache Spark](https://spark.apache.org/docs/2.2.0/rdd-programming-guide.html#broadcast-variables) are a mechanism for sharing variables across executors that are meant to be read-only. They allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. They can be used, for example, to give every node a copy of a large input dataset in an efficient manner. - -### How to use broadcast variables in .NET for Apache Spark - -Broadcast variables are created from a variable `v` by calling `SparkContext.Broadcast(v)`. The broadcast variable is a wrapper around `v`, and its value can be accessed by calling the `Value()` method on it. - -Example: - -```csharp -string v = "Variable to be broadcasted"; -Broadcast bv = SparkContext.Broadcast(v); - -// Using the broadcast variable in a UDF: -Func udf = Udf( - str => $"{str}: {bv.Value()}"); -``` - -The type of broadcast variable is captured by using Generics in C#, as can be seen in the above example. - -### Deleting broadcast variables - -The broadcast variable can be deleted from all executors by calling the `Destroy()` function on it. - -```csharp -// Destroying the broadcast variable bv: -bv.Destroy(); -``` - -> Note: `Destroy` deletes all data and metadata related to the broadcast variable. Use this with caution- once a broadcast variable has been destroyed, it cannot be used again. - -#### Caveat of using Destroy - -One important thing to keep in mind while using broadcast variables in UDFs is to limit the scope of the variable to only the UDF that is referencing it. The [guide to using UDFs](udf-guide.md) describes this phenomenon in detail. This is especially crucial when calling `Destroy` on the broadcast variable. If the broadcast variable that has been destroyed is visible to or accessible from other UDFs, it gets picked up for serialization by all those UDFs, even if it is not being referenced by them. This will throw an error as .NET for Apache Spark is not able to serialize the destroyed broadcast variable. - -Example to demonstrate: - -```csharp -string v = "Variable to be broadcasted"; -Broadcast bv = SparkContext.Broadcast(v); - -// Using the broadcast variable in a UDF: -Func udf1 = Udf( - str => $"{str}: {bv.Value()}"); - -// Destroying bv -bv.Destroy(); - -// Calling udf1 after destroying bv throws the following expected exception: -// org.apache.spark.SparkException: Attempted to use Broadcast(0) after it was destroyed -df.Select(udf1(df["_1"])).Show(); - -// Different UDF udf2 that is not referencing bv -Func udf2 = Udf( - str => $"{str}: not referencing broadcast variable"); - -// Calling udf2 throws the following (unexpected) exception: -// [Error] [JvmBridge] org.apache.spark.SparkException: Task not serializable -df.Select(udf2(df["_1"])).Show(); -``` - -The recommended way of implementing above desired behavior: - -```csharp -string v = "Variable to be broadcasted"; -// Restricting the visibility of bv to only the UDF referencing it -{ - Broadcast bv = SparkContext.Broadcast(v); - - // Using the broadcast variable in a UDF: - Func udf1 = Udf( - str => $"{str}: {bv.Value()}"); - - // Destroying bv - bv.Destroy(); -} - -// Different UDF udf2 that is not referencing bv -Func udf2 = Udf( - str => $"{str}: not referencing broadcast variable"); - -// Calling udf2 works fine as expected -df.Select(udf2(df["_1"])).Show(); -``` - This ensures that destroying `bv` doesn't affect calling `udf2` because of unexpected serialization behavior. - - Broadcast variables are very useful for transmitting read-only data to all executors, as the data is sent only once and this gives huge performance benefits when compared with using local variables that get shipped to the executors with each task. Please refer to the [official documentation](https://spark.apache.org/docs/2.2.0/rdd-programming-guide.html#broadcast-variables) to get a deeper understanding of broadcast variables and why they are used. \ No newline at end of file diff --git a/docs/udf-guide.md b/docs/udf-guide.md deleted file mode 100644 index bb308815d..000000000 --- a/docs/udf-guide.md +++ /dev/null @@ -1,172 +0,0 @@ -# Guide to User-Defined Functions (UDFs) - -This is a guide to show how to use UDFs in .NET for Apache Spark. - -## What are UDFs - -[User-Defined Functions (UDFs)](https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/expressions/UserDefinedFunction.html) are a feature of Spark that allow developers to use custom functions to extend the system's built-in functionality. They transform values from a single row within a table to produce a single corresponding output value per row based on the logic defined in the UDF. - -Let's take the following as an example for a UDF definition: - -```csharp -string s1 = "hello"; -Func udf = Udf( - str => $"{s1} {str}"); - -``` -The above defined UDF takes a `string` as an input (in the form of a [Column](https://github.com/dotnet/spark/blob/master/src/csharp/Microsoft.Spark/Sql/Column.cs#L14) of a [Dataframe](https://github.com/dotnet/spark/blob/master/src/csharp/Microsoft.Spark/Sql/DataFrame.cs#L24)), and returns a `string` with `hello` appended in front of the input. - -For a sample Dataframe, let's take the following Dataframe `df`: - -```text -+-------+ -| name| -+-------+ -|Michael| -| Andy| -| Justin| -+-------+ -``` - -Now let's apply the above defined `udf` to the dataframe `df`: - -```csharp -DataFrame udfResult = df.Select(udf(df["name"])); -``` - -This would return the below as the Dataframe `udfResult`: - -```text -+-------------+ -| name| -+-------------+ -|hello Michael| -| hello Andy| -| hello Justin| -+-------------+ -``` -To get a better understanding of how to implement UDFs, please take a look at the [UDF helper functions](https://github.com/dotnet/spark/blob/master/src/csharp/Microsoft.Spark/Sql/Functions.cs#L3616) and some [test examples](https://github.com/dotnet/spark/blob/master/src/csharp/Microsoft.Spark.E2ETest/UdfTests/UdfSimpleTypesTests.cs#L49). - -## UDF serialization - -Since UDFs are functions that need to be executed on the workers, they have to be serialized and sent to the workers as part of the payload from the driver. This involves serializing the [delegate](https://docs.microsoft.com/en-us/dotnet/csharp/programming-guide/delegates/) which is a reference to the method, along with its [target](https://docs.microsoft.com/en-us/dotnet/api/system.delegate.target?view=netframework-4.8) which is the class instance on which the current delegate invokes the instance method. Please take a look at this [code](https://github.com/dotnet/spark/blob/master/src/csharp/Microsoft.Spark/Utils/CommandSerDe.cs#L149) to get a better understanding of how UDF serialization is being done. - -## Good to know while implementing UDFs - -One behavior to be aware of while implementing UDFs in .NET for Apache Spark is how the target of the UDF gets serialized. .NET for Apache Spark uses .NET Core, which does not support serializing delegates, so it is instead done by using reflection to serialize the target where the delegate is defined. When multiple delegates are defined in a common scope, they have a shared closure that becomes the target of reflection for serialization. Let's take an example to illustrate what that means. - -The following code snippet defines two string variables that are being referenced in two function delegates, that just return the respective strings as result: - -```csharp -using System; - -public class C { - public void M() { - string s1 = "s1"; - string s2 = "s2"; - Func a = str => s1; - Func b = str => s2; - } -} -``` - -The above C# code generates the following C# disassembly (credit source: [sharplab.io](sharplab.io)) code from the compiler: - -```csharp -public class C -{ - [CompilerGenerated] - private sealed class <>c__DisplayClass0_0 - { - public string s1; - - public string s2; - - internal string b__0(string str) - { - return s1; - } - - internal string b__1(string str) - { - return s2; - } - } - - public void M() - { - <>c__DisplayClass0_0 <>c__DisplayClass0_ = new <>c__DisplayClass0_0(); - <>c__DisplayClass0_.s1 = "s1"; - <>c__DisplayClass0_.s2 = "s2"; - Func func = new Func(<>c__DisplayClass0_.b__0); - Func func2 = new Func(<>c__DisplayClass0_.b__1); - } -} -``` -As can be seen in the above IL code, both `func` and `func2` share the same closure `<>c__DisplayClass0_0`, which is the target that is serialized when serializing the delegates `func` and `func2`. Hence, even though `Func a` is only referencing `s1`, `s2` also gets serialized when sending over the bytes to the workers. - -This can lead to some unexpected behaviors at runtime (like in the case of using [broadcast variables](broadcast-guide.md)), which is why we recommend restricting the visibility of the variables used in a function to that function's scope. -Taking the above example to better explain what that means: - -Recommended user code to implement desired behavior of previous code snippet: - -```csharp -using System; - -public class C { - public void M() { - { - string s1 = "s1"; - Func a = str => s1; - } - { - string s2 = "s2"; - Func b = str => s2; - } - } -} -``` - -The above C# code generates the following C# disassembly (credit source: [sharplab.io](sharplab.io)) code from the compiler: - -```csharp -public class C -{ - [CompilerGenerated] - private sealed class <>c__DisplayClass0_0 - { - public string s1; - - internal string b__0(string str) - { - return s1; - } - } - - [CompilerGenerated] - private sealed class <>c__DisplayClass0_1 - { - public string s2; - - internal string b__1(string str) - { - return s2; - } - } - - public void M() - { - <>c__DisplayClass0_0 <>c__DisplayClass0_ = new <>c__DisplayClass0_0(); - <>c__DisplayClass0_.s1 = "s1"; - Func func = new Func(<>c__DisplayClass0_.b__0); - <>c__DisplayClass0_1 <>c__DisplayClass0_2 = new <>c__DisplayClass0_1(); - <>c__DisplayClass0_2.s2 = "s2"; - Func func2 = new Func(<>c__DisplayClass0_2.b__1); - } -} -``` - -Here we see that `func` and `func2` no longer share a closure and have their own separate closures `<>c__DisplayClass0_0` and `<>c__DisplayClass0_1` respectively. When used as the target for serialization, nothing other than the referenced variables will get serialized for the delegate. - -This above behavior is important to keep in mind while implementing multiple UDFs in a common scope. -To learn more about UDFs in general, please review the following articles that explain UDFs and how to use them: [UDFs in databricks(scala)](https://docs.databricks.com/spark/latest/spark-sql/udf-scala.html), [Spark UDFs and some gotchas](https://medium.com/@achilleus/spark-udfs-we-can-use-them-but-should-we-use-them-2c5a561fde6d). \ No newline at end of file From 6bab99604db5cc8b8528b54216085afb96cbaff7 Mon Sep 17 00:00:00 2001 From: GOEddieUK Date: Mon, 27 Jul 2020 21:10:51 +0100 Subject: [PATCH 03/20] CountVectorizer --- .../ML/Feature/CountVectorizerModelTests.cs | 73 +++++++ .../ML/Feature/CountVectorizerTests.cs | 70 +++++++ .../ML/Feature/CountVectorizer.cs | 195 ++++++++++++++++++ .../ML/Feature/CountVectorizerModel.cs | 170 +++++++++++++++ 4 files changed, 508 insertions(+) create mode 100644 src/csharp/Microsoft.Spark.E2ETest/IpcTests/ML/Feature/CountVectorizerModelTests.cs create mode 100644 src/csharp/Microsoft.Spark.E2ETest/IpcTests/ML/Feature/CountVectorizerTests.cs create mode 100644 src/csharp/Microsoft.Spark/ML/Feature/CountVectorizer.cs create mode 100644 src/csharp/Microsoft.Spark/ML/Feature/CountVectorizerModel.cs diff --git a/src/csharp/Microsoft.Spark.E2ETest/IpcTests/ML/Feature/CountVectorizerModelTests.cs b/src/csharp/Microsoft.Spark.E2ETest/IpcTests/ML/Feature/CountVectorizerModelTests.cs new file mode 100644 index 000000000..3c3132dd9 --- /dev/null +++ b/src/csharp/Microsoft.Spark.E2ETest/IpcTests/ML/Feature/CountVectorizerModelTests.cs @@ -0,0 +1,73 @@ +// Licensed to the .NET Foundation under one or more agreements. +// The .NET Foundation licenses this file to you under the MIT license. +// See the LICENSE file in the project root for more information. + +using System; +using System.Collections.Generic; +using System.IO; +using Microsoft.Spark.ML.Feature; +using Microsoft.Spark.Sql; +using Microsoft.Spark.UnitTest.TestUtils; +using Xunit; + +namespace Microsoft.Spark.E2ETest.IpcTests.ML.Feature +{ + [Collection("Spark E2E Tests")] + public class CountVectorizerModelTests + { + private readonly SparkSession _spark; + + public CountVectorizerModelTests(SparkFixture fixture) + { + _spark = fixture.Spark; + } + + [Fact] + public void Test_CountVectorizerModel() + { + DataFrame input = _spark.Sql("SELECT array('hello', 'I', 'AM', 'a', 'string', 'TO', " + + "'TOKENIZE') as input from range(100)"); + + const string inputColumn = "input"; + const string outputColumn = "output"; + const double minTf = 10.0; + const bool binary = false; + + List vocabulary = new List() + { + "hello", + "I", + "AM", + "TO", + "TOKENIZE" + }; + + var countVectorizerModel = new CountVectorizerModel(vocabulary); + + Assert.IsType(new CountVectorizerModel("my-uid", vocabulary)); + + countVectorizerModel = countVectorizerModel + .SetInputCol(inputColumn) + .SetOutputCol(outputColumn) + .SetMinTF(minTf) + .SetBinary(binary); + + Assert.Equal(inputColumn, countVectorizerModel.GetInputCol()); + Assert.Equal(outputColumn, countVectorizerModel.GetOutputCol()); + Assert.Equal(minTf, countVectorizerModel.GetMinTF()); + Assert.Equal(binary, countVectorizerModel.GetBinary()); + using (var tempDirectory = new TemporaryDirectory()) + { + string savePath = Path.Join(tempDirectory.Path, "countVectorizerModel"); + countVectorizerModel.Save(savePath); + + CountVectorizerModel loadedModel = CountVectorizerModel.Load(savePath); + Assert.Equal(countVectorizerModel.Uid(), loadedModel.Uid()); + } + + Assert.IsType(countVectorizerModel.GetVocabSize()); + Assert.NotEmpty(countVectorizerModel.ExplainParams()); + Assert.NotEmpty(countVectorizerModel.ToString()); + } + } +} diff --git a/src/csharp/Microsoft.Spark.E2ETest/IpcTests/ML/Feature/CountVectorizerTests.cs b/src/csharp/Microsoft.Spark.E2ETest/IpcTests/ML/Feature/CountVectorizerTests.cs new file mode 100644 index 000000000..d54bfe376 --- /dev/null +++ b/src/csharp/Microsoft.Spark.E2ETest/IpcTests/ML/Feature/CountVectorizerTests.cs @@ -0,0 +1,70 @@ +// Licensed to the .NET Foundation under one or more agreements. +// The .NET Foundation licenses this file to you under the MIT license. +// See the LICENSE file in the project root for more information. + +using System; +using System.IO; +using Microsoft.Spark.ML.Feature; +using Microsoft.Spark.Sql; +using Microsoft.Spark.UnitTest.TestUtils; +using Xunit; + +namespace Microsoft.Spark.E2ETest.IpcTests.ML.Feature +{ + [Collection("Spark E2E Tests")] + public class CountVectorizerTests + { + private readonly SparkSession _spark; + + public CountVectorizerTests(SparkFixture fixture) + { + _spark = fixture.Spark; + } + + [Fact] + public void Test_CountVectorizer() + { + DataFrame input = _spark.Sql("SELECT array('hello', 'I', 'AM', 'a', 'string', 'TO', " + + "'TOKENIZE') as input from range(100)"); + + const string inputColumn = "input"; + const string outputColumn = "output"; + const double minDf = 1; + const double maxDf = 100; + const double minTf = 10; + const int vocabSize = 10000; + const bool binary = false; + + var countVectorizer = new CountVectorizer(); + + countVectorizer + .SetInputCol(inputColumn) + .SetOutputCol(outputColumn) + .SetMinDF(minDf) + .SetMaxDF(maxDf) + .SetMinTF(minTf) + .SetVocabSize(vocabSize); + + Assert.IsType(countVectorizer.Fit(input)); + Assert.Equal(inputColumn, countVectorizer.GetInputCol()); + Assert.Equal(outputColumn, countVectorizer.GetOutputCol()); + Assert.Equal(minDf, countVectorizer.GetMinDF()); + Assert.Equal(maxDf, countVectorizer.GetMaxDF()); + Assert.Equal(minTf, countVectorizer.GetMinTF()); + Assert.Equal(vocabSize, countVectorizer.GetVocabSize()); + Assert.Equal(binary, countVectorizer.GetBinary()); + + using (var tempDirectory = new TemporaryDirectory()) + { + string savePath = Path.Join(tempDirectory.Path, "countVectorizer"); + countVectorizer.Save(savePath); + + CountVectorizer loadedVectorizer = CountVectorizer.Load(savePath); + Assert.Equal(countVectorizer.Uid(), loadedVectorizer.Uid()); + } + + Assert.NotEmpty(countVectorizer.ExplainParams()); + Assert.NotEmpty(countVectorizer.ToString()); + } + } +} diff --git a/src/csharp/Microsoft.Spark/ML/Feature/CountVectorizer.cs b/src/csharp/Microsoft.Spark/ML/Feature/CountVectorizer.cs new file mode 100644 index 000000000..41e0dbdd0 --- /dev/null +++ b/src/csharp/Microsoft.Spark/ML/Feature/CountVectorizer.cs @@ -0,0 +1,195 @@ +// Licensed to the .NET Foundation under one or more agreements. +// The .NET Foundation licenses this file to you under the MIT license. +// See the LICENSE file in the project root for more information. + +using Microsoft.Spark.Interop; +using Microsoft.Spark.Interop.Ipc; +using Microsoft.Spark.Sql; + +namespace Microsoft.Spark.ML.Feature +{ + public class CountVectorizer : FeatureBase, IJvmObjectReferenceProvider + { + private static readonly string s_countVectorizerClassName = + "org.apache.spark.ml.feature.CountVectorizer"; + + /// + /// Create a without any parameters + /// + public CountVectorizer() : base(s_countVectorizerClassName) + { + } + + /// + /// Create a with a UID that is used to give the + /// a unique ID + /// + /// An immutable unique ID for the object and its derivatives. + public CountVectorizer(string uid) : base(s_countVectorizerClassName, uid) + { + } + + internal CountVectorizer(JvmObjectReference jvmObject) : base(jvmObject) + { + } + + JvmObjectReference IJvmObjectReferenceProvider.Reference => _jvmObject; + + /// Fits a model to the input data. + /// The to fit the model to. + /// + public CountVectorizerModel Fit(DataFrame dataFrame) => + new CountVectorizerModel((JvmObjectReference)_jvmObject.Invoke("fit", dataFrame)); + + /// + /// Loads the that was previously saved using Save + /// + /// + /// The path the previous was saved to + /// + /// New object + public static CountVectorizer Load(string path) => + WrapAsType((JvmObjectReference) + SparkEnvironment.JvmBridge.CallStaticJavaMethod( + s_countVectorizerClassName,"load", path)); + + /// + /// Gets the binary toggle to control the output vector values. If True, all nonzero counts + /// (after minTF filter applied) are set to 1. This is useful for discrete probabilistic + /// models that model binary events rather than integer counts. Default: false + /// + /// boolean + public bool GetBinary() => (bool)_jvmObject.Invoke("getBinary"); + + /// + /// Sets the binary toggle to control the output vector values. If True, all nonzero counts + /// (after minTF filter applied) are set to 1. This is useful for discrete probabilistic + /// models that model binary events rather than integer counts. Default: false + /// + /// Turn the binary toggle on or off + /// with the new binary toggle value set + public CountVectorizer SetBinary(bool value) => + WrapAsCountVectorizer((JvmObjectReference)_jvmObject.Invoke("setBinary", value)); + + private static CountVectorizer WrapAsCountVectorizer(object obj) => + new CountVectorizer((JvmObjectReference)obj); + + /// + /// Gets the column that the should read from and convert + /// into buckets. This would have been set by SetInputCol + /// + /// string, the input column + public string GetInputCol() => _jvmObject.Invoke("getInputCol") as string; + + /// + /// Sets the column that the should read from. + /// + /// The name of the column to as the source. + /// with the input column set + public CountVectorizer SetInputCol(string value) => + WrapAsCountVectorizer((JvmObjectReference)_jvmObject.Invoke("setInputCol", value)); + + /// + /// The will create a new column in the DataFrame, this is + /// the name of the new column. + /// + /// The name of the output column. + public string GetOutputCol() => _jvmObject.Invoke("getOutputCol") as string; + + /// + /// The will create a new column in the DataFrame, this + /// is the name of the new column. + /// + /// The name of the output column which will be created. + /// New with the output column set + public CountVectorizer SetOutputCol(string value) => + WrapAsCountVectorizer((JvmObjectReference)_jvmObject.Invoke("setOutputCol", value)); + + /// + /// Gets the maximum number of different documents a term could appear in to be included in + /// the vocabulary. A term that appears more than the threshold will be ignored. If this is + /// an integer greater than or equal to 1, this specifies the maximum number of documents + /// the term could appear in; if this is a double in [0,1), then this specifies the maximum + /// fraction of documents the term could appear in. + /// + /// The maximum document term frequency + public double GetMaxDF() => (double)_jvmObject.Invoke("getMaxDF"); + + /// + /// Sets the maximum number of different documents a term could appear in to be included in + /// the vocabulary. A term that appears more than the threshold will be ignored. If this is + /// an integer greater than or equal to 1, this specifies the maximum number of documents + /// the term could appear in; if this is a double in [0,1), then this specifies the maximum + /// fraction of documents the term could appear in. + /// + /// The maximum document term frequency + /// New with the max df value set + public CountVectorizer SetMaxDF(double value) => + WrapAsCountVectorizer((JvmObjectReference)_jvmObject.Invoke("setMaxDF", value)); + + /// + /// Gets the minimum number of different documents a term must appear in to be included in + /// the vocabulary. If this is an integer greater than or equal to 1, this specifies the + /// number of documents the term must appear in; if this is a double in [0,1), then this + /// specifies the fraction of documents. + /// + /// The minimum document term frequency + public double GetMinDF() => (double)_jvmObject.Invoke("getMinDF"); + + /// + /// Sets the minimum number of different documents a term must appear in to be included in + /// the vocabulary. If this is an integer greater than or equal to 1, this specifies the + /// number of documents the term must appear in; if this is a double in [0,1), then this + /// specifies the fraction of documents. + /// + /// The minimum document term frequency + /// New with the min df value set + public CountVectorizer SetMinDF(double value) => + WrapAsCountVectorizer((JvmObjectReference)_jvmObject.Invoke("setMinDF", value)); + + /// + /// Filter to ignore rare words in a document. For each document, terms with + /// frequency/count less than the given threshold are ignored. If this is an integer + /// greater than or equal to 1, then this specifies a count (of times the term must appear + /// in the document); if this is a double in [0,1), then this specifies a fraction (out of + /// the document's token count). + /// + /// Note that the parameter is only used in transform of CountVectorizerModel and does not + /// affect fitting. + /// + /// Minimum term frequency + public double GetMinTF() => (double)_jvmObject.Invoke("getMinTF"); + + /// + /// Filter to ignore rare words in a document. For each document, terms with + /// frequency/count less than the given threshold are ignored. If this is an integer + /// greater than or equal to 1, then this specifies a count (of times the term must appear + /// in the document); if this is a double in [0,1), then this specifies a fraction (out of + /// the document's token count). + /// + /// Note that the parameter is only used in transform of CountVectorizerModel and does not + /// affect fitting. + /// + /// Minimum term frequency + /// New with the min term frequency set + public CountVectorizer SetMinTF(double value) => + WrapAsCountVectorizer((JvmObjectReference)_jvmObject.Invoke("setMinTF", value)); + + /// + /// Gets the max size of the vocabulary. CountVectorizer will build a vocabulary that only + /// considers the top vocabSize terms ordered by term frequency across the corpus. + /// + /// The max size of the vocabulary + public int GetVocabSize() => (int)_jvmObject.Invoke("getVocabSize"); + + /// + /// Sets the max size of the vocabulary. will build a + /// vocabulary that only considers the top vocabSize terms ordered by term frequency across + /// the corpus. + /// + /// The max vocabulary size + /// with the max vocab value set + public CountVectorizer SetVocabSize(int value) => + WrapAsCountVectorizer(_jvmObject.Invoke("setVocabSize", value)); + } +} diff --git a/src/csharp/Microsoft.Spark/ML/Feature/CountVectorizerModel.cs b/src/csharp/Microsoft.Spark/ML/Feature/CountVectorizerModel.cs new file mode 100644 index 000000000..8a6e427df --- /dev/null +++ b/src/csharp/Microsoft.Spark/ML/Feature/CountVectorizerModel.cs @@ -0,0 +1,170 @@ +// Licensed to the .NET Foundation under one or more agreements. +// The .NET Foundation licenses this file to you under the MIT license. +// See the LICENSE file in the project root for more information. + +using System.Collections.Generic; +using Microsoft.Spark.Interop; +using Microsoft.Spark.Interop.Ipc; + +namespace Microsoft.Spark.ML.Feature +{ + public class CountVectorizerModel : FeatureBase + , IJvmObjectReferenceProvider + { + private static readonly string s_countVectorizerModelClassName = + "org.apache.spark.ml.feature.CountVectorizerModel"; + + /// + /// Create a without any parameters + /// + /// The vocabulary to use + public CountVectorizerModel(List vocabulary) : + this(SparkEnvironment.JvmBridge.CallConstructor( + s_countVectorizerModelClassName, vocabulary)) + { + } + + /// + /// Create a with a UID that is used to give the + /// a unique ID + /// + /// An immutable unique ID for the object and its derivatives. + /// The vocabulary to use + public CountVectorizerModel(string uid, List vocabulary) : + this(SparkEnvironment.JvmBridge.CallConstructor( + s_countVectorizerModelClassName, uid, vocabulary)) + { + } + + internal CountVectorizerModel(JvmObjectReference jvmObject) : base(jvmObject) + { + } + + JvmObjectReference IJvmObjectReferenceProvider.Reference => _jvmObject; + + /// + /// Loads the that was previously saved using Save + /// + /// + /// The path the previous was saved to + /// + /// New object + public static CountVectorizerModel Load(string path) => + WrapAsType((JvmObjectReference) + SparkEnvironment.JvmBridge.CallStaticJavaMethod( + s_countVectorizerModelClassName,"load", path)); + + /// + /// Gets the binary toggle to control the output vector values. If True, all nonzero counts + /// (after minTF filter applied) are set to 1. This is useful for discrete probabilistic + /// models that model binary events rather than integer counts. Default: false + /// + /// boolean + public bool GetBinary() => (bool)_jvmObject.Invoke("getBinary"); + + /// + /// Sets the binary toggle to control the output vector values. If True, all nonzero counts + /// (after minTF filter applied) are set to 1. This is useful for discrete probabilistic + /// models that model binary events rather than integer counts. Default: false + /// + /// Turn the binary toggle on or off + /// + /// with the new binary toggle value set + /// + public CountVectorizerModel SetBinary(bool value) => + WrapAsCountVectorizerModel((JvmObjectReference)_jvmObject.Invoke("setBinary", value)); + + private static CountVectorizerModel WrapAsCountVectorizerModel(object obj) => + new CountVectorizerModel((JvmObjectReference)obj); + + /// + /// Gets the column that the should read from and + /// convert into buckets. This would have been set by SetInputCol + /// + /// string, the input column + public string GetInputCol() => _jvmObject.Invoke("getInputCol") as string; + + /// + /// Sets the column that the should read from. + /// + /// The name of the column to as the source. + /// with the input column set + public CountVectorizerModel SetInputCol(string value) => + WrapAsCountVectorizerModel( + (JvmObjectReference)_jvmObject.Invoke("setInputCol", value)); + + /// + /// The will create a new column in the DataFrame, this + /// is the name of the new column. + /// + /// The name of the output column. + public string GetOutputCol() => _jvmObject.Invoke("getOutputCol") as string; + + /// + /// The will create a new column in the DataFrame, + /// this is the name of the new column. + /// + /// The name of the output column which will be created. + /// New with the output column set + public CountVectorizerModel SetOutputCol(string value) => + WrapAsCountVectorizerModel( + (JvmObjectReference)_jvmObject.Invoke("setOutputCol", value)); + + /// + /// Gets the maximum number of different documents a term could appear in to be included in + /// the vocabulary. A term that appears more than the threshold will be ignored. If this is + /// an integer greater than or equal to 1, this specifies the maximum number of documents + /// the term could appear in; if this is a double in [0,1), then this specifies the maximum + /// fraction of documents the term could appear in. + /// + /// The maximum document term frequency + public double GetMaxDF() => (double)_jvmObject.Invoke("getMaxDF"); + + /// + /// Gets the minimum number of different documents a term must appear in to be included in + /// the vocabulary. If this is an integer greater than or equal to 1, this specifies the + /// number of documents the term must appear in; if this is a double in [0,1), then this + /// specifies the fraction of documents. + /// + /// The minimum document term frequency + public double GetMinDF() => (double)_jvmObject.Invoke("getMinDF"); + + /// + /// Filter to ignore rare words in a document. For each document, terms with + /// frequency/count less than the given threshold are ignored. If this is an integer + /// greater than or equal to 1, then this specifies a count (of times the term must appear + /// in the document); if this is a double in [0,1), then this specifies a fraction (out of + /// the document's token count). + /// + /// Note that the parameter is only used in transform of CountVectorizerModel and does not + /// affect fitting. + /// + /// Minimum term frequency + public double GetMinTF() => (double)_jvmObject.Invoke("getMinTF"); + + /// + /// Filter to ignore rare words in a document. For each document, terms with + /// frequency/count less than the given threshold are ignored. If this is an integer + /// greater than or equal to 1, then this specifies a count (of times the term must appear + /// in the document); if this is a double in [0,1), then this specifies a fraction (out of + /// the document's token count). + /// + /// Note that the parameter is only used in transform of CountVectorizerModel and does not + /// affect fitting. + /// + /// Minimum term frequency + /// + /// New with the min term frequency set + /// + public CountVectorizerModel SetMinTF(double value) => + WrapAsCountVectorizerModel((JvmObjectReference)_jvmObject.Invoke("setMinTF", value)); + + /// + /// Gets the max size of the vocabulary. will build a + /// vocabulary that only considers the top vocabSize terms ordered by term frequency across + /// the corpus. + /// + /// The max size of the vocabulary + public int GetVocabSize() => (int)_jvmObject.Invoke("getVocabSize"); + } +} From e2a566b1f4b29775be9b57616a258802e294f304 Mon Sep 17 00:00:00 2001 From: GOEddieUK Date: Mon, 27 Jul 2020 21:24:35 +0100 Subject: [PATCH 04/20] moving private methods to bottom --- src/csharp/Microsoft.Spark/ML/Feature/CountVectorizer.cs | 6 +++--- .../Microsoft.Spark/ML/Feature/CountVectorizerModel.cs | 6 +++--- 2 files changed, 6 insertions(+), 6 deletions(-) diff --git a/src/csharp/Microsoft.Spark/ML/Feature/CountVectorizer.cs b/src/csharp/Microsoft.Spark/ML/Feature/CountVectorizer.cs index 41e0dbdd0..cf68f7c4a 100644 --- a/src/csharp/Microsoft.Spark/ML/Feature/CountVectorizer.cs +++ b/src/csharp/Microsoft.Spark/ML/Feature/CountVectorizer.cs @@ -71,9 +71,6 @@ public static CountVectorizer Load(string path) => public CountVectorizer SetBinary(bool value) => WrapAsCountVectorizer((JvmObjectReference)_jvmObject.Invoke("setBinary", value)); - private static CountVectorizer WrapAsCountVectorizer(object obj) => - new CountVectorizer((JvmObjectReference)obj); - /// /// Gets the column that the should read from and convert /// into buckets. This would have been set by SetInputCol @@ -191,5 +188,8 @@ public CountVectorizer SetMinTF(double value) => /// with the max vocab value set public CountVectorizer SetVocabSize(int value) => WrapAsCountVectorizer(_jvmObject.Invoke("setVocabSize", value)); + + private static CountVectorizer WrapAsCountVectorizer(object obj) => + new CountVectorizer((JvmObjectReference)obj); } } diff --git a/src/csharp/Microsoft.Spark/ML/Feature/CountVectorizerModel.cs b/src/csharp/Microsoft.Spark/ML/Feature/CountVectorizerModel.cs index 8a6e427df..8e225a179 100644 --- a/src/csharp/Microsoft.Spark/ML/Feature/CountVectorizerModel.cs +++ b/src/csharp/Microsoft.Spark/ML/Feature/CountVectorizerModel.cs @@ -74,9 +74,6 @@ public static CountVectorizerModel Load(string path) => public CountVectorizerModel SetBinary(bool value) => WrapAsCountVectorizerModel((JvmObjectReference)_jvmObject.Invoke("setBinary", value)); - private static CountVectorizerModel WrapAsCountVectorizerModel(object obj) => - new CountVectorizerModel((JvmObjectReference)obj); - /// /// Gets the column that the should read from and /// convert into buckets. This would have been set by SetInputCol @@ -166,5 +163,8 @@ public CountVectorizerModel SetMinTF(double value) => /// /// The max size of the vocabulary public int GetVocabSize() => (int)_jvmObject.Invoke("getVocabSize"); + + private static CountVectorizerModel WrapAsCountVectorizerModel(object obj) => + new CountVectorizerModel((JvmObjectReference)obj); } } From 5f682a601ec783f1609e6fd6e32c4d83ff1491d1 Mon Sep 17 00:00:00 2001 From: GOEddieUK Date: Tue, 28 Jul 2020 20:47:31 +0100 Subject: [PATCH 05/20] changing wrap method --- src/csharp/Microsoft.Spark/ML/Feature/CountVectorizer.cs | 2 +- src/csharp/Microsoft.Spark/ML/Feature/CountVectorizerModel.cs | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/src/csharp/Microsoft.Spark/ML/Feature/CountVectorizer.cs b/src/csharp/Microsoft.Spark/ML/Feature/CountVectorizer.cs index cf68f7c4a..b3fa0ef8a 100644 --- a/src/csharp/Microsoft.Spark/ML/Feature/CountVectorizer.cs +++ b/src/csharp/Microsoft.Spark/ML/Feature/CountVectorizer.cs @@ -49,7 +49,7 @@ public CountVectorizerModel Fit(DataFrame dataFrame) => /// /// New object public static CountVectorizer Load(string path) => - WrapAsType((JvmObjectReference) + WrapAsCountVectorizer((JvmObjectReference) SparkEnvironment.JvmBridge.CallStaticJavaMethod( s_countVectorizerClassName,"load", path)); diff --git a/src/csharp/Microsoft.Spark/ML/Feature/CountVectorizerModel.cs b/src/csharp/Microsoft.Spark/ML/Feature/CountVectorizerModel.cs index 8e225a179..52bbd72c3 100644 --- a/src/csharp/Microsoft.Spark/ML/Feature/CountVectorizerModel.cs +++ b/src/csharp/Microsoft.Spark/ML/Feature/CountVectorizerModel.cs @@ -50,7 +50,7 @@ internal CountVectorizerModel(JvmObjectReference jvmObject) : base(jvmObject) /// /// New object public static CountVectorizerModel Load(string path) => - WrapAsType((JvmObjectReference) + WrapAsCountVectorizerModel((JvmObjectReference) SparkEnvironment.JvmBridge.CallStaticJavaMethod( s_countVectorizerModelClassName,"load", path)); From 31371db73b4faa653c07fdb8082e7aed02c0a031 Mon Sep 17 00:00:00 2001 From: GOEddieUK Date: Fri, 31 Jul 2020 18:45:46 +0100 Subject: [PATCH 06/20] setting min version required --- .../IpcTests/ML/Feature/CountVectorizerTests.cs | 14 ++++++++++---- .../Microsoft.Spark/ML/Feature/CountVectorizer.cs | 2 ++ .../Microsoft.Spark/ML/Feature/FeatureBase.cs | 3 ++- src/csharp/Microsoft.Spark/Microsoft.Spark.csproj | 5 +---- 4 files changed, 15 insertions(+), 9 deletions(-) diff --git a/src/csharp/Microsoft.Spark.E2ETest/IpcTests/ML/Feature/CountVectorizerTests.cs b/src/csharp/Microsoft.Spark.E2ETest/IpcTests/ML/Feature/CountVectorizerTests.cs index d54bfe376..95b9bc504 100644 --- a/src/csharp/Microsoft.Spark.E2ETest/IpcTests/ML/Feature/CountVectorizerTests.cs +++ b/src/csharp/Microsoft.Spark.E2ETest/IpcTests/ML/Feature/CountVectorizerTests.cs @@ -4,6 +4,7 @@ using System; using System.IO; +using Microsoft.Spark.E2ETest.Utils; using Microsoft.Spark.ML.Feature; using Microsoft.Spark.Sql; using Microsoft.Spark.UnitTest.TestUtils; @@ -30,7 +31,6 @@ public void Test_CountVectorizer() const string inputColumn = "input"; const string outputColumn = "output"; const double minDf = 1; - const double maxDf = 100; const double minTf = 10; const int vocabSize = 10000; const bool binary = false; @@ -41,7 +41,6 @@ public void Test_CountVectorizer() .SetInputCol(inputColumn) .SetOutputCol(outputColumn) .SetMinDF(minDf) - .SetMaxDF(maxDf) .SetMinTF(minTf) .SetVocabSize(vocabSize); @@ -49,7 +48,6 @@ public void Test_CountVectorizer() Assert.Equal(inputColumn, countVectorizer.GetInputCol()); Assert.Equal(outputColumn, countVectorizer.GetOutputCol()); Assert.Equal(minDf, countVectorizer.GetMinDF()); - Assert.Equal(maxDf, countVectorizer.GetMaxDF()); Assert.Equal(minTf, countVectorizer.GetMinTF()); Assert.Equal(vocabSize, countVectorizer.GetVocabSize()); Assert.Equal(binary, countVectorizer.GetBinary()); @@ -65,6 +63,14 @@ public void Test_CountVectorizer() Assert.NotEmpty(countVectorizer.ExplainParams()); Assert.NotEmpty(countVectorizer.ToString()); - } + } + + [SkipIfSparkVersionIsLessThan(Versions.V2_4_0)] + public void CountVectorizer_MaxDF() + { + const double maxDf = 100; + CountVectorizer countVectorizer = new CountVectorizer().SetMaxDF(maxDf); + Assert.Equal(maxDf, countVectorizer.GetMaxDF()); + } } } diff --git a/src/csharp/Microsoft.Spark/ML/Feature/CountVectorizer.cs b/src/csharp/Microsoft.Spark/ML/Feature/CountVectorizer.cs index b3fa0ef8a..5689e19fd 100644 --- a/src/csharp/Microsoft.Spark/ML/Feature/CountVectorizer.cs +++ b/src/csharp/Microsoft.Spark/ML/Feature/CountVectorizer.cs @@ -110,6 +110,7 @@ public CountVectorizer SetOutputCol(string value) => /// fraction of documents the term could appear in. /// /// The maximum document term frequency + [Since(Versions.V2_4_0)] public double GetMaxDF() => (double)_jvmObject.Invoke("getMaxDF"); /// @@ -121,6 +122,7 @@ public CountVectorizer SetOutputCol(string value) => /// /// The maximum document term frequency /// New with the max df value set + [Since(Versions.V2_4_0)] public CountVectorizer SetMaxDF(double value) => WrapAsCountVectorizer((JvmObjectReference)_jvmObject.Invoke("setMaxDF", value)); diff --git a/src/csharp/Microsoft.Spark/ML/Feature/FeatureBase.cs b/src/csharp/Microsoft.Spark/ML/Feature/FeatureBase.cs index fcc90b43d..0895dace1 100644 --- a/src/csharp/Microsoft.Spark/ML/Feature/FeatureBase.cs +++ b/src/csharp/Microsoft.Spark/ML/Feature/FeatureBase.cs @@ -98,7 +98,7 @@ public Param.Param GetParam(string paramName) => public T Set(Param.Param param, object value) => WrapAsType((JvmObjectReference)_jvmObject.Invoke("set", param, value)); - private static T WrapAsType(JvmObjectReference reference) + internal static T WrapAsType(JvmObjectReference reference) { ConstructorInfo constructor = typeof(T) .GetConstructors(BindingFlags.NonPublic | BindingFlags.Instance) @@ -111,5 +111,6 @@ private static T WrapAsType(JvmObjectReference reference) return (T)constructor.Invoke(new object[] {reference}); } + } } diff --git a/src/csharp/Microsoft.Spark/Microsoft.Spark.csproj b/src/csharp/Microsoft.Spark/Microsoft.Spark.csproj index 2cddc5627..f284de8c6 100644 --- a/src/csharp/Microsoft.Spark/Microsoft.Spark.csproj +++ b/src/csharp/Microsoft.Spark/Microsoft.Spark.csproj @@ -38,10 +38,7 @@ - + From 60eb82f40ac37c553ca00a3ab4d0e404e4447dca Mon Sep 17 00:00:00 2001 From: GOEddieUK Date: Fri, 31 Jul 2020 19:52:23 +0100 Subject: [PATCH 07/20] undoing csproj change --- .ionide/symbolCache.db | Bin 28672 -> 0 bytes .../Microsoft.Spark/Microsoft.Spark.csproj | 5 ++++- 2 files changed, 4 insertions(+), 1 deletion(-) delete mode 100644 .ionide/symbolCache.db diff --git a/.ionide/symbolCache.db b/.ionide/symbolCache.db deleted file mode 100644 index 43e567d6d682d85dd32b3baebb0fdf61f67c1643..0000000000000000000000000000000000000000 GIT binary patch literal 0 HcmV?d00001 literal 28672 zcmeHPYiuJ|6}A(f3n*{Ao>?UI&RXX3biC7$u_ zru*Uw_|aboAq4!aPz(G7Dp3^)2^9$msX|362qA<70{*l}EvWneB<{TIImtK)+7qdu zu{?L~_%Yvi&pr3fz2}^JGgp@L0vB1UR7NW^3^wa~*(5A|iH8H;*B z&*Jr7uND(C74gzvnY~{#(YNt3Bw$FukbofpLjr~b3<($#FeG3|z>t6;0j&hSxNf$G zdV0*S4h!s^BA3}J-Ki9L<b^F{6=TjCuvK9>U*;l97n^=RUn$l~UNgYJCMRsfNA1?9Bl`LCS%K#Ngzf z;{QO@KD+;){!jcL`9JXd#`A*vLwDKru4~cx6X&wy2aZ|$+xDBbpW9ZfKeZmWylYu) z{ax!q%P(3^n}21#VfwY{Wv%+^=L?ZG(qpDCr$c=8^*P6`^IVl5<5tIVd0~tSzgigM z?z5uk`LPT6Y_-By)&wRae!(ne*4gR?lUBdKT&?7)Y>8Rp+k4w%>fhA!qkaWU!g8j& z9avyP?TH38h17hd$}v3E&T>vpABO?_g&-RIX#2Phe6h%7MQ!I9p4+7FTpy5icQ=}> z549iWIkuWzms8^H1xPCcSS58?UH(Q%WgSo}pSi&1%ghFqw{V?jb6g|G_h<$0h{v%C z-gd4n!}2^=x>Jxh}w>8SGqHC9v|D(^V*GlIoc1w|iaru~I2##}Q_|T1nBeK3-|27DujYN}U*Qnog;yWG zZr^}f59*n?f&gs=trhybzAhapG&7CzWFvl0k6CDOn7FsU92`wI{g3@Pu)FM&(n0byiegJk8tp$;dZ(^ zv=Y$fNsAXqBh!KP@Nsv!7c*PDz?GP*+?q1NVCFOrK}H;XiU+ZH0EsJTZI6;T-JAq- zjuS+DU_#0}R28`M+eFlgwc9Ox&3NKZQZlZ_NMqF!>tue`EenJqeO= z-(k%EWxt9n4P*W<`j}){81sMGVPVYwoyPqCY2B>G{GWCz|I_n-YoKm$sAt63{IKZ@ zCjVW()Axq2+xxcnw&!C{$^DLd+4ZLDu=DedKRby1-}XDUcWqa#zqMYm{LFH`_1mrC zmLIi@o8L1(L;w7AFA|##XrEvtfaXaX7#xVzi>ihYNYj*Mww$X`*YV|QzC^=M?s7b{ zR2E&Ad_Jr7vJDN1UN$i$u~P>{b1*gdEEdg`lZyRFw(CoE4YZL{_BddI=VPNxf)tFtiS1c{kwLRP_Bs2KJFK{M=Zfgp z5MzaS^-RLY01>>A4J9(PJCPk;3|-Gg3h=}8Y*2oI=KR!=4YAJvd^~4-*c;Zwp=d)e zwB3Zp8E>CHASQ8d{J&ySn^K6#J;CrWR!`-^; zqM`|6+mM`(61?aElrp4!0${VlSjO{EDzvau3op=cAg;PpUaK$*T(-!H0bn9Ea6!6~ zfK+Z2jZ}ANN{^JVURgcM@|@U<%-5<_8pe2m6F=O3P0ZtfS{ltsMe8cM8#S4aNHRO7 zP>{8>qXSDzJ6)YVwwmL`gM=7R(3d8$>Y$^w!>jYuqOlTbI-J)^}tn~k4R`# z%gp&{VwO;uNmcVHV%BjK=nOZ4Rh#YB_F$tnC4XE!;#3Ygq;Hbn}0)EjsT`7(7Hm(^n4Shvcww9bHjiGUTkb|JUFBEjjap;AiR+{ zRfi)Sw-Q%wk3G;24hEwfM_e&LA1|CP7zp+@8d-fr1qb>{%Ti8k6mY>C>QgR<x&4FUUSXg<^HJJmg z+yu~gPaO(^pg4n>Feuf`;H>7b0t(b?!3eIh&iXMNlE!-ryH6Vpc*fwWGc*$D;%gTqk&q-Cl6 z&AhPQ$ki?Yc)V6Ocw}&wR=2ebVDZ#{FgVz|M=$;yfW}jQD)4qec<>o;`)ASWbD(2j zKwGjUe(ny9$7ZuaBe${y!12mP!8_>_%6uN&A8%dEKw#jIev4{6q2zTjkIn%KC!$yb zu1H5@5XzU1b*Mvx8ey0CdUjSW00%Hm3Vam6}SWN^JQ4~XCiyMMBOimJ2yEVw`3RFM``6MP+8rh)j(l45I9&m?x{pdOk1GyTW(8~A=7nOb-8>4TyfiALgBG{yP< zfN9y}f7L(ad(C&&`$zA~ooi^`*Tu02K?}xl$2k0 zbAGCma`TY_FI()QQ7l(SGLY<5OiOdr+o8!4+e1*~+h_)dnql67B-dwfHZZUiz zzabvn1HdDMHf^Ll5)zb@WM2U=c;2LBe==J4CfT-n1+TU#`8q*c)H!K{K&_voHgaIsS6#Dmy)4GV_@uWLz6 zCQj|6Ygy;mRq^1efCpaJLU|n1bS;|z2Cr*L98HWhx|R)qg4eYq#uF1dLT%c$kd6vq zg6CvS^dv?aO!Sley-p$ur@`DqgVL}*9Lx0e7KVs&&S^rGFGR!5--h~O1_78cOIj<(HBye+O7eATP3F@vFj0ZfVApl|kbfCFCq0p*~C8rNO| z5O|G}`oPsFL8V8~HcD~+AGF&{{x#oUeXn>w_CEI7Jzw?M-QRS_UEg(`cE0b-IsWD- z*gvut>CcRpApt`Ih6D@=7!vqDl0cTI`}N6#m<}|fmUX}y7k4Wa4`L?jmVGkJ6*QPm zYp1g@6xEwkmnZ@8xJSUAea_QpR|;ZeGEFy$x=)|HN*vD)pBhe0C^NlncK8B;yY%7BE>k_JC|$KEu;nbLo^u3l z$A=f=Oa&b*Noz^uG)L4O_hgO6Lz)T9gtSOo1n!-O7mYD7>XTNG)BbMWD7*yKTnn{p2kcErn^;5QO|^~@dfjb=g`(vO!rQ!^N%?io_SZ? zsF<0ra;!kxqpY{^imti@RPfs8^z`%@jfgc_8rKLgN6`#2wD@X-UA_4A%qCGc(UU6~ zuc#A&3DvB(O|fV<3^&xn+pKdkFx9uS8&=()>Y`TKtVBwi1a7T|SN5q>Q|Fbs-R8Wi r5i_V}JR8*1*wjg2b^;fqlb8T7-kvp6;i)FU08O@N<|Vu8nsWLNN`{i9 diff --git a/src/csharp/Microsoft.Spark/Microsoft.Spark.csproj b/src/csharp/Microsoft.Spark/Microsoft.Spark.csproj index f284de8c6..2cddc5627 100644 --- a/src/csharp/Microsoft.Spark/Microsoft.Spark.csproj +++ b/src/csharp/Microsoft.Spark/Microsoft.Spark.csproj @@ -38,7 +38,10 @@ - + From ed36375561e3495a675f9ac14ab80f79f3fbb38d Mon Sep 17 00:00:00 2001 From: GOEddieUK Date: Fri, 31 Jul 2020 19:55:49 +0100 Subject: [PATCH 08/20] member doesnt need to be internal --- src/csharp/Microsoft.Spark/ML/Feature/FeatureBase.cs | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/csharp/Microsoft.Spark/ML/Feature/FeatureBase.cs b/src/csharp/Microsoft.Spark/ML/Feature/FeatureBase.cs index 0895dace1..8446b9f4e 100644 --- a/src/csharp/Microsoft.Spark/ML/Feature/FeatureBase.cs +++ b/src/csharp/Microsoft.Spark/ML/Feature/FeatureBase.cs @@ -98,7 +98,7 @@ public Param.Param GetParam(string paramName) => public T Set(Param.Param param, object value) => WrapAsType((JvmObjectReference)_jvmObject.Invoke("set", param, value)); - internal static T WrapAsType(JvmObjectReference reference) + private static T WrapAsType(JvmObjectReference reference) { ConstructorInfo constructor = typeof(T) .GetConstructors(BindingFlags.NonPublic | BindingFlags.Instance) From c7baf7231914b10300175e67158b604d646b97d4 Mon Sep 17 00:00:00 2001 From: GOEddieUK Date: Fri, 31 Jul 2020 19:56:29 +0100 Subject: [PATCH 09/20] too many lines --- src/csharp/Microsoft.Spark/ML/Feature/FeatureBase.cs | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/src/csharp/Microsoft.Spark/ML/Feature/FeatureBase.cs b/src/csharp/Microsoft.Spark/ML/Feature/FeatureBase.cs index 8446b9f4e..9ccd64d5b 100644 --- a/src/csharp/Microsoft.Spark/ML/Feature/FeatureBase.cs +++ b/src/csharp/Microsoft.Spark/ML/Feature/FeatureBase.cs @@ -106,11 +106,10 @@ private static T WrapAsType(JvmObjectReference reference) { ParameterInfo[] parameters = c.GetParameters(); return (parameters.Length == 1) && - (parameters[0].ParameterType == typeof(JvmObjectReference)); + (parameters[0].ParameterType == typeof(JvmObjectReference)); }); return (T)constructor.Invoke(new object[] {reference}); } - } } From d13303ccaeb691691c4d294d96e0995f3597becb Mon Sep 17 00:00:00 2001 From: GOEddieUK Date: Fri, 31 Jul 2020 20:01:07 +0100 Subject: [PATCH 10/20] removing whitespace change --- src/csharp/Microsoft.Spark/ML/Feature/FeatureBase.cs | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/csharp/Microsoft.Spark/ML/Feature/FeatureBase.cs b/src/csharp/Microsoft.Spark/ML/Feature/FeatureBase.cs index 9ccd64d5b..326268a5e 100644 --- a/src/csharp/Microsoft.Spark/ML/Feature/FeatureBase.cs +++ b/src/csharp/Microsoft.Spark/ML/Feature/FeatureBase.cs @@ -105,7 +105,7 @@ private static T WrapAsType(JvmObjectReference reference) .Single(c => { ParameterInfo[] parameters = c.GetParameters(); - return (parameters.Length == 1) && + return (parameters.Length == 1) && (parameters[0].ParameterType == typeof(JvmObjectReference)); }); From f5b477c72158599b1c6552c7eb1af20edfab7779 Mon Sep 17 00:00:00 2001 From: GOEddieUK Date: Fri, 31 Jul 2020 20:01:57 +0100 Subject: [PATCH 11/20] removing whitespace change --- src/csharp/Microsoft.Spark/ML/Feature/FeatureBase.cs | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/csharp/Microsoft.Spark/ML/Feature/FeatureBase.cs b/src/csharp/Microsoft.Spark/ML/Feature/FeatureBase.cs index 326268a5e..9ccd64d5b 100644 --- a/src/csharp/Microsoft.Spark/ML/Feature/FeatureBase.cs +++ b/src/csharp/Microsoft.Spark/ML/Feature/FeatureBase.cs @@ -105,7 +105,7 @@ private static T WrapAsType(JvmObjectReference reference) .Single(c => { ParameterInfo[] parameters = c.GetParameters(); - return (parameters.Length == 1) && + return (parameters.Length == 1) && (parameters[0].ParameterType == typeof(JvmObjectReference)); }); From 73db52b400637585b2216f44aac616828800b9d2 Mon Sep 17 00:00:00 2001 From: GOEddieUK Date: Fri, 31 Jul 2020 20:06:12 +0100 Subject: [PATCH 12/20] ionide --- .ionide/symbolCache.db | Bin 0 -> 28672 bytes .../Microsoft.Spark/ML/Feature/FeatureBase.cs | 2 +- 2 files changed, 1 insertion(+), 1 deletion(-) create mode 100644 .ionide/symbolCache.db diff --git a/.ionide/symbolCache.db b/.ionide/symbolCache.db new file mode 100644 index 0000000000000000000000000000000000000000..43e567d6d682d85dd32b3baebb0fdf61f67c1643 GIT binary patch literal 28672 zcmeHPYiuJ|6}A(f3n*{Ao>?UI&RXX3biC7$u_ zru*Uw_|aboAq4!aPz(G7Dp3^)2^9$msX|362qA<70{*l}EvWneB<{TIImtK)+7qdu zu{?L~_%Yvi&pr3fz2}^JGgp@L0vB1UR7NW^3^wa~*(5A|iH8H;*B z&*Jr7uND(C74gzvnY~{#(YNt3Bw$FukbofpLjr~b3<($#FeG3|z>t6;0j&hSxNf$G zdV0*S4h!s^BA3}J-Ki9L<b^F{6=TjCuvK9>U*;l97n^=RUn$l~UNgYJCMRsfNA1?9Bl`LCS%K#Ngzf z;{QO@KD+;){!jcL`9JXd#`A*vLwDKru4~cx6X&wy2aZ|$+xDBbpW9ZfKeZmWylYu) z{ax!q%P(3^n}21#VfwY{Wv%+^=L?ZG(qpDCr$c=8^*P6`^IVl5<5tIVd0~tSzgigM z?z5uk`LPT6Y_-By)&wRae!(ne*4gR?lUBdKT&?7)Y>8Rp+k4w%>fhA!qkaWU!g8j& z9avyP?TH38h17hd$}v3E&T>vpABO?_g&-RIX#2Phe6h%7MQ!I9p4+7FTpy5icQ=}> z549iWIkuWzms8^H1xPCcSS58?UH(Q%WgSo}pSi&1%ghFqw{V?jb6g|G_h<$0h{v%C z-gd4n!}2^=x>Jxh}w>8SGqHC9v|D(^V*GlIoc1w|iaru~I2##}Q_|T1nBeK3-|27DujYN}U*Qnog;yWG zZr^}f59*n?f&gs=trhybzAhapG&7CzWFvl0k6CDOn7FsU92`wI{g3@Pu)FM&(n0byiegJk8tp$;dZ(^ zv=Y$fNsAXqBh!KP@Nsv!7c*PDz?GP*+?q1NVCFOrK}H;XiU+ZH0EsJTZI6;T-JAq- zjuS+DU_#0}R28`M+eFlgwc9Ox&3NKZQZlZ_NMqF!>tue`EenJqeO= z-(k%EWxt9n4P*W<`j}){81sMGVPVYwoyPqCY2B>G{GWCz|I_n-YoKm$sAt63{IKZ@ zCjVW()Axq2+xxcnw&!C{$^DLd+4ZLDu=DedKRby1-}XDUcWqa#zqMYm{LFH`_1mrC zmLIi@o8L1(L;w7AFA|##XrEvtfaXaX7#xVzi>ihYNYj*Mww$X`*YV|QzC^=M?s7b{ zR2E&Ad_Jr7vJDN1UN$i$u~P>{b1*gdEEdg`lZyRFw(CoE4YZL{_BddI=VPNxf)tFtiS1c{kwLRP_Bs2KJFK{M=Zfgp z5MzaS^-RLY01>>A4J9(PJCPk;3|-Gg3h=}8Y*2oI=KR!=4YAJvd^~4-*c;Zwp=d)e zwB3Zp8E>CHASQ8d{J&ySn^K6#J;CrWR!`-^; zqM`|6+mM`(61?aElrp4!0${VlSjO{EDzvau3op=cAg;PpUaK$*T(-!H0bn9Ea6!6~ zfK+Z2jZ}ANN{^JVURgcM@|@U<%-5<_8pe2m6F=O3P0ZtfS{ltsMe8cM8#S4aNHRO7 zP>{8>qXSDzJ6)YVwwmL`gM=7R(3d8$>Y$^w!>jYuqOlTbI-J)^}tn~k4R`# z%gp&{VwO;uNmcVHV%BjK=nOZ4Rh#YB_F$tnC4XE!;#3Ygq;Hbn}0)EjsT`7(7Hm(^n4Shvcww9bHjiGUTkb|JUFBEjjap;AiR+{ zRfi)Sw-Q%wk3G;24hEwfM_e&LA1|CP7zp+@8d-fr1qb>{%Ti8k6mY>C>QgR<x&4FUUSXg<^HJJmg z+yu~gPaO(^pg4n>Feuf`;H>7b0t(b?!3eIh&iXMNlE!-ryH6Vpc*fwWGc*$D;%gTqk&q-Cl6 z&AhPQ$ki?Yc)V6Ocw}&wR=2ebVDZ#{FgVz|M=$;yfW}jQD)4qec<>o;`)ASWbD(2j zKwGjUe(ny9$7ZuaBe${y!12mP!8_>_%6uN&A8%dEKw#jIev4{6q2zTjkIn%KC!$yb zu1H5@5XzU1b*Mvx8ey0CdUjSW00%Hm3Vam6}SWN^JQ4~XCiyMMBOimJ2yEVw`3RFM``6MP+8rh)j(l45I9&m?x{pdOk1GyTW(8~A=7nOb-8>4TyfiALgBG{yP< zfN9y}f7L(ad(C&&`$zA~ooi^`*Tu02K?}xl$2k0 zbAGCma`TY_FI()QQ7l(SGLY<5OiOdr+o8!4+e1*~+h_)dnql67B-dwfHZZUiz zzabvn1HdDMHf^Ll5)zb@WM2U=c;2LBe==J4CfT-n1+TU#`8q*c)H!K{K&_voHgaIsS6#Dmy)4GV_@uWLz6 zCQj|6Ygy;mRq^1efCpaJLU|n1bS;|z2Cr*L98HWhx|R)qg4eYq#uF1dLT%c$kd6vq zg6CvS^dv?aO!Sley-p$ur@`DqgVL}*9Lx0e7KVs&&S^rGFGR!5--h~O1_78cOIj<(HBye+O7eATP3F@vFj0ZfVApl|kbfCFCq0p*~C8rNO| z5O|G}`oPsFL8V8~HcD~+AGF&{{x#oUeXn>w_CEI7Jzw?M-QRS_UEg(`cE0b-IsWD- z*gvut>CcRpApt`Ih6D@=7!vqDl0cTI`}N6#m<}|fmUX}y7k4Wa4`L?jmVGkJ6*QPm zYp1g@6xEwkmnZ@8xJSUAea_QpR|;ZeGEFy$x=)|HN*vD)pBhe0C^NlncK8B;yY%7BE>k_JC|$KEu;nbLo^u3l z$A=f=Oa&b*Noz^uG)L4O_hgO6Lz)T9gtSOo1n!-O7mYD7>XTNG)BbMWD7*yKTnn{p2kcErn^;5QO|^~@dfjb=g`(vO!rQ!^N%?io_SZ? zsF<0ra;!kxqpY{^imti@RPfs8^z`%@jfgc_8rKLgN6`#2wD@X-UA_4A%qCGc(UU6~ zuc#A&3DvB(O|fV<3^&xn+pKdkFx9uS8&=()>Y`TKtVBwi1a7T|SN5q>Q|Fbs-R8Wi r5i_V}JR8*1*wjg2b^;fqlb8T7-kvp6;i)FU08O@N<|Vu8nsWLNN`{i9 literal 0 HcmV?d00001 diff --git a/src/csharp/Microsoft.Spark/ML/Feature/FeatureBase.cs b/src/csharp/Microsoft.Spark/ML/Feature/FeatureBase.cs index 9ccd64d5b..326268a5e 100644 --- a/src/csharp/Microsoft.Spark/ML/Feature/FeatureBase.cs +++ b/src/csharp/Microsoft.Spark/ML/Feature/FeatureBase.cs @@ -105,7 +105,7 @@ private static T WrapAsType(JvmObjectReference reference) .Single(c => { ParameterInfo[] parameters = c.GetParameters(); - return (parameters.Length == 1) && + return (parameters.Length == 1) && (parameters[0].ParameterType == typeof(JvmObjectReference)); }); From 8e1685cd270657c5e7a6769e732bf85d5ae6cb2e Mon Sep 17 00:00:00 2001 From: Niharika Dutta Date: Thu, 13 Aug 2020 12:59:34 -0700 Subject: [PATCH 13/20] Revert "Merge branch 'master' into ml/countvectorizer" This reverts commit a766146f56014ccae4118b35495b84da588af94f, reversing changes made to 73db52b400637585b2216f44aac616828800b9d2. Reverting countvectorizer changes --- .gitignore | 3 --- .ionide/symbolCache.db | Bin 0 -> 28672 bytes .../Processor/BroadcastVariableProcessor.cs | 3 +-- 3 files changed, 1 insertion(+), 5 deletions(-) create mode 100644 .ionide/symbolCache.db diff --git a/.gitignore b/.gitignore index faada9c8a..251cfa7e2 100644 --- a/.gitignore +++ b/.gitignore @@ -367,6 +367,3 @@ hs_err_pid* # The target folder contains the output of building **/target/** - -# F# vs code -.ionide/ diff --git a/.ionide/symbolCache.db b/.ionide/symbolCache.db new file mode 100644 index 0000000000000000000000000000000000000000..43e567d6d682d85dd32b3baebb0fdf61f67c1643 GIT binary patch literal 28672 zcmeHPYiuJ|6}A(f3n*{Ao>?UI&RXX3biC7$u_ zru*Uw_|aboAq4!aPz(G7Dp3^)2^9$msX|362qA<70{*l}EvWneB<{TIImtK)+7qdu zu{?L~_%Yvi&pr3fz2}^JGgp@L0vB1UR7NW^3^wa~*(5A|iH8H;*B z&*Jr7uND(C74gzvnY~{#(YNt3Bw$FukbofpLjr~b3<($#FeG3|z>t6;0j&hSxNf$G zdV0*S4h!s^BA3}J-Ki9L<b^F{6=TjCuvK9>U*;l97n^=RUn$l~UNgYJCMRsfNA1?9Bl`LCS%K#Ngzf z;{QO@KD+;){!jcL`9JXd#`A*vLwDKru4~cx6X&wy2aZ|$+xDBbpW9ZfKeZmWylYu) z{ax!q%P(3^n}21#VfwY{Wv%+^=L?ZG(qpDCr$c=8^*P6`^IVl5<5tIVd0~tSzgigM z?z5uk`LPT6Y_-By)&wRae!(ne*4gR?lUBdKT&?7)Y>8Rp+k4w%>fhA!qkaWU!g8j& z9avyP?TH38h17hd$}v3E&T>vpABO?_g&-RIX#2Phe6h%7MQ!I9p4+7FTpy5icQ=}> z549iWIkuWzms8^H1xPCcSS58?UH(Q%WgSo}pSi&1%ghFqw{V?jb6g|G_h<$0h{v%C z-gd4n!}2^=x>Jxh}w>8SGqHC9v|D(^V*GlIoc1w|iaru~I2##}Q_|T1nBeK3-|27DujYN}U*Qnog;yWG zZr^}f59*n?f&gs=trhybzAhapG&7CzWFvl0k6CDOn7FsU92`wI{g3@Pu)FM&(n0byiegJk8tp$;dZ(^ zv=Y$fNsAXqBh!KP@Nsv!7c*PDz?GP*+?q1NVCFOrK}H;XiU+ZH0EsJTZI6;T-JAq- zjuS+DU_#0}R28`M+eFlgwc9Ox&3NKZQZlZ_NMqF!>tue`EenJqeO= z-(k%EWxt9n4P*W<`j}){81sMGVPVYwoyPqCY2B>G{GWCz|I_n-YoKm$sAt63{IKZ@ zCjVW()Axq2+xxcnw&!C{$^DLd+4ZLDu=DedKRby1-}XDUcWqa#zqMYm{LFH`_1mrC zmLIi@o8L1(L;w7AFA|##XrEvtfaXaX7#xVzi>ihYNYj*Mww$X`*YV|QzC^=M?s7b{ zR2E&Ad_Jr7vJDN1UN$i$u~P>{b1*gdEEdg`lZyRFw(CoE4YZL{_BddI=VPNxf)tFtiS1c{kwLRP_Bs2KJFK{M=Zfgp z5MzaS^-RLY01>>A4J9(PJCPk;3|-Gg3h=}8Y*2oI=KR!=4YAJvd^~4-*c;Zwp=d)e zwB3Zp8E>CHASQ8d{J&ySn^K6#J;CrWR!`-^; zqM`|6+mM`(61?aElrp4!0${VlSjO{EDzvau3op=cAg;PpUaK$*T(-!H0bn9Ea6!6~ zfK+Z2jZ}ANN{^JVURgcM@|@U<%-5<_8pe2m6F=O3P0ZtfS{ltsMe8cM8#S4aNHRO7 zP>{8>qXSDzJ6)YVwwmL`gM=7R(3d8$>Y$^w!>jYuqOlTbI-J)^}tn~k4R`# z%gp&{VwO;uNmcVHV%BjK=nOZ4Rh#YB_F$tnC4XE!;#3Ygq;Hbn}0)EjsT`7(7Hm(^n4Shvcww9bHjiGUTkb|JUFBEjjap;AiR+{ zRfi)Sw-Q%wk3G;24hEwfM_e&LA1|CP7zp+@8d-fr1qb>{%Ti8k6mY>C>QgR<x&4FUUSXg<^HJJmg z+yu~gPaO(^pg4n>Feuf`;H>7b0t(b?!3eIh&iXMNlE!-ryH6Vpc*fwWGc*$D;%gTqk&q-Cl6 z&AhPQ$ki?Yc)V6Ocw}&wR=2ebVDZ#{FgVz|M=$;yfW}jQD)4qec<>o;`)ASWbD(2j zKwGjUe(ny9$7ZuaBe${y!12mP!8_>_%6uN&A8%dEKw#jIev4{6q2zTjkIn%KC!$yb zu1H5@5XzU1b*Mvx8ey0CdUjSW00%Hm3Vam6}SWN^JQ4~XCiyMMBOimJ2yEVw`3RFM``6MP+8rh)j(l45I9&m?x{pdOk1GyTW(8~A=7nOb-8>4TyfiALgBG{yP< zfN9y}f7L(ad(C&&`$zA~ooi^`*Tu02K?}xl$2k0 zbAGCma`TY_FI()QQ7l(SGLY<5OiOdr+o8!4+e1*~+h_)dnql67B-dwfHZZUiz zzabvn1HdDMHf^Ll5)zb@WM2U=c;2LBe==J4CfT-n1+TU#`8q*c)H!K{K&_voHgaIsS6#Dmy)4GV_@uWLz6 zCQj|6Ygy;mRq^1efCpaJLU|n1bS;|z2Cr*L98HWhx|R)qg4eYq#uF1dLT%c$kd6vq zg6CvS^dv?aO!Sley-p$ur@`DqgVL}*9Lx0e7KVs&&S^rGFGR!5--h~O1_78cOIj<(HBye+O7eATP3F@vFj0ZfVApl|kbfCFCq0p*~C8rNO| z5O|G}`oPsFL8V8~HcD~+AGF&{{x#oUeXn>w_CEI7Jzw?M-QRS_UEg(`cE0b-IsWD- z*gvut>CcRpApt`Ih6D@=7!vqDl0cTI`}N6#m<}|fmUX}y7k4Wa4`L?jmVGkJ6*QPm zYp1g@6xEwkmnZ@8xJSUAea_QpR|;ZeGEFy$x=)|HN*vD)pBhe0C^NlncK8B;yY%7BE>k_JC|$KEu;nbLo^u3l z$A=f=Oa&b*Noz^uG)L4O_hgO6Lz)T9gtSOo1n!-O7mYD7>XTNG)BbMWD7*yKTnn{p2kcErn^;5QO|^~@dfjb=g`(vO!rQ!^N%?io_SZ? zsF<0ra;!kxqpY{^imti@RPfs8^z`%@jfgc_8rKLgN6`#2wD@X-UA_4A%qCGc(UU6~ zuc#A&3DvB(O|fV<3^&xn+pKdkFx9uS8&=()>Y`TKtVBwi1a7T|SN5q>Q|Fbs-R8Wi r5i_V}JR8*1*wjg2b^;fqlb8T7-kvp6;i)FU08O@N<|Vu8nsWLNN`{i9 literal 0 HcmV?d00001 diff --git a/src/csharp/Microsoft.Spark.Worker/Processor/BroadcastVariableProcessor.cs b/src/csharp/Microsoft.Spark.Worker/Processor/BroadcastVariableProcessor.cs index bf8f48ed8..41c817d02 100644 --- a/src/csharp/Microsoft.Spark.Worker/Processor/BroadcastVariableProcessor.cs +++ b/src/csharp/Microsoft.Spark.Worker/Processor/BroadcastVariableProcessor.cs @@ -54,8 +54,7 @@ internal BroadcastVariables Process(Stream stream) else { string path = SerDe.ReadString(stream); - using FileStream fStream = - File.Open(path, FileMode.Open, FileAccess.Read, FileShare.Read); + using FileStream fStream = File.Open(path, FileMode.Open, FileAccess.Read); object value = formatter.Deserialize(fStream); BroadcastRegistry.Add(bid, value); } From 255515eecbd6cb8e7919fbd2b857d99e335c66d2 Mon Sep 17 00:00:00 2001 From: Niharika Dutta Date: Thu, 13 Aug 2020 13:04:05 -0700 Subject: [PATCH 14/20] Revert "Merge branch 'ml/countvectorizer' of https://github.com/GoEddie/spark" This reverts commit ad6bcede69de012c22178825e76c6b175c770b8f, reversing changes made to 4c5d502a9f56e79ea071b12d2a49dced3873dea8. reverting countvectorizer changes -2 --- .../ML/Feature/CountVectorizerModelTests.cs | 73 ------- .../ML/Feature/CountVectorizerTests.cs | 76 ------- .../ML/Feature/CountVectorizer.cs | 197 ------------------ .../ML/Feature/CountVectorizerModel.cs | 170 --------------- .../Microsoft.Spark/ML/Feature/FeatureBase.cs | 4 +- 5 files changed, 2 insertions(+), 518 deletions(-) delete mode 100644 src/csharp/Microsoft.Spark.E2ETest/IpcTests/ML/Feature/CountVectorizerModelTests.cs delete mode 100644 src/csharp/Microsoft.Spark.E2ETest/IpcTests/ML/Feature/CountVectorizerTests.cs delete mode 100644 src/csharp/Microsoft.Spark/ML/Feature/CountVectorizer.cs delete mode 100644 src/csharp/Microsoft.Spark/ML/Feature/CountVectorizerModel.cs diff --git a/src/csharp/Microsoft.Spark.E2ETest/IpcTests/ML/Feature/CountVectorizerModelTests.cs b/src/csharp/Microsoft.Spark.E2ETest/IpcTests/ML/Feature/CountVectorizerModelTests.cs deleted file mode 100644 index 3c3132dd9..000000000 --- a/src/csharp/Microsoft.Spark.E2ETest/IpcTests/ML/Feature/CountVectorizerModelTests.cs +++ /dev/null @@ -1,73 +0,0 @@ -// Licensed to the .NET Foundation under one or more agreements. -// The .NET Foundation licenses this file to you under the MIT license. -// See the LICENSE file in the project root for more information. - -using System; -using System.Collections.Generic; -using System.IO; -using Microsoft.Spark.ML.Feature; -using Microsoft.Spark.Sql; -using Microsoft.Spark.UnitTest.TestUtils; -using Xunit; - -namespace Microsoft.Spark.E2ETest.IpcTests.ML.Feature -{ - [Collection("Spark E2E Tests")] - public class CountVectorizerModelTests - { - private readonly SparkSession _spark; - - public CountVectorizerModelTests(SparkFixture fixture) - { - _spark = fixture.Spark; - } - - [Fact] - public void Test_CountVectorizerModel() - { - DataFrame input = _spark.Sql("SELECT array('hello', 'I', 'AM', 'a', 'string', 'TO', " + - "'TOKENIZE') as input from range(100)"); - - const string inputColumn = "input"; - const string outputColumn = "output"; - const double minTf = 10.0; - const bool binary = false; - - List vocabulary = new List() - { - "hello", - "I", - "AM", - "TO", - "TOKENIZE" - }; - - var countVectorizerModel = new CountVectorizerModel(vocabulary); - - Assert.IsType(new CountVectorizerModel("my-uid", vocabulary)); - - countVectorizerModel = countVectorizerModel - .SetInputCol(inputColumn) - .SetOutputCol(outputColumn) - .SetMinTF(minTf) - .SetBinary(binary); - - Assert.Equal(inputColumn, countVectorizerModel.GetInputCol()); - Assert.Equal(outputColumn, countVectorizerModel.GetOutputCol()); - Assert.Equal(minTf, countVectorizerModel.GetMinTF()); - Assert.Equal(binary, countVectorizerModel.GetBinary()); - using (var tempDirectory = new TemporaryDirectory()) - { - string savePath = Path.Join(tempDirectory.Path, "countVectorizerModel"); - countVectorizerModel.Save(savePath); - - CountVectorizerModel loadedModel = CountVectorizerModel.Load(savePath); - Assert.Equal(countVectorizerModel.Uid(), loadedModel.Uid()); - } - - Assert.IsType(countVectorizerModel.GetVocabSize()); - Assert.NotEmpty(countVectorizerModel.ExplainParams()); - Assert.NotEmpty(countVectorizerModel.ToString()); - } - } -} diff --git a/src/csharp/Microsoft.Spark.E2ETest/IpcTests/ML/Feature/CountVectorizerTests.cs b/src/csharp/Microsoft.Spark.E2ETest/IpcTests/ML/Feature/CountVectorizerTests.cs deleted file mode 100644 index 95b9bc504..000000000 --- a/src/csharp/Microsoft.Spark.E2ETest/IpcTests/ML/Feature/CountVectorizerTests.cs +++ /dev/null @@ -1,76 +0,0 @@ -// Licensed to the .NET Foundation under one or more agreements. -// The .NET Foundation licenses this file to you under the MIT license. -// See the LICENSE file in the project root for more information. - -using System; -using System.IO; -using Microsoft.Spark.E2ETest.Utils; -using Microsoft.Spark.ML.Feature; -using Microsoft.Spark.Sql; -using Microsoft.Spark.UnitTest.TestUtils; -using Xunit; - -namespace Microsoft.Spark.E2ETest.IpcTests.ML.Feature -{ - [Collection("Spark E2E Tests")] - public class CountVectorizerTests - { - private readonly SparkSession _spark; - - public CountVectorizerTests(SparkFixture fixture) - { - _spark = fixture.Spark; - } - - [Fact] - public void Test_CountVectorizer() - { - DataFrame input = _spark.Sql("SELECT array('hello', 'I', 'AM', 'a', 'string', 'TO', " + - "'TOKENIZE') as input from range(100)"); - - const string inputColumn = "input"; - const string outputColumn = "output"; - const double minDf = 1; - const double minTf = 10; - const int vocabSize = 10000; - const bool binary = false; - - var countVectorizer = new CountVectorizer(); - - countVectorizer - .SetInputCol(inputColumn) - .SetOutputCol(outputColumn) - .SetMinDF(minDf) - .SetMinTF(minTf) - .SetVocabSize(vocabSize); - - Assert.IsType(countVectorizer.Fit(input)); - Assert.Equal(inputColumn, countVectorizer.GetInputCol()); - Assert.Equal(outputColumn, countVectorizer.GetOutputCol()); - Assert.Equal(minDf, countVectorizer.GetMinDF()); - Assert.Equal(minTf, countVectorizer.GetMinTF()); - Assert.Equal(vocabSize, countVectorizer.GetVocabSize()); - Assert.Equal(binary, countVectorizer.GetBinary()); - - using (var tempDirectory = new TemporaryDirectory()) - { - string savePath = Path.Join(tempDirectory.Path, "countVectorizer"); - countVectorizer.Save(savePath); - - CountVectorizer loadedVectorizer = CountVectorizer.Load(savePath); - Assert.Equal(countVectorizer.Uid(), loadedVectorizer.Uid()); - } - - Assert.NotEmpty(countVectorizer.ExplainParams()); - Assert.NotEmpty(countVectorizer.ToString()); - } - - [SkipIfSparkVersionIsLessThan(Versions.V2_4_0)] - public void CountVectorizer_MaxDF() - { - const double maxDf = 100; - CountVectorizer countVectorizer = new CountVectorizer().SetMaxDF(maxDf); - Assert.Equal(maxDf, countVectorizer.GetMaxDF()); - } - } -} diff --git a/src/csharp/Microsoft.Spark/ML/Feature/CountVectorizer.cs b/src/csharp/Microsoft.Spark/ML/Feature/CountVectorizer.cs deleted file mode 100644 index 5689e19fd..000000000 --- a/src/csharp/Microsoft.Spark/ML/Feature/CountVectorizer.cs +++ /dev/null @@ -1,197 +0,0 @@ -// Licensed to the .NET Foundation under one or more agreements. -// The .NET Foundation licenses this file to you under the MIT license. -// See the LICENSE file in the project root for more information. - -using Microsoft.Spark.Interop; -using Microsoft.Spark.Interop.Ipc; -using Microsoft.Spark.Sql; - -namespace Microsoft.Spark.ML.Feature -{ - public class CountVectorizer : FeatureBase, IJvmObjectReferenceProvider - { - private static readonly string s_countVectorizerClassName = - "org.apache.spark.ml.feature.CountVectorizer"; - - /// - /// Create a without any parameters - /// - public CountVectorizer() : base(s_countVectorizerClassName) - { - } - - /// - /// Create a with a UID that is used to give the - /// a unique ID - /// - /// An immutable unique ID for the object and its derivatives. - public CountVectorizer(string uid) : base(s_countVectorizerClassName, uid) - { - } - - internal CountVectorizer(JvmObjectReference jvmObject) : base(jvmObject) - { - } - - JvmObjectReference IJvmObjectReferenceProvider.Reference => _jvmObject; - - /// Fits a model to the input data. - /// The to fit the model to. - /// - public CountVectorizerModel Fit(DataFrame dataFrame) => - new CountVectorizerModel((JvmObjectReference)_jvmObject.Invoke("fit", dataFrame)); - - /// - /// Loads the that was previously saved using Save - /// - /// - /// The path the previous was saved to - /// - /// New object - public static CountVectorizer Load(string path) => - WrapAsCountVectorizer((JvmObjectReference) - SparkEnvironment.JvmBridge.CallStaticJavaMethod( - s_countVectorizerClassName,"load", path)); - - /// - /// Gets the binary toggle to control the output vector values. If True, all nonzero counts - /// (after minTF filter applied) are set to 1. This is useful for discrete probabilistic - /// models that model binary events rather than integer counts. Default: false - /// - /// boolean - public bool GetBinary() => (bool)_jvmObject.Invoke("getBinary"); - - /// - /// Sets the binary toggle to control the output vector values. If True, all nonzero counts - /// (after minTF filter applied) are set to 1. This is useful for discrete probabilistic - /// models that model binary events rather than integer counts. Default: false - /// - /// Turn the binary toggle on or off - /// with the new binary toggle value set - public CountVectorizer SetBinary(bool value) => - WrapAsCountVectorizer((JvmObjectReference)_jvmObject.Invoke("setBinary", value)); - - /// - /// Gets the column that the should read from and convert - /// into buckets. This would have been set by SetInputCol - /// - /// string, the input column - public string GetInputCol() => _jvmObject.Invoke("getInputCol") as string; - - /// - /// Sets the column that the should read from. - /// - /// The name of the column to as the source. - /// with the input column set - public CountVectorizer SetInputCol(string value) => - WrapAsCountVectorizer((JvmObjectReference)_jvmObject.Invoke("setInputCol", value)); - - /// - /// The will create a new column in the DataFrame, this is - /// the name of the new column. - /// - /// The name of the output column. - public string GetOutputCol() => _jvmObject.Invoke("getOutputCol") as string; - - /// - /// The will create a new column in the DataFrame, this - /// is the name of the new column. - /// - /// The name of the output column which will be created. - /// New with the output column set - public CountVectorizer SetOutputCol(string value) => - WrapAsCountVectorizer((JvmObjectReference)_jvmObject.Invoke("setOutputCol", value)); - - /// - /// Gets the maximum number of different documents a term could appear in to be included in - /// the vocabulary. A term that appears more than the threshold will be ignored. If this is - /// an integer greater than or equal to 1, this specifies the maximum number of documents - /// the term could appear in; if this is a double in [0,1), then this specifies the maximum - /// fraction of documents the term could appear in. - /// - /// The maximum document term frequency - [Since(Versions.V2_4_0)] - public double GetMaxDF() => (double)_jvmObject.Invoke("getMaxDF"); - - /// - /// Sets the maximum number of different documents a term could appear in to be included in - /// the vocabulary. A term that appears more than the threshold will be ignored. If this is - /// an integer greater than or equal to 1, this specifies the maximum number of documents - /// the term could appear in; if this is a double in [0,1), then this specifies the maximum - /// fraction of documents the term could appear in. - /// - /// The maximum document term frequency - /// New with the max df value set - [Since(Versions.V2_4_0)] - public CountVectorizer SetMaxDF(double value) => - WrapAsCountVectorizer((JvmObjectReference)_jvmObject.Invoke("setMaxDF", value)); - - /// - /// Gets the minimum number of different documents a term must appear in to be included in - /// the vocabulary. If this is an integer greater than or equal to 1, this specifies the - /// number of documents the term must appear in; if this is a double in [0,1), then this - /// specifies the fraction of documents. - /// - /// The minimum document term frequency - public double GetMinDF() => (double)_jvmObject.Invoke("getMinDF"); - - /// - /// Sets the minimum number of different documents a term must appear in to be included in - /// the vocabulary. If this is an integer greater than or equal to 1, this specifies the - /// number of documents the term must appear in; if this is a double in [0,1), then this - /// specifies the fraction of documents. - /// - /// The minimum document term frequency - /// New with the min df value set - public CountVectorizer SetMinDF(double value) => - WrapAsCountVectorizer((JvmObjectReference)_jvmObject.Invoke("setMinDF", value)); - - /// - /// Filter to ignore rare words in a document. For each document, terms with - /// frequency/count less than the given threshold are ignored. If this is an integer - /// greater than or equal to 1, then this specifies a count (of times the term must appear - /// in the document); if this is a double in [0,1), then this specifies a fraction (out of - /// the document's token count). - /// - /// Note that the parameter is only used in transform of CountVectorizerModel and does not - /// affect fitting. - /// - /// Minimum term frequency - public double GetMinTF() => (double)_jvmObject.Invoke("getMinTF"); - - /// - /// Filter to ignore rare words in a document. For each document, terms with - /// frequency/count less than the given threshold are ignored. If this is an integer - /// greater than or equal to 1, then this specifies a count (of times the term must appear - /// in the document); if this is a double in [0,1), then this specifies a fraction (out of - /// the document's token count). - /// - /// Note that the parameter is only used in transform of CountVectorizerModel and does not - /// affect fitting. - /// - /// Minimum term frequency - /// New with the min term frequency set - public CountVectorizer SetMinTF(double value) => - WrapAsCountVectorizer((JvmObjectReference)_jvmObject.Invoke("setMinTF", value)); - - /// - /// Gets the max size of the vocabulary. CountVectorizer will build a vocabulary that only - /// considers the top vocabSize terms ordered by term frequency across the corpus. - /// - /// The max size of the vocabulary - public int GetVocabSize() => (int)_jvmObject.Invoke("getVocabSize"); - - /// - /// Sets the max size of the vocabulary. will build a - /// vocabulary that only considers the top vocabSize terms ordered by term frequency across - /// the corpus. - /// - /// The max vocabulary size - /// with the max vocab value set - public CountVectorizer SetVocabSize(int value) => - WrapAsCountVectorizer(_jvmObject.Invoke("setVocabSize", value)); - - private static CountVectorizer WrapAsCountVectorizer(object obj) => - new CountVectorizer((JvmObjectReference)obj); - } -} diff --git a/src/csharp/Microsoft.Spark/ML/Feature/CountVectorizerModel.cs b/src/csharp/Microsoft.Spark/ML/Feature/CountVectorizerModel.cs deleted file mode 100644 index 52bbd72c3..000000000 --- a/src/csharp/Microsoft.Spark/ML/Feature/CountVectorizerModel.cs +++ /dev/null @@ -1,170 +0,0 @@ -// Licensed to the .NET Foundation under one or more agreements. -// The .NET Foundation licenses this file to you under the MIT license. -// See the LICENSE file in the project root for more information. - -using System.Collections.Generic; -using Microsoft.Spark.Interop; -using Microsoft.Spark.Interop.Ipc; - -namespace Microsoft.Spark.ML.Feature -{ - public class CountVectorizerModel : FeatureBase - , IJvmObjectReferenceProvider - { - private static readonly string s_countVectorizerModelClassName = - "org.apache.spark.ml.feature.CountVectorizerModel"; - - /// - /// Create a without any parameters - /// - /// The vocabulary to use - public CountVectorizerModel(List vocabulary) : - this(SparkEnvironment.JvmBridge.CallConstructor( - s_countVectorizerModelClassName, vocabulary)) - { - } - - /// - /// Create a with a UID that is used to give the - /// a unique ID - /// - /// An immutable unique ID for the object and its derivatives. - /// The vocabulary to use - public CountVectorizerModel(string uid, List vocabulary) : - this(SparkEnvironment.JvmBridge.CallConstructor( - s_countVectorizerModelClassName, uid, vocabulary)) - { - } - - internal CountVectorizerModel(JvmObjectReference jvmObject) : base(jvmObject) - { - } - - JvmObjectReference IJvmObjectReferenceProvider.Reference => _jvmObject; - - /// - /// Loads the that was previously saved using Save - /// - /// - /// The path the previous was saved to - /// - /// New object - public static CountVectorizerModel Load(string path) => - WrapAsCountVectorizerModel((JvmObjectReference) - SparkEnvironment.JvmBridge.CallStaticJavaMethod( - s_countVectorizerModelClassName,"load", path)); - - /// - /// Gets the binary toggle to control the output vector values. If True, all nonzero counts - /// (after minTF filter applied) are set to 1. This is useful for discrete probabilistic - /// models that model binary events rather than integer counts. Default: false - /// - /// boolean - public bool GetBinary() => (bool)_jvmObject.Invoke("getBinary"); - - /// - /// Sets the binary toggle to control the output vector values. If True, all nonzero counts - /// (after minTF filter applied) are set to 1. This is useful for discrete probabilistic - /// models that model binary events rather than integer counts. Default: false - /// - /// Turn the binary toggle on or off - /// - /// with the new binary toggle value set - /// - public CountVectorizerModel SetBinary(bool value) => - WrapAsCountVectorizerModel((JvmObjectReference)_jvmObject.Invoke("setBinary", value)); - - /// - /// Gets the column that the should read from and - /// convert into buckets. This would have been set by SetInputCol - /// - /// string, the input column - public string GetInputCol() => _jvmObject.Invoke("getInputCol") as string; - - /// - /// Sets the column that the should read from. - /// - /// The name of the column to as the source. - /// with the input column set - public CountVectorizerModel SetInputCol(string value) => - WrapAsCountVectorizerModel( - (JvmObjectReference)_jvmObject.Invoke("setInputCol", value)); - - /// - /// The will create a new column in the DataFrame, this - /// is the name of the new column. - /// - /// The name of the output column. - public string GetOutputCol() => _jvmObject.Invoke("getOutputCol") as string; - - /// - /// The will create a new column in the DataFrame, - /// this is the name of the new column. - /// - /// The name of the output column which will be created. - /// New with the output column set - public CountVectorizerModel SetOutputCol(string value) => - WrapAsCountVectorizerModel( - (JvmObjectReference)_jvmObject.Invoke("setOutputCol", value)); - - /// - /// Gets the maximum number of different documents a term could appear in to be included in - /// the vocabulary. A term that appears more than the threshold will be ignored. If this is - /// an integer greater than or equal to 1, this specifies the maximum number of documents - /// the term could appear in; if this is a double in [0,1), then this specifies the maximum - /// fraction of documents the term could appear in. - /// - /// The maximum document term frequency - public double GetMaxDF() => (double)_jvmObject.Invoke("getMaxDF"); - - /// - /// Gets the minimum number of different documents a term must appear in to be included in - /// the vocabulary. If this is an integer greater than or equal to 1, this specifies the - /// number of documents the term must appear in; if this is a double in [0,1), then this - /// specifies the fraction of documents. - /// - /// The minimum document term frequency - public double GetMinDF() => (double)_jvmObject.Invoke("getMinDF"); - - /// - /// Filter to ignore rare words in a document. For each document, terms with - /// frequency/count less than the given threshold are ignored. If this is an integer - /// greater than or equal to 1, then this specifies a count (of times the term must appear - /// in the document); if this is a double in [0,1), then this specifies a fraction (out of - /// the document's token count). - /// - /// Note that the parameter is only used in transform of CountVectorizerModel and does not - /// affect fitting. - /// - /// Minimum term frequency - public double GetMinTF() => (double)_jvmObject.Invoke("getMinTF"); - - /// - /// Filter to ignore rare words in a document. For each document, terms with - /// frequency/count less than the given threshold are ignored. If this is an integer - /// greater than or equal to 1, then this specifies a count (of times the term must appear - /// in the document); if this is a double in [0,1), then this specifies a fraction (out of - /// the document's token count). - /// - /// Note that the parameter is only used in transform of CountVectorizerModel and does not - /// affect fitting. - /// - /// Minimum term frequency - /// - /// New with the min term frequency set - /// - public CountVectorizerModel SetMinTF(double value) => - WrapAsCountVectorizerModel((JvmObjectReference)_jvmObject.Invoke("setMinTF", value)); - - /// - /// Gets the max size of the vocabulary. will build a - /// vocabulary that only considers the top vocabSize terms ordered by term frequency across - /// the corpus. - /// - /// The max size of the vocabulary - public int GetVocabSize() => (int)_jvmObject.Invoke("getVocabSize"); - - private static CountVectorizerModel WrapAsCountVectorizerModel(object obj) => - new CountVectorizerModel((JvmObjectReference)obj); - } -} diff --git a/src/csharp/Microsoft.Spark/ML/Feature/FeatureBase.cs b/src/csharp/Microsoft.Spark/ML/Feature/FeatureBase.cs index 326268a5e..fcc90b43d 100644 --- a/src/csharp/Microsoft.Spark/ML/Feature/FeatureBase.cs +++ b/src/csharp/Microsoft.Spark/ML/Feature/FeatureBase.cs @@ -105,8 +105,8 @@ private static T WrapAsType(JvmObjectReference reference) .Single(c => { ParameterInfo[] parameters = c.GetParameters(); - return (parameters.Length == 1) && - (parameters[0].ParameterType == typeof(JvmObjectReference)); + return (parameters.Length == 1) && + (parameters[0].ParameterType == typeof(JvmObjectReference)); }); return (T)constructor.Invoke(new object[] {reference}); From 3c2c936b007d7b5d761fda737625dc8f7d03728b Mon Sep 17 00:00:00 2001 From: Niharika Dutta Date: Fri, 14 Aug 2020 13:32:54 -0700 Subject: [PATCH 15/20] fixing merge errors --- .gitignore | 3 +++ .../Processor/BroadcastVariableProcessor.cs | 3 ++- 2 files changed, 5 insertions(+), 1 deletion(-) diff --git a/.gitignore b/.gitignore index 251cfa7e2..8e67b5699 100644 --- a/.gitignore +++ b/.gitignore @@ -367,3 +367,6 @@ hs_err_pid* # The target folder contains the output of building **/target/** + +# F# vs code +.ionide/ \ No newline at end of file diff --git a/src/csharp/Microsoft.Spark.Worker/Processor/BroadcastVariableProcessor.cs b/src/csharp/Microsoft.Spark.Worker/Processor/BroadcastVariableProcessor.cs index 41c817d02..bf8f48ed8 100644 --- a/src/csharp/Microsoft.Spark.Worker/Processor/BroadcastVariableProcessor.cs +++ b/src/csharp/Microsoft.Spark.Worker/Processor/BroadcastVariableProcessor.cs @@ -54,7 +54,8 @@ internal BroadcastVariables Process(Stream stream) else { string path = SerDe.ReadString(stream); - using FileStream fStream = File.Open(path, FileMode.Open, FileAccess.Read); + using FileStream fStream = + File.Open(path, FileMode.Open, FileAccess.Read, FileShare.Read); object value = formatter.Deserialize(fStream); BroadcastRegistry.Add(bid, value); } From 88e834d53b7be8931147a095a7b0df3c08cd9aa8 Mon Sep 17 00:00:00 2001 From: Niharika Dutta Date: Wed, 19 Aug 2020 19:24:14 -0700 Subject: [PATCH 16/20] removing ionid --- .gitignore | 2 +- .ionide/symbolCache.db | Bin 28672 -> 0 bytes 2 files changed, 1 insertion(+), 1 deletion(-) delete mode 100644 .ionide/symbolCache.db diff --git a/.gitignore b/.gitignore index 8e67b5699..faada9c8a 100644 --- a/.gitignore +++ b/.gitignore @@ -369,4 +369,4 @@ hs_err_pid* **/target/** # F# vs code -.ionide/ \ No newline at end of file +.ionide/ diff --git a/.ionide/symbolCache.db b/.ionide/symbolCache.db deleted file mode 100644 index 43e567d6d682d85dd32b3baebb0fdf61f67c1643..0000000000000000000000000000000000000000 GIT binary patch literal 0 HcmV?d00001 literal 28672 zcmeHPYiuJ|6}A(f3n*{Ao>?UI&RXX3biC7$u_ zru*Uw_|aboAq4!aPz(G7Dp3^)2^9$msX|362qA<70{*l}EvWneB<{TIImtK)+7qdu zu{?L~_%Yvi&pr3fz2}^JGgp@L0vB1UR7NW^3^wa~*(5A|iH8H;*B z&*Jr7uND(C74gzvnY~{#(YNt3Bw$FukbofpLjr~b3<($#FeG3|z>t6;0j&hSxNf$G zdV0*S4h!s^BA3}J-Ki9L<b^F{6=TjCuvK9>U*;l97n^=RUn$l~UNgYJCMRsfNA1?9Bl`LCS%K#Ngzf z;{QO@KD+;){!jcL`9JXd#`A*vLwDKru4~cx6X&wy2aZ|$+xDBbpW9ZfKeZmWylYu) z{ax!q%P(3^n}21#VfwY{Wv%+^=L?ZG(qpDCr$c=8^*P6`^IVl5<5tIVd0~tSzgigM z?z5uk`LPT6Y_-By)&wRae!(ne*4gR?lUBdKT&?7)Y>8Rp+k4w%>fhA!qkaWU!g8j& z9avyP?TH38h17hd$}v3E&T>vpABO?_g&-RIX#2Phe6h%7MQ!I9p4+7FTpy5icQ=}> z549iWIkuWzms8^H1xPCcSS58?UH(Q%WgSo}pSi&1%ghFqw{V?jb6g|G_h<$0h{v%C z-gd4n!}2^=x>Jxh}w>8SGqHC9v|D(^V*GlIoc1w|iaru~I2##}Q_|T1nBeK3-|27DujYN}U*Qnog;yWG zZr^}f59*n?f&gs=trhybzAhapG&7CzWFvl0k6CDOn7FsU92`wI{g3@Pu)FM&(n0byiegJk8tp$;dZ(^ zv=Y$fNsAXqBh!KP@Nsv!7c*PDz?GP*+?q1NVCFOrK}H;XiU+ZH0EsJTZI6;T-JAq- zjuS+DU_#0}R28`M+eFlgwc9Ox&3NKZQZlZ_NMqF!>tue`EenJqeO= z-(k%EWxt9n4P*W<`j}){81sMGVPVYwoyPqCY2B>G{GWCz|I_n-YoKm$sAt63{IKZ@ zCjVW()Axq2+xxcnw&!C{$^DLd+4ZLDu=DedKRby1-}XDUcWqa#zqMYm{LFH`_1mrC zmLIi@o8L1(L;w7AFA|##XrEvtfaXaX7#xVzi>ihYNYj*Mww$X`*YV|QzC^=M?s7b{ zR2E&Ad_Jr7vJDN1UN$i$u~P>{b1*gdEEdg`lZyRFw(CoE4YZL{_BddI=VPNxf)tFtiS1c{kwLRP_Bs2KJFK{M=Zfgp z5MzaS^-RLY01>>A4J9(PJCPk;3|-Gg3h=}8Y*2oI=KR!=4YAJvd^~4-*c;Zwp=d)e zwB3Zp8E>CHASQ8d{J&ySn^K6#J;CrWR!`-^; zqM`|6+mM`(61?aElrp4!0${VlSjO{EDzvau3op=cAg;PpUaK$*T(-!H0bn9Ea6!6~ zfK+Z2jZ}ANN{^JVURgcM@|@U<%-5<_8pe2m6F=O3P0ZtfS{ltsMe8cM8#S4aNHRO7 zP>{8>qXSDzJ6)YVwwmL`gM=7R(3d8$>Y$^w!>jYuqOlTbI-J)^}tn~k4R`# z%gp&{VwO;uNmcVHV%BjK=nOZ4Rh#YB_F$tnC4XE!;#3Ygq;Hbn}0)EjsT`7(7Hm(^n4Shvcww9bHjiGUTkb|JUFBEjjap;AiR+{ zRfi)Sw-Q%wk3G;24hEwfM_e&LA1|CP7zp+@8d-fr1qb>{%Ti8k6mY>C>QgR<x&4FUUSXg<^HJJmg z+yu~gPaO(^pg4n>Feuf`;H>7b0t(b?!3eIh&iXMNlE!-ryH6Vpc*fwWGc*$D;%gTqk&q-Cl6 z&AhPQ$ki?Yc)V6Ocw}&wR=2ebVDZ#{FgVz|M=$;yfW}jQD)4qec<>o;`)ASWbD(2j zKwGjUe(ny9$7ZuaBe${y!12mP!8_>_%6uN&A8%dEKw#jIev4{6q2zTjkIn%KC!$yb zu1H5@5XzU1b*Mvx8ey0CdUjSW00%Hm3Vam6}SWN^JQ4~XCiyMMBOimJ2yEVw`3RFM``6MP+8rh)j(l45I9&m?x{pdOk1GyTW(8~A=7nOb-8>4TyfiALgBG{yP< zfN9y}f7L(ad(C&&`$zA~ooi^`*Tu02K?}xl$2k0 zbAGCma`TY_FI()QQ7l(SGLY<5OiOdr+o8!4+e1*~+h_)dnql67B-dwfHZZUiz zzabvn1HdDMHf^Ll5)zb@WM2U=c;2LBe==J4CfT-n1+TU#`8q*c)H!K{K&_voHgaIsS6#Dmy)4GV_@uWLz6 zCQj|6Ygy;mRq^1efCpaJLU|n1bS;|z2Cr*L98HWhx|R)qg4eYq#uF1dLT%c$kd6vq zg6CvS^dv?aO!Sley-p$ur@`DqgVL}*9Lx0e7KVs&&S^rGFGR!5--h~O1_78cOIj<(HBye+O7eATP3F@vFj0ZfVApl|kbfCFCq0p*~C8rNO| z5O|G}`oPsFL8V8~HcD~+AGF&{{x#oUeXn>w_CEI7Jzw?M-QRS_UEg(`cE0b-IsWD- z*gvut>CcRpApt`Ih6D@=7!vqDl0cTI`}N6#m<}|fmUX}y7k4Wa4`L?jmVGkJ6*QPm zYp1g@6xEwkmnZ@8xJSUAea_QpR|;ZeGEFy$x=)|HN*vD)pBhe0C^NlncK8B;yY%7BE>k_JC|$KEu;nbLo^u3l z$A=f=Oa&b*Noz^uG)L4O_hgO6Lz)T9gtSOo1n!-O7mYD7>XTNG)BbMWD7*yKTnn{p2kcErn^;5QO|^~@dfjb=g`(vO!rQ!^N%?io_SZ? zsF<0ra;!kxqpY{^imti@RPfs8^z`%@jfgc_8rKLgN6`#2wD@X-UA_4A%qCGc(UU6~ zuc#A&3DvB(O|fV<3^&xn+pKdkFx9uS8&=()>Y`TKtVBwi1a7T|SN5q>Q|Fbs-R8Wi r5i_V}JR8*1*wjg2b^;fqlb8T7-kvp6;i)FU08O@N<|Vu8nsWLNN`{i9 From 39b3950f50db5d96037a666e375fe1af198e967d Mon Sep 17 00:00:00 2001 From: Niharika Dutta Date: Mon, 5 Oct 2020 14:05:18 -0700 Subject: [PATCH 17/20] first draft --- README.md | 26 ++- docs/broadcast-guide.md | 92 ---------- docs/building/ubuntu-instructions.md | 25 +-- docs/building/windows-instructions.md | 33 ++-- docs/getting-started/macos-instructions.md | 85 --------- docs/getting-started/ubuntu-instructions.md | 77 --------- docs/getting-started/windows-instructions.md | 65 ------- docs/how-to-guides.md | 16 ++ docs/udf-guide.md | 171 ------------------- 9 files changed, 67 insertions(+), 523 deletions(-) delete mode 100644 docs/broadcast-guide.md delete mode 100644 docs/getting-started/macos-instructions.md delete mode 100644 docs/getting-started/ubuntu-instructions.md delete mode 100644 docs/getting-started/windows-instructions.md create mode 100644 docs/how-to-guides.md delete mode 100644 docs/udf-guide.md diff --git a/README.md b/README.md index 7aef188eb..8592c4464 100644 --- a/README.md +++ b/README.md @@ -8,7 +8,7 @@ .NET for Apache Spark is compliant with .NET Standard - a formal specification of .NET APIs that are common across .NET implementations. This means you can use .NET for Apache Spark anywhere you write .NET code allowing you to reuse all the knowledge, skills, code, and libraries you already have as a .NET developer. -.NET for Apache Spark runs on Windows, Linux, and macOS using .NET Core, or Windows using .NET Framework. It also runs on all major cloud providers including [Azure HDInsight Spark](deployment/README.md#azure-hdinsight-spark), [Amazon EMR Spark](deployment/README.md#amazon-emr-spark), [AWS](deployment/README.md#databricks) & [Azure](deployment/README.md#databricks) Databricks. +.NET for Apache Spark runs on Windows, Linux, and macOS using .NET Core, or Windows using .NET Framework. It also runs on all major cloud providers including [Azure HDInsight Spark](deployment/README.md#azure-hdinsight-spark), [Amazon EMR Spark](deployment/README.md#amazon-emr-spark), [AWS](deployment/README.md#databricks), [Azure Databricks](deployment/README.md#databricks) & [Azure Synapse Analytics](https://azure.microsoft.com/en-us/services/synapse-analytics/). **Note**: We currently have a Spark Project Improvement Proposal JIRA at [SPIP: .NET bindings for Apache Spark](https://issues.apache.org/jira/browse/SPARK-27006) to work with the community towards getting .NET support by default into Apache Spark. We highly encourage you to participate in the discussion. @@ -39,7 +39,7 @@ 2.3.* - v0.12.1 + v1.0 2.4.0 @@ -56,6 +56,18 @@ 2.4.5 + + 2.4.6 + + + 2.4.7 + + + 3.0.0 + + + 3.0.1 + 2.4.2 Not supported @@ -69,9 +81,9 @@ ## Get Started These instructions will show you how to run a .NET for Apache Spark app using .NET Core. -- [Windows Instructions](docs/getting-started/windows-instructions.md) -- [Ubuntu Instructions](docs/getting-started/ubuntu-instructions.md) -- [MacOs Instructions](docs/getting-started/macos-instructions.md) +- [Windows Instructions](https://docs.microsoft.com/en-us/dotnet/spark/tutorials/get-started?tabs=windows) +- [Ubuntu Instructions](https://docs.microsoft.com/en-us/dotnet/spark/tutorials/get-started?tabs=linux) +- [MacOs Instructions](https://docs.microsoft.com/en-us/dotnet/spark/tutorials/get-started?tabs=linux) ## Build Status @@ -155,6 +167,10 @@ We welcome contributions to both categories! +## Learn More + +To learn more about some features of .NET for Apache Spark, please visit [this compilation of How-To guides](docs/how-to-guides.md). + ## Contributing We welcome contributions! Please review our [contribution guide](CONTRIBUTING.md). diff --git a/docs/broadcast-guide.md b/docs/broadcast-guide.md deleted file mode 100644 index c3026516b..000000000 --- a/docs/broadcast-guide.md +++ /dev/null @@ -1,92 +0,0 @@ -# Guide to using Broadcast Variables - -This is a guide to show how to use broadcast variables in .NET for Apache Spark. - -## What are Broadcast Variables - -[Broadcast variables in Apache Spark](https://spark.apache.org/docs/2.2.0/rdd-programming-guide.html#broadcast-variables) are a mechanism for sharing variables across executors that are meant to be read-only. They allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. They can be used, for example, to give every node a copy of a large input dataset in an efficient manner. - -### How to use broadcast variables in .NET for Apache Spark - -Broadcast variables are created from a variable `v` by calling `SparkContext.Broadcast(v)`. The broadcast variable is a wrapper around `v`, and its value can be accessed by calling the `Value()` method. - -Example: - -```csharp -string v = "Variable to be broadcasted"; -Broadcast bv = SparkContext.Broadcast(v); - -// Using the broadcast variable in a UDF: -Func udf = Udf( - str => $"{str}: {bv.Value()}"); -``` - -The type parameter for `Broadcast` should be the type of the variable being broadcasted. - -### Deleting broadcast variables - -The broadcast variable can be deleted from all executors by calling the `Destroy()` method on it. - -```csharp -// Destroying the broadcast variable bv: -bv.Destroy(); -``` - -> Note: `Destroy()` deletes all data and metadata related to the broadcast variable. Use this with caution - once a broadcast variable has been destroyed, it cannot be used again. - -#### Caveat of using Destroy - -One important thing to keep in mind while using broadcast variables in UDFs is to limit the scope of the variable to only the UDF that is referencing it. The [guide to using UDFs](udf-guide.md) describes this phenomenon in detail. This is especially crucial when calling `Destroy` on the broadcast variable. If the broadcast variable that has been destroyed is visible to or accessible from other UDFs, it gets picked up for serialization by all those UDFs, even if it is not being referenced by them. This will throw an error as .NET for Apache Spark is not able to serialize the destroyed broadcast variable. - -Example to demonstrate: - -```csharp -string v = "Variable to be broadcasted"; -Broadcast bv = SparkContext.Broadcast(v); - -// Using the broadcast variable in a UDF: -Func udf1 = Udf( - str => $"{str}: {bv.Value()}"); - -// Destroying bv -bv.Destroy(); - -// Calling udf1 after destroying bv throws the following expected exception: -// org.apache.spark.SparkException: Attempted to use Broadcast(0) after it was destroyed -df.Select(udf1(df["_1"])).Show(); - -// Different UDF udf2 that is not referencing bv -Func udf2 = Udf( - str => $"{str}: not referencing broadcast variable"); - -// Calling udf2 throws the following (unexpected) exception: -// [Error] [JvmBridge] org.apache.spark.SparkException: Task not serializable -df.Select(udf2(df["_1"])).Show(); -``` - -The recommended way of implementing above desired behavior: - -```csharp -string v = "Variable to be broadcasted"; -// Restricting the visibility of bv to only the UDF referencing it -{ - Broadcast bv = SparkContext.Broadcast(v); - - // Using the broadcast variable in a UDF: - Func udf1 = Udf( - str => $"{str}: {bv.Value()}"); - - // Destroying bv - bv.Destroy(); -} - -// Different UDF udf2 that is not referencing bv -Func udf2 = Udf( - str => $"{str}: not referencing broadcast variable"); - -// Calling udf2 works fine as expected -df.Select(udf2(df["_1"])).Show(); -``` - This ensures that destroying `bv` doesn't affect calling `udf2` because of unexpected serialization behavior. - - Broadcast variables are useful for transmitting read-only data to all executors, as the data is sent only once and this can give performance benefits when compared with using local variables that get shipped to the executors with each task. Please refer to the [official documentation](https://spark.apache.org/docs/2.2.0/rdd-programming-guide.html#broadcast-variables) to get a deeper understanding of broadcast variables and why they are used. \ No newline at end of file diff --git a/docs/building/ubuntu-instructions.md b/docs/building/ubuntu-instructions.md index b259768e5..6bf51c333 100644 --- a/docs/building/ubuntu-instructions.md +++ b/docs/building/ubuntu-instructions.md @@ -3,7 +3,7 @@ Building Spark .NET on Ubuntu 18.04 # Table of Contents - [Open Issues](#open-issues) -- [Pre-requisites](#pre-requisites) +- [Prerequisites](#prerequisites) - [Building](#building) - [Building Spark .NET Scala Extensions Layer](#building-spark-net-scala-extensions-layer) - [Building .NET Sample Applications using .NET Core CLI](#building-net-sample-applications-using-net-core-cli) @@ -12,17 +12,17 @@ Building Spark .NET on Ubuntu 18.04 # Open Issues: - [Building through Visual Studio Code]() -# Pre-requisites: +# Prerequisites: -If you already have all the pre-requisites, skip to the [build](ubuntu-instructions.md#building) steps below. +If you already have all the prerequisites, skip to the [build](ubuntu-instructions.md#building) steps below. 1. Download and install **[.NET Core 3.1 SDK](https://dotnet.microsoft.com/download/dotnet-core/3.1)** - installing the SDK will add the `dotnet` toolchain to your path. - 2. Install **[OpenJDK 8](https://openjdk.java.net/install/)** + 2. Install **[OpenJDK 8](https://openjdk.java.net/install/)** . - You can use the following command: ```bash sudo apt install openjdk-8-jdk ``` - - Verify you are able to run `java` from your command-line + - Verify you are able to run `java` from your command-line.
📙 Click to see sample java -version output @@ -49,7 +49,7 @@ If you already have all the pre-requisites, skip to the [build](ubuntu-instructi ``` Note that these environment variables will be lost when you close your terminal. If you want the changes to be permanent, add the `export` lines to your `~/.bashrc` file. - - Verify you are able to run `mvn` from your command-line + - Verify you are able to run `mvn` from your command-line.
📙 Click to see sample mvn -version output @@ -61,8 +61,8 @@ If you already have all the pre-requisites, skip to the [build](ubuntu-instructi OS name: "linux", version: "4.4.0-142-generic", arch: "amd64", family: "unix" ``` 4. Install **[Apache Spark 2.3+](https://spark.apache.org/downloads.html)** - - Download [Apache Spark 2.3+](https://spark.apache.org/downloads.html) and extract it into a local folder (e.g., `~/bin/spark-2.3.2-bin-hadoop2.7`) - - Add the necessary [environment variables](https://www.java.com/en/download/help/path.xml) `SPARK_HOME` e.g., `~/bin/spark-2.3.2-bin-hadoop2.7/` + - Download [Apache Spark 2.3+](https://spark.apache.org/downloads.html) and extract it into a local folder (e.g., `~/bin/spark-2.3.2-bin-hadoop2.7`). + - Add the necessary [environment variables](https://www.java.com/en/download/help/path.xml) `SPARK_HOME` to point to the local directory where you installed Apache Spark e.g., `~/bin/spark-2.3.2-bin-hadoop2.7/`. ```bash export SPARK_HOME=~/bin/spark-2.3.2-hadoop2.7 export PATH="$SPARK_HOME/bin:$PATH" @@ -96,7 +96,7 @@ Please make sure you are able to run `dotnet`, `java`, `mvn`, `spark-shell` from # Building -For the rest of the section, it is assumed that you have cloned Spark .NET repo into your machine e.g., `~/dotnet.spark/` +For the rest of the section, it is assumed that you have cloned Spark .NET repo into your machine e.g., `~/dotnet.spark/`. ``` git clone https://github.com/dotnet/spark.git ~/dotnet.spark @@ -104,7 +104,7 @@ git clone https://github.com/dotnet/spark.git ~/dotnet.spark ## Building Spark .NET Scala Extensions Layer -When you submit a .NET application, Spark .NET has the necessary logic written in Scala that inform Apache Spark how to handle your requests (e.g., request to create a new Spark Session, request to transfer data from .NET side to JVM side etc.). This logic can be found in the [Spark .NET Scala Source Code](../../src/scala). +When you submit a .NET application, Spark .NET has the necessary logic written in Scala that informs Apache Spark how to handle your requests (e.g., request to create a new Spark Session, request to transfer data from .NET side to JVM side etc.). This logic can be found in the [Spark .NET Scala Source Code](../../src/scala). Let us now build the Spark .NET Scala extension layer. This is easy to do: @@ -164,7 +164,7 @@ You should see JARs created for the supported Spark versions: # Run Samples -Once you build the samples, you can use `spark-submit` to submit your .NET Core apps. Make sure you have followed the [pre-requisites](#pre-requisites) section and installed Apache Spark. +Once you build the samples, you can use `spark-submit` to submit your .NET Core apps. Make sure you have followed the [prerequisites](#prerequisites) section and installed Apache Spark. 1. Set the `DOTNET_WORKER_DIR` or `PATH` environment variable to include the path where the `Microsoft.Spark.Worker` binary has been generated (e.g., `~/dotnet.spark/artifacts/bin/Microsoft.Spark.Worker/Debug/netcoreapp3.1/linux-x64/publish`) 2. Open a terminal and go to the directory where your app binary has been generated (e.g., `~/dotnet.spark/artifacts/bin/Microsoft.Spark.CSharp.Examples/Debug/netcoreapp3.1/linux-x64/publish`) @@ -172,6 +172,7 @@ Once you build the samples, you can use `spark-submit` to submit your .NET Core ```bash spark-submit \ [--jars ] \ + --conf = \ --class org.apache.spark.deploy.dotnet.DotnetRunner \ --master local \ \ @@ -214,4 +215,4 @@ Once you build the samples, you can use `spark-submit` to submit your .NET Core ./Microsoft.Spark.CSharp.Examples Sql.Streaming.StructuredKafkaWordCount localhost:9092 subscribe test ``` -Feel this experience is complicated? Help us by taking up [Simplify User Experience for Running an App](https://github.com/dotnet/spark/issues/6) +Feel this experience is complicated? Help us by taking up [Simplify User Experience for Running an App](https://github.com/dotnet/spark/issues/6). diff --git a/docs/building/windows-instructions.md b/docs/building/windows-instructions.md index aad141b68..d3ec922f2 100644 --- a/docs/building/windows-instructions.md +++ b/docs/building/windows-instructions.md @@ -3,7 +3,7 @@ Building Spark .NET on Windows # Table of Contents - [Open Issues](#open-issues) -- [Pre-requisites](#pre-requisites) +- [Prerequisites](#prerequisites) - [Building](#building) - [Building Spark .NET Scala Extensions Layer](#building-spark-net-scala-extensions-layer) - [Building .NET Samples Application](#building-net-samples-application) @@ -16,9 +16,9 @@ Building Spark .NET on Windows - [Building through Visual Studio Code]() - [Building fully automatically through .NET Core CLI]() -# Pre-requisites: +# Prerequisites: -If you already have all the pre-requisites, skip to the [build](windows-instructions.md#building) steps below. +If you already have all the prerequisites, skip to the [build](windows-instructions.md#building) steps below. 1. Download and install the **[.NET Core 3.1 SDK](https://dotnet.microsoft.com/download/dotnet-core/3.1)** - installing the SDK will add the `dotnet` toolchain to your path. 2. Install **[Visual Studio 2019](https://www.visualstudio.com/downloads/)** (Version 16.4 or later). The Community version is completely free. When configuring your installation, include these components at minimum: @@ -29,17 +29,17 @@ If you already have all the pre-requisites, skip to the [build](windows-instruct * All Required Components 3. Install **[Java 1.8](https://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html)** - Select the appropriate version for your operating system e.g., jdk-8u201-windows-x64.exe for Win x64 machine. - - Install using the installer and verify you are able to run `java` from your command-line + - Install using the installer and verify you are able to run `java` from your command-line. 4. Install **[Apache Maven 3.6.3+](https://maven.apache.org/download.cgi)** - - Download [Apache Maven 3.6.3](http://mirror.metrocast.net/apache/maven/maven-3/3.6.3/binaries/apache-maven-3.6.3-bin.zip) - - Extract to a local directory e.g., `c:\bin\apache-maven-3.6.3\` - - Add Apache Maven to your [PATH environment variable](https://www.java.com/en/download/help/path.xml) e.g., `c:\bin\apache-maven-3.6.3\bin` - - Verify you are able to run `mvn` from your command-line + - Download [Apache Maven 3.6.3](http://mirror.metrocast.net/apache/maven/maven-3/3.6.3/binaries/apache-maven-3.6.3-bin.zip). + - Extract to a local directory e.g., `c:\bin\apache-maven-3.6.3\`. + - Add Apache Maven to your [PATH environment variable](https://www.java.com/en/download/help/path.xml) e.g., `c:\bin\apache-maven-3.6.3\bin`. + - Verify you are able to run `mvn` from your command-line. 5. Install **[Apache Spark 2.3+](https://spark.apache.org/downloads.html)** - Download [Apache Spark 2.3+](https://spark.apache.org/downloads.html) and extract it into a local folder (e.g., `c:\bin\spark-2.3.2-bin-hadoop2.7\`) using [7-zip](https://www.7-zip.org/). - - Add Apache Spark to your [PATH environment variable](https://www.java.com/en/download/help/path.xml) e.g., `c:\bin\spark-2.3.2-bin-hadoop2.7\bin` - - Add a [new environment variable](https://www.java.com/en/download/help/path.xml) `SPARK_HOME` e.g., `C:\bin\spark-2.3.2-bin-hadoop2.7\` - - Verify you are able to run `spark-shell` from your command-line + - Add Apache Spark to your [PATH environment variable](https://www.java.com/en/download/help/path.xml) e.g., `c:\bin\spark-2.3.2-bin-hadoop2.7\bin`. + - Add a [new environment variable](https://www.java.com/en/download/help/path.xml) `SPARK_HOME` and point it to the directory you downloaded Apache Spark to, for e.g., `C:\bin\spark-2.3.2-bin-hadoop2.7\`. + - Verify you are able to run `spark-shell` from your command-line.
📙 Click to see sample console output @@ -63,7 +63,7 @@ If you already have all the pre-requisites, skip to the [build](windows-instruct 6. Install **[WinUtils](https://github.com/steveloughran/winutils)** - Download `winutils.exe` binary from [WinUtils repository](https://github.com/steveloughran/winutils). You should select the version of Hadoop the Spark distribution was compiled with, e.g. use hadoop-2.7.1 for Spark 2.3.2. - - Save `winutils.exe` binary to a directory of your choice e.g., `c:\hadoop\bin` + - Save `winutils.exe` binary to a directory of your choice e.g., `c:\hadoop\bin`. - Set `HADOOP_HOME` to reflect the directory with winutils.exe (without bin). For instance, using command-line: ```powershell set HADOOP_HOME=c:\hadoop @@ -80,7 +80,7 @@ Please make sure you are able to run `dotnet`, `java`, `mvn`, `spark-shell` from # Building -For the rest of the section, it is assumed that you have cloned Spark .NET repo into your machine e.g., `c:\github\dotnet-spark\` +For the rest of the section, it is assumed that you have cloned Spark .NET repo into your machine e.g., `c:\github\dotnet-spark\`. ```powershell git clone https://github.com/dotnet/spark.git c:\github\dotnet-spark @@ -88,7 +88,7 @@ git clone https://github.com/dotnet/spark.git c:\github\dotnet-spark ## Building Spark .NET Scala Extensions Layer -When you submit a .NET application, Spark .NET has the necessary logic written in Scala that inform Apache Spark how to handle your requests (e.g., request to create a new Spark Session, request to transfer data from .NET side to JVM side etc.). This logic can be found in the [Spark .NET Scala Source Code](../../src/scala). +When you submit a .NET application, Spark .NET has the necessary logic written in Scala that informs Apache Spark how to handle your requests (e.g., request to create a new Spark Session, request to transfer data from .NET side to JVM side etc.). This logic can be found in the [Spark .NET Scala Source Code](../../src/scala). Regardless of whether you are using .NET Framework or .NET Core, you will need to build the Spark .NET Scala extension layer. This is easy to do: @@ -196,7 +196,7 @@ You should see JARs created for the supported Spark versions: # Run Samples -Once you build the samples, running them will be through `spark-submit` regardless of whether you are targeting .NET Framework or .NET Core apps. Make sure you have followed the [pre-requisites](#pre-requisites) section and installed Apache Spark. +Once you build the samples, running them will be through `spark-submit` regardless of whether you are targeting .NET Framework or .NET Core apps. Make sure you have followed the [prerequisites](#prerequisites) section and installed Apache Spark. 1. Set the `DOTNET_WORKER_DIR` or `PATH` environment variable to include the path where the `Microsoft.Spark.Worker` binary has been generated (e.g., `c:\github\dotnet\spark\artifacts\bin\Microsoft.Spark.Worker\Debug\net461` for .NET Framework, `c:\github\dotnet-spark\artifacts\bin\Microsoft.Spark.Worker\Debug\netcoreapp3.1\win-x64\publish` for .NET Core) 2. Open Powershell and go to the directory where your app binary has been generated (e.g., `c:\github\dotnet\spark\artifacts\bin\Microsoft.Spark.CSharp.Examples\Debug\net461` for .NET Framework, `c:\github\dotnet-spark\artifacts\bin\Microsoft.Spark.CSharp.Examples\Debug\netcoreapp3.1\win1-x64\publish` for .NET Core) @@ -204,6 +204,7 @@ Once you build the samples, running them will be through `spark-submit` regardle ```powershell spark-submit.cmd ` [--jars ] ` + --conf = ` --class org.apache.spark.deploy.dotnet.DotnetRunner ` --master local ` ` @@ -246,4 +247,4 @@ Once you build the samples, running them will be through `spark-submit` regardle Microsoft.Spark.CSharp.Examples.exe Sql.Streaming.StructuredKafkaWordCount localhost:9092 subscribe test ``` -Feel this experience is complicated? Help us by taking up [Simplify User Experience for Running an App](https://github.com/dotnet/spark/issues/6) +Feel this experience is complicated? Help us by taking up [Simplify User Experience for Running an App](https://github.com/dotnet/spark/issues/6). diff --git a/docs/getting-started/macos-instructions.md b/docs/getting-started/macos-instructions.md deleted file mode 100644 index 91fbf8f88..000000000 --- a/docs/getting-started/macos-instructions.md +++ /dev/null @@ -1,85 +0,0 @@ -# Getting Started with Spark .NET on MacOS - -These instructions will show you how to run a .NET for Apache Spark app using .NET Core on MacOSX. - -## Pre-requisites - -- Download and install **[.NET Core 2.1 SDK](https://dotnet.microsoft.com/download/dotnet-core/2.1)** -- Install **[Java 8](https://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html)** - - Select the appropriate version for your operating system e.g., `jdk-8u231-macosx-x64.dmg`. - - Install using the installer and verify you are able to run `java` from your command-line -- Download and install **[Apache Spark 2.4.4](https://archive.apache.org/dist/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz)**: - - Add the necessary environment variables SPARK_HOME e.g., `~/bin/spark-2.4.4-bin-hadoop2.7/` - ```bash - export SPARK_HOME=~/bin/spark-2.4.4-bin-hadoop2.7/ - export PATH="$SPARK_HOME/bin:$PATH" - source ~/.bashrc - ``` -- Download and install **[Microsoft.Spark.Worker](https://github.com/dotnet/spark/releases)** release: - - Select a **[Microsoft.Spark.Worker](https://github.com/dotnet/spark/releases)** release from .NET for Apache Spark GitHub Releases page and download into your local machine (e.g., `/bin/Microsoft.Spark.Worker/`). - - **IMPORTANT** Create a new environment variable using ```export DOTNET_WORKER_DIR ``` and set it to the directory where you downloaded and extracted the Microsoft.Spark.Worker (e.g., `/bin/Microsoft.Spark.Worker/`). - - -## Authoring a .NET for Apache Spark App -- Use the `dotnet` CLI to create a console application. - ``` - dotnet new console -o HelloSpark - ``` -- Install `Microsoft.Spark` Nuget package into the project from the [spark nuget.org feed](https://www.nuget.org/profiles/spark) - see [Ways to install Nuget Package](https://docs.microsoft.com/en-us/nuget/consume-packages/ways-to-install-a-package) - ``` - cd HelloSpark - dotnet add package Microsoft.Spark - ``` -- Replace the contents of the `Program.cs` file with the following code: - ```csharp - using Microsoft.Spark.Sql; - - namespace HelloSpark - { - class Program - { - static void Main(string[] args) - { - var spark = SparkSession.Builder().GetOrCreate(); - var df = spark.Read().Json("people.json"); - df.Show(); - } - } - } - ``` -- Use the `dotnet` CLI to build the application: - ```bash - dotnet build - ``` - -## Running your .NET for Apache Spark App -- Open your terminal and navigate into your app folder: - ```bash - cd - ``` -- Create `people.json` with the following content: - ```json - { "name" : "Michael" } - { "name" : "Andy", "age" : 30 } - { "name" : "Justin", "age" : 19 } - ``` -- Run your app - ```bash - spark-submit \ - --class org.apache.spark.deploy.dotnet.DotnetRunner \ - --master local \ - microsoft-spark-2.4.x-.jar \ - dotnet HelloSpark.dll - ``` - **Note**: This command assumes you have downloaded Apache Spark and added it to your PATH environment variable to be able to use `spark-submit`, otherwise, you would have to use the full path (e.g., `~/spark/bin/spark-submit`). - -- The output of the application should look similar to the output below: - ```text - +----+-------+ - | age| name| - +----+-------+ - |null|Michael| - | 30| Andy| - | 19| Justin| - +----+-------+ - ``` diff --git a/docs/getting-started/ubuntu-instructions.md b/docs/getting-started/ubuntu-instructions.md deleted file mode 100644 index 4821bbec6..000000000 --- a/docs/getting-started/ubuntu-instructions.md +++ /dev/null @@ -1,77 +0,0 @@ -# Getting Started with Spark.NET on Ubuntu - -These instructions will show you how to run a .NET for Apache Spark app using .NET Core on Ubuntu 18.04. - -## Pre-requisites - -- Download and install the following: **[.NET Core 3.1 SDK](https://dotnet.microsoft.com/download/dotnet-core/3.1)** | **[OpenJDK 8](https://openjdk.java.net/install/)** | **[Apache Spark 2.4.1](https://archive.apache.org/dist/spark/spark-2.4.1/spark-2.4.1-bin-hadoop2.7.tgz)** -- Download and install **[Microsoft.Spark.Worker](https://github.com/dotnet/spark/releases)** release: - - Select a **[Microsoft.Spark.Worker](https://github.com/dotnet/spark/releases)** release from .NET for Apache Spark GitHub Releases page and download into your local machine (e.g., `~/bin/Microsoft.Spark.Worker`). - - **IMPORTANT** Create a [new environment variable](https://help.ubuntu.com/community/EnvironmentVariables) `DOTNET_WORKER_DIR` and set it to the directory where you downloaded and extracted the Microsoft.Spark.Worker (e.g., `~/bin/Microsoft.Spark.Worker`). - -For detailed instructions, you can see [Building .NET for Apache Spark from Source on Ubuntu](../building/ubuntu-instructions.md). - -## Authoring a .NET for Apache Spark App - -- Use the `dotnet` CLI to create a console application. - ```shell - dotnet new console -o HelloSpark - ``` -- Install `Microsoft.Spark` Nuget package into the project from the [spark nuget.org feed](https://www.nuget.org/profiles/spark) - see [Ways to install Nuget Package](https://docs.microsoft.com/en-us/nuget/consume-packages/ways-to-install-a-package) - ```shell - cd HelloSpark - dotnet add package Microsoft.Spark - ``` -- Replace the contents of the `Program.cs` file with the following code: - ```csharp - using Microsoft.Spark.Sql; - - namespace HelloSpark - { - class Program - { - static void Main(string[] args) - { - var spark = SparkSession.Builder().GetOrCreate(); - var df = spark.Read().Json("people.json"); - df.Show(); - } - } - } - ``` -- Use the `dotnet` CLI to build the application: - ```shell - dotnet build - ``` - - -## Running your .NET for Apache Spark App -- Open your terminal and navigate into your app folder. - ```shell - cd - ``` -- Create `people.json` with the following content: - ```json - {"name":"Michael"} - {"name":"Andy", "age":30} - {"name":"Justin", "age":19} - ``` -- Run your app. - ```shell - spark-submit \ - --class org.apache.spark.deploy.dotnet.DotnetRunner \ - --master local \ - microsoft-spark-2.4.x-.jar \ - dotnet HelloSpark.dll - ``` - **Note**: This command assumes you have downloaded Apache Spark and added it to your PATH environment variable to be able to use `spark-submit`, otherwise, you would have to use the full path (e.g., `~/spark/bin/spark-submit`). For detailed instructions, you can see [Building .NET for Apache Spark from Source on Ubuntu](../building/ubuntu-instructions.md). -- The output of the application should look similar to the output below: - ```text - +----+-------+ - | age| name| - +----+-------+ - |null|Michael| - | 30| Andy| - | 19| Justin| - +----+-------+ - ``` diff --git a/docs/getting-started/windows-instructions.md b/docs/getting-started/windows-instructions.md deleted file mode 100644 index 698ca8b94..000000000 --- a/docs/getting-started/windows-instructions.md +++ /dev/null @@ -1,65 +0,0 @@ -# Getting Started with Spark .NET on Windows - -These instructions will show you how to run a .NET for Apache Spark app using .NET Core on Windows. - -## Pre-requisites - -- Download and install the following: **[.NET Core 3.1 SDK](https://dotnet.microsoft.com/download/dotnet-core/3.1)** | **[Visual Studio 2019](https://www.visualstudio.com/downloads/)** | **[Java 1.8](https://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html)** | **[Apache Spark 2.4.1](https://archive.apache.org/dist/spark/spark-2.4.1/spark-2.4.1-bin-hadoop2.7.tgz)** -- Download and install **[Microsoft.Spark.Worker](https://github.com/dotnet/spark/releases)** release: - - Select a **[Microsoft.Spark.Worker](https://github.com/dotnet/spark/releases)** release from .NET for Apache Spark GitHub Releases page and download into your local machine (e.g., `c:\bin\Microsoft.Spark.Worker\`). - - **IMPORTANT** Create a [new environment variable](https://www.java.com/en/download/help/path.xml) `DOTNET_WORKER_DIR` and set it to the directory where you downloaded and extracted the Microsoft.Spark.Worker (e.g., `c:\bin\Microsoft.Spark.Worker`). - -For detailed instructions, you can see [Building .NET for Apache Spark from Source on Windows](../building/windows-instructions.md). - -## Authoring a .NET for Apache Spark App -- Open Visual Studio -> Create New Project -> Console App (.NET Core) -> Name: `HelloSpark` -- Install `Microsoft.Spark` Nuget package into the solution from the [spark nuget.org feed](https://www.nuget.org/profiles/spark) - see [Ways to install Nuget Package](https://docs.microsoft.com/en-us/nuget/consume-packages/ways-to-install-a-package) -- Write the following code into `Program.cs`: - ```csharp - using Microsoft.Spark.Sql; - - namespace HelloSpark - { - class Program - { - static void Main(string[] args) - { - var spark = SparkSession.Builder().GetOrCreate(); - var df = spark.Read().Json("people.json"); - df.Show(); - } - } - } - ``` -- Build the solution - -## Running your .NET for Apache Spark App -- Open your terminal and navigate into your app folder: - ``` - cd - ``` -- Create `people.json` with the following content: - ```json - {"name":"Michael"} - {"name":"Andy", "age":30} - {"name":"Justin", "age":19} - ``` -- Run your app - ``` - spark-submit ` - --class org.apache.spark.deploy.dotnet.DotnetRunner ` - --master local ` - microsoft-spark-2.4.x-.jar ` - dotnet HelloSpark.dll - ``` - **Note**: This command assumes you have downloaded Apache Spark and added it to your PATH environment variable to be able to use `spark-submit`, otherwise, you would have to use the full path (e.g., `c:\bin\apache-spark\bin\spark-submit`). For detailed instructions, you can see [Building .NET for Apache Spark from Source on Windows](../building/windows-instructions.md). -- The output of the application should look similar to the output below: - ```text - +----+-------+ - | age| name| - +----+-------+ - |null|Michael| - | 30| Andy| - | 19| Justin| - +----+-------+ - ``` diff --git a/docs/how-to-guides.md b/docs/how-to-guides.md new file mode 100644 index 000000000..d31cea6e5 --- /dev/null +++ b/docs/how-to-guides.md @@ -0,0 +1,16 @@ +# How-To Guides + +.NET for Apache Spark applications can be used to perform a variety of tasks from connecting to external sources to running in notebooks interactively. Here is a list of things you can do through your application along with some important things to know: + +1. [How to use Broadcast variables](https://docs.microsoft.com/en-us/dotnet/spark/how-to-guides/broadcast-guide) in .NET for Apache Spark. +2. [How to use UDFs (User-defined Functions)](https://docs.microsoft.com/en-us/dotnet/spark/how-to-guides/udf-guide) in .NET for Apache Spark. +3. [How to deploy UDF binaries](docs/deploy-worker-udf-binaries.md). +4. [How to connect to Azure Storage accounts]() locally through .NET for Apache Spark. +5. [How to connect to SQL server]() locally through .NET for Apache Spark. +6. [How to connect to Azure EventHub]() through your .NET for Apache Spark application. +7. [How to connect to MongoDB]() through you .NET for Apache Spark application. +8. [How to call UDFs in cross-platform]() applications for example, invoking Scala/Java UDFs from .NET for Apache Spark application, or invoking C# UDFs from Spark apps written in Scala/Java/Python. +9. [How to use .NET for Apache Spark locally using Jupyter notebooks](). +10. [How to use .NET for Apache Spark locally using VS code](). +11. [How to get started](https://dotnet.microsoft.com/learn/data/spark-tutorial/intro) with .NET for Apache Spark through a tutorial. +12. [How to get started with .NET for Apache Spark in Azure Synapse Analytics](). diff --git a/docs/udf-guide.md b/docs/udf-guide.md deleted file mode 100644 index 6a2905bf4..000000000 --- a/docs/udf-guide.md +++ /dev/null @@ -1,171 +0,0 @@ -# Guide to User-Defined Functions (UDFs) - -This is a guide to show how to use UDFs in .NET for Apache Spark. - -## What are UDFs - -[User-Defined Functions (UDFs)](https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/expressions/UserDefinedFunction.html) are a feature of Spark that allow developers to use custom functions to extend the system's built-in functionality. They transform values from a single row within a table to produce a single corresponding output value per row based on the logic defined in the UDF. - -Let's take the following as an example for a UDF definition: - -```csharp -string s1 = "hello"; -Func udf = Udf( - str => $"{s1} {str}"); - -``` -The above defined UDF takes a `string` as an input (in the form of a [Column](https://github.com/dotnet/spark/blob/master/src/csharp/Microsoft.Spark/Sql/Column.cs#L14) of a [Dataframe](https://github.com/dotnet/spark/blob/master/src/csharp/Microsoft.Spark/Sql/DataFrame.cs#L24)), and returns a `string` with `hello` appended in front of the input. - -For a sample Dataframe, let's take the following Dataframe `df`: - -```text -+-------+ -| name| -+-------+ -|Michael| -| Andy| -| Justin| -+-------+ -``` - -Now let's apply the above defined `udf` to the dataframe `df`: - -```csharp -DataFrame udfResult = df.Select(udf(df["name"])); -``` - -This would return the below as the Dataframe `udfResult`: - -```text -+-------------+ -| name| -+-------------+ -|hello Michael| -| hello Andy| -| hello Justin| -+-------------+ -``` -To get a better understanding of how to implement UDFs, please take a look at the [UDF helper functions](https://github.com/dotnet/spark/blob/master/src/csharp/Microsoft.Spark/Sql/Functions.cs#L3616) and some [test examples](https://github.com/dotnet/spark/blob/master/src/csharp/Microsoft.Spark.E2ETest/UdfTests/UdfSimpleTypesTests.cs#L49). - -## UDF serialization - -Since UDFs are functions that need to be executed on the workers, they have to be serialized and sent to the workers as part of the payload from the driver. This involves serializing the [delegate](https://docs.microsoft.com/en-us/dotnet/csharp/programming-guide/delegates/) which is a reference to the method, along with its [target](https://docs.microsoft.com/en-us/dotnet/api/system.delegate.target?view=netframework-4.8) which is the class instance on which the current delegate invokes the instance method. Please take a look at this [code](https://github.com/dotnet/spark/blob/master/src/csharp/Microsoft.Spark/Utils/CommandSerDe.cs#L149) to get a better understanding of how UDF serialization is being done. - -## Good to know while implementing UDFs - -One behavior to be aware of while implementing UDFs in .NET for Apache Spark is how the target of the UDF gets serialized. .NET for Apache Spark uses .NET Core, which does not support serializing delegates, so it is instead done by using reflection to serialize the target where the delegate is defined. When multiple delegates are defined in a common scope, they have a shared closure that becomes the target of reflection for serialization. Let's take an example to illustrate what that means. - -The following code snippet defines two string variables that are being referenced in two function delegates that return the respective strings as result: - -```csharp -using System; - -public class C { - public void M() { - string s1 = "s1"; - string s2 = "s2"; - Func a = str => s1; - Func b = str => s2; - } -} -``` - -The above C# code generates the following C# disassembly (credit source: [sharplab.io](https://sharplab.io)) code from the compiler: - -```csharp -public class C -{ - [CompilerGenerated] - private sealed class <>c__DisplayClass0_0 - { - public string s1; - - public string s2; - - internal string b__0(string str) - { - return s1; - } - - internal string b__1(string str) - { - return s2; - } - } - - public void M() - { - <>c__DisplayClass0_0 <>c__DisplayClass0_ = new <>c__DisplayClass0_0(); - <>c__DisplayClass0_.s1 = "s1"; - <>c__DisplayClass0_.s2 = "s2"; - Func func = new Func(<>c__DisplayClass0_.b__0); - Func func2 = new Func(<>c__DisplayClass0_.b__1); - } -} -``` -As can be seen in the above decompiled code, both `func` and `func2` share the same closure `<>c__DisplayClass0_0`, which is the target that is serialized when serializing the delegates `func` and `func2`. Hence, even though `Func a` is only referencing `s1`, `s2` also gets serialized when sending over the bytes to the workers. - -This can lead to some unexpected behaviors at runtime (like in the case of using [broadcast variables](broadcast-guide.md)), which is why we recommend restricting the visibility of the variables used in a function to that function's scope. - -Going back to the above example, the following is the recommended way to implement the desired behavior of previous code snippet: - -```csharp -using System; - -public class C { - public void M() { - { - string s1 = "s1"; - Func a = str => s1; - } - { - string s2 = "s2"; - Func b = str => s2; - } - } -} -``` - -The above C# code generates the following C# disassembly (credit source: [sharplab.io](https://sharplab.io)) code from the compiler: - -```csharp -public class C -{ - [CompilerGenerated] - private sealed class <>c__DisplayClass0_0 - { - public string s1; - - internal string b__0(string str) - { - return s1; - } - } - - [CompilerGenerated] - private sealed class <>c__DisplayClass0_1 - { - public string s2; - - internal string b__1(string str) - { - return s2; - } - } - - public void M() - { - <>c__DisplayClass0_0 <>c__DisplayClass0_ = new <>c__DisplayClass0_0(); - <>c__DisplayClass0_.s1 = "s1"; - Func func = new Func(<>c__DisplayClass0_.b__0); - <>c__DisplayClass0_1 <>c__DisplayClass0_2 = new <>c__DisplayClass0_1(); - <>c__DisplayClass0_2.s2 = "s2"; - Func func2 = new Func(<>c__DisplayClass0_2.b__1); - } -} -``` - -Here we see that `func` and `func2` no longer share a closure and have their own separate closures `<>c__DisplayClass0_0` and `<>c__DisplayClass0_1` respectively. When used as the target for serialization, nothing other than the referenced variables will get serialized for the delegate. - -This behavior is important to keep in mind while implementing multiple UDFs in a common scope. -To learn more about UDFs in general, please review the following articles that explain UDFs and how to use them: [UDFs in databricks(scala)](https://docs.databricks.com/spark/latest/spark-sql/udf-scala.html), [Spark UDFs and some gotchas](https://medium.com/@achilleus/spark-udfs-we-can-use-them-but-should-we-use-them-2c5a561fde6d). \ No newline at end of file From 9377692845fc9322a489345e1e1ff691b9404eac Mon Sep 17 00:00:00 2001 From: Niharika Dutta Date: Mon, 12 Oct 2020 07:13:43 -0700 Subject: [PATCH 18/20] removing duplicate docs --- README.md | 10 +- docs/building/ubuntu-instructions.md | 218 ---------------------- docs/building/windows-instructions.md | 250 -------------------------- docs/deploy-worker-udf-binaries.md | 115 ------------ docs/features.md | 1 - docs/how-to-guides.md | 16 -- 6 files changed, 5 insertions(+), 605 deletions(-) delete mode 100644 docs/building/ubuntu-instructions.md delete mode 100644 docs/building/windows-instructions.md delete mode 100644 docs/deploy-worker-udf-binaries.md delete mode 100644 docs/features.md delete mode 100644 docs/how-to-guides.md diff --git a/README.md b/README.md index 8592c4464..b9d4f00b1 100644 --- a/README.md +++ b/README.md @@ -8,7 +8,7 @@ .NET for Apache Spark is compliant with .NET Standard - a formal specification of .NET APIs that are common across .NET implementations. This means you can use .NET for Apache Spark anywhere you write .NET code allowing you to reuse all the knowledge, skills, code, and libraries you already have as a .NET developer. -.NET for Apache Spark runs on Windows, Linux, and macOS using .NET Core, or Windows using .NET Framework. It also runs on all major cloud providers including [Azure HDInsight Spark](deployment/README.md#azure-hdinsight-spark), [Amazon EMR Spark](deployment/README.md#amazon-emr-spark), [AWS](deployment/README.md#databricks), [Azure Databricks](deployment/README.md#databricks) & [Azure Synapse Analytics](https://azure.microsoft.com/en-us/services/synapse-analytics/). +.NET for Apache Spark runs on Windows, Linux, and macOS using .NET Core, or Windows using .NET Framework. It also runs on all major cloud providers including [Azure HDInsight Spark](https://docs.microsoft.com/en-us/dotnet/spark/tutorials/hdinsight-deployment), [Amazon EMR Spark](https://docs.microsoft.com/en-us/dotnet/spark/tutorials/amazon-emr-spark-deployment), [AWS](deployment/README.md#databricks), [Azure Databricks](https://docs.microsoft.com/en-us/dotnet/spark/tutorials/databricks-deployment) & [Azure Synapse Analytics](https://azure.microsoft.com/en-us/services/synapse-analytics/). **Note**: We currently have a Spark Project Improvement Proposal JIRA at [SPIP: .NET bindings for Apache Spark](https://issues.apache.org/jira/browse/SPARK-27006) to work with the community towards getting .NET support by default into Apache Spark. We highly encourage you to participate in the discussion. @@ -39,7 +39,7 @@ 2.3.* - v1.0 + v0.12.1 2.4.0 @@ -98,8 +98,8 @@ Building from source is very easy and the whole process (from cloning to being a | | | Instructions | | :---: | :--- | :--- | -| ![Windows icon](docs/img/windows-icon-32.png) | **Windows** |
  • Local - [.NET Framework 4.6.1](docs/building/windows-instructions.md#using-visual-studio-for-net-framework-461)
  • Local - [.NET Core 3.1](docs/building/windows-instructions.md#using-net-core-cli-for-net-core)
    • | -| ![Ubuntu icon](docs/img/ubuntu-icon-32.png) | **Ubuntu** |
      • Local - [.NET Core 3.1](docs/building/ubuntu-instructions.md)
      • [Azure HDInsight Spark - .NET Core 3.1](deployment/README.md)
      | +| ![Windows icon](docs/img/windows-icon-32.png) | **Windows** |
      • Local - [.NET Framework 4.6.1](https://docs.microsoft.com/en-us/dotnet/spark/how-to-guides/windows-instructions#using-visual-studio-for-net-framework)
      • Local - [.NET Core 3.1](https://docs.microsoft.com/en-us/dotnet/spark/how-to-guides/windows-instructions#using-net-core-cli-for-net-core)
        • | +| ![Ubuntu icon](docs/img/ubuntu-icon-32.png) | **Ubuntu** |
          • Local - [.NET Core 3.1](https://docs.microsoft.com/en-us/dotnet/spark/how-to-guides/ubuntu-instructions)
          • [Azure HDInsight Spark - .NET Core 3.1](https://docs.microsoft.com/en-us/dotnet/spark/tutorials/hdinsight-deployment)
          | ## Samples @@ -169,7 +169,7 @@ We welcome contributions to both categories! ## Learn More -To learn more about some features of .NET for Apache Spark, please visit [this compilation of How-To guides](docs/how-to-guides.md). +To learn more about some features of .NET for Apache Spark, please visit [the official .NET documentation]](https://docs.microsoft.com/en-us/dotnet/spark/). ## Contributing diff --git a/docs/building/ubuntu-instructions.md b/docs/building/ubuntu-instructions.md deleted file mode 100644 index 6bf51c333..000000000 --- a/docs/building/ubuntu-instructions.md +++ /dev/null @@ -1,218 +0,0 @@ -Building Spark .NET on Ubuntu 18.04 -========================== - -# Table of Contents -- [Open Issues](#open-issues) -- [Prerequisites](#prerequisites) -- [Building](#building) - - [Building Spark .NET Scala Extensions Layer](#building-spark-net-scala-extensions-layer) - - [Building .NET Sample Applications using .NET Core CLI](#building-net-sample-applications-using-net-core-cli) -- [Run Samples](#run-samples) - -# Open Issues: -- [Building through Visual Studio Code]() - -# Prerequisites: - -If you already have all the prerequisites, skip to the [build](ubuntu-instructions.md#building) steps below. - - 1. Download and install **[.NET Core 3.1 SDK](https://dotnet.microsoft.com/download/dotnet-core/3.1)** - installing the SDK will add the `dotnet` toolchain to your path. - 2. Install **[OpenJDK 8](https://openjdk.java.net/install/)** . - - You can use the following command: - ```bash - sudo apt install openjdk-8-jdk - ``` - - Verify you are able to run `java` from your command-line. -
          - 📙 Click to see sample java -version output - - ``` - openjdk version "1.8.0_191" - OpenJDK Runtime Environment (build 1.8.0_191-8u191-b12-2ubuntu0.18.04.1-b12) - OpenJDK 64-Bit Server VM (build 25.191-b12, mixed mode) - ``` - - If you already have multiple OpenJDK versions installed and want to select OpenJDK 8, use the following command: - ```bash - sudo update-alternatives --config java - ``` - 3. Install **[Apache Maven 3.6.3+](https://maven.apache.org/download.cgi)** - - Run the following command: - ```bash - mkdir -p ~/bin/maven - cd ~/bin/maven - wget https://www-us.apache.org/dist/maven/maven-3/3.6.3/binaries/apache-maven-3.6.3-bin.tar.gz - tar -xvzf apache-maven-3.6.3-bin.tar.gz - ln -s apache-maven-3.6.3 current - export M2_HOME=~/bin/maven/current - export PATH=${M2_HOME}/bin:${PATH} - source ~/.bashrc - ``` - - Note that these environment variables will be lost when you close your terminal. If you want the changes to be permanent, add the `export` lines to your `~/.bashrc` file. - - Verify you are able to run `mvn` from your command-line. -
          - 📙 Click to see sample mvn -version output - - ``` - Apache Maven 3.6.3 (cecedd343002696d0abb50b32b541b8a6ba2883f) - Maven home: ~/bin/apache-maven-3.6.3 - Java version: 1.8.0_242, vendor: Oracle Corporation, runtime: /usr/lib/jvm/java-8-openjdk-amd64/jre - Default locale: en_US, platform encoding: ANSI_X3.4-1968 - OS name: "linux", version: "4.4.0-142-generic", arch: "amd64", family: "unix" - ``` - 4. Install **[Apache Spark 2.3+](https://spark.apache.org/downloads.html)** - - Download [Apache Spark 2.3+](https://spark.apache.org/downloads.html) and extract it into a local folder (e.g., `~/bin/spark-2.3.2-bin-hadoop2.7`). - - Add the necessary [environment variables](https://www.java.com/en/download/help/path.xml) `SPARK_HOME` to point to the local directory where you installed Apache Spark e.g., `~/bin/spark-2.3.2-bin-hadoop2.7/`. - ```bash - export SPARK_HOME=~/bin/spark-2.3.2-hadoop2.7 - export PATH="$SPARK_HOME/bin:$PATH" - source ~/.bashrc - ``` - - Note that these environment variables will be lost when you close your terminal. If you want the changes to be permanent, add the `export` lines to your `~/.bashrc` file. - - Verify you are able to run `spark-shell` from your command-line -
          - 📙 Click to see sample console output - - ``` - Welcome to - ____ __ - / __/__ ___ _____/ /__ - _\ \/ _ \/ _ `/ __/ '_/ - /___/ .__/\_,_/_/ /_/\_\ version 2.3.2 - /_/ - - Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_201) - Type in expressions to have them evaluated. - Type :help for more information. - - scala> sc - res0: org.apache.spark.SparkContext = org.apache.spark.SparkContext@6eaa6b0c - ``` - -
          - -Please make sure you are able to run `dotnet`, `java`, `mvn`, `spark-shell` from your command-line before you move to the next section. Feel there is a better way? Please [open an issue](https://github.com/dotnet/spark/issues) and feel free to contribute. - -# Building - -For the rest of the section, it is assumed that you have cloned Spark .NET repo into your machine e.g., `~/dotnet.spark/`. - -``` -git clone https://github.com/dotnet/spark.git ~/dotnet.spark -``` - -## Building Spark .NET Scala Extensions Layer - -When you submit a .NET application, Spark .NET has the necessary logic written in Scala that informs Apache Spark how to handle your requests (e.g., request to create a new Spark Session, request to transfer data from .NET side to JVM side etc.). This logic can be found in the [Spark .NET Scala Source Code](../../src/scala). - -Let us now build the Spark .NET Scala extension layer. This is easy to do: - -``` -cd src/scala -mvn clean package -``` -You should see JARs created for the supported Spark versions: -* `microsoft-spark-2.3.x/target/microsoft-spark-2.3.x-.jar` -* `microsoft-spark-2.4.x/target/microsoft-spark-2.4.x-.jar` - -## Building .NET Sample Applications using .NET Core CLI - - 1. Build the Worker - ```bash - cd ~/dotnet.spark/src/csharp/Microsoft.Spark.Worker/ - dotnet publish -f netcoreapp3.1 -r linux-x64 - ``` -
          - 📙 Click to see sample console output - - ```bash - user@machine:/home/user/dotnet.spark/src/csharp/Microsoft.Spark.Worker$ dotnet publish -f netcoreapp3.1 -r linux-x64 - Microsoft (R) Build Engine version 16.0.462+g62fb89029d for .NET Core - Copyright (C) Microsoft Corporation. All rights reserved. - - Restore completed in 36.03 ms for /home/user/dotnet.spark/src/csharp/Microsoft.Spark.Worker/Microsoft.Spark.Worker.csproj. - Restore completed in 35.94 ms for /home/user/dotnet.spark/src/csharp/Microsoft.Spark/Microsoft.Spark.csproj. - Microsoft.Spark -> /home/user/dotnet.spark/artifacts/bin/Microsoft.Spark/Debug/netstandard2.0/Microsoft.Spark.dll - Microsoft.Spark.Worker -> /home/user/dotnet.spark/artifacts/bin/Microsoft.Spark.Worker/Debug/netcoreapp3.1/linux-x64/Microsoft.Spark.Worker.dll - Microsoft.Spark.Worker -> /home/user/dotnet.spark/artifacts/bin/Microsoft.Spark.Worker/Debug/netcoreapp3.1/linux-x64/publish/ - ``` - -
          - - 2. Build the Samples - ```bash - cd ~/dotnet.spark/examples/Microsoft.Spark.CSharp.Examples/ - dotnet publish -f netcoreapp3.1 -r linux-x64 - ``` -
          - 📙 Click to see sample console output - - ```bash - user@machine:/home/user/dotnet.spark/examples/Microsoft.Spark.CSharp.Examples$ dotnet publish -f netcoreapp3.1 -r linux-x64 - Microsoft (R) Build Engine version 16.0.462+g62fb89029d for .NET Core - Copyright (C) Microsoft Corporation. All rights reserved. - - Restore completed in 37.11 ms for /home/user/dotnet.spark/src/csharp/Microsoft.Spark/Microsoft.Spark.csproj. - Restore completed in 281.63 ms for /home/user/dotnet.spark/examples/Microsoft.Spark.CSharp.Examples/Microsoft.Spark.CSharp.Examples.csproj. - Microsoft.Spark -> /home/user/dotnet.spark/artifacts/bin/Microsoft.Spark/Debug/netstandard2.0/Microsoft.Spark.dll - Microsoft.Spark.CSharp.Examples -> /home/user/dotnet.spark/artifacts/bin/Microsoft.Spark.CSharp.Examples/Debug/netcoreapp3.1/linux-x64/Microsoft.Spark.CSharp.Examples.dll - Microsoft.Spark.CSharp.Examples -> /home/user/dotnet.spark/artifacts/bin/Microsoft.Spark.CSharp.Examples/Debug/netcoreapp3.1/linux-x64/publish/ - ``` - -
          - -# Run Samples - -Once you build the samples, you can use `spark-submit` to submit your .NET Core apps. Make sure you have followed the [prerequisites](#prerequisites) section and installed Apache Spark. - - 1. Set the `DOTNET_WORKER_DIR` or `PATH` environment variable to include the path where the `Microsoft.Spark.Worker` binary has been generated (e.g., `~/dotnet.spark/artifacts/bin/Microsoft.Spark.Worker/Debug/netcoreapp3.1/linux-x64/publish`) - 2. Open a terminal and go to the directory where your app binary has been generated (e.g., `~/dotnet.spark/artifacts/bin/Microsoft.Spark.CSharp.Examples/Debug/netcoreapp3.1/linux-x64/publish`) - 3. Running your app follows the basic structure: - ```bash - spark-submit \ - [--jars ] \ - --conf = \ - --class org.apache.spark.deploy.dotnet.DotnetRunner \ - --master local \ - \ - - ``` - - Here are some examples you can run: - - **[Microsoft.Spark.Examples.Sql.Batch.Basic](../../examples/Microsoft.Spark.CSharp.Examples/Sql/Batch/Basic.cs)** - ```bash - spark-submit \ - --class org.apache.spark.deploy.dotnet.DotnetRunner \ - --master local \ - ~/dotnet.spark/src/scala/microsoft-spark-/target/microsoft-spark-.jar \ - ./Microsoft.Spark.CSharp.Examples Sql.Batch.Basic $SPARK_HOME/examples/src/main/resources/people.json - ``` - - **[Microsoft.Spark.Examples.Sql.Streaming.StructuredNetworkWordCount](../../examples/Microsoft.Spark.CSharp.Examples/Sql/Streaming/StructuredNetworkWordCount.cs)** - ```bash - spark-submit \ - --class org.apache.spark.deploy.dotnet.DotnetRunner \ - --master local \ - ~/dotnet.spark/src/scala/microsoft-spark-/target/microsoft-spark-.jar \ - ./Microsoft.Spark.CSharp.Examples Sql.Streaming.StructuredNetworkWordCount localhost 9999 - ``` - - **[Microsoft.Spark.Examples.Sql.Streaming.StructuredKafkaWordCount (maven accessible)](../../examples/Microsoft.Spark.CSharp.Examples/Sql/Streaming/StructuredKafkaWordCount.cs)** - ```bash - spark-submit \ - --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.2 \ - --class org.apache.spark.deploy.dotnet.DotnetRunner \ - --master local \ - ~/dotnet.spark/src/scala/microsoft-spark-/target/microsoft-spark-.jar \ - ./Microsoft.Spark.CSharp.Examples Sql.Streaming.StructuredKafkaWordCount localhost:9092 subscribe test - ``` - - **[Microsoft.Spark.Examples.Sql.Streaming.StructuredKafkaWordCount (jars provided)](../../examples/Microsoft.Spark.CSharp.Examples/Sql/Streaming/StructuredKafkaWordCount.cs)** - ```bash - spark-submit \ - --jars path/to/net.jpountz.lz4/lz4-1.3.0.jar,path/to/org.apache.kafka/kafka-clients-0.10.0.1.jar,path/to/org.apache.spark/spark-sql-kafka-0-10_2.11-2.3.2.jar,`path/to/org.slf4j/slf4j-api-1.7.6.jar,path/to/org.spark-project.spark/unused-1.0.0.jar,path/to/org.xerial.snappy/snappy-java-1.1.2.6.jar \ - --class org.apache.spark.deploy.dotnet.DotnetRunner \ - --master local \ - ~/dotnet.spark/src/scala/microsoft-spark-/target/microsoft-spark-.jar \ - ./Microsoft.Spark.CSharp.Examples Sql.Streaming.StructuredKafkaWordCount localhost:9092 subscribe test - ``` - -Feel this experience is complicated? Help us by taking up [Simplify User Experience for Running an App](https://github.com/dotnet/spark/issues/6). diff --git a/docs/building/windows-instructions.md b/docs/building/windows-instructions.md deleted file mode 100644 index d3ec922f2..000000000 --- a/docs/building/windows-instructions.md +++ /dev/null @@ -1,250 +0,0 @@ -Building Spark .NET on Windows -========================== - -# Table of Contents -- [Open Issues](#open-issues) -- [Prerequisites](#prerequisites) -- [Building](#building) - - [Building Spark .NET Scala Extensions Layer](#building-spark-net-scala-extensions-layer) - - [Building .NET Samples Application](#building-net-samples-application) - - [Using Visual Studio for .NET Framework](#using-visual-studio-for-net-framework) - - [Using .NET Core CLI for .NET Core](#using-net-core-cli-for-net-core) -- [Run Samples](#run-samples) - -# Open Issues: -- [Allow users to choose which .NET framework to build for]() -- [Building through Visual Studio Code]() -- [Building fully automatically through .NET Core CLI]() - -# Prerequisites: - -If you already have all the prerequisites, skip to the [build](windows-instructions.md#building) steps below. - - 1. Download and install the **[.NET Core 3.1 SDK](https://dotnet.microsoft.com/download/dotnet-core/3.1)** - installing the SDK will add the `dotnet` toolchain to your path. - 2. Install **[Visual Studio 2019](https://www.visualstudio.com/downloads/)** (Version 16.4 or later). The Community version is completely free. When configuring your installation, include these components at minimum: - * .NET desktop development - * All Required Components - * .NET Framework 4.6.1 Development Tools - * .NET Core cross-platform development - * All Required Components - 3. Install **[Java 1.8](https://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html)** - - Select the appropriate version for your operating system e.g., jdk-8u201-windows-x64.exe for Win x64 machine. - - Install using the installer and verify you are able to run `java` from your command-line. - 4. Install **[Apache Maven 3.6.3+](https://maven.apache.org/download.cgi)** - - Download [Apache Maven 3.6.3](http://mirror.metrocast.net/apache/maven/maven-3/3.6.3/binaries/apache-maven-3.6.3-bin.zip). - - Extract to a local directory e.g., `c:\bin\apache-maven-3.6.3\`. - - Add Apache Maven to your [PATH environment variable](https://www.java.com/en/download/help/path.xml) e.g., `c:\bin\apache-maven-3.6.3\bin`. - - Verify you are able to run `mvn` from your command-line. - 5. Install **[Apache Spark 2.3+](https://spark.apache.org/downloads.html)** - - Download [Apache Spark 2.3+](https://spark.apache.org/downloads.html) and extract it into a local folder (e.g., `c:\bin\spark-2.3.2-bin-hadoop2.7\`) using [7-zip](https://www.7-zip.org/). - - Add Apache Spark to your [PATH environment variable](https://www.java.com/en/download/help/path.xml) e.g., `c:\bin\spark-2.3.2-bin-hadoop2.7\bin`. - - Add a [new environment variable](https://www.java.com/en/download/help/path.xml) `SPARK_HOME` and point it to the directory you downloaded Apache Spark to, for e.g., `C:\bin\spark-2.3.2-bin-hadoop2.7\`. - - Verify you are able to run `spark-shell` from your command-line. -
          - 📙 Click to see sample console output - - ``` - Welcome to - ____ __ - / __/__ ___ _____/ /__ - _\ \/ _ \/ _ `/ __/ '_/ - /___/ .__/\_,_/_/ /_/\_\ version 2.3.2 - /_/ - - Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_201) - Type in expressions to have them evaluated. - Type :help for more information. - - scala> sc - res0: org.apache.spark.SparkContext = org.apache.spark.SparkContext@6eaa6b0c - ``` - -
          - - 6. Install **[WinUtils](https://github.com/steveloughran/winutils)** - - Download `winutils.exe` binary from [WinUtils repository](https://github.com/steveloughran/winutils). You should select the version of Hadoop the Spark distribution was compiled with, e.g. use hadoop-2.7.1 for Spark 2.3.2. - - Save `winutils.exe` binary to a directory of your choice e.g., `c:\hadoop\bin`. - - Set `HADOOP_HOME` to reflect the directory with winutils.exe (without bin). For instance, using command-line: - ```powershell - set HADOOP_HOME=c:\hadoop - ``` - - Set PATH environment variable to include `%HADOOP_HOME%\bin`. For instance, using command-line: - ```powershell - set PATH=%HADOOP_HOME%\bin;%PATH% - ``` - - -Please make sure you are able to run `dotnet`, `java`, `mvn`, `spark-shell` from your command-line before you move to the next section. Feel there is a better way? Please [open an issue](https://github.com/dotnet/spark/issues) and feel free to contribute. - -> **Note**: A new instance of the command-line may be required if any environment variables were updated. - -# Building - -For the rest of the section, it is assumed that you have cloned Spark .NET repo into your machine e.g., `c:\github\dotnet-spark\`. - -```powershell -git clone https://github.com/dotnet/spark.git c:\github\dotnet-spark -``` - -## Building Spark .NET Scala Extensions Layer - -When you submit a .NET application, Spark .NET has the necessary logic written in Scala that informs Apache Spark how to handle your requests (e.g., request to create a new Spark Session, request to transfer data from .NET side to JVM side etc.). This logic can be found in the [Spark .NET Scala Source Code](../../src/scala). - -Regardless of whether you are using .NET Framework or .NET Core, you will need to build the Spark .NET Scala extension layer. This is easy to do: - -```powershell -cd src\scala -mvn clean package -``` -You should see JARs created for the supported Spark versions: -* `microsoft-spark-2.3.x\target\microsoft-spark-2.3.x-.jar` -* `microsoft-spark-2.4.x\target\microsoft-spark-2.4.x-.jar` - -## Building .NET Samples Application - -### Using Visual Studio for .NET Framework - - 1. Open `src\csharp\Microsoft.Spark.sln` in Visual Studio and build the `Microsoft.Spark.CSharp.Examples` project under the `examples` folder (this will in turn build the .NET bindings project as well). If you want, you can write your own code in the `Microsoft.Spark.Examples` project: - - ```csharp - // Instantiate a session - var spark = SparkSession - .Builder() - .AppName("Hello Spark!") - .GetOrCreate(); - - var df = spark.Read().Json(args[0]); - - // Print schema - df.PrintSchema(); - - // Apply a filter and show results - df.Filter(df["age"] > 21).Show(); - ``` - Once the build is successfuly, you will see the appropriate binaries produced in the output directory. -
          - 📙 Click to see sample console output - - ```powershell - Directory: C:\github\dotnet-spark\artifacts\bin\Microsoft.Spark.CSharp.Examples\Debug\net461 - - - Mode LastWriteTime Length Name - ---- ------------- ------ ---- - -a---- 3/6/2019 12:18 AM 125440 Apache.Arrow.dll - -a---- 3/16/2019 12:00 AM 13824 Microsoft.Spark.CSharp.Examples.exe - -a---- 3/16/2019 12:00 AM 19423 Microsoft.Spark.CSharp.Examples.exe.config - -a---- 3/16/2019 12:00 AM 2720 Microsoft.Spark.CSharp.Examples.pdb - -a---- 3/16/2019 12:00 AM 143360 Microsoft.Spark.dll - -a---- 3/16/2019 12:00 AM 63388 Microsoft.Spark.pdb - -a---- 3/16/2019 12:00 AM 34304 Microsoft.Spark.Worker.exe - -a---- 3/16/2019 12:00 AM 19423 Microsoft.Spark.Worker.exe.config - -a---- 3/16/2019 12:00 AM 11900 Microsoft.Spark.Worker.pdb - -a---- 3/16/2019 12:00 AM 23552 Microsoft.Spark.Worker.xml - -a---- 3/16/2019 12:00 AM 332363 Microsoft.Spark.xml - ------------------------------------------- More framework files ------------------------------------- - ``` - -
          - -### Using .NET Core CLI for .NET Core - -> Note: We are currently working on automating .NET Core builds for Spark .NET. Until then, we appreciate your patience in performing some of the steps manually. - - 1. Build the Worker - ```powershell - cd C:\github\dotnet-spark\src\csharp\Microsoft.Spark.Worker\ - dotnet publish -f netcoreapp3.1 -r win-x64 - ``` -
          - 📙 Click to see sample console output - - ```powershell - PS C:\github\dotnet-spark\src\csharp\Microsoft.Spark.Worker> dotnet publish -f netcoreapp3.1 -r win-x64 - Microsoft (R) Build Engine version 16.0.462+g62fb89029d for .NET Core - Copyright (C) Microsoft Corporation. All rights reserved. - - Restore completed in 299.95 ms for C:\github\dotnet-spark\src\csharp\Microsoft.Spark\Microsoft.Spark.csproj. - Restore completed in 306.62 ms for C:\github\dotnet-spark\src\csharp\Microsoft.Spark.Worker\Microsoft.Spark.Worker.csproj. - Microsoft.Spark -> C:\github\dotnet-spark\artifacts\bin\Microsoft.Spark\Debug\netstandard2.0\Microsoft.Spark.dll - Microsoft.Spark.Worker -> C:\github\dotnet-spark\artifacts\bin\Microsoft.Spark.Worker\Debug\netcoreapp3.1\win-x64\Microsoft.Spark.Worker.dll - Microsoft.Spark.Worker -> C:\github\dotnet-spark\artifacts\bin\Microsoft.Spark.Worker\Debug\netcoreapp3.1\win-x64\publish\ - ``` - -
          - 2. Build the Samples - ```powershell - cd C:\github\dotnet-spark\examples\Microsoft.Spark.CSharp.Examples\ - dotnet publish -f netcoreapp3.1 -r win-x64 - ``` -
          - 📙 Click to see sample console output - - ```powershell - PS C:\github\dotnet-spark\examples\Microsoft.Spark.CSharp.Examples> dotnet publish -f netcoreapp3.1 -r win10-x64 - Microsoft (R) Build Engine version 16.0.462+g62fb89029d for .NET Core - Copyright (C) Microsoft Corporation. All rights reserved. - - Restore completed in 44.22 ms for C:\github\dotnet-spark\src\csharp\Microsoft.Spark\Microsoft.Spark.csproj. - Restore completed in 336.94 ms for C:\github\dotnet-spark\examples\Microsoft.Spark.CSharp.Examples\Microsoft.Spark.CSharp.Examples.csproj. - Microsoft.Spark -> C:\github\dotnet-spark\artifacts\bin\Microsoft.Spark\Debug\netstandard2.0\Microsoft.Spark.dll - Microsoft.Spark.CSharp.Examples -> C:\github\dotnet-spark\artifacts\bin\Microsoft.Spark.CSharp.Examples\Debug\netcoreapp3.1\win-x64\Microsoft.Spark.CSharp.Examples.dll - Microsoft.Spark.CSharp.Examples -> C:\github\dotnet-spark\artifacts\bin\Microsoft.Spark.CSharp.Examples\Debug\netcoreapp3.1\win-x64\publish\ - ``` - -
          - -# Run Samples - -Once you build the samples, running them will be through `spark-submit` regardless of whether you are targeting .NET Framework or .NET Core apps. Make sure you have followed the [prerequisites](#prerequisites) section and installed Apache Spark. - - 1. Set the `DOTNET_WORKER_DIR` or `PATH` environment variable to include the path where the `Microsoft.Spark.Worker` binary has been generated (e.g., `c:\github\dotnet\spark\artifacts\bin\Microsoft.Spark.Worker\Debug\net461` for .NET Framework, `c:\github\dotnet-spark\artifacts\bin\Microsoft.Spark.Worker\Debug\netcoreapp3.1\win-x64\publish` for .NET Core) - 2. Open Powershell and go to the directory where your app binary has been generated (e.g., `c:\github\dotnet\spark\artifacts\bin\Microsoft.Spark.CSharp.Examples\Debug\net461` for .NET Framework, `c:\github\dotnet-spark\artifacts\bin\Microsoft.Spark.CSharp.Examples\Debug\netcoreapp3.1\win1-x64\publish` for .NET Core) - 3. Running your app follows the basic structure: - ```powershell - spark-submit.cmd ` - [--jars ] ` - --conf = ` - --class org.apache.spark.deploy.dotnet.DotnetRunner ` - --master local ` - ` - - ``` - - Here are some examples you can run: - - **[Microsoft.Spark.Examples.Sql.Batch.Basic](../../examples/Microsoft.Spark.CSharp.Examples/Sql/Batch/Basic.cs)** - ```powershell - spark-submit.cmd ` - --class org.apache.spark.deploy.dotnet.DotnetRunner ` - --master local ` - C:\github\dotnet-spark\src\scala\microsoft-spark-\target\microsoft-spark-.jar ` - Microsoft.Spark.CSharp.Examples.exe Sql.Batch.Basic %SPARK_HOME%\examples\src\main\resources\people.json - ``` - - **[Microsoft.Spark.Examples.Sql.Streaming.StructuredNetworkWordCount](../../examples/Microsoft.Spark.CSharp.Examples/Sql/Streaming/StructuredNetworkWordCount.cs)** - ```powershell - spark-submit.cmd ` - --class org.apache.spark.deploy.dotnet.DotnetRunner ` - --master local ` - C:\github\dotnet-spark\src\scala\microsoft-spark-\target\microsoft-spark-.jar ` - Microsoft.Spark.CSharp.Examples.exe Sql.Streaming.StructuredNetworkWordCount localhost 9999 - ``` - - **[Microsoft.Spark.Examples.Sql.Streaming.StructuredKafkaWordCount (maven accessible)](../../examples/Microsoft.Spark.CSharp.Examples/Sql/Streaming/StructuredKafkaWordCount.cs)** - ```powershell - spark-submit.cmd ` - --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.2 ` - --class org.apache.spark.deploy.dotnet.DotnetRunner ` - --master local ` - C:\github\dotnet-spark\src\scala\microsoft-spark-\target\microsoft-spark-.jar ` - Microsoft.Spark.CSharp.Examples.exe Sql.Streaming.StructuredKafkaWordCount localhost:9092 subscribe test - ``` - - **[Microsoft.Spark.Examples.Sql.Streaming.StructuredKafkaWordCount (jars provided)](../../examples/Microsoft.Spark.CSharp.Examples/Sql/Streaming/StructuredKafkaWordCount.cs)** - ```powershell - spark-submit.cmd - --jars path\to\net.jpountz.lz4\lz4-1.3.0.jar,path\to\org.apache.kafka\kafka-clients-0.10.0.1.jar,path\to\org.apache.spark\spark-sql-kafka-0-10_2.11-2.3.2.jar,`path\to\org.slf4j\slf4j-api-1.7.6.jar,path\to\org.spark-project.spark\unused-1.0.0.jar,path\to\org.xerial.snappy\snappy-java-1.1.2.6.jar ` - --class org.apache.spark.deploy.dotnet.DotnetRunner ` - --master local ` - C:\github\dotnet-spark\src\scala\microsoft-spark-\target\microsoft-spark-.jar ` - Microsoft.Spark.CSharp.Examples.exe Sql.Streaming.StructuredKafkaWordCount localhost:9092 subscribe test - ``` - -Feel this experience is complicated? Help us by taking up [Simplify User Experience for Running an App](https://github.com/dotnet/spark/issues/6). diff --git a/docs/deploy-worker-udf-binaries.md b/docs/deploy-worker-udf-binaries.md deleted file mode 100644 index 7887d2c37..000000000 --- a/docs/deploy-worker-udf-binaries.md +++ /dev/null @@ -1,115 +0,0 @@ -# Deploy Worker and UDF Binaries General Instruction - -This how-to provides general instructions on how to deploy Worker and UDF (User-Defined Function) binaries, -including which Environment Variables to set up and some commonly used parameters -when launching applications with `spark-submit`. - -## Configurations - -### 1. Environment Variables -When deploying workers and writing UDFs, there are a few commonly used environment variables that you may need to set: - - - - - - - - - - - - - - - - - - -
          Environment VariableDescription
          DOTNET_WORKER_DIRPath where the Microsoft.Spark.Worker binary has been generated.
          It's used by the Spark driver and will be passed to Spark executors. If this variable is not set up, the Spark executors will search the path specified in the PATH environment variable.
          e.g. "C:\bin\Microsoft.Spark.Worker"
          DOTNET_ASSEMBLY_SEARCH_PATHSComma-separated paths where Microsoft.Spark.Worker will load assemblies.
          Note that if a path starts with ".", the working directory will be prepended. If in yarn mode, "." would represent the container's working directory.
          e.g. "C:\Users\<user name>\<mysparkapp>\bin\Debug\<dotnet version>"
          DOTNET_WORKER_DEBUGIf you want to debug a UDF, then set this environment variable to 1 before running spark-submit.
          - -### 2. Parameter Options -Once the Spark application is [bundled](https://spark.apache.org/docs/latest/submitting-applications.html#bundling-your-applications-dependencies), you can launch it using `spark-submit`. The following table shows some of the commonly used options: - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
          Parameter NameDescription
          --classThe entry point for your application.
          e.g. org.apache.spark.deploy.dotnet.DotnetRunner
          --masterThe master URL for the cluster.
          e.g. yarn
          --deploy-modeWhether to deploy your driver on the worker nodes (cluster) or locally as an external client (client).
          Default: client
          --confArbitrary Spark configuration property in key=value format.
          e.g. spark.yarn.appMasterEnv.DOTNET_WORKER_DIR=.\worker\Microsoft.Spark.Worker
          --filesComma-separated list of files to be placed in the working directory of each executor.
          -
            -
          • Please note that this option is only applicable for yarn mode.
          • -
          • It supports specifying file names with # similar to Hadoop.
            -
          - e.g. myLocalSparkApp.dll#appSeen.dll. Your application should use the name as appSeen.dll to reference myLocalSparkApp.dll when running on YARN.
          --archivesComma-separated list of archives to be extracted into the working directory of each executor.
          -
            -
          • Please note that this option is only applicable for yarn mode.
          • -
          • It supports specifying file names with # similar to Hadoop.
            -
          - e.g. hdfs://<path to your worker file>/Microsoft.Spark.Worker.zip#worker. This will copy and extract the zip file to worker folder.
          application-jarPath to a bundled jar including your application and all dependencies.
          - e.g. hdfs://<path to your jar>/microsoft-spark-<version>.jar
          application-argumentsArguments passed to the main method of your main class, if any.
          e.g. hdfs://<path to your app>/<your app>.zip <your app name> <app args>
          - -> Note: Please specify all the `--options` before `application-jar` when launching applications with `spark-submit`, otherwise they will be ignored. Please see more `spark-submit` options [here](https://spark.apache.org/docs/latest/submitting-applications.html) and running spark on YARN details [here](https://spark.apache.org/docs/latest/running-on-yarn.html). - -## FAQ -#### 1. Question: When I run a spark app with UDFs, I get the following error. What should I do? -> **Error:** [ ] [ ] [Error] [TaskRunner] [0] ProcessStream() failed with exception: System.IO.FileNotFoundException: Assembly 'mySparkApp, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null' file not found: 'mySparkApp.dll' - -**Answer:** Please check if the `DOTNET_ASSEMBLY_SEARCH_PATHS` environment variable is set correctly. It should be the path that contains your `mySparkApp.dll`. - -#### 2. Question: After I upgraded my Spark Dotnet version and reset the `DOTNET_WORKER_DIR` environment variable, why do I still get the following error? -> **Error:** Lost task 0.0 in stage 11.0 (TID 24, localhost, executor driver): java.io.IOException: Cannot run program "Microsoft.Spark.Worker.exe": CreateProcess error=2, The system cannot find the file specified. - -**Answer:** Please try restarting your PowerShell window (or other command windows) first so that it can take the latest environment variable values. Then start your program. - -#### 3. Question: After submitting my Spark application, I get the error `System.TypeLoadException: Could not load type 'System.Runtime.Remoting.Contexts.Context'`. -> **Error:** [ ] [ ] [Error] [TaskRunner] [0] ProcessStream() failed with exception: System.TypeLoadException: Could not load type 'System.Runtime.Remoting.Contexts.Context' from assembly 'mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=...'. - -**Answer:** Please check the `Microsoft.Spark.Worker` version you are using. We currently provide two versions: **.NET Framework 4.6.1** and **.NET Core 2.1.x**. In this case, `Microsoft.Spark.Worker.net461.win-x64-` (which you can download [here](https://github.com/dotnet/spark/releases)) should be used since `System.Runtime.Remoting.Contexts.Context` is only for .NET Framework. - -#### 4. Question: How to run my spark application with UDFs on YARN? Which environment variables and parameters should I use? - -**Answer:** To launch the spark application on YARN, the environment variables should be specified as `spark.yarn.appMasterEnv.[EnvironmentVariableName]`. Please see below as an example using `spark-submit`: -```shell -spark-submit \ ---class org.apache.spark.deploy.dotnet.DotnetRunner \ ---master yarn \ ---deploy-mode cluster \ ---conf spark.yarn.appMasterEnv.DOTNET_WORKER_DIR=./worker/Microsoft.Spark.Worker- \ ---conf spark.yarn.appMasterEnv.DOTNET_ASSEMBLY_SEARCH_PATHS=./udfs \ ---archives hdfs:///Microsoft.Spark.Worker.net461.win-x64-.zip#worker,hdfs:///mySparkApp.zip#udfs \ -hdfs:///microsoft-spark-2.4.x-.jar \ -hdfs:///mySparkApp.zip mySparkApp -``` diff --git a/docs/features.md b/docs/features.md deleted file mode 100644 index ead022319..000000000 --- a/docs/features.md +++ /dev/null @@ -1 +0,0 @@ -# Features diff --git a/docs/how-to-guides.md b/docs/how-to-guides.md deleted file mode 100644 index d31cea6e5..000000000 --- a/docs/how-to-guides.md +++ /dev/null @@ -1,16 +0,0 @@ -# How-To Guides - -.NET for Apache Spark applications can be used to perform a variety of tasks from connecting to external sources to running in notebooks interactively. Here is a list of things you can do through your application along with some important things to know: - -1. [How to use Broadcast variables](https://docs.microsoft.com/en-us/dotnet/spark/how-to-guides/broadcast-guide) in .NET for Apache Spark. -2. [How to use UDFs (User-defined Functions)](https://docs.microsoft.com/en-us/dotnet/spark/how-to-guides/udf-guide) in .NET for Apache Spark. -3. [How to deploy UDF binaries](docs/deploy-worker-udf-binaries.md). -4. [How to connect to Azure Storage accounts]() locally through .NET for Apache Spark. -5. [How to connect to SQL server]() locally through .NET for Apache Spark. -6. [How to connect to Azure EventHub]() through your .NET for Apache Spark application. -7. [How to connect to MongoDB]() through you .NET for Apache Spark application. -8. [How to call UDFs in cross-platform]() applications for example, invoking Scala/Java UDFs from .NET for Apache Spark application, or invoking C# UDFs from Spark apps written in Scala/Java/Python. -9. [How to use .NET for Apache Spark locally using Jupyter notebooks](). -10. [How to use .NET for Apache Spark locally using VS code](). -11. [How to get started](https://dotnet.microsoft.com/learn/data/spark-tutorial/intro) with .NET for Apache Spark through a tutorial. -12. [How to get started with .NET for Apache Spark in Azure Synapse Analytics](). From 129593444986d40c1f3b3d4059aa6d15ce5b20b5 Mon Sep 17 00:00:00 2001 From: Niharika Dutta Date: Mon, 12 Oct 2020 07:14:51 -0700 Subject: [PATCH 19/20] reverting table changes --- README.md | 12 ------------ 1 file changed, 12 deletions(-) diff --git a/README.md b/README.md index b9d4f00b1..0a2eaaf11 100644 --- a/README.md +++ b/README.md @@ -56,18 +56,6 @@ 2.4.5 - - 2.4.6 - - - 2.4.7 - - - 3.0.0 - - - 3.0.1 - 2.4.2 Not supported From 7497bd76b83403d140f093762d8549e7498966ce Mon Sep 17 00:00:00 2001 From: Niharika Dutta Date: Mon, 12 Oct 2020 07:22:04 -0700 Subject: [PATCH 20/20] changes --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 0a2eaaf11..3e8d08424 100644 --- a/README.md +++ b/README.md @@ -8,7 +8,7 @@ .NET for Apache Spark is compliant with .NET Standard - a formal specification of .NET APIs that are common across .NET implementations. This means you can use .NET for Apache Spark anywhere you write .NET code allowing you to reuse all the knowledge, skills, code, and libraries you already have as a .NET developer. -.NET for Apache Spark runs on Windows, Linux, and macOS using .NET Core, or Windows using .NET Framework. It also runs on all major cloud providers including [Azure HDInsight Spark](https://docs.microsoft.com/en-us/dotnet/spark/tutorials/hdinsight-deployment), [Amazon EMR Spark](https://docs.microsoft.com/en-us/dotnet/spark/tutorials/amazon-emr-spark-deployment), [AWS](deployment/README.md#databricks), [Azure Databricks](https://docs.microsoft.com/en-us/dotnet/spark/tutorials/databricks-deployment) & [Azure Synapse Analytics](https://azure.microsoft.com/en-us/services/synapse-analytics/). +.NET for Apache Spark runs on Windows, Linux, and macOS using .NET Core, or Windows using .NET Framework. It also runs on all major cloud providers including [Azure HDInsight Spark](https://docs.microsoft.com/en-us/dotnet/spark/tutorials/hdinsight-deployment), [Amazon EMR Spark](https://docs.microsoft.com/en-us/dotnet/spark/tutorials/amazon-emr-spark-deployment), [AWS](deployment/README.md#databricks), [Azure Databricks](https://docs.microsoft.com/en-us/dotnet/spark/tutorials/databricks-deployment) & [Azure Synapse Analytics](https://docs.microsoft.com/en-us/azure/synapse-analytics/spark/spark-dotnet). **Note**: We currently have a Spark Project Improvement Proposal JIRA at [SPIP: .NET bindings for Apache Spark](https://issues.apache.org/jira/browse/SPARK-27006) to work with the community towards getting .NET support by default into Apache Spark. We highly encourage you to participate in the discussion. @@ -157,7 +157,7 @@ We welcome contributions to both categories! ## Learn More -To learn more about some features of .NET for Apache Spark, please visit [the official .NET documentation]](https://docs.microsoft.com/en-us/dotnet/spark/). +To learn more about some features of .NET for Apache Spark, please visit [the official .NET documentation](https://docs.microsoft.com/en-us/dotnet/spark/). ## Contributing