diff --git a/README.md b/README.md
index 03dd08e39..f45ef06fb 100644
--- a/README.md
+++ b/README.md
@@ -8,7 +8,7 @@
.NET for Apache Spark is compliant with .NET Standard - a formal specification of .NET APIs that are common across .NET implementations. This means you can use .NET for Apache Spark anywhere you write .NET code allowing you to reuse all the knowledge, skills, code, and libraries you already have as a .NET developer.
-.NET for Apache Spark runs on Windows, Linux, and macOS using .NET Core, or Windows using .NET Framework. It also runs on all major cloud providers including [Azure HDInsight Spark](deployment/README.md#azure-hdinsight-spark), [Amazon EMR Spark](deployment/README.md#amazon-emr-spark), [AWS](deployment/README.md#databricks) & [Azure](deployment/README.md#databricks) Databricks.
+.NET for Apache Spark runs on Windows, Linux, and macOS using .NET Core, or Windows using .NET Framework. It also runs on all major cloud providers including [Azure HDInsight Spark](https://docs.microsoft.com/en-us/dotnet/spark/tutorials/hdinsight-deployment), [Amazon EMR Spark](https://docs.microsoft.com/en-us/dotnet/spark/tutorials/amazon-emr-spark-deployment), [AWS](deployment/README.md#databricks), [Azure Databricks](https://docs.microsoft.com/en-us/dotnet/spark/tutorials/databricks-deployment) & [Azure Synapse Analytics](https://docs.microsoft.com/en-us/azure/synapse-analytics/spark/spark-dotnet).
**Note**: We currently have a Spark Project Improvement Proposal JIRA at [SPIP: .NET bindings for Apache Spark](https://issues.apache.org/jira/browse/SPARK-27006) to work with the community towards getting .NET support by default into Apache Spark. We highly encourage you to participate in the discussion.
@@ -58,9 +58,9 @@
## Get Started
These instructions will show you how to run a .NET for Apache Spark app using .NET Core.
-- [Windows Instructions](docs/getting-started/windows-instructions.md)
-- [Ubuntu Instructions](docs/getting-started/ubuntu-instructions.md)
-- [MacOs Instructions](docs/getting-started/macos-instructions.md)
+- [Windows Instructions](https://docs.microsoft.com/en-us/dotnet/spark/tutorials/get-started?tabs=windows)
+- [Ubuntu Instructions](https://docs.microsoft.com/en-us/dotnet/spark/tutorials/get-started?tabs=linux)
+- [MacOs Instructions](https://docs.microsoft.com/en-us/dotnet/spark/tutorials/get-started?tabs=linux)
## Build Status
@@ -75,8 +75,8 @@ Building from source is very easy and the whole process (from cloning to being a
| | | Instructions |
| :---: | :--- | :--- |
-| ![Windows icon](docs/img/windows-icon-32.png) | **Windows** |
- Local - [.NET Framework 4.6.1](docs/building/windows-instructions.md#using-visual-studio-for-net-framework-461)
- Local - [.NET Core 3.1](docs/building/windows-instructions.md#using-net-core-cli-for-net-core)
|
-| ![Ubuntu icon](docs/img/ubuntu-icon-32.png) | **Ubuntu** | - Local - [.NET Core 3.1](docs/building/ubuntu-instructions.md)
- [Azure HDInsight Spark - .NET Core 3.1](deployment/README.md)
|
+| ![Windows icon](docs/img/windows-icon-32.png) | **Windows** | - Local - [.NET Framework 4.6.1](https://docs.microsoft.com/en-us/dotnet/spark/how-to-guides/windows-instructions#using-visual-studio-for-net-framework)
- Local - [.NET Core 3.1](https://docs.microsoft.com/en-us/dotnet/spark/how-to-guides/windows-instructions#using-net-core-cli-for-net-core)
|
+| ![Ubuntu icon](docs/img/ubuntu-icon-32.png) | **Ubuntu** | - Local - [.NET Core 3.1](https://docs.microsoft.com/en-us/dotnet/spark/how-to-guides/ubuntu-instructions)
- [Azure HDInsight Spark - .NET Core 3.1](https://docs.microsoft.com/en-us/dotnet/spark/tutorials/hdinsight-deployment)
|
## Samples
@@ -144,6 +144,10 @@ We welcome contributions to both categories!
+## Learn More
+
+To learn more about some features of .NET for Apache Spark, please visit [the official .NET documentation](https://docs.microsoft.com/en-us/dotnet/spark/).
+
## Contributing
We welcome contributions! Please review our [contribution guide](CONTRIBUTING.md).
diff --git a/docs/broadcast-guide.md b/docs/broadcast-guide.md
deleted file mode 100644
index c3026516b..000000000
--- a/docs/broadcast-guide.md
+++ /dev/null
@@ -1,92 +0,0 @@
-# Guide to using Broadcast Variables
-
-This is a guide to show how to use broadcast variables in .NET for Apache Spark.
-
-## What are Broadcast Variables
-
-[Broadcast variables in Apache Spark](https://spark.apache.org/docs/2.2.0/rdd-programming-guide.html#broadcast-variables) are a mechanism for sharing variables across executors that are meant to be read-only. They allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. They can be used, for example, to give every node a copy of a large input dataset in an efficient manner.
-
-### How to use broadcast variables in .NET for Apache Spark
-
-Broadcast variables are created from a variable `v` by calling `SparkContext.Broadcast(v)`. The broadcast variable is a wrapper around `v`, and its value can be accessed by calling the `Value()` method.
-
-Example:
-
-```csharp
-string v = "Variable to be broadcasted";
-Broadcast bv = SparkContext.Broadcast(v);
-
-// Using the broadcast variable in a UDF:
-Func udf = Udf(
- str => $"{str}: {bv.Value()}");
-```
-
-The type parameter for `Broadcast` should be the type of the variable being broadcasted.
-
-### Deleting broadcast variables
-
-The broadcast variable can be deleted from all executors by calling the `Destroy()` method on it.
-
-```csharp
-// Destroying the broadcast variable bv:
-bv.Destroy();
-```
-
-> Note: `Destroy()` deletes all data and metadata related to the broadcast variable. Use this with caution - once a broadcast variable has been destroyed, it cannot be used again.
-
-#### Caveat of using Destroy
-
-One important thing to keep in mind while using broadcast variables in UDFs is to limit the scope of the variable to only the UDF that is referencing it. The [guide to using UDFs](udf-guide.md) describes this phenomenon in detail. This is especially crucial when calling `Destroy` on the broadcast variable. If the broadcast variable that has been destroyed is visible to or accessible from other UDFs, it gets picked up for serialization by all those UDFs, even if it is not being referenced by them. This will throw an error as .NET for Apache Spark is not able to serialize the destroyed broadcast variable.
-
-Example to demonstrate:
-
-```csharp
-string v = "Variable to be broadcasted";
-Broadcast bv = SparkContext.Broadcast(v);
-
-// Using the broadcast variable in a UDF:
-Func udf1 = Udf(
- str => $"{str}: {bv.Value()}");
-
-// Destroying bv
-bv.Destroy();
-
-// Calling udf1 after destroying bv throws the following expected exception:
-// org.apache.spark.SparkException: Attempted to use Broadcast(0) after it was destroyed
-df.Select(udf1(df["_1"])).Show();
-
-// Different UDF udf2 that is not referencing bv
-Func udf2 = Udf(
- str => $"{str}: not referencing broadcast variable");
-
-// Calling udf2 throws the following (unexpected) exception:
-// [Error] [JvmBridge] org.apache.spark.SparkException: Task not serializable
-df.Select(udf2(df["_1"])).Show();
-```
-
-The recommended way of implementing above desired behavior:
-
-```csharp
-string v = "Variable to be broadcasted";
-// Restricting the visibility of bv to only the UDF referencing it
-{
- Broadcast bv = SparkContext.Broadcast(v);
-
- // Using the broadcast variable in a UDF:
- Func udf1 = Udf(
- str => $"{str}: {bv.Value()}");
-
- // Destroying bv
- bv.Destroy();
-}
-
-// Different UDF udf2 that is not referencing bv
-Func udf2 = Udf(
- str => $"{str}: not referencing broadcast variable");
-
-// Calling udf2 works fine as expected
-df.Select(udf2(df["_1"])).Show();
-```
- This ensures that destroying `bv` doesn't affect calling `udf2` because of unexpected serialization behavior.
-
- Broadcast variables are useful for transmitting read-only data to all executors, as the data is sent only once and this can give performance benefits when compared with using local variables that get shipped to the executors with each task. Please refer to the [official documentation](https://spark.apache.org/docs/2.2.0/rdd-programming-guide.html#broadcast-variables) to get a deeper understanding of broadcast variables and why they are used.
\ No newline at end of file
diff --git a/docs/building/ubuntu-instructions.md b/docs/building/ubuntu-instructions.md
deleted file mode 100644
index dd36c5337..000000000
--- a/docs/building/ubuntu-instructions.md
+++ /dev/null
@@ -1,218 +0,0 @@
-Building Spark .NET on Ubuntu 18.04
-==========================
-
-# Table of Contents
-- [Open Issues](#open-issues)
-- [Pre-requisites](#pre-requisites)
-- [Building](#building)
- - [Building Spark .NET Scala Extensions Layer](#building-spark-net-scala-extensions-layer)
- - [Building .NET Sample Applications using .NET Core CLI](#building-net-sample-applications-using-net-core-cli)
-- [Run Samples](#run-samples)
-
-# Open Issues:
-- [Building through Visual Studio Code]()
-
-# Pre-requisites:
-
-If you already have all the pre-requisites, skip to the [build](ubuntu-instructions.md#building) steps below.
-
- 1. Download and install **[.NET Core 3.1 SDK](https://dotnet.microsoft.com/download/dotnet-core/3.1)** - installing the SDK will add the `dotnet` toolchain to your path.
- 2. Install **[OpenJDK 8](https://openjdk.java.net/install/)**
- - You can use the following command:
- ```bash
- sudo apt install openjdk-8-jdk
- ```
- - Verify you are able to run `java` from your command-line
-
- 📙 Click to see sample java -version output
-
- ```
- openjdk version "1.8.0_191"
- OpenJDK Runtime Environment (build 1.8.0_191-8u191-b12-2ubuntu0.18.04.1-b12)
- OpenJDK 64-Bit Server VM (build 25.191-b12, mixed mode)
- ```
- - If you already have multiple OpenJDK versions installed and want to select OpenJDK 8, use the following command:
- ```bash
- sudo update-alternatives --config java
- ```
- 3. Install **[Apache Maven 3.6.3+](https://maven.apache.org/download.cgi)**
- - Run the following command:
- ```bash
- mkdir -p ~/bin/maven
- cd ~/bin/maven
- wget https://www-us.apache.org/dist/maven/maven-3/3.6.3/binaries/apache-maven-3.6.3-bin.tar.gz
- tar -xvzf apache-maven-3.6.3-bin.tar.gz
- ln -s apache-maven-3.6.3 current
- export M2_HOME=~/bin/maven/current
- export PATH=${M2_HOME}/bin:${PATH}
- source ~/.bashrc
- ```
-
- Note that these environment variables will be lost when you close your terminal. If you want the changes to be permanent, add the `export` lines to your `~/.bashrc` file.
- - Verify you are able to run `mvn` from your command-line
-
- 📙 Click to see sample mvn -version output
-
- ```
- Apache Maven 3.6.3 (cecedd343002696d0abb50b32b541b8a6ba2883f)
- Maven home: ~/bin/apache-maven-3.6.3
- Java version: 1.8.0_242, vendor: Oracle Corporation, runtime: /usr/lib/jvm/java-8-openjdk-amd64/jre
- Default locale: en_US, platform encoding: ANSI_X3.4-1968
- OS name: "linux", version: "4.4.0-142-generic", arch: "amd64", family: "unix"
- ```
- 4. Install **[Apache Spark 2.3+](https://spark.apache.org/downloads.html)**
- - Download [Apache Spark 2.3+](https://spark.apache.org/downloads.html) and extract it into a local folder (e.g., `~/bin/spark-2.3.2-bin-hadoop2.7`)
- - Add the necessary [environment variables](https://www.java.com/en/download/help/path.xml) `SPARK_HOME` e.g., `~/bin/spark-2.3.2-bin-hadoop2.7/`
- ```bash
- export SPARK_HOME=~/bin/spark-2.3.2-hadoop2.7
- export PATH="$SPARK_HOME/bin:$PATH"
- source ~/.bashrc
- ```
-
- Note that these environment variables will be lost when you close your terminal. If you want the changes to be permanent, add the `export` lines to your `~/.bashrc` file.
- - Verify you are able to run `spark-shell` from your command-line
-
- 📙 Click to see sample console output
-
- ```
- Welcome to
- ____ __
- / __/__ ___ _____/ /__
- _\ \/ _ \/ _ `/ __/ '_/
- /___/ .__/\_,_/_/ /_/\_\ version 2.3.2
- /_/
-
- Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_201)
- Type in expressions to have them evaluated.
- Type :help for more information.
-
- scala> sc
- res0: org.apache.spark.SparkContext = org.apache.spark.SparkContext@6eaa6b0c
- ```
-
-
-
-Please make sure you are able to run `dotnet`, `java`, `mvn`, `spark-shell` from your command-line before you move to the next section. Feel there is a better way? Please [open an issue](https://github.com/dotnet/spark/issues) and feel free to contribute.
-
-# Building
-
-For the rest of the section, it is assumed that you have cloned Spark .NET repo into your machine e.g., `~/dotnet.spark/`
-
-```
-git clone https://github.com/dotnet/spark.git ~/dotnet.spark
-```
-
-## Building Spark .NET Scala Extensions Layer
-
-When you submit a .NET application, Spark .NET has the necessary logic written in Scala that inform Apache Spark how to handle your requests (e.g., request to create a new Spark Session, request to transfer data from .NET side to JVM side etc.). This logic can be found in the [Spark .NET Scala Source Code](../../src/scala).
-
-Let us now build the Spark .NET Scala extension layer. This is easy to do:
-
-```
-cd src/scala
-mvn clean package
-```
-You should see JARs created for the supported Spark versions:
-* `microsoft-spark-2-3/target/microsoft-spark-2-3_2.11-.jar`
-* `microsoft-spark-2-4/target/microsoft-spark-2-4_2.11-.jar`
-* `microsoft-spark-3-0/target/microsoft-spark-3-0_2.12-.jar`
-
-## Building .NET Sample Applications using .NET Core CLI
-
- 1. Build the Worker
- ```bash
- cd ~/dotnet.spark/src/csharp/Microsoft.Spark.Worker/
- dotnet publish -f netcoreapp3.1 -r linux-x64
- ```
-
- 📙 Click to see sample console output
-
- ```bash
- user@machine:/home/user/dotnet.spark/src/csharp/Microsoft.Spark.Worker$ dotnet publish -f netcoreapp3.1 -r linux-x64
- Microsoft (R) Build Engine version 16.0.462+g62fb89029d for .NET Core
- Copyright (C) Microsoft Corporation. All rights reserved.
-
- Restore completed in 36.03 ms for /home/user/dotnet.spark/src/csharp/Microsoft.Spark.Worker/Microsoft.Spark.Worker.csproj.
- Restore completed in 35.94 ms for /home/user/dotnet.spark/src/csharp/Microsoft.Spark/Microsoft.Spark.csproj.
- Microsoft.Spark -> /home/user/dotnet.spark/artifacts/bin/Microsoft.Spark/Debug/netstandard2.0/Microsoft.Spark.dll
- Microsoft.Spark.Worker -> /home/user/dotnet.spark/artifacts/bin/Microsoft.Spark.Worker/Debug/netcoreapp3.1/linux-x64/Microsoft.Spark.Worker.dll
- Microsoft.Spark.Worker -> /home/user/dotnet.spark/artifacts/bin/Microsoft.Spark.Worker/Debug/netcoreapp3.1/linux-x64/publish/
- ```
-
-
-
- 2. Build the Samples
- ```bash
- cd ~/dotnet.spark/examples/Microsoft.Spark.CSharp.Examples/
- dotnet publish -f netcoreapp3.1 -r linux-x64
- ```
-
- 📙 Click to see sample console output
-
- ```bash
- user@machine:/home/user/dotnet.spark/examples/Microsoft.Spark.CSharp.Examples$ dotnet publish -f netcoreapp3.1 -r linux-x64
- Microsoft (R) Build Engine version 16.0.462+g62fb89029d for .NET Core
- Copyright (C) Microsoft Corporation. All rights reserved.
-
- Restore completed in 37.11 ms for /home/user/dotnet.spark/src/csharp/Microsoft.Spark/Microsoft.Spark.csproj.
- Restore completed in 281.63 ms for /home/user/dotnet.spark/examples/Microsoft.Spark.CSharp.Examples/Microsoft.Spark.CSharp.Examples.csproj.
- Microsoft.Spark -> /home/user/dotnet.spark/artifacts/bin/Microsoft.Spark/Debug/netstandard2.0/Microsoft.Spark.dll
- Microsoft.Spark.CSharp.Examples -> /home/user/dotnet.spark/artifacts/bin/Microsoft.Spark.CSharp.Examples/Debug/netcoreapp3.1/linux-x64/Microsoft.Spark.CSharp.Examples.dll
- Microsoft.Spark.CSharp.Examples -> /home/user/dotnet.spark/artifacts/bin/Microsoft.Spark.CSharp.Examples/Debug/netcoreapp3.1/linux-x64/publish/
- ```
-
-
-
-# Run Samples
-
-Once you build the samples, you can use `spark-submit` to submit your .NET Core apps. Make sure you have followed the [pre-requisites](#pre-requisites) section and installed Apache Spark.
-
- 1. Set the `DOTNET_WORKER_DIR` or `PATH` environment variable to include the path where the `Microsoft.Spark.Worker` binary has been generated (e.g., `~/dotnet.spark/artifacts/bin/Microsoft.Spark.Worker/Debug/netcoreapp3.1/linux-x64/publish`)
- 2. Open a terminal and go to the directory where your app binary has been generated (e.g., `~/dotnet.spark/artifacts/bin/Microsoft.Spark.CSharp.Examples/Debug/netcoreapp3.1/linux-x64/publish`)
- 3. Running your app follows the basic structure:
- ```bash
- spark-submit \
- [--jars ] \
- --class org.apache.spark.deploy.dotnet.DotnetRunner \
- --master local \
- \
-
- ```
-
- Here are some examples you can run:
- - **[Microsoft.Spark.Examples.Sql.Batch.Basic](../../examples/Microsoft.Spark.CSharp.Examples/Sql/Batch/Basic.cs)**
- ```bash
- spark-submit \
- --class org.apache.spark.deploy.dotnet.DotnetRunner \
- --master local \
- ~/dotnet.spark/src/scala/microsoft-spark-/target/microsoft-spark-.jar \
- ./Microsoft.Spark.CSharp.Examples Sql.Batch.Basic $SPARK_HOME/examples/src/main/resources/people.json
- ```
- - **[Microsoft.Spark.Examples.Sql.Streaming.StructuredNetworkWordCount](../../examples/Microsoft.Spark.CSharp.Examples/Sql/Streaming/StructuredNetworkWordCount.cs)**
- ```bash
- spark-submit \
- --class org.apache.spark.deploy.dotnet.DotnetRunner \
- --master local \
- ~/dotnet.spark/src/scala/microsoft-spark-/target/microsoft-spark-.jar \
- ./Microsoft.Spark.CSharp.Examples Sql.Streaming.StructuredNetworkWordCount localhost 9999
- ```
- - **[Microsoft.Spark.Examples.Sql.Streaming.StructuredKafkaWordCount (maven accessible)](../../examples/Microsoft.Spark.CSharp.Examples/Sql/Streaming/StructuredKafkaWordCount.cs)**
- ```bash
- spark-submit \
- --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.2 \
- --class org.apache.spark.deploy.dotnet.DotnetRunner \
- --master local \
- ~/dotnet.spark/src/scala/microsoft-spark-/target/microsoft-spark-.jar \
- ./Microsoft.Spark.CSharp.Examples Sql.Streaming.StructuredKafkaWordCount localhost:9092 subscribe test
- ```
- - **[Microsoft.Spark.Examples.Sql.Streaming.StructuredKafkaWordCount (jars provided)](../../examples/Microsoft.Spark.CSharp.Examples/Sql/Streaming/StructuredKafkaWordCount.cs)**
- ```bash
- spark-submit \
- --jars path/to/net.jpountz.lz4/lz4-1.3.0.jar,path/to/org.apache.kafka/kafka-clients-0.10.0.1.jar,path/to/org.apache.spark/spark-sql-kafka-0-10_2.11-2.3.2.jar,`path/to/org.slf4j/slf4j-api-1.7.6.jar,path/to/org.spark-project.spark/unused-1.0.0.jar,path/to/org.xerial.snappy/snappy-java-1.1.2.6.jar \
- --class org.apache.spark.deploy.dotnet.DotnetRunner \
- --master local \
- ~/dotnet.spark/src/scala/microsoft-spark-/target/microsoft-spark-.jar \
- ./Microsoft.Spark.CSharp.Examples Sql.Streaming.StructuredKafkaWordCount localhost:9092 subscribe test
- ```
-
-Feel this experience is complicated? Help us by taking up [Simplify User Experience for Running an App](https://github.com/dotnet/spark/issues/6)
diff --git a/docs/building/windows-instructions.md b/docs/building/windows-instructions.md
deleted file mode 100644
index 754a4bf61..000000000
--- a/docs/building/windows-instructions.md
+++ /dev/null
@@ -1,250 +0,0 @@
-Building Spark .NET on Windows
-==========================
-
-# Table of Contents
-- [Open Issues](#open-issues)
-- [Pre-requisites](#pre-requisites)
-- [Building](#building)
- - [Building Spark .NET Scala Extensions Layer](#building-spark-net-scala-extensions-layer)
- - [Building .NET Samples Application](#building-net-samples-application)
- - [Using Visual Studio for .NET Framework](#using-visual-studio-for-net-framework)
- - [Using .NET Core CLI for .NET Core](#using-net-core-cli-for-net-core)
-- [Run Samples](#run-samples)
-
-# Open Issues:
-- [Allow users to choose which .NET framework to build for]()
-- [Building through Visual Studio Code]()
-- [Building fully automatically through .NET Core CLI]()
-
-# Pre-requisites:
-
-If you already have all the pre-requisites, skip to the [build](windows-instructions.md#building) steps below.
-
- 1. Download and install the **[.NET Core 3.1 SDK](https://dotnet.microsoft.com/download/dotnet-core/3.1)** - installing the SDK will add the `dotnet` toolchain to your path.
- 2. Install **[Visual Studio 2019](https://www.visualstudio.com/downloads/)** (Version 16.4 or later). The Community version is completely free. When configuring your installation, include these components at minimum:
- * .NET desktop development
- * All Required Components
- * .NET Framework 4.6.1 Development Tools
- * .NET Core cross-platform development
- * All Required Components
- 3. Install **[Java 1.8](https://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html)**
- - Select the appropriate version for your operating system e.g., jdk-8u201-windows-x64.exe for Win x64 machine.
- - Install using the installer and verify you are able to run `java` from your command-line
- 4. Install **[Apache Maven 3.6.3+](https://maven.apache.org/download.cgi)**
- - Download [Apache Maven 3.6.3](http://mirror.metrocast.net/apache/maven/maven-3/3.6.3/binaries/apache-maven-3.6.3-bin.zip)
- - Extract to a local directory e.g., `c:\bin\apache-maven-3.6.3\`
- - Add Apache Maven to your [PATH environment variable](https://www.java.com/en/download/help/path.xml) e.g., `c:\bin\apache-maven-3.6.3\bin`
- - Verify you are able to run `mvn` from your command-line
- 5. Install **[Apache Spark 2.3+](https://spark.apache.org/downloads.html)**
- - Download [Apache Spark 2.3+](https://spark.apache.org/downloads.html) and extract it into a local folder (e.g., `c:\bin\spark-2.3.2-bin-hadoop2.7\`) using [7-zip](https://www.7-zip.org/).
- - Add Apache Spark to your [PATH environment variable](https://www.java.com/en/download/help/path.xml) e.g., `c:\bin\spark-2.3.2-bin-hadoop2.7\bin`
- - Add a [new environment variable](https://www.java.com/en/download/help/path.xml) `SPARK_HOME` e.g., `C:\bin\spark-2.3.2-bin-hadoop2.7\`
- - Verify you are able to run `spark-shell` from your command-line
-
- 📙 Click to see sample console output
-
- ```
- Welcome to
- ____ __
- / __/__ ___ _____/ /__
- _\ \/ _ \/ _ `/ __/ '_/
- /___/ .__/\_,_/_/ /_/\_\ version 2.3.2
- /_/
-
- Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_201)
- Type in expressions to have them evaluated.
- Type :help for more information.
-
- scala> sc
- res0: org.apache.spark.SparkContext = org.apache.spark.SparkContext@6eaa6b0c
- ```
-
-
-
- 6. Install **[WinUtils](https://github.com/steveloughran/winutils)**
- - Download `winutils.exe` binary from [WinUtils repository](https://github.com/steveloughran/winutils). You should select the version of Hadoop the Spark distribution was compiled with, e.g. use hadoop-2.7.1 for Spark 2.3.2.
- - Save `winutils.exe` binary to a directory of your choice e.g., `c:\hadoop\bin`
- - Set `HADOOP_HOME` to reflect the directory with winutils.exe (without bin). For instance, using command-line:
- ```powershell
- set HADOOP_HOME=c:\hadoop
- ```
- - Set PATH environment variable to include `%HADOOP_HOME%\bin`. For instance, using command-line:
- ```powershell
- set PATH=%HADOOP_HOME%\bin;%PATH%
- ```
-
-
-Please make sure you are able to run `dotnet`, `java`, `mvn`, `spark-shell` from your command-line before you move to the next section. Feel there is a better way? Please [open an issue](https://github.com/dotnet/spark/issues) and feel free to contribute.
-
-> **Note**: A new instance of the command-line may be required if any environment variables were updated.
-
-# Building
-
-For the rest of the section, it is assumed that you have cloned Spark .NET repo into your machine e.g., `c:\github\dotnet-spark\`
-
-```powershell
-git clone https://github.com/dotnet/spark.git c:\github\dotnet-spark
-```
-
-## Building Spark .NET Scala Extensions Layer
-
-When you submit a .NET application, Spark .NET has the necessary logic written in Scala that inform Apache Spark how to handle your requests (e.g., request to create a new Spark Session, request to transfer data from .NET side to JVM side etc.). This logic can be found in the [Spark .NET Scala Source Code](../../src/scala).
-
-Regardless of whether you are using .NET Framework or .NET Core, you will need to build the Spark .NET Scala extension layer. This is easy to do:
-
-```powershell
-cd src\scala
-mvn clean package
-```
-You should see JARs created for the supported Spark versions:
-* `microsoft-spark-2-3\target\microsoft-spark-2-3_2.11-.jar`
-* `microsoft-spark-2-4\target\microsoft-spark-2-4_2.11-.jar`
-* `microsoft-spark-3-0\target\microsoft-spark-3-0_2.12-.jar`
-
-## Building .NET Samples Application
-
-### Using Visual Studio for .NET Framework
-
- 1. Open `src\csharp\Microsoft.Spark.sln` in Visual Studio and build the `Microsoft.Spark.CSharp.Examples` project under the `examples` folder (this will in turn build the .NET bindings project as well). If you want, you can write your own code in the `Microsoft.Spark.Examples` project:
-
- ```csharp
- // Instantiate a session
- var spark = SparkSession
- .Builder()
- .AppName("Hello Spark!")
- .GetOrCreate();
-
- var df = spark.Read().Json(args[0]);
-
- // Print schema
- df.PrintSchema();
-
- // Apply a filter and show results
- df.Filter(df["age"] > 21).Show();
- ```
- Once the build is successfuly, you will see the appropriate binaries produced in the output directory.
-
- 📙 Click to see sample console output
-
- ```powershell
- Directory: C:\github\dotnet-spark\artifacts\bin\Microsoft.Spark.CSharp.Examples\Debug\net461
-
-
- Mode LastWriteTime Length Name
- ---- ------------- ------ ----
- -a---- 3/6/2019 12:18 AM 125440 Apache.Arrow.dll
- -a---- 3/16/2019 12:00 AM 13824 Microsoft.Spark.CSharp.Examples.exe
- -a---- 3/16/2019 12:00 AM 19423 Microsoft.Spark.CSharp.Examples.exe.config
- -a---- 3/16/2019 12:00 AM 2720 Microsoft.Spark.CSharp.Examples.pdb
- -a---- 3/16/2019 12:00 AM 143360 Microsoft.Spark.dll
- -a---- 3/16/2019 12:00 AM 63388 Microsoft.Spark.pdb
- -a---- 3/16/2019 12:00 AM 34304 Microsoft.Spark.Worker.exe
- -a---- 3/16/2019 12:00 AM 19423 Microsoft.Spark.Worker.exe.config
- -a---- 3/16/2019 12:00 AM 11900 Microsoft.Spark.Worker.pdb
- -a---- 3/16/2019 12:00 AM 23552 Microsoft.Spark.Worker.xml
- -a---- 3/16/2019 12:00 AM 332363 Microsoft.Spark.xml
- ------------------------------------------- More framework files -------------------------------------
- ```
-
-
-
-### Using .NET Core CLI for .NET Core
-
-> Note: We are currently working on automating .NET Core builds for Spark .NET. Until then, we appreciate your patience in performing some of the steps manually.
-
- 1. Build the Worker
- ```powershell
- cd C:\github\dotnet-spark\src\csharp\Microsoft.Spark.Worker\
- dotnet publish -f netcoreapp3.1 -r win-x64
- ```
-
- 📙 Click to see sample console output
-
- ```powershell
- PS C:\github\dotnet-spark\src\csharp\Microsoft.Spark.Worker> dotnet publish -f netcoreapp3.1 -r win-x64
- Microsoft (R) Build Engine version 16.0.462+g62fb89029d for .NET Core
- Copyright (C) Microsoft Corporation. All rights reserved.
-
- Restore completed in 299.95 ms for C:\github\dotnet-spark\src\csharp\Microsoft.Spark\Microsoft.Spark.csproj.
- Restore completed in 306.62 ms for C:\github\dotnet-spark\src\csharp\Microsoft.Spark.Worker\Microsoft.Spark.Worker.csproj.
- Microsoft.Spark -> C:\github\dotnet-spark\artifacts\bin\Microsoft.Spark\Debug\netstandard2.0\Microsoft.Spark.dll
- Microsoft.Spark.Worker -> C:\github\dotnet-spark\artifacts\bin\Microsoft.Spark.Worker\Debug\netcoreapp3.1\win-x64\Microsoft.Spark.Worker.dll
- Microsoft.Spark.Worker -> C:\github\dotnet-spark\artifacts\bin\Microsoft.Spark.Worker\Debug\netcoreapp3.1\win-x64\publish\
- ```
-
-
- 2. Build the Samples
- ```powershell
- cd C:\github\dotnet-spark\examples\Microsoft.Spark.CSharp.Examples\
- dotnet publish -f netcoreapp3.1 -r win-x64
- ```
-
- 📙 Click to see sample console output
-
- ```powershell
- PS C:\github\dotnet-spark\examples\Microsoft.Spark.CSharp.Examples> dotnet publish -f netcoreapp3.1 -r win10-x64
- Microsoft (R) Build Engine version 16.0.462+g62fb89029d for .NET Core
- Copyright (C) Microsoft Corporation. All rights reserved.
-
- Restore completed in 44.22 ms for C:\github\dotnet-spark\src\csharp\Microsoft.Spark\Microsoft.Spark.csproj.
- Restore completed in 336.94 ms for C:\github\dotnet-spark\examples\Microsoft.Spark.CSharp.Examples\Microsoft.Spark.CSharp.Examples.csproj.
- Microsoft.Spark -> C:\github\dotnet-spark\artifacts\bin\Microsoft.Spark\Debug\netstandard2.0\Microsoft.Spark.dll
- Microsoft.Spark.CSharp.Examples -> C:\github\dotnet-spark\artifacts\bin\Microsoft.Spark.CSharp.Examples\Debug\netcoreapp3.1\win-x64\Microsoft.Spark.CSharp.Examples.dll
- Microsoft.Spark.CSharp.Examples -> C:\github\dotnet-spark\artifacts\bin\Microsoft.Spark.CSharp.Examples\Debug\netcoreapp3.1\win-x64\publish\
- ```
-
-
-
-# Run Samples
-
-Once you build the samples, running them will be through `spark-submit` regardless of whether you are targeting .NET Framework or .NET Core apps. Make sure you have followed the [pre-requisites](#pre-requisites) section and installed Apache Spark.
-
- 1. Set the `DOTNET_WORKER_DIR` or `PATH` environment variable to include the path where the `Microsoft.Spark.Worker` binary has been generated (e.g., `c:\github\dotnet\spark\artifacts\bin\Microsoft.Spark.Worker\Debug\net461` for .NET Framework, `c:\github\dotnet-spark\artifacts\bin\Microsoft.Spark.Worker\Debug\netcoreapp3.1\win-x64\publish` for .NET Core)
- 2. Open Powershell and go to the directory where your app binary has been generated (e.g., `c:\github\dotnet\spark\artifacts\bin\Microsoft.Spark.CSharp.Examples\Debug\net461` for .NET Framework, `c:\github\dotnet-spark\artifacts\bin\Microsoft.Spark.CSharp.Examples\Debug\netcoreapp3.1\win1-x64\publish` for .NET Core)
- 3. Running your app follows the basic structure:
- ```powershell
- spark-submit.cmd `
- [--jars ] `
- --class org.apache.spark.deploy.dotnet.DotnetRunner `
- --master local `
- `
-
- ```
-
- Here are some examples you can run:
- - **[Microsoft.Spark.Examples.Sql.Batch.Basic](../../examples/Microsoft.Spark.CSharp.Examples/Sql/Batch/Basic.cs)**
- ```powershell
- spark-submit.cmd `
- --class org.apache.spark.deploy.dotnet.DotnetRunner `
- --master local `
- C:\github\dotnet-spark\src\scala\microsoft-spark-\target\microsoft-spark-.jar `
- Microsoft.Spark.CSharp.Examples.exe Sql.Batch.Basic %SPARK_HOME%\examples\src\main\resources\people.json
- ```
- - **[Microsoft.Spark.Examples.Sql.Streaming.StructuredNetworkWordCount](../../examples/Microsoft.Spark.CSharp.Examples/Sql/Streaming/StructuredNetworkWordCount.cs)**
- ```powershell
- spark-submit.cmd `
- --class org.apache.spark.deploy.dotnet.DotnetRunner `
- --master local `
- C:\github\dotnet-spark\src\scala\microsoft-spark-\target\microsoft-spark-.jar `
- Microsoft.Spark.CSharp.Examples.exe Sql.Streaming.StructuredNetworkWordCount localhost 9999
- ```
- - **[Microsoft.Spark.Examples.Sql.Streaming.StructuredKafkaWordCount (maven accessible)](../../examples/Microsoft.Spark.CSharp.Examples/Sql/Streaming/StructuredKafkaWordCount.cs)**
- ```powershell
- spark-submit.cmd `
- --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.2 `
- --class org.apache.spark.deploy.dotnet.DotnetRunner `
- --master local `
- C:\github\dotnet-spark\src\scala\microsoft-spark-\target\microsoft-spark-.jar `
- Microsoft.Spark.CSharp.Examples.exe Sql.Streaming.StructuredKafkaWordCount localhost:9092 subscribe test
- ```
- - **[Microsoft.Spark.Examples.Sql.Streaming.StructuredKafkaWordCount (jars provided)](../../examples/Microsoft.Spark.CSharp.Examples/Sql/Streaming/StructuredKafkaWordCount.cs)**
- ```powershell
- spark-submit.cmd
- --jars path\to\net.jpountz.lz4\lz4-1.3.0.jar,path\to\org.apache.kafka\kafka-clients-0.10.0.1.jar,path\to\org.apache.spark\spark-sql-kafka-0-10_2.11-2.3.2.jar,`path\to\org.slf4j\slf4j-api-1.7.6.jar,path\to\org.spark-project.spark\unused-1.0.0.jar,path\to\org.xerial.snappy\snappy-java-1.1.2.6.jar `
- --class org.apache.spark.deploy.dotnet.DotnetRunner `
- --master local `
- C:\github\dotnet-spark\src\scala\microsoft-spark-\target\microsoft-spark-.jar `
- Microsoft.Spark.CSharp.Examples.exe Sql.Streaming.StructuredKafkaWordCount localhost:9092 subscribe test
- ```
-
-Feel this experience is complicated? Help us by taking up [Simplify User Experience for Running an App](https://github.com/dotnet/spark/issues/6)
diff --git a/docs/deploy-worker-udf-binaries.md b/docs/deploy-worker-udf-binaries.md
deleted file mode 100644
index c1a84c813..000000000
--- a/docs/deploy-worker-udf-binaries.md
+++ /dev/null
@@ -1,115 +0,0 @@
-# Deploy Worker and UDF Binaries General Instruction
-
-This how-to provides general instructions on how to deploy Worker and UDF (User-Defined Function) binaries,
-including which Environment Variables to set up and some commonly used parameters
-when launching applications with `spark-submit`.
-
-## Configurations
-
-### 1. Environment Variables
-When deploying workers and writing UDFs, there are a few commonly used environment variables that you may need to set:
-
-
-
- Environment Variable |
- Description |
-
-
- DOTNET_WORKER_DIR |
- Path where the Microsoft.Spark.Worker binary has been generated.It's used by the Spark driver and will be passed to Spark executors. If this variable is not set up, the Spark executors will search the path specified in the PATH environment variable.e.g. "C:\bin\Microsoft.Spark.Worker" |
-
-
- DOTNET_ASSEMBLY_SEARCH_PATHS |
- Comma-separated paths where Microsoft.Spark.Worker will load assemblies.Note that if a path starts with ".", the working directory will be prepended. If in yarn mode, "." would represent the container's working directory.e.g. "C:\Users\<user name>\<mysparkapp>\bin\Debug\<dotnet version>" |
-
-
- DOTNET_WORKER_DEBUG |
- If you want to debug a UDF, then set this environment variable to 1 before running spark-submit . |
-
-
-
-### 2. Parameter Options
-Once the Spark application is [bundled](https://spark.apache.org/docs/latest/submitting-applications.html#bundling-your-applications-dependencies), you can launch it using `spark-submit`. The following table shows some of the commonly used options:
-
-
-
- Parameter Name |
- Description |
-
-
- --class |
- The entry point for your application.e.g. org.apache.spark.deploy.dotnet.DotnetRunner |
-
-
- --master |
- The master URL for the cluster.e.g. yarn |
-
-
- --deploy-mode |
- Whether to deploy your driver on the worker nodes (cluster ) or locally as an external client (client ).Default: client |
-
-
- --conf |
- Arbitrary Spark configuration property in key=value format.e.g. spark.yarn.appMasterEnv.DOTNET_WORKER_DIR=.\worker\Microsoft.Spark.Worker |
-
-
- --files |
- Comma-separated list of files to be placed in the working directory of each executor.
-
- - Please note that this option is only applicable for yarn mode.
- - It supports specifying file names with # similar to Hadoop.
-
- e.g. myLocalSparkApp.dll#appSeen.dll . Your application should use the name as appSeen.dll to reference myLocalSparkApp.dll when running on YARN. |
-
-
- --archives |
- Comma-separated list of archives to be extracted into the working directory of each executor.
-
- - Please note that this option is only applicable for yarn mode.
- - It supports specifying file names with # similar to Hadoop.
-
- e.g. hdfs://<path to your worker file>/Microsoft.Spark.Worker.zip#worker . This will copy and extract the zip file to worker folder. |
-
-
- application-jar |
- Path to a bundled jar including your application and all dependencies.
- e.g. hdfs://<path to your jar>/microsoft-spark-<version>.jar |
-
-
- application-arguments |
- Arguments passed to the main method of your main class, if any.e.g. hdfs://<path to your app>/<your app>.zip <your app name> <app args> |
-
-
-
-> Note: Please specify all the `--options` before `application-jar` when launching applications with `spark-submit`, otherwise they will be ignored. Please see more `spark-submit` options [here](https://spark.apache.org/docs/latest/submitting-applications.html) and running spark on YARN details [here](https://spark.apache.org/docs/latest/running-on-yarn.html).
-
-## FAQ
-#### 1. Question: When I run a spark app with UDFs, I get the following error. What should I do?
-> **Error:** [ ] [ ] [Error] [TaskRunner] [0] ProcessStream() failed with exception: System.IO.FileNotFoundException: Assembly 'mySparkApp, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null' file not found: 'mySparkApp.dll'
-
-**Answer:** Please check if the `DOTNET_ASSEMBLY_SEARCH_PATHS` environment variable is set correctly. It should be the path that contains your `mySparkApp.dll`.
-
-#### 2. Question: After I upgraded my Spark Dotnet version and reset the `DOTNET_WORKER_DIR` environment variable, why do I still get the following error?
-> **Error:** Lost task 0.0 in stage 11.0 (TID 24, localhost, executor driver): java.io.IOException: Cannot run program "Microsoft.Spark.Worker.exe": CreateProcess error=2, The system cannot find the file specified.
-
-**Answer:** Please try restarting your PowerShell window (or other command windows) first so that it can take the latest environment variable values. Then start your program.
-
-#### 3. Question: After submitting my Spark application, I get the error `System.TypeLoadException: Could not load type 'System.Runtime.Remoting.Contexts.Context'`.
-> **Error:** [ ] [ ] [Error] [TaskRunner] [0] ProcessStream() failed with exception: System.TypeLoadException: Could not load type 'System.Runtime.Remoting.Contexts.Context' from assembly 'mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=...'.
-
-**Answer:** Please check the `Microsoft.Spark.Worker` version you are using. We currently provide two versions: **.NET Framework 4.6.1** and **.NET Core 2.1.x**. In this case, `Microsoft.Spark.Worker.net461.win-x64-` (which you can download [here](https://github.com/dotnet/spark/releases)) should be used since `System.Runtime.Remoting.Contexts.Context` is only for .NET Framework.
-
-#### 4. Question: How to run my spark application with UDFs on YARN? Which environment variables and parameters should I use?
-
-**Answer:** To launch the spark application on YARN, the environment variables should be specified as `spark.yarn.appMasterEnv.[EnvironmentVariableName]`. Please see below as an example using `spark-submit`:
-```shell
-spark-submit \
---class org.apache.spark.deploy.dotnet.DotnetRunner \
---master yarn \
---deploy-mode cluster \
---conf spark.yarn.appMasterEnv.DOTNET_WORKER_DIR=./worker/Microsoft.Spark.Worker- \
---conf spark.yarn.appMasterEnv.DOTNET_ASSEMBLY_SEARCH_PATHS=./udfs \
---archives hdfs:///Microsoft.Spark.Worker.net461.win-x64-.zip#worker,hdfs:///mySparkApp.zip#udfs \
-hdfs:///microsoft-spark-.jar \
-hdfs:///mySparkApp.zip mySparkApp
-```
diff --git a/docs/features.md b/docs/features.md
deleted file mode 100644
index ead022319..000000000
--- a/docs/features.md
+++ /dev/null
@@ -1 +0,0 @@
-# Features
diff --git a/docs/getting-started/macos-instructions.md b/docs/getting-started/macos-instructions.md
deleted file mode 100644
index 2d553a8e6..000000000
--- a/docs/getting-started/macos-instructions.md
+++ /dev/null
@@ -1,85 +0,0 @@
-# Getting Started with Spark .NET on MacOS
-
-These instructions will show you how to run a .NET for Apache Spark app using .NET Core on MacOSX.
-
-## Pre-requisites
-
-- Download and install **[.NET Core 2.1 SDK](https://dotnet.microsoft.com/download/dotnet-core/2.1)**
-- Install **[Java 8](https://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html)**
- - Select the appropriate version for your operating system e.g., `jdk-8u231-macosx-x64.dmg`.
- - Install using the installer and verify you are able to run `java` from your command-line
-- Download and install **[Apache Spark 2.4.4](https://archive.apache.org/dist/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz)**:
- - Add the necessary environment variables SPARK_HOME e.g., `~/bin/spark-2.4.4-bin-hadoop2.7/`
- ```bash
- export SPARK_HOME=~/bin/spark-2.4.4-bin-hadoop2.7/
- export PATH="$SPARK_HOME/bin:$PATH"
- source ~/.bashrc
- ```
-- Download and install **[Microsoft.Spark.Worker](https://github.com/dotnet/spark/releases)** release:
- - Select a **[Microsoft.Spark.Worker](https://github.com/dotnet/spark/releases)** release from .NET for Apache Spark GitHub Releases page and download into your local machine (e.g., `/bin/Microsoft.Spark.Worker/`).
- - **IMPORTANT** Create a new environment variable using ```export DOTNET_WORKER_DIR ``` and set it to the directory where you downloaded and extracted the Microsoft.Spark.Worker (e.g., `/bin/Microsoft.Spark.Worker/`).
-
-
-## Authoring a .NET for Apache Spark App
-- Use the `dotnet` CLI to create a console application.
- ```
- dotnet new console -o HelloSpark
- ```
-- Install `Microsoft.Spark` Nuget package into the project from the [spark nuget.org feed](https://www.nuget.org/profiles/spark) - see [Ways to install Nuget Package](https://docs.microsoft.com/en-us/nuget/consume-packages/ways-to-install-a-package)
- ```
- cd HelloSpark
- dotnet add package Microsoft.Spark
- ```
-- Replace the contents of the `Program.cs` file with the following code:
- ```csharp
- using Microsoft.Spark.Sql;
-
- namespace HelloSpark
- {
- class Program
- {
- static void Main(string[] args)
- {
- var spark = SparkSession.Builder().GetOrCreate();
- var df = spark.Read().Json("people.json");
- df.Show();
- }
- }
- }
- ```
-- Use the `dotnet` CLI to build the application:
- ```bash
- dotnet build
- ```
-
-## Running your .NET for Apache Spark App
-- Open your terminal and navigate into your app folder:
- ```bash
- cd
- ```
-- Create `people.json` with the following content:
- ```json
- { "name" : "Michael" }
- { "name" : "Andy", "age" : 30 }
- { "name" : "Justin", "age" : 19 }
- ```
-- Run your app
- ```bash
- spark-submit \
- --class org.apache.spark.deploy.dotnet.DotnetRunner \
- --master local \
- microsoft-spark-.jar \
- dotnet HelloSpark.dll
- ```
- **Note**: This command assumes you have downloaded Apache Spark and added it to your PATH environment variable to be able to use `spark-submit`, otherwise, you would have to use the full path (e.g., `~/spark/bin/spark-submit`).
-
-- The output of the application should look similar to the output below:
- ```text
- +----+-------+
- | age| name|
- +----+-------+
- |null|Michael|
- | 30| Andy|
- | 19| Justin|
- +----+-------+
- ```
diff --git a/docs/getting-started/ubuntu-instructions.md b/docs/getting-started/ubuntu-instructions.md
deleted file mode 100644
index 8f5b2fd6b..000000000
--- a/docs/getting-started/ubuntu-instructions.md
+++ /dev/null
@@ -1,77 +0,0 @@
-# Getting Started with Spark.NET on Ubuntu
-
-These instructions will show you how to run a .NET for Apache Spark app using .NET Core on Ubuntu 18.04.
-
-## Pre-requisites
-
-- Download and install the following: **[.NET Core 3.1 SDK](https://dotnet.microsoft.com/download/dotnet-core/3.1)** | **[OpenJDK 8](https://openjdk.java.net/install/)** | **[Apache Spark 2.4.1](https://archive.apache.org/dist/spark/spark-2.4.1/spark-2.4.1-bin-hadoop2.7.tgz)**
-- Download and install **[Microsoft.Spark.Worker](https://github.com/dotnet/spark/releases)** release:
- - Select a **[Microsoft.Spark.Worker](https://github.com/dotnet/spark/releases)** release from .NET for Apache Spark GitHub Releases page and download into your local machine (e.g., `~/bin/Microsoft.Spark.Worker`).
- - **IMPORTANT** Create a [new environment variable](https://help.ubuntu.com/community/EnvironmentVariables) `DOTNET_WORKER_DIR` and set it to the directory where you downloaded and extracted the Microsoft.Spark.Worker (e.g., `~/bin/Microsoft.Spark.Worker`).
-
-For detailed instructions, you can see [Building .NET for Apache Spark from Source on Ubuntu](../building/ubuntu-instructions.md).
-
-## Authoring a .NET for Apache Spark App
-
-- Use the `dotnet` CLI to create a console application.
- ```shell
- dotnet new console -o HelloSpark
- ```
-- Install `Microsoft.Spark` Nuget package into the project from the [spark nuget.org feed](https://www.nuget.org/profiles/spark) - see [Ways to install Nuget Package](https://docs.microsoft.com/en-us/nuget/consume-packages/ways-to-install-a-package)
- ```shell
- cd HelloSpark
- dotnet add package Microsoft.Spark
- ```
-- Replace the contents of the `Program.cs` file with the following code:
- ```csharp
- using Microsoft.Spark.Sql;
-
- namespace HelloSpark
- {
- class Program
- {
- static void Main(string[] args)
- {
- var spark = SparkSession.Builder().GetOrCreate();
- var df = spark.Read().Json("people.json");
- df.Show();
- }
- }
- }
- ```
-- Use the `dotnet` CLI to build the application:
- ```shell
- dotnet build
- ```
-
-
-## Running your .NET for Apache Spark App
-- Open your terminal and navigate into your app folder.
- ```shell
- cd
- ```
-- Create `people.json` with the following content:
- ```json
- {"name":"Michael"}
- {"name":"Andy", "age":30}
- {"name":"Justin", "age":19}
- ```
-- Run your app.
- ```shell
- spark-submit \
- --class org.apache.spark.deploy.dotnet.DotnetRunner \
- --master local \
- microsoft-spark-.jar \
- dotnet HelloSpark.dll
- ```
- **Note**: This command assumes you have downloaded Apache Spark and added it to your PATH environment variable to be able to use `spark-submit`, otherwise, you would have to use the full path (e.g., `~/spark/bin/spark-submit`). For detailed instructions, you can see [Building .NET for Apache Spark from Source on Ubuntu](../building/ubuntu-instructions.md).
-- The output of the application should look similar to the output below:
- ```text
- +----+-------+
- | age| name|
- +----+-------+
- |null|Michael|
- | 30| Andy|
- | 19| Justin|
- +----+-------+
- ```
diff --git a/docs/getting-started/windows-instructions.md b/docs/getting-started/windows-instructions.md
deleted file mode 100644
index 7b45987a9..000000000
--- a/docs/getting-started/windows-instructions.md
+++ /dev/null
@@ -1,65 +0,0 @@
-# Getting Started with Spark .NET on Windows
-
-These instructions will show you how to run a .NET for Apache Spark app using .NET Core on Windows.
-
-## Pre-requisites
-
-- Download and install the following: **[.NET Core 3.1 SDK](https://dotnet.microsoft.com/download/dotnet-core/3.1)** | **[Visual Studio 2019](https://www.visualstudio.com/downloads/)** | **[Java 1.8](https://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html)** | **[Apache Spark 2.4.1](https://archive.apache.org/dist/spark/spark-2.4.1/spark-2.4.1-bin-hadoop2.7.tgz)**
-- Download and install **[Microsoft.Spark.Worker](https://github.com/dotnet/spark/releases)** release:
- - Select a **[Microsoft.Spark.Worker](https://github.com/dotnet/spark/releases)** release from .NET for Apache Spark GitHub Releases page and download into your local machine (e.g., `c:\bin\Microsoft.Spark.Worker\`).
- - **IMPORTANT** Create a [new environment variable](https://www.java.com/en/download/help/path.xml) `DOTNET_WORKER_DIR` and set it to the directory where you downloaded and extracted the Microsoft.Spark.Worker (e.g., `c:\bin\Microsoft.Spark.Worker`).
-
-For detailed instructions, you can see [Building .NET for Apache Spark from Source on Windows](../building/windows-instructions.md).
-
-## Authoring a .NET for Apache Spark App
-- Open Visual Studio -> Create New Project -> Console App (.NET Core) -> Name: `HelloSpark`
-- Install `Microsoft.Spark` Nuget package into the solution from the [spark nuget.org feed](https://www.nuget.org/profiles/spark) - see [Ways to install Nuget Package](https://docs.microsoft.com/en-us/nuget/consume-packages/ways-to-install-a-package)
-- Write the following code into `Program.cs`:
- ```csharp
- using Microsoft.Spark.Sql;
-
- namespace HelloSpark
- {
- class Program
- {
- static void Main(string[] args)
- {
- var spark = SparkSession.Builder().GetOrCreate();
- var df = spark.Read().Json("people.json");
- df.Show();
- }
- }
- }
- ```
-- Build the solution
-
-## Running your .NET for Apache Spark App
-- Open your terminal and navigate into your app folder:
- ```
- cd
- ```
-- Create `people.json` with the following content:
- ```json
- {"name":"Michael"}
- {"name":"Andy", "age":30}
- {"name":"Justin", "age":19}
- ```
-- Run your app
- ```
- spark-submit `
- --class org.apache.spark.deploy.dotnet.DotnetRunner `
- --master local `
- microsoft-spark-.jar `
- dotnet HelloSpark.dll
- ```
- **Note**: This command assumes you have downloaded Apache Spark and added it to your PATH environment variable to be able to use `spark-submit`, otherwise, you would have to use the full path (e.g., `c:\bin\apache-spark\bin\spark-submit`). For detailed instructions, you can see [Building .NET for Apache Spark from Source on Windows](../building/windows-instructions.md).
-- The output of the application should look similar to the output below:
- ```text
- +----+-------+
- | age| name|
- +----+-------+
- |null|Michael|
- | 30| Andy|
- | 19| Justin|
- +----+-------+
- ```
diff --git a/docs/udf-guide.md b/docs/udf-guide.md
deleted file mode 100644
index 6a2905bf4..000000000
--- a/docs/udf-guide.md
+++ /dev/null
@@ -1,171 +0,0 @@
-# Guide to User-Defined Functions (UDFs)
-
-This is a guide to show how to use UDFs in .NET for Apache Spark.
-
-## What are UDFs
-
-[User-Defined Functions (UDFs)](https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/expressions/UserDefinedFunction.html) are a feature of Spark that allow developers to use custom functions to extend the system's built-in functionality. They transform values from a single row within a table to produce a single corresponding output value per row based on the logic defined in the UDF.
-
-Let's take the following as an example for a UDF definition:
-
-```csharp
-string s1 = "hello";
-Func udf = Udf(
- str => $"{s1} {str}");
-
-```
-The above defined UDF takes a `string` as an input (in the form of a [Column](https://github.com/dotnet/spark/blob/master/src/csharp/Microsoft.Spark/Sql/Column.cs#L14) of a [Dataframe](https://github.com/dotnet/spark/blob/master/src/csharp/Microsoft.Spark/Sql/DataFrame.cs#L24)), and returns a `string` with `hello` appended in front of the input.
-
-For a sample Dataframe, let's take the following Dataframe `df`:
-
-```text
-+-------+
-| name|
-+-------+
-|Michael|
-| Andy|
-| Justin|
-+-------+
-```
-
-Now let's apply the above defined `udf` to the dataframe `df`:
-
-```csharp
-DataFrame udfResult = df.Select(udf(df["name"]));
-```
-
-This would return the below as the Dataframe `udfResult`:
-
-```text
-+-------------+
-| name|
-+-------------+
-|hello Michael|
-| hello Andy|
-| hello Justin|
-+-------------+
-```
-To get a better understanding of how to implement UDFs, please take a look at the [UDF helper functions](https://github.com/dotnet/spark/blob/master/src/csharp/Microsoft.Spark/Sql/Functions.cs#L3616) and some [test examples](https://github.com/dotnet/spark/blob/master/src/csharp/Microsoft.Spark.E2ETest/UdfTests/UdfSimpleTypesTests.cs#L49).
-
-## UDF serialization
-
-Since UDFs are functions that need to be executed on the workers, they have to be serialized and sent to the workers as part of the payload from the driver. This involves serializing the [delegate](https://docs.microsoft.com/en-us/dotnet/csharp/programming-guide/delegates/) which is a reference to the method, along with its [target](https://docs.microsoft.com/en-us/dotnet/api/system.delegate.target?view=netframework-4.8) which is the class instance on which the current delegate invokes the instance method. Please take a look at this [code](https://github.com/dotnet/spark/blob/master/src/csharp/Microsoft.Spark/Utils/CommandSerDe.cs#L149) to get a better understanding of how UDF serialization is being done.
-
-## Good to know while implementing UDFs
-
-One behavior to be aware of while implementing UDFs in .NET for Apache Spark is how the target of the UDF gets serialized. .NET for Apache Spark uses .NET Core, which does not support serializing delegates, so it is instead done by using reflection to serialize the target where the delegate is defined. When multiple delegates are defined in a common scope, they have a shared closure that becomes the target of reflection for serialization. Let's take an example to illustrate what that means.
-
-The following code snippet defines two string variables that are being referenced in two function delegates that return the respective strings as result:
-
-```csharp
-using System;
-
-public class C {
- public void M() {
- string s1 = "s1";
- string s2 = "s2";
- Func a = str => s1;
- Func b = str => s2;
- }
-}
-```
-
-The above C# code generates the following C# disassembly (credit source: [sharplab.io](https://sharplab.io)) code from the compiler:
-
-```csharp
-public class C
-{
- [CompilerGenerated]
- private sealed class <>c__DisplayClass0_0
- {
- public string s1;
-
- public string s2;
-
- internal string b__0(string str)
- {
- return s1;
- }
-
- internal string b__1(string str)
- {
- return s2;
- }
- }
-
- public void M()
- {
- <>c__DisplayClass0_0 <>c__DisplayClass0_ = new <>c__DisplayClass0_0();
- <>c__DisplayClass0_.s1 = "s1";
- <>c__DisplayClass0_.s2 = "s2";
- Func func = new Func(<>c__DisplayClass0_.b__0);
- Func func2 = new Func(<>c__DisplayClass0_.b__1);
- }
-}
-```
-As can be seen in the above decompiled code, both `func` and `func2` share the same closure `<>c__DisplayClass0_0`, which is the target that is serialized when serializing the delegates `func` and `func2`. Hence, even though `Func a` is only referencing `s1`, `s2` also gets serialized when sending over the bytes to the workers.
-
-This can lead to some unexpected behaviors at runtime (like in the case of using [broadcast variables](broadcast-guide.md)), which is why we recommend restricting the visibility of the variables used in a function to that function's scope.
-
-Going back to the above example, the following is the recommended way to implement the desired behavior of previous code snippet:
-
-```csharp
-using System;
-
-public class C {
- public void M() {
- {
- string s1 = "s1";
- Func a = str => s1;
- }
- {
- string s2 = "s2";
- Func b = str => s2;
- }
- }
-}
-```
-
-The above C# code generates the following C# disassembly (credit source: [sharplab.io](https://sharplab.io)) code from the compiler:
-
-```csharp
-public class C
-{
- [CompilerGenerated]
- private sealed class <>c__DisplayClass0_0
- {
- public string s1;
-
- internal string b__0(string str)
- {
- return s1;
- }
- }
-
- [CompilerGenerated]
- private sealed class <>c__DisplayClass0_1
- {
- public string s2;
-
- internal string b__1(string str)
- {
- return s2;
- }
- }
-
- public void M()
- {
- <>c__DisplayClass0_0 <>c__DisplayClass0_ = new <>c__DisplayClass0_0();
- <>c__DisplayClass0_.s1 = "s1";
- Func func = new Func(<>c__DisplayClass0_.b__0);
- <>c__DisplayClass0_1 <>c__DisplayClass0_2 = new <>c__DisplayClass0_1();
- <>c__DisplayClass0_2.s2 = "s2";
- Func func2 = new Func(<>c__DisplayClass0_2.b__1);
- }
-}
-```
-
-Here we see that `func` and `func2` no longer share a closure and have their own separate closures `<>c__DisplayClass0_0` and `<>c__DisplayClass0_1` respectively. When used as the target for serialization, nothing other than the referenced variables will get serialized for the delegate.
-
-This behavior is important to keep in mind while implementing multiple UDFs in a common scope.
-To learn more about UDFs in general, please review the following articles that explain UDFs and how to use them: [UDFs in databricks(scala)](https://docs.databricks.com/spark/latest/spark-sql/udf-scala.html), [Spark UDFs and some gotchas](https://medium.com/@achilleus/spark-udfs-we-can-use-them-but-should-we-use-them-2c5a561fde6d).
\ No newline at end of file