The goal of the .NET for Apache Spark project is to provide an easy to use, .NET-friendly integration to the popular big data platform, Apache Spark. This document describes the tentative plan for the project in the short and long-term.
.NET for Apache Spark is a community effort and we welcome community feedback on our plans. The best way to give feedback is to open an issue in this repo. We are also excited to receive contributions (check out the contribution guide). It's always a good idea to open an issue for a discussion before embarking on a large code change to make sure there is not duplicated effort. Where we do know that efforts are already underway, we have used the (*) marker below.
- 1:1 API compatibility for Dataframes with Apache Spark 2.3.x, Apache Spark 2.4.x and Apache Spark 3.0.x (*)
- Improvements to C# Pickling Library
- Improvements to Arrow .NET Library
- Exploiting .NET Vectorization (*)
- Micro-benchmarking framework for Interop
- Benchmarking scripts for all languages that include generating the dataset and running queries against it (*)
- Published reproducible benchmarks against TPC-H (industry-standard database benchmark) (*)
- VS Code support (*)
- Apache Jupyter integration with C# & F# Notebook Support (*)
- Improved user experience for .NET app submission to a remote Spark cluster
- Idiomatic C# and F# APIs
- Contribute extensible interop layer to Apache Spark
- Published reproducible benchmarks against TPC-DS (industry-standard database benchmark)
- Visual Studio Extension for .NET app submission to a remote Spark cluster
- Visual Studio Extension for .NET app debugging
- Make it easy to copy/paste Scala examples into Visual Studio