Releases: oap-project/oap-tools
v1.5.0
OAP 1.5.0 released with three major components: Gazelle, OAP MLlib and CloudTik. Here are the major features/improvements in OAP 1.5.0.
CloudTik
- Optimize CloudTik Spark runtime and achieve 1.4X performance speedup. The optimizations to Spark include runtime filter, top-N computation optimization, sized based join reorder, distinct before intersecting, flatten scaler subquery and flatten single row aggregate subquery and removing joins for subqueries of “IN” filtering
- Support Kubernetes and integrate with AWS EKS and GCP GKE
- Design and implement auto scaling up and scaling down mechanism
- Support a useful network topology with all nodes with only private IP by utilizing VPC peering, which provides a practical and isolated way for running private clusters without head public IP
- Enhance CloudTik API for in-cluster usage. This new in-cluster (prefixed with This) API can be used in jobs for querying or operating the cluster for desired states. This API is pretty useful in ML/DL job integrations
- Implement the experimental Spark ML/DL runtime. Horovod on Spark as distributed training, Hyperopt as super parameter tuning, MLflow as training and model management, work together with various deep learning frameworks including TensorFlow, PyTorch, MXNet. Spark ML/DL runtime supports running distributed training upon cloud storages (S3, Azure Data Lake Gen2, GCS)
- Dozens of improvements on usability, features and stabilities
Gazelle
- Reach 1.57X overall performance speedup vs. vanilla Spark 3.2.1 on TPC-DS with 5TB dataset on ICX clusters
- Reach 1.77X overall performance speedup vs. vanilla Spark 3.2.1 on TPC-H with 5TB dataset on ICX clusters
- Support more columnar expressions
- Support dynamic merge file partitions
- Error handling on Codegen and now Gazelle is more robust
- Fix memory leak issues
- Improve native Parquet read/write performance
- Improve operator performance on Sort/HashAgg/HashJoin
- Achieve performance goal on customer workloads
- Add Spark 3.2.2 support; now Gazelle can support Spark 3.1.1, Spark 3.1.2, Spark 3.1.3, Spark 3.2.1 and Spark 3.2.2
OAP MLlib
- Reach over 11X performance speedup vs. vanilla Spark with PCA, Linear and Ridge Regression on ICX clusters
Changelog
Gazelle Plugin
Features
#931 | Reuse partition vectors for arrow scan |
#955 | implement missing expressions |
#1120 | Support aggregation window functions with order by |
#1135 | Supports Spark 3.2.2 shims |
#1114 | Remove tmp directory after application exits |
#862 | implement row_number window function |
#1007 | Document how to test columnar UDF |
#942 | Use hash aggregate for string type input |
Performance
#1144 | Optimize cast WSCG performance |
Bugs Fixed
#1170 | Segfault on data source v2 |
#1164 | Limit the column num in WSCG |
#1166 | Peers' values should be considered in window function for CURRENT ROW in range mode |
#1149 | Vulnerability issues |
#1112 | Validate Error: “Invalid: Length spanned by binary offsets (21) larger than values array (size 20)” |
#1103 | wrong hashagg results |
#929 | Failed to add user extension while using gazelle |
#1100 | Wildcard in json path is not supported |
#1079 | Like function gets wrong result when default escape char is contained |
#1046 | Fall back to use row-based operators, error is makeStructField is unable to parse from conv |
#1053 | Exception when there is function expression in pos or len of substring |
#1024 | ShortType is not supported in ColumnarLiteral |
#1034 | Exception when there is unix_timestamp in CaseWhen |
#1032 | Missing WSCG check for ExistenceJoin |
#1027 | partition by literal in window function |
#1019 | Support more date formats for from_unixtime & unix_timestamp |
#999 | The performance of using ColumnarSort operator to sort string type is significantly lower than that of native spark Sortexec |
#984 | concat_ws |
#958 | JVM/Native R2C and CoalesceBatcth process time inaccuracy |
#979 | Failed to find column while reading parquet with case insensitive |
PRs
#1192 | [NSE-1191] fix AQE exchange reuse in Spark3.2 |
#1180 | [NSE-1193] fix jni unload |
#1175 | [NSE-1171] Support merge parquet schema and read missing schema |
#1178 | [NSE-1161][FOLLOWUP] Remove extra compression type check |
#1162 | [NSE-1161] Support read-write parquet conversion to read-write arrow |
#1014 | [NSE-956] allow to write parquet with compression |
#1176 | bump h2/pgsql version |
#1173 | [NSE-1171] Throw RuntimeException when reading duplicate fields in case-insensitive mode |
#1172 | [NSE-1170] Setting correct row number in batch scan w/ partition columns |
#1169 | [NSE-1161] Format sql config string key |
#1167 | [NSE-1166] Cover peers' values in sum window function in range mode |
#1165 | [NSE-1164] Limit the max column num in WSCG |
#1160 | [NSE-1149] upgrade guava to 30.1.1 |
#1158 | [NSE-1149] upgrade guava to 30.1.1 |
#1152 | [NSE-1149] upgrade guava to 24.1.1 |
#1153 | [NSE-1149] upgrade pgsql to 42.3.3 |
#1150 | [NSE-1149] Remove log4j in shims module |
#1146 | [NSE-1135] Introduce shim layer for supporting spark 3.2.2 |
#1145 | [NSE-1144] Optimize cast wscg performance |
#1136 | Remove project from wscg when it's the child of window |
#1122 | [NSE-1120] Support sum window function with order by statement |
#1131 | [NSE-1114] Remove temp directory without FileUtils.forceDeleteOnExit |
#1129 | [NSE-1127] Use larger buffer for hash agg |
#1130 | [NSE-610] fix hashjoin build time metric |
#1126 | [NSE-1125] Add status check for hashing GetOrInsert |
#1056 | [NSE-955] Support window function lag |
#1123 | [NSE-1118] fix codegen on TPCDS Q88 |
#1119 | [NSE-1118] adding more checks for SMJ codegen |
#1058 | [NSE-981] Add a test suite for projection codegen |
#1117 | [NSE-1116] Disable columnar url_decoder |
#1113 | [NSE-1112] Fix Arrow array meta data validating issue when writing parquet files |
#1039 | [NSE-1019] fix codegen for all expressions |
#1115 | [NSE-1114] Remove tmp directory after application exits |
#1111 | remove debug log |
#1098 | [NSE-1108] allow to use different cases in column names |
[#1082](h... |
v1.4.0
Overview
OAP 1.4.0 released with three major components: Gazelle, OAP MLlib and CloudTik (a new addition). In this release, 59 features/improvements were committed to Gazelle; OAP MLlib is paused to focus more on oneDAL; CloudTik is a cloud scale platform for distributed analytics and AI on public cloud providers including AWS, Azure, GCP, and so on. CloudTik enables users or enterprises to easily create and manage analytics and AI platform on public clouds with out-of-box optimized functionalities and performance, and to go quickly to focus on running the business workloads in minutes or hours instead of spending months to construct and optimize the platform.
Here are the major features/improvements in OAP 1.4.0.
Gazelle
- Reach 1.6X overall performance vs. vanilla Spark on TPC-DS 103 queries with 5TB dataset on ICX clusters
- Reach 1.8X overall performance vs. vanilla Spark on TPC-H 22 queries with 5TB dataset on ICX clusters
- Implement split by reducer by column and allocate large block of memory for all reducer to optimize shuffle
- Optimize Columnar2Row and Row2Columnar performance
- Add support for 10+ expressions
- Pack the classes into one single jar
- Bugfixes on WholeStage Codegen on unsupported pattern, sort spill, etc
CloudTik
- Scalable, robust, powerful and unified control plane for Cloud cluster and runtime management
- Support 3 major public Cloud providers AWS, GCP and Azure with managed cloud storage and on-premise mode
- Support of Cloud workspace to manage shared Cloud resources including VPC and subnets, cloud storage, firewalls, identities/roles and so on
- Out-of-box runtimes including Spark, Presto/Trino, HDFS, Metastore, Kafka, Zookeeper and Ganglia
- Integrate with OAP (Gazelle and MLlib) and various tools to run benchmarks on CloudTik
OAP MLlib
- Reach over 11X performance vs. vanilla Spark with PCA, Linear and Ridge Regression on ICX clusters
Changelog
Gazelle
Features
#781 | Add spark eventlog analyzer for advanced analyzing |
#927 | Column2Row further enhancement |
#913 | Add Hadoop 3.3 profile to pom.xml |
#869 | implement first agg function |
#926 | Support UDF URLDecoder |
#856 | [SHUFFLE] manually split of Variable length buffer (String likely) |
#886 | Add pmod function support |
#855 | [SHUFFLE] HugePage support in shuffle |
#872 | implement replace function |
#867 | Add substring_index function support |
#818 | Support length, char_length, locate, regexp_extract |
#864 | Enable native parquet write by default |
#828 | CoalesceBatches native implementation |
#800 | Combine datasource and columnar core jar |
Performance
#848 | Optimize Columnar2Row performance |
#943 | Optimize Row2Columnar performance |
#854 | Enable skipping columnarWSCG for queries with small shuffle size |
#857 | [SHUFFLE] split by reducer by column |
Bugs Fixed
#827 | Github action is broken |
#987 | TPC-H q7, q8, q9 run failed when using String for Date |
#892 | Q47 and q57 failed on ubuntu 20.04 OS without open-jdk. |
#784 | Improve Sort Spill |
#788 | Spark UT of "randomSplit on reordered partitions" encountered "Invalid: Map array child array should have no nulls" issue |
#821 | Improve Wholestage Codegen check |
#831 | Support more expression types in getting attribute |
#876 | Write arrow hang with OutputWriter.path |
#891 | Spark executor lost while DatasetFileWriter failed with speculation |
#909 | "INSERT OVERWRITE x SELECT /*+ REPARTITION(2) */ * FROM y LIMIT 2" drains 4 rows into table x using Arrow write extension |
#889 | Failed to write with ParquetFileFormat while using ArrowWriteExtension |
#910 | TPCDS failed, segfault caused by PR903 |
#852 | Unit test fix for NSE-843 |
#843 | ArrowDataSouce: Arrow dataset inspect() is called every time a file is read |
PRs
#1005 | [NSE-800] Fix an assembly warning |
#1002 | [NSE-800] Pack the classes into one single jar |
#988 | [NSE-987] fix string date |
#977 | [NSE-126] set default codegen opt to O1 |
#975 | [NSE-927] Add macro AVX512BW check for different CPU architecture |
#962 | [NSE-359] disable unit tests on spark32 package |
#966 | [NSE-913] Add support for Hadoop 3.3.1 when packaging |
#936 | [NSE-943] Optimize IsNULL() function for Row2Columnar |
#937 | [NSE-927] Implement AVX512 optimization selection in Runtime and merge two C2R code files into one. |
#951 | [DNM] update sparklog |
#938 | [NSE-581] implement rlike/regexp_like |
#946 | [DNM] update on sparklog script |
#939 | [NSE-581] adding ShortType/FloatType in ColumnarLiteral |
#934 | [NSE-927] Extract and inline functions for native ColumnartoRow |
#933 | [NSE-581] Improve GetArrayItem(Split()) performance |
#922 | [NSE-912] Remove extra handleSafe costs |
#925 | [NSE-926] Support a UDF: URLDecoder |
#924 | [NSE-927] Enable AVX512 in Binary length calculation for native ColumnartoRow |
#918 | [NSE-856] Optimize of string/binary split |
#908 | [NSE-848] Optimize performance for Column2Row |
#900 | [NSE-869] Add 'first' agg function support |
#917 | [NSE-886] Add pmod expression support |
#916 | [NSE-909] fix slow test |
#915 | [NSE-857] Further optimizations of validity buffer split |
#912 | [NSE-909] "INSERT OVERWRITE x SELECT /*+ REPARTITION(2) */ * FROM y L… |
#896 | [NSE-889] Failed to write with ParquetFileFormat while using ArrowWriteExtension |
#911 | [NSE-910] fix bug of PR903 |
#901 | [NSE-891] Spark executor lost while D... |
v1.3.1
Overview
OAP 1.3.1 is a maintenance release and contains two major components: Gazelle and OAP MLlib. In this release, 51 issues/improvements were committed.
Here are the major features/improvements in OAP 1.3.1.
Gazelle (Native SQL Engine)
- Reach 1.5X overall performance vs. vanilla Spark on TPC-DS 103 queries with 5TB dataset on ICX clusters
- Reach 1.7X overall performance vs. vanilla Spark on TPC-H 22 queries with 3TB dataset on ICX clusters
- Support Spark-3.1.1, Spark-3.1.2, Spark-3.1.3, Spark-3.2.0 and Spark-3.2.1
- Support rand expression and complex types for ColumnarSortExec
- Refactor on shuffled hash join/hash agg
- Bug fix for SMJ and memory allocation in row to columnar, etc
OAP MLlib
- Reach over 12X performance vs. vanilla Spark using PCA, Linear and Ridge Regression on ICX clusters
- Support Spark-3.1.1, Spark-3.1.2, Spark-3.1.3, Spark-3.2.0, Spark-3.2.1 and CDH Spark
- Bump Intel oneAPI Base Toolkit to 2022.1.2
Changelog
Gazelle Plugin
Features
#710 | Add rand expression support |
#745 | improve codegen check |
#761 | Update the document to reflect the changes in build and deployment |
#635 | Document the incompatibility with Spark on Expressions |
#702 | Print output datatype for columnar shuffle on WebUI |
#712 | [Nested type] Optimize Array split and support nested Array |
#732 | [Nested type] Support Struct and Map nested types in Shuffle |
#759 | Add spark 3.1.2 & 3.1.3 as supported versions for 3.1.1 shim layer |
Performance
#610 | refactor on shuffled hash join/hash agg |
Bugs Fixed
#755 | GetAttrFromExpr unsupported issue when run TPCDS Q57 |
#764 | add java.version to clarify jdk version |
#774 | Fix runtime issues on spark 3.2 |
#778 | Failed to find include file while running code gen |
#725 | gazelle failed to run with spark local |
#746 | Improve memory allocation on native row to column operator |
#770 | There are cast exception and null pointer expection in spark-3.2 |
#772 | ColumnarBatchScan name missing in UI for Spark321 |
#740 | Handle exceptions like std::out_of_range in casting string to numeric types in WSCG |
#727 | Create table failed with TPCH partiton dataset |
#719 | Wrong result on TPC-DS Q38, Q87 |
#705 | Two unit tests failed on master branch |
PRs
#834 | [NSE-746]Fix memory allocation in row to columnar |
#809 | [NSE-746]Fix memory allocation in row to columnar |
#817 | [NSE-761] Update document to reflect spark 3.2.x support |
#805 | [NSE-772] Code refactor for ColumnarBatchScan |
#802 | [NSE-794] Fix count() with decimal value |
#779 | [NSE-778] Failed to find include file while running code gen |
#798 | [NSE-795] Fix a consecutive SMJ issue in wscg |
#799 | [NSE-791] fix xchg reuse in Spark321 |
#773 | [NSE-770] [NSE-774] Fix runtime issues on spark 3.2 |
#787 | [NSE-774] Fallback broadcast exchange for DPP to reuse |
#763 | [NSE-762] Add complex types support for ColumnarSortExec |
#783 | [NSE-782] prepare 1.3.1 release |
#777 | [NSE-732]Adding new config to enable/disable complex data type support |
#776 | [NSE-770] [NSE-774] Fix runtime issues on spark 3.2 |
#765 | [NSE-764] declare java.version for maven |
#767 | [NSE-610] fix unit tests on SHJ |
#760 | [NSE-759] Add spark 3.1.2 & 3.1.3 as supported versions for 3.1.1 shim layer |
#757 | [NSE-746]Fix memory allocation in row to columnar |
#724 | [NSE-725] change the code style for ExecutorManger |
#751 | [NSE-745] Improve codegen check for expression |
#742 | [NSE-359] [NSE-273] Introduce shim layer to fix compatibility issues for gazelle on spark 3.1 & 3.2 |
#754 | [NSE-755] Quick fix for ConverterUtils.getAttrFromExpr for TPCDS queries |
#749 | [NSE-732] Support Map complex type in Shuffle |
#738 | [NSE-610] hashjoin opt1 |
#733 | [NSE-732] Support Struct complex type in Shuffle |
#744 | [NSE-740] fix codegen with out_of_range check |
#743 | [NSE-740] Catch out_of_range exception in casting string to numeric types in wscg |
#735 | [NSE-610] hashagg opt#2 |
#707 | [NSE-710] Add rand expression support |
#734 | [NSE-727] Create table failed with TPCH partiton dataset, patch 2 |
#715 | [NSE-610] hashagg opt#1 |
#731 | [NSE-727] Create table failed with TPCH partiton dataset |
#713 | [NSE-712] Optimize Array split and support nested Array |
#721 | [NSE-719][backport]fix null check in SMJ |
#720 | [NSE-719] fix null check in SMJ |
#718 | Following NSE-702, fix for AQE enabled case |
#691 | [NSE-687]Try to upgrade log4j |
#703 | [NSE-702] Print output datatype for columnar shuffle on WebUI |
#706 | [NSE-705] Fallback R2C on unsupported cases |
#657 | [NSE-635] Add document to clarify incompatibility issues in expressions |
#623 | [NSE-602] Fix Array type shuffle split segmentation fault |
#693 | [NSE-692] JoinBenchmark is broken |
OAP MLlib
Features
#189 | Intel-MLlib not support spark-3.2.1 version |
#186 | [Core] Support CDH versions |
#187 | Intel-MLlib not support spark-3.1.3 version. |
[#180](https://github.c... |
v1.3.0
Overview
OAP 1.3.0 contains two major components: Gazelle and OAP MLlib. In this release, 74 issues/improvements were committed.
Here are the major features/improvements in OAP 1.3.0.
Gazelle (Native SQL Engine):
- Further optimization to gain 1.5X overall performance on TPC-DS 103 queries;
- Further optimization to gain 1.7X overall performance on TPC-H 22 queries;
- Add ORC Read support;
- Add Native Row to Column support;
- Add 10+ expressions support;
- Complete function & performance evaluation on GCP Dataproc platform;
- Bugfixes on Columnar WholeStage Codegen, Columnar Sort operators,etc.
OAP MLlib:
- Support Covariance & Low Order Moments algorithms optimization for CPU;
- Gain over 8x training performance for Covariance algorithm.
- Add support for multiple Spark versions (Spark 3.1.1 and Spark 3.2).
Changelog
Gazelle Plugin
Features
#550 | [ORC] Support ORC Format Reading |
#188 | Support Dockerfile |
#574 | implement native LocalLimit/GlobalLimit |
#684 | BufferedOutputStream causes massive futex system calls |
#465 | Provide option to rely on JVM GC to release Arrow buffers in Java |
#681 | Enable gazelle to support two math expressions: ceil & floor |
#651 | Set Hadoop 3.2 as default in pom.xml |
#126 | speed up codegen |
#596 | [ORC] Verify whether ORC file format supported complex data types in gazelle |
#581 | implement regex/trim/split expr |
#473 | Optimize the ArrowColumnarToRow performance |
#647 | Leverage buffered write in shuffle |
#674 | Add translate expression support |
#675 | Add instr expression support |
#645 | Add support to cast data in bool type to bigint type or string type |
#463 | version bump on 1.3 |
#583 | implement get_json_object |
#640 | Disable compression for tiny payloads in shuffle |
#631 | Do not write schema in shuffle writting |
#609 | Implement date related expression like to_date, date_sub |
#629 | Improve codegen failure handling |
#612 | Add metric "prepare time" for shuffle writer |
#576 | columnar FROM_UNIXTIME |
#589 | [ORC] Add TPCDS and TPCH UTs for ORC Format Reading |
#537 | Increase partition number adaptively for large SHJ stages |
#580 | document how to create metadata for data source V1 based testing |
#555 | support batch size > 32k |
#561 | document the code generation behavior on driver, suggest to deploy driver on powerful server |
#523 | Support ArrayType in ArrowColumnarToRow operator |
#542 | Add rule to propagate local window for rank + filter pattern |
#21 | JNI: Unexpected behavior when executing codes after calling JNIEnv::ThrowNew |
#512 | Add strategy to force use of SHJ |
#518 | Arrow buffer cleanup: Support both manual release and auto release as a hybrid mode |
#525 | Support AQE in columnWriter |
#516 | Support External Sort in sort kernel |
#503 | 能提供下官网性能测试的详细配置吗? |
#501 | Remove ArrowRecordBatchBuilder and its usages |
#461 | Support ArrayType in Gazelle |
#479 | Optimize sort materialization |
#449 | Refactor sort codegen kernel |
#667 | 1.3 RC release |
#352 | Map/Array/Struct type support for Parquet reading in Arrow Data Source |
Bugs Fixed
#660 | support string builder in window output |
#636 | Remove log4j 1.2 Support for security issue |
#540 | reuse subquery in TPC-DS Q14a |
#687 | log4j 1.2.17 in spark-core |
#617 | Exceptions handling for stoi, stol, stof, stod in whole stage code gen |
#653 | Handle special cases for get_json_object in WSCG |
#650 | Scala test ArrowColumnarBatchSerializerSuite is failing |
#642 | Fail to cast unresolved reference to attribute reference |
#599 | data source unit tests are broken |
#604 | Sort with special projection key broken |
#627 | adding security instructions |
#615 | An excpetion in trying to cast attribute in getResultAttrFromExpr of ConverterUtils |
#588 | preallocated memory for shuffle split |
#606 | NullpointerException getting map values from ArrowWritableColumnVector |
#569 | CPU overhead on fine grain / concurrent off-heap acquire operations |
#553 | Support casting string type to types like int, bigint, float, double |
#514 | Fix the core dump issue in Q93 when enable columnar2row |
#532 | Fix the failed UTs in ArrowEvalPythonExecSuite when enable ArrowColumnarToRow |
#534 | Columnar SHJ: Error if probing with empty record batch |
#529 | Wrong build side may be chosen for SemiJoin when forcing use of SHJ |
#504 | Fix non-decimal window function unit test failures |
#493 | Three unit tests newly failed on master branch |
PRs
#690 | [NSE-667] backport patches to 1.3 branch |
#688 | [NSE-687]remove exclude log4j when running ut |
#686 | [NSE-400] Fix the bug for negative decimal data |
#685 | [NSE-684] BufferedOutputStream causes massive futex system calls |
#680 | [NSE-667] backport patches to 1.3 branch |
#683 | [NSE-400] fix leakage in row to column operator |
#637 | [NSE-400] Native Arrow Row to columnar support |
#648 | [NSE-647] Leverage buffered write in shuffle |
#682 | [NSE-681] Add floor & ceil expression support |
#672 | [NSE-674] Add translate expression support |
#676 | [NSE-675] Add instr expression support |
#652 | [NSE-651]Use Hadoop 3.2 as default hadoop.version |
#666 | [NSE-667] backport patches to 1.3 branch |
#644 | [NSE-645] Add support to cast bool type to bigint type & string type |
[#659](oap-project/gazelle_plugin#65... |
v1.2.1
OAP 1.2.1 is a bug fix release to remove log4j 2 dependency, without new features and code changes.
OAP 1.2.1 has been updated to remove Log4j version 2.13.3 and may not include the latest functional and security updates. OAP 1.3 is targeted to be released in January 2022 and will include additional functional and/or security updates. Customers should update to the latest version as it becomes available.
v1.2.0
OAP 1.2.0 is the second release we reorganized the source code and transit to the dedicated oap-project organization (https://github.com/oap-project). In this release, 36 issues/improvements were committed. We also completed first round Cloud integration evaluation for Native SQL Engine, OAP MLlib and SQL Data Source Cache on different Cloud platforms including AWS EMR, GCP Dataproc and AWS EKS. The performance data is not promising and 5 Cloud integration functionality and performance issues postponed. In addition, we released two other dedicated Conda packages for EMR and Dataproc Cloud integration.
Here are the major features/improvements in OAP 1.2.0:
- SQL Data Source Cache adds K8S support with Unix domain socket deployment limitation left.
- Native SQL Engine further optimization to gain 25% overall performance on TPC-DS 103 queries; adds RDD Cache support; adds Spill & UDF Support; completes Column to Row optimization feature; and further enhances the stability of performance.
- OAP MLlib further enhances the stability of performance of KMeans, PCA & ALS; supports Linear regression & Bayes algorithm optimization for CPU and gains over 10x training performance for Linear regression algorithm; and completes functionality of KMeans & PCA algorithm support for GPU.
- PMem Shuffle adds Remote PMem pool feature (experimental).
Gazelle Plugin
Features
#394 | Support ColumnarArrowEvalPython operator |
#368 | Encountered Hadoop version (3.2.1) conflict issue on AWS EMR-6.3.0 |
#375 | Implement a series of datetime functions |
#183 | Add Date/Timestamp type support |
#362 | make arrow-unsafe allocator as the default |
#343 | configurable codegen opt level |
#333 | Arrow Data Source: CSV format support fix |
#223 | Add Parquet write support to Arrow data source |
#320 | Add build option to enable unsafe Arrow allocator |
#337 | UDF: Add test case for validating basic row-based udf |
#326 | Update Scala unit test to spark-3.1.1 |
Performance
#400 | Optimize ColumnarToRow Operator in NSE. |
#411 | enable ccache on C++ code compiling |
Bugs Fixed
#358 | Running TPC DS all queries with native-sql-engine for 10 rounds will have performance degradation problems in the last few rounds |
#481 | JVM heap memory leak on memory leak tracker facilities |
#436 | Fix for Arrow Data Source test suite |
#317 | persistent memory cache issue |
#382 | Hadoop version conflict when supporting to use gazelle_plugin on Google Cloud Dataproc |
#384 | ColumnarBatchScanExec reading parquet failed on java.lang.IllegalArgumentException: not all nodes and buffers were consumed |
#370 | Failed to get time zone: NoSuchElementException: None.get |
#360 | Cannot compile master branch. |
#341 | build failed on v2 with -Phadoop-3.2 |
PRs
#489 | [NSE-481] JVM heap memory leak on memory leak tracker facilities (Arrow Allocator) |
#486 | [NSE-475] restore coalescebatches operator before window |
#482 | [NSE-481] JVM heap memory leak on memory leak tracker facilities |
#470 | [NSE-469] Lazy Read: Iterator objects are not correctly released |
#464 | [NSE-460] fix decimal partial sum in 1.2 branch |
#439 | [NSE-433]Support pre-built Jemalloc |
#453 | [NSE-254] remove arrow-data-source-common from jar with dependency |
#452 | [NSE-254]Fix redundant arrow library issue. |
#432 | [NSE-429] TPC-DS Q14a/b get slowed down within setting spark.oap.sql.columnar.sortmergejoin.lazyread=true |
#426 | [NSE-207] Fix aggregate and refresh UT test script |
#442 | [NSE-254]Issue0410 jar size |
#441 | [NSE-254]Issue0410 jar size |
#440 | [NSE-254]Solve the redundant arrow library issue |
#437 | [NSE-436] Fix for Arrow Data Source test suite |
#387 | [NSE-383] Release SMJ input data immediately after being used |
#423 | [NSE-417] fix sort spill on inplsace sort |
#416 | [NSE-207] fix left/right outer join in SMJ |
#422 | [NSE-421]Disable the wholestagecodegen feature for the ArrowColumnarToRow operator |
#369 | [NSE-417] Sort spill support framework |
#401 | [NSE-400] Optimize ColumnarToRow Operator in NSE. |
#413 | [NSE-411] adding ccache support |
#393 | [NSE-207] fix scala unit tests |
#407 | [NSE-403]Add Dataproc integration section to README |
#406 | [NSE-404]Modify repo name in documents |
#402 | [NSE-368]Update emr-6.3.0 support |
#395 | [NSE-394]Support ColumnarArrowEvalPython operator |
#346 | [NSE-317]fix columnar cache |
#392 | [NSE-382]Support GCP Dataproc 2.0 |
#388 | [NSE-382]Fix Hadoop version issue |
#385 | [NSE-384] "Select count(*)" without group by results in error: java.lang.IllegalArgumentException: not all nodes and buffers were consumed |
#374 | [NSE-207] fix left anti join and support filter wo/ project |
#376 | [NSE-375] Implement a series of datetime functions |
#373 | [NSE-183] fix timestamp in native side |
#356 | [NSE-207] fix issues found in scala unit tests |
#371 | [NSE-370] Failed to get time zone: NoSuchElementException: None.get |
#347 | [NSE-183] Add Date/Timestamp type support |
#363 | [NSE-362] use arrow-unsafe allocator by default |
#361 | [NSE-273] Spark shim layer infrastructure |
#364 | [NSE-360] fix ut compile and travis test |
#264 | [NSE-207] fix issues found from join unit tests |
#344 | [NSE-343]allow to config codegen opt level |
#342 | [NSE-341] fix maven build failure |
#324 | [NSE-223] Add Parquet write support to Arrow data source |
#321 | [NSE-320] Add build option to enable unsafe Arrow allocator |
#299 | [NSE-207] fix unsuppored types in aggregate |
#338 | [NSE-337] UDF: Add test case for validating basic row-based udf |
#336 | [NSE-333] Arrow Data Source: CSV format support fix |
#327 | [NSE-326] update scala unit tests to spark-3.1.1 |
OAP MLlib
Features
#110 | Update isOAPEnabled for Kmeans, PCA & ALS |
[#108](https://github.com/oap-project/oap-m... |
v1.1.1-spark-3.1.1
OAP 1.1.1 is a maintenance released based on OAP 1.1.0. In this release, the major work is supporting Spark 3.1.1, 24 issues/improvements were committed.
Release 1.1.1
Native SQL Engine
Features
#304 | Upgrade to Arrow 4.0.0 |
#285 | ColumnarWindow: Support Date/Timestamp input in MAX/MIN |
#297 | Disable incremental compiler in CI |
#245 | Support columnar rdd cache |
#276 | Add option to switch Hadoop version |
#274 | Comment to trigger tpc-h RAM test |
#256 | CI: do not run ram report for each PR |
Bugs Fixed
#325 | java.util.ConcurrentModificationException: mutation occurred during iteration |
#329 | numPartitions are not the same |
#318 | fix Spark 311 on data source v2 |
#311 | Build reports errors |
#302 | test on v2 failed due to an exception |
#257 | different version of slf4j-log4j |
#293 | Fix BHJ loss if key = 0 |
#248 | arrow dependency must put after arrow installation |
PRs
#332 | [NSE-325] fix incremental compile issue with 4.5.x scala-maven-plugin |
#335 | [NSE-329] fix out partitioning in BHJ and SHJ |
#328 | [NSE-318]check schema before reuse exchange |
#307 | [NSE-304] Upgrade to Arrow 4.0.0 |
#312 | [NSE-311] Build reports errors |
#272 | [NSE-273] support spark311 |
#303 | [NSE-302] fix v2 test |
#306 | [NSE-304] Upgrade to Arrow 4.0.0: Change basic GHA TPC-H test target … |
#286 | [NSE-285] ColumnarWindow: Support Date input in MAX/MIN |
#298 | [NSE-297] Disable incremental compiler in GHA CI |
#291 | [NSE-257] fix multiple slf4j bindings |
#294 | [NSE-293] fix unsafemap with key = '0' |
#233 | [NSE-207] fix issues found from aggregate unit tests |
#246 | [NSE-245]Adding columnar RDD cache support |
#289 | [NSE-206]Update installation guide and configuration guide. |
#277 | [NSE-276] Add option to switch Hadoop version |
#275 | [NSE-274] Comment to trigger tpc-h RAM test |
#271 | [NSE-196] clean up configs in unit tests |
#258 | [NSE-257] fix different versions of slf4j-log4j12 |
#259 | [NSE-248] fix arrow dependency order |
#249 | [NSE-241] fix hashagg result length |
#255 | [NSE-256] do not run ram report test on each PR |
SQL Data Source Cache
Features
#118 | port to Spark 3.1.1 |
Bugs Fixed
#121 | OAP Index creation stuck issue |
PRs
#132 | Fix SampleBasedStatisticsSuite UnitTest case |
#122 | [ sql-ds-cache-121] Fix Index stuck issues |
#119 | [SQL-DS-CACHE-118][POAE7-1130] port sql-ds-cache to Spark3.1.1 |
OAP MLlib
Features
#26 | [PIP] Support Spark 3.0.1 / 3.0.2 and upcoming 3.1.1 |
PRs
#39 | [ML-26] Build for different spark version by -Pprofile |
PMem Spill
Features
#34 | Support vanilla spark 3.1.1 |
PRs
#41 | [PMEM-SPILL-34][POAE7-1119]Port RDD cache to Spark 3.1.1 as separate module |
PMem Common
Features
#10 | add -mclflushopt flag to enable clflushopt for gcc |
#8 | use clflushopt instead of clflush |
PRs
#11 | [PMEM-COMMON-10][POAE7-1010]Add -mclflushopt flag to enable clflushop… |
#9 | [PMEM-COMMON-8][POAE7-896]use clflush optimize version for clflush |
PMem Shuffle
Features
#15 | Doesn't work with Spark3.1.1 |
PRs
#16 | [pmem-shuffle-15] Make pmem-shuffle support Spark3.1.1 |
Remote Shuffle
Features
#18 | upgrade to Spark-3.1.1 |
#11 | Support DAOS Object Async API |
PRs
#19 | [REMOTE-SHUFFLE-18] upgrade to Spark-3.1.1 |
#14 | [REMOTE-SHUFFLE-11] Support DAOS Object Async API |
v1.1.0-spark-3.0.0
OAP 1.1.0 is the first release after OAP repository transition to dedicated oap-project github organization. In this release, 85 issues/improvements were committed, and we also provide website documentation.
Here are the major features/improvements in OAP 1.1.0:
- OAP MLlib supports PCA & ALS algorithms based on the latest oneCCL version and gains 2x-3x training performance and good stability.
- Native SQL Engine can gain over 30% overall performance on TPC-DS 103 queries benchmark with partition tables (tpcds-kit 2.4), and supports DecimalType.
- SQL Data Source Cache improves the Plasma Backend stability.
- PMem Spill adds FSDAX mode support.
- Remote Shuffle adds DAOS support. (experimental)
Native SQL Engine
Features
#261 | ArrowDataSource: Add S3 Support |
#239 | Adopt ARROW-7011 |
#62 | Support Arrow's Build from Source and Package dependency library in the jar |
#145 | Support decimal in columnar window |
#31 | Decimal data type support |
#128 | Support Decimal in Aggregate |
#130 | Support decimal in project |
#134 | Update input metrics during reading |
#120 | Columnar window: Reduce peak memory usage and fix performance issues |
#108 | Add end-to-end test suite against TPC-DS |
#68 | Adaptive compression select in Shuffle. |
#97 | optimize null check in codegen sort |
#29 | Support mutiple-key sort without codegen |
#75 | Support HashAggregate in ColumnarWSCG |
#73 | improve columnar SMJ |
#51 | Decimal fallback |
#38 | Supporting expression as join keys in columnar SMJ |
#27 | Support REUSE exchange when DPP enabled |
#17 | ColumnarWSCG further optimization |
Performance
#194 | Arrow Parameters Update when compiling Arrow |
#136 | upgrade to arrow 3.0 |
#103 | reduce codegen in multiple-key sort |
#90 | Refine HashAggregate to do everything in CPP |
Bugs Fixed
#278 | fix arrow dep in 1.1 branch |
#265 | TPC-DS Q67 failed with memmove exception in native split code. |
#280 | CMake version check |
#241 | TPC-DS q67 failed for XXH3_hashLong_64b_withSecret.constprop.0+0x180 |
#262 | q18 has different digits compared with vanilla spark |
#196 | clean up options for native sql engine |
#224 | update 3rd party libs |
#227 | fix vulnerabilities from klockwork |
#237 | Add ARROW_CSV=ON to default C++ build commands |
#229 | Fix the deprecated code warning in shuffle_split_test |
#119 | consolidate batch size |
#217 | TPC-H query20 result not correct when use decimal dataset |
#211 | IndexOutOfBoundsException during running TPC-DS Q2 |
#167 | Cannot successfully run q.14a.sql and q14b.sql when using double format for TPC-DS workload. |
#191 | libarrow.so and libgandiva.so not copy into the tmp directory |
#179 | Unable to find Arrow headers during build |
#153 | Fix incorrect queries after enabled Decimal |
#173 | fix the incorrect result of q69 |
#48 | unit tests for c++ are broken |
#101 | ColumnarWindow: Remove obsolete debug code |
#100 | Incorrect result in Q45 w/ v2 bhj threshold is 10MB sf500 |
#81 | Some ArrowVectorWriter implementations doesn't implement setNulls method |
#82 | Incorrect result in TPCDS Q72 SF1536 |
#70 | Duplicate IsNull check in codegen sort |
#64 | Memleak in sort when SMJ is disabled |
#58 | Issues when running tpcds with DPP enabled and AQE disabled |
#52 | memory leakage in columnar SMJ |
#53 | Q24a/Q24b SHJ tail task took about 50 secs in SF1500 |
#42 | reduce columnar sort memory footprint |
#40 | columnar sort codegen fallback to executor side |
#1 | columnar whole stage codegen failed due to empty results |
#23 | TPC-DS Q8 failed due to unsupported operation in columnar sortmergejoin |
#22 | TPC-DS Q95 failed due in columnar wscg |
#4 | columnar BHJ failed on new memory pool |
#5 | columnar BHJ failed on partitioned table with prefercolumnar=false |
PRs
#288 | [NSE-119] clean up on comments |
#282 | [NSE-280]fix cmake version check |
#281 | [NSE-280] bump cmake to 3.16 |
#279 | [NSE-278]fix arrow dep in 1.1 branch |
#268 | [NSE-186] backport to 1.1 branch |
#266 | [NSE-265] Reserve enough memory before UnsafeAppend in builder |
#270 | [NSE-261] ArrowDataSource: Add S3 Support |
#263 | [NSE-262] fix remainer loss in decimal divide |
#215 | [NSE-196] clean up native sql options |
#231 | [NSE-176]Arrow install order issue |
#242 | [NSE-224] update third party code |
#240 | [NSE-239] Adopt ARROW-7011 |
#238 | [NSE-237] Add ARROW_CSV=ON to default C++ build commands |
#230 | [NSE-229] Fix the deprecated code warning in shuffle_split_test |
#225 | [NSE-227]fix issues from codescan |
#219 | [NSE-217] fix missing decimal check |
#212 | [NSE-211] IndexOutOfBoundsException during running TPC-DS Q2 |
#187 | [NSE-185] Avoid unnecessary copying when simply projecting on fields |
#195 | [NSE-194]Turn on several Arrow parameters |
#189 | [NSE-153] Following NSE-153, optimize fallback conditions for columnar window |
#192 | [NSE-191]Fix issue0191 for .so file copy to tmp. |
#181 | [NSE-179]Fix arrow include directory not include when using ARROW_ROOT |
[#175](https://github... |