Releases: apache/incubator-gluten
Releases · apache/incubator-gluten
Gluten v1.0.0
Release Notes - Gluten - Version 1.0.0
Highlights (Velox backend only)
- Support Spark 3.2 and Spark3.3
- Run Pass all Velox, Spark3.2 UTs, and partially Spark3.3 UTs
- Support Ubuntu 20.04/22.04, CentOS 7/8, alinux 3, Anolis 7/8
- Support FileSystem: localfs, HDFS, S3, OSS (via s3a)
- Support data types: Primitive type, Decimal, Date, Timestamp
- Support 20 operators, detail here
- Support 164 functions, detail here
- Support native Parquet write
- Support native ORC read
- Support Intel® In-memory Analytics Accelerator (IAA/IAX) hardware accelerator in Shuffle compression
- Support cap-based spill (static memory allocation) for join/agg/sort operator (experimental feature)
- Support static build method via vcpkg
- Support local cache (experimental feature)
- 2.71x speedup in Decision Support Benchmark1 (TPC-H Like) testing
- 2.29x speedup in Decision Support Benchmark2 (TPC-DS Like) testing
- Velox code updated to commit
- Document improvement for support features and configuration
Known Issues
- Parquet write only support
compression.codec
,parquet.block.size
andparquet.block.rows
configurations - Velox backend does not support dynamic partition write and bucket write
- Spill may throw
OutOfMemoryExcetpion
New Features
- [GLUTEN-1243][VL] Support bit_xor aggregate function
- [GLUTEN-1245][VL][Feat] Add VeloxParquetFileFormat to support parquet write in velox backend
- [GLUTEN-1270][VL][Feat] Support multiple HDFS endpoints
- [GLUTEN-1306][VL] feat: Link static depends via vcpkg
- [GLUTEN-1306][FOLLOWUP] vcpkg setup script add alinux3 support
- [GLUTEN-1346][VL] Support native velox row to column
- [GLUTEN-1367] Support running gluten on anolis
- [GLUTEN-1371][VL] Support First/Last aggregate functions
- [GLUTEN-1374][VL] RangePartitioning supports velox columnar batch
- [GLUTEN-1409][VL] feat: Support named_struct in Velox backend
- [GLUTEN-1476][VL] Support GetStructField
- [GLUTEN-1478] Support ordered result check for MapData
- [GLUTEN-1490] refactor substrait literals using generics, and support map/struct/array literals based on it
- [GLUTEN-1521][Core] Support to add the customer columnar rules by config
- [GLUTEN-1623][VL] Support asinh, acosh, atanh, sec, csc math functions for Velox backend
- [GLUTEN-1638][VL] feat: Add hdfs support in parquet write
- [GLUTEN-1640] Support judging whether the execution plan has a fallback
- [GLUTEN-1654][VL] support approx_count_distinct for velox
- [GLUTEN-1658][CORE] feat: Support SparkResourcesUtil.scala in k8s
- [GLUTEN-1662][VL] feat: Support InsertIntoHiveDirCommand in velox parquet write
- [GLUTEN-1704][VL] Support metrics on splits and row groups by
- [GLUTEN-1794][VL] support split preload
- [GLUTEN-1860] StructLiteral support null literal
- [CORE] Support submit subqueries concurrently to improve scalar subquery performance
- [VL] package.sh support centos7 and centos8
- [VL] feat: support partial merge phase in aggregation
- [VL] package and velox scripts add alinux support
- [VL] feat: support more distinct functions
- [VL] Support mocking map stage with no input files in micro benchmark
- [VL] add support for reading ORC
- [VL] add long decimal type support for Orc file format
Improvements
- [GLUTEN-842][VL] convert expand op to expand exec in velox
- [GLUTEN-842] remove group id transformer
- [GLUTEN-1108][VL] Init NativeRowToColumnarJniWrapper with memory pool and schema
- [GLUTEN-1199] Avoid throwing exception from destructor of JavaInputStreamAdaptor
- [GLUTEN-1205][VL] Rename some class name and dir name for columnar sh…
- [GLUTEN-1205][VL] Refactor shuffle partition writer
- [GLUTEN-1205][VL] Refactor shuffle partitioner
- [GLUTEN-1205][VL][FOLLOWUP] Refactor shuffle partition writer
- [GLUTEN-1209][VL] refactor: Refactor Java Celeborn into an independent module
- [GLUTEN-1296][VL] Remove some logs in CI
- [GLUTEN-1325][VL] Optimize decimal arithmetic
- [GLUTEN-1331][CORE] Enable some functions
- [GLUTEN-1336][VL] add spark3.3 UT under connector and expression
- [GLUTEN-1336][VL] move Spark3.3 Unit tests to seperate job
- [GLUTEN-1336][VL] add more spark3.3 UT
- [GLUTEN-1336][VL] CI: move slow tests into another job for Spark3.3
- [GLUTEN-1357][CORE] Change soft-affinity log level from INFO to DEBUG
- [GLUTEN-1369][Core] Move config 'spark.gluten.enabled' to GlutenConfig from QueryPlanSelector
- [GLUTEN-1393][VL] feat: Change velox pipeline input from arrow to velox ValueStreamNode
- [GLUTEN-1407] Let profile control shim version
- [GLUTEN-1416][VL] NoSuchMethodError from shaded Arrow
- [GLUTEN-1433][VL] feat: offload timestamp scan to Velox - phase 1
- [GLUTEN-1433][VL] Enable GlutenStatisticsCollectionSuite
- [GLUTEN-1434][VL] Delete some unused files and functions
- [GLUTEN-1434][VL] Refactor to add ColumnarBatchIterator
- [GLUTEN-1434][VL] Remove unused arrow code and add GLUTEN_CHECK and GLUTEN_DCHECK
- [GLUTEN-1458][VL][CI] feat: Adding Spark3.3 w/ Ubuntu22.04 test
- [GLUTEN-1476][VL] Enable scan on struct and map types
- [GLUTEN-1476][CORE] Use correct field name in struct type
- [GLUTEN-1478][VL] enable timestamp expression tests
- [GLUTEN-1478] Enable failed UT in GlutenIntervalExpressionsSuite
- [GLUTEN-1478][VL] Enable some spark UTs for cast function
- [GLUTEN-1478][VL] Enable tests on casting from string to decimal
- [GLUTEN-1478][VL] Enable test on casting from decimal to bool
- [GLUTEN-1480][DOC] Refactor to enable github pages
- [GLUTEN-1491][VL][feat] Refine row_number() method in velox backend
- [GLUTEN-1500][VL] feat: Use 0.6 * task memory cap as spill threshold for all spillable operators
- [GLUTEN-1500][VL] Implement OOM cap shared by tasks, and spill threshold shared by tasks and operators
- [GLUTEN-1500][VL] Integrate with Velox arbitration API
- [GLUTEN-1533][VL][Feat] Replace sort agg with gluten hash agg
- [[GLUTEN-1534][VL]](https://github.com/oap-proj...
Gluten 0.5.0
Change log
Generated on 2023-04-07
Gluten 0.5.0
Gluten 0.5.0 is the 1st preview release from the repository(https://github.com/oap-project/gluten).
In this release, we have merged 971 PRs and fixed 216 issues.
Here is the major highlight in Gluten 0.5.0:
- Support Spark3.2 and Spark3.3
- Support Ubuntu20.04 or later
- Support CentOS7 and 8
- Support JDK8 only
- Support GCC9 or later
- Use Substrait as unified plan
- Use Velox as default backend engine
- Use Celeborn as default RSS
- Support most popular data types including Boolean, Byte, Short, Int, Long, Float, Double, Date, Decimal, String, ...etc.
- Support Spill for Sort, Agg, and Join operators
- Run Pass all Spark3.2 Unit Test
- 2.5x speedup in Decision Support Benchmark1(TPC-H Like) testing
- 2x speedup in Decision Support Benchmark2(TPC-DS Like) testing
- Support Intel QAT accelerators in Shuffle compression
Limitations
- Not Support Complex data type such as Array, Map, Struct
- OOM happened in some operators not support Spill
- Decimal result may mismatch in some cases
Features
#974 | [CH] Supprt string repeat function |
#1008 | [CH] Support locate function |
#1273 | Implement cast decimal to int |
#1223 | [CH] support reading from S3 and using Clickhouse local cache to speed up |
#1131 | [Gluten-core] Add an option to only fallback once |
#1165 | Reduce GC Time when executing BHJ for CH backend. |
#1147 | [Gluten-core]Make validate failure logLevel configuable |
#1100 | Making transformer plan log more obvious |
#1112 | Refactor Gluten metrics and add apis for each backend |
#926 | gluten timezone not the same as backend |
#1039 | Remove compute pid metric in shuffle operator. |
#882 | Selective query execution |
#959 | Upgrade Arrow version to 11.0.0 |
#969 | Docker for gluten running on centos 8 |
#986 | Align and enrich metrics compare to Spark |
#972 | Can we separate native dynamic library from build generated jars? |
#913 | No Spark Shim Provider found for 3.2.0 |
#853 | Support named struct type |
#888 | Clickhouse backend broadcast relation support r2c |
#850 | Add cast check in ExpressionTransformer |
#825 | Setup development environment for macOS |
#788 | Pass needed hadoop conf from driver to executor |
Bugs Fixed
#1284 | Scala double data is wronlgy compared with null in a ut |
#729 | Validation failed for GlutenHashAggregateExecTransformer class |
#799 | This operator doesn't support doExecuteColumnar |
#527 | archives for Spark patch versions become unavailable on new releases affecting shims versioning |
#523 | Some basic failed SQL cases |
#1028 | [VL] SusbtraitToVeloxPlan error |
#858 | Sort result mismatch issue with different input records. |
#877 | Array/Map DataType result mismatch issue when containing null value |
#1227 | [CH] Scalar subquery filters execute twice for parquet file |
#1265 | [CH] Rescale decimal trigger fallback |
#1233 | [CH] Fix fallback issue when reading csv files |
#1235 | [CH] Fix missing reading from the broadcasted value when executing DPP |
#1234 | [CH] Fix error 'Invalid number of columns in chunk pushed to OutputPort' when executing hash agg after union all |
#1207 | shims-spark32 and shims-spark33 may be depencied at the same time |
#1161 | Bundle built by buildbundle-veloxbe.sh for Spark3.3 is broken |
#1210 | [CH] Fix the wrong table path of the orders table for TPCH in UT |
#1175 | FileNotFoundException while executing spark jobs -.so files |
#1179 | [VL] CI is failing on boost's checksum |
#1162 | [CK]fix CoaleseBatches metrics |
#1124 | Memory management not suitable with Velox split preload feature. |
#1149 | Run tpc-ds core |
#741 | Handle remainder for the case that its right input is zero |
#1090 | [TPCH][VL] tpch has some query execution error logs but queries could finish and the result is correct |
#1068 | [VL] Managed memory leak in imported Spark UTs |
#772 | Velox does not install folly in centos8 by default, break compile in centos8. |
#789 | Jar conflicts on Arrow and Protobuf between Vanilla Spark and Gluten |
#700 | AARCH64 port of Gluten |
#1027 | [VL] unsupported method |
#1072 | [CH] Fix NPE when executing BatchScanExecTransformer.getInputFilePaths with MergeTree DS V2 |
#489 | cannot build gluten (velox backend) in Amazon Linux 2 |
#1012 | Enable local cache throw exception |
#995 | Fix memory leak for ClickHouse Backend |
#914 | System variables related to Folly could not be found when compiling gluten. |
#990 | Failed to build velox |
#946 | Upgrade arrow version to 10.0.1 |
#860 | CH backend inset result not equals spark result |
#601 | Can't decide data type of null value in gluten test framework, when transforming InteralRow to DataFrame |
#843 | Unable to convert BHJ to SHJ by using hint |
#826 | ch_backend not support inset is empty |
#815 | Gluten + Velox backend does not support Struct dataset with same element name. |
#563 | Error compiling within -Pbackends-xx,spark-3.3,spark-ut |
#560 | An unsupportedOperationException interrupted the query execution |
#770 | VeloxRuntimeError when reading parquet file with only meta data |
#800 | [UT]ExpectedAnswer may not match SparkAnswer when is sorted |
#676 | WholeStageTransformerSuite#logForFailedTest() swallows exceptions |
#790 | Join RuntimeException when having duplicated equal-join keys |
#757 | Parquet scan not offloaded |
#797 | It won't load the libparquet.so.1000 when we use Gluten with Velox backend and run it on the yarn. |
#784 | No Spark Shim Provider found for 3.3.0 |
#547 | Jar conflict issue |
#727 | build from local velox repo doesn't work |
PRs
#1266 | [GLUTEN-1246] [CORE] Fix scale may be negative issue |
#1313 | [VL] Update doc for centos7 install |
#1312 | [CH] Ignore ch backend tpcds suite |
#1198 | [VL] fix: Update Velox setup scripts for centos 7 |
#1294 | [VL] Following #1185, do some clean-ups against Velox + Celeborn CI |
[#1196](https://github.com/oa... |