Skip to content

Gluten v1.0.0

Compare
Choose a tag to compare
@xieqi xieqi released this 14 Jul 03:07
bfe394b

Release Notes - Gluten - Version 1.0.0

Highlights (Velox backend only)

  • Support Spark 3.2 and Spark3.3
  • Run Pass all Velox, Spark3.2 UTs, and partially Spark3.3 UTs
  • Support Ubuntu 20.04/22.04, CentOS 7/8, alinux 3, Anolis 7/8
  • Support FileSystem: localfs, HDFS, S3, OSS (via s3a)
  • Support data types: Primitive type, Decimal, Date, Timestamp
  • Support 20 operators, detail here
  • Support 164 functions, detail here
  • Support native Parquet write
  • Support native ORC read
  • Support Intel® In-memory Analytics Accelerator (IAA/IAX) hardware accelerator in Shuffle compression
  • Support cap-based spill (static memory allocation) for join/agg/sort operator (experimental feature)
  • Support static build method via vcpkg
  • Support local cache (experimental feature)
  • 2.71x speedup in Decision Support Benchmark1 (TPC-H Like) testing
  • 2.29x speedup in Decision Support Benchmark2 (TPC-DS Like) testing
  • Velox code updated to commit
  • Document improvement for support features and configuration

Known Issues

  • Parquet write only support compression.codec, parquet.block.size and parquet.block.rows configurations
  • Velox backend does not support dynamic partition write and bucket write
  • Spill may throw OutOfMemoryExcetpion

New Features

Improvements

  • [GLUTEN-842][VL] convert expand op to expand exec in velox
  • [GLUTEN-842] remove group id transformer
  • [GLUTEN-1108][VL] Init NativeRowToColumnarJniWrapper with memory pool and schema
  • [GLUTEN-1199] Avoid throwing exception from destructor of JavaInputStreamAdaptor
  • [GLUTEN-1205][VL] Rename some class name and dir name for columnar sh…
  • [GLUTEN-1205][VL] Refactor shuffle partition writer
  • [GLUTEN-1205][VL] Refactor shuffle partitioner
  • [GLUTEN-1205][VL][FOLLOWUP] Refactor shuffle partition writer
  • [GLUTEN-1209][VL] refactor: Refactor Java Celeborn into an independent module
  • [GLUTEN-1296][VL] Remove some logs in CI
  • [GLUTEN-1325][VL] Optimize decimal arithmetic
  • [GLUTEN-1331][CORE] Enable some functions
  • [GLUTEN-1336][VL] add spark3.3 UT under connector and expression
  • [GLUTEN-1336][VL] move Spark3.3 Unit tests to seperate job
  • [GLUTEN-1336][VL] add more spark3.3 UT
  • [GLUTEN-1336][VL] CI: move slow tests into another job for Spark3.3
  • [GLUTEN-1357][CORE] Change soft-affinity log level from INFO to DEBUG
  • [GLUTEN-1369][Core] Move config 'spark.gluten.enabled' to GlutenConfig from QueryPlanSelector
  • [GLUTEN-1393][VL] feat: Change velox pipeline input from arrow to velox ValueStreamNode
  • [GLUTEN-1407] Let profile control shim version
  • [GLUTEN-1416][VL] NoSuchMethodError from shaded Arrow
  • [GLUTEN-1433][VL] feat: offload timestamp scan to Velox - phase 1
  • [GLUTEN-1433][VL] Enable GlutenStatisticsCollectionSuite
  • [GLUTEN-1434][VL] Delete some unused files and functions
  • [GLUTEN-1434][VL] Refactor to add ColumnarBatchIterator
  • [GLUTEN-1434][VL] Remove unused arrow code and add GLUTEN_CHECK and GLUTEN_DCHECK
  • [GLUTEN-1458][VL][CI] feat: Adding Spark3.3 w/ Ubuntu22.04 test
  • [GLUTEN-1476][VL] Enable scan on struct and map types
  • [GLUTEN-1476][CORE] Use correct field name in struct type
  • [GLUTEN-1478][VL] enable timestamp expression tests
  • [GLUTEN-1478] Enable failed UT in GlutenIntervalExpressionsSuite
  • [GLUTEN-1478][VL] Enable some spark UTs for cast function
  • [GLUTEN-1478][VL] Enable tests on casting from string to decimal
  • [GLUTEN-1478][VL] Enable test on casting from decimal to bool
  • [GLUTEN-1480][DOC] Refactor to enable github pages
  • [GLUTEN-1491][VL][feat] Refine row_number() method in velox backend
  • [GLUTEN-1500][VL] feat: Use 0.6 * task memory cap as spill threshold for all spillable operators
  • [GLUTEN-1500][VL] Implement OOM cap shared by tasks, and spill threshold shared by tasks and operators
  • [GLUTEN-1500][VL] Integrate with Velox arbitration API
  • [GLUTEN-1533][VL][Feat] Replace sort agg with gluten hash agg
  • [GLUTEN-1534][VL] reduce pre-projection for hash join
  • [GLUTEN-1543] Use Spark's function name for substrait mapping if not supported by substrait
  • [GLUTEN-1575][VL] Add a parameter run_setup_script to control whether to run velox setup
  • [GLUTEN-1577][CORE] Respect spark's config for case sensitive when get attribute name
  • [GLUTEN-1690][VL] Header Search Paths do not include the complete velox directory during development
  • [GLUTEN-1695][VL] Bug: Show the right writen row number of DataWritingCommandExec in spark UI
  • [GLUTEN-1712][VL] feat: Write parquet data into temp dir
  • [GLUTEN-1747][VL] Change the profile for backends velox and rss
  • [GLUTEN-1772] Pass batch size to native
  • [GLUTEN-1792][VL] Remove local cache files on Executor exit
  • [GLUTEN-1803][VL] Refactor gluten-data and cleanup class names' prefixes
  • [GLUTEN-1809][VL] Enable date type for kPreceding & kFollowing window range bound
  • [GLUTEN-1812][VL] package missing libthrift.so into jar on Ubuntu20/22
  • [GLUTEN-1858][CORE] Add PlanOneRowRelation to make gluten work with OneRowRelation
  • [GLUTEN-1868][VL] Optimize ColumnarToRow to reuse the allocated Buffer in native side
  • [GLUTEN-1879][Core] Skip some operator when judging whether the execution plan has a fallback
  • [GLUTEN-1928][Core] Support to add the custom expressions transformer by Spark conf
  • [GLUTEN-1928][Core] Followup: fix the custom expressions transformer ut bug
  • [GLUTEN-1945] Support summarizing the supported spark built-in functions
  • [GLUTEN-1947][VL] Parquet should respect user-specified write options
  • [GLUTEN-1972][CORE] Log gluten build info
  • [GLUTEN-1980][VL][DOCS] Update the documentation for compiling Gluten+Velox with Docker
  • [GLUTEN-2002][VL] Build: add boost sort lib in static linking job
  • [GLUTEN-2025][VL] disable sort in window operations
  • [GLUTEN-2055][VL] Support s3 iam role credentials
  • [GLUTEN-2100] backport recent important changes to 1.0 branch
  • [GLUTEN-2125][VL] Decouple S3 endpoint/ssl.enabled/path-style-access config from AK/SK
  • [GLUTEN-2213][VL] Upgrade the dependencies version to meet the requir…
  • [GLUTEN-CORE] Refactor cpp dependency on gtest and gbenchmark in more general way
  • [Gluten-Core] Make GlutenConfig slightly more extensible
  • [GLUTEN-CORE][VL] Avoid duplicate window column name
  • [GLUTEN-CORE] minor, add another profile for hadoop 3.3
  • [GLUTEN-CORE][VL] Add SparkTaskInfo to wrap stageId, taskId, partitionId
  • [GLUTEN-CORE][VL] Minor refactor c2r codes to improve readability
  • [GLUTEN-2154][DOC] Fix Github pages parsing Markdown files issue
  • [GLUTEN-2240][DOC] Update doc for supported functions and operators
  • [VL QPL] WIP Add Intel®-IAA/QPL-based Codec by
  • [VL] Add velox cpp test to CI
  • [VL][CI] Increase the scale of testing dataset
  • [CI] Update repo url for centos 7
  • [VL] Optimize failure output for WrappedVeloxMemoryPool
  • [VL] Add xsimd search path
  • [VL] Upgrade velox to 2023/4/20
  • [VL] Short decimal should memcpy int64 data to arrow int128
  • [VL] Refine the naming for two native build options and correct the queries path for tpch test
  • [VL] Use a config to control coalesce batches
  • [VL] Add config option "spark.gluten.sql.columnar.backend.velox.joinSpillMemoryThreshold"
  • [VL]minor: Enable coalesce batches by default
  • [VL][DOC] Add the limitation for df.describe() method
  • [VL] Adjust batch size to 4096
  • [VL] Use clang-format 12 to format check
  • [VL] Rename kSparkBatchSize
  • [VL] Shuffle reader is not tracked by Spark memory manager
  • [VL] minor improvements for building velox behind proxy
  • [VL] C++: Rename identifiers based on the CPP code style
  • [VL] Script dev/package.sh ignores C++ compilation failures
  • [VL] Add shortcut tools/gluten-it/sbin/gluten-it.sh to run gluten-it
  • [VL] CI: Update mirror of CentOS 7
  • [VL] Generate compile_commands.json on arrow build
  • [VL] Remove unused option --fixed-width-as-double from gluten-it
  • [VL] Following GLUTEN-1205, fix malformed namings
  • [VL] Enable tests for string for ascii, replace
  • [VL] Avoid initializing C++ backend instance more than once
  • [VL] Upgrade arrow to 12
  • [VL] Use current available off-heap memory to decide on partition buffer size in shuffle writer
  • [VL] Port shuffle mirco benchmark
  • [VL] Result mismatch with vanilla Spark SQL on atan2
  • [VL] In unit testing, optimize diff tolerance for doubles
  • [VL] Logging velox log to stderr
  • [VL] Result mismatch with Vanilla Spark on log2/log10
  • [VL] Add parquet write benchmark
  • [VL][DOCS] Improve hdfs and kerberos docs with velox
  • [VL] Make git checkout script in get_arrow.sh use an unique local tag name
  • [VL] Parameterized benchmark tool for gluten-it
  • [VL] refactor setBackendFactory
  • [VL][DOC] Update celeborn doc
  • [VL] Add quotes for CPU_TARGET to avoid warning
  • [VL] CI: Set environment variables (e.g. http proxy settings) for all shell types
  • [VL] Doc: How to prioritize loading Gluten jars in Spark
  • [VL][BUILD] Pop!_OS as Ubuntu alias
  • [BUILD] Allow use custom ARROW_HOME for Velox
  • [VL] Refactor VeloxMemoryPool and VeloxInitializer
  • [VL] Doc: Update VeloxNotSupport.md about limitations of spilling
  • [VL] Avoid print conf in default situation
  • [VL][Doc] Add a troubleshooting document
  • [VL] Adopt Spark local-cluster run mode in gluten-it
  • [VL][BUILD] Update vcpkg scripts for development workflow
  • [VL] DOC: refresh velox S3 docs
  • [VL] Backport Delete shuffle spilled file directories
  • [VL][Doc] Update Velox backend performance in README.md
  • [Doc] Update docs and remove a useless config
  • [MINOR] add missing ASF header
  • [Minor] Auto set system arch library path
  • [MINOR] change logInfo to logDebug in RowToArrowColumnarExec#javaConvert
  • Minor: upgrade guava to 32.0.1-jre
  • [UT] Fix a config overwritten issue in UT
  • [UT] Minor: add jvm xss configuration in gluten-ut to avoid stackoverflow
  • [UT][VL] Exclude unstable test
  • [PR-1308]minor log change for filter validation
  • [PR-1735]open decimal test in literal

Bug Fixes

Note

Binary Files only provide for Velox backend only

Full Changelog: 0.5.0...v1.0.0

What's Changed

  • [VL] Optimize variable name in nativeConvertRowToColumnar by @kerwin-zk in #1290
  • minor log change for filter validation by @zhli1142015 in #1308
  • [GLUTEN-1251][ClickHouse-Backend] feat: fallback operations not supported by ch backend in CHHashJoinExecTransformer && CHBroadcastHashJoinExecTransformer && CHSortMergeJoinExecTransformer by @zheniantoushipashi in #1252
  • [GLUTEN-1217][VL] Fix native columnar to row in avx mode by @jinchengchenghh in #1314
  • [GLUTEN-1277][CORE] Fix spark33 TPCDS Q66 decimal * int expression by @jinchengchenghh in #1278
  • [VL] Remove low version libre2-dev by @zhejiangxiaomai in #1320
  • [VL] Ut for bloomfilter validation fix by @zhli1142015 in #1291
  • [GLUTEN-905][VL]fix: Skip jemalloc building by default by @zhouyuan in #1322
  • [VL] package.sh support centos7 and centos8 by @zhejiangxiaomai in #1321
  • [VL] CI: Throw error on managed memory-leak in TPC tests by @zhztheplayer in #1310
  • [GLUTEN-1013][VL]fix: improve local cache related configurations in Spark side by @zhouyuan in #1092
  • [VL] Update README.md by @zhejiangxiaomai in #1329
  • [VL] Correct task-wise off-heap memory calculation by @zhztheplayer in #1316
  • [CH-397] support posexplode/sequence functions by @taiyang-li in #1281
  • [GLUTEN-1296][VL][CI]fix: Speed up on running Spark unit tests by @zhouyuan in #1302
  • [GLUTEN-1296][VL] Remove some logs in CI by @jinchengchenghh in #1304
  • [VL] package and velox scripts add alinux support by @liujiayi771 in #1339
  • [GLUTEN-1108] [VL] Init NativeRowToColumnarJniWrapper with memory pool and schema by @jinchengchenghh in #1318
  • [GLUTEN-1205][VL] Rename some class name and dir name for columnar sh… by @CLTFOREVER in #1334
  • [VL QPL] WIP Add Intel®-IAA/QPL-based Codec by @marin-ma in #1057
  • [VL] Quick fix package.sh by @zhejiangxiaomai in #1342
  • [VL] Optimize failure output for WrappedVeloxMemoryPool by @zhztheplayer in #1335
  • [CI] Update repo url for centos 7 by @ccat3z in #1327
  • [VL] fix: use bool for isEmpty in decimal sum by @rui-mo in #1343
  • [CH-398] Add test case for collect_list by @lgbo-ustc in #1301
  • [minor] fix typo by @leesf in #1352
  • Fix aarch64 compile error by @liujiayi771 in #1350
  • [GLUTEN-1325] [VL] Optimize decimal arithmetic by @jinchengchenghh in #1337
  • [GLUTEN-1270][VL][Feat] Support multiple HDFS endpoints by @PHILO-HE in #1271
  • [VL] Fix broken memory limit used for Velox spilling by @zhztheplayer in #1356
  • [VL] feat: support partial merge phase in aggregation by @rui-mo in #1330
  • [VL] Add xsimd search path by @Yohahaha in #1364
  • [GLUTEN-1331] [CORE] Enable some functions by @jinchengchenghh in #1332
  • [GLUTEN-1357][CORE] Change soft-affinity log level from INFO to DEBUG by @jackylee-ch in #1358
  • [CH-413] implement function reinterpretAsStringSpark and fix bugs of length(binary)/cast(x as binary) by @taiyang-li in #1328
  • [CH] [Refactor Repo] Move libch source code from Kyligence/Clickhouse by @liuneng1994 in #1153
  • [GLUTEN-1141][VL]fix: building hashmap with new key equality check by @zhouyuan in #1288
  • [GLUTEN-1351][VL] Fix aarch64 velox shuffle writer short decimal core dump by @liujiayi771 in #1370
  • [VL][CI] Increase the scale of testing dataset by @rui-mo in #1268
  • [GLUTEN-1209] [VL] refactor: Refactor Java Celeborn into an independent module by @kerwin-zk in #1319
  • [GLUTEN-1346][VL] Support native velox row to column by @jinchengchenghh in #1347
  • [GLUTEN-1379] [CH] update clickhouse backends doc by @liuneng1994 in #1380
  • [GLUTEN-1367] Support running gluten on anolis by @leesf in #1368
  • [VL] Add velox cpp test to CI by @jinchengchenghh in #1181
  • [VL] feat: support more distinct functions by @rui-mo in #1365
  • [VL] Fix broken memory limit used for Velox spilling (patch 2) by @zhztheplayer in #1363
  • [GLUTEN-1372][clickhouse_backend] fix null partition read for ch by @lhuang09287750 in #1373
  • [GLUTEN-1396][ClickHouse-Backend] fix assert failure in c2r and r2c by @taiyang-li in #1401
  • [GLUTEN-842][VL] convert expand op to expand exec in velox by @zhli1142015 in #1361
  • [CH-320] Support space function by @KevinyhZou in #1054
  • [CH] Update doc of ClickHouse by @zhanglistar in #1400
  • [VL] Support mocking map stage with no input files in micro benchmark by @marin-ma in #1388
  • [GLUTEN-1407] Let profile control shim version by @Yohahaha in #1408
  • [VL] Refine the naming for two native build options and correct the queries path for tpch test by @zhejiangxiaomai in #1422
  • [GLUTEN-1369][Core] Move config 'spark.gluten.enabled' to GlutenConfig from QueryPlanSelector by @zzcclp in #1420
  • [GLUTEN-1405] [CH] fix CH continue executing when JNI call throw exception by @shuai-xu in #1406
  • [GLUTEN-1374] [VL] RangePartitioning supports velox columnar batch by @jinchengchenghh in #1375
  • [GLUTEN-1429][CH] trigger clickhouse CI job on specific files change by @lwz9103 in #1423
  • [VL] Add config option "spark.gluten.sql.columnar.backend.velox.joinSpillMemoryThreshold" by @zhztheplayer in #1427
  • [VL] fix: remove Scala-side validation for functions by @rui-mo in #1360
  • [GLUTEN-1245][VL][Feat] Add VeloxParquetFileFormat to support parquet write in velox backend. by @JkSelf in #1344
  • [GLUTEN-1450] [CH] enable forked repository's GITHUB_TOKEN comment in PR by @lwz9103 in #1451
  • [CH] refine clickhouse doc by @binmahone in #1462
  • [GLUTEN-1410] [CH] convert partition name to lower case by @shuai-xu in #1411
  • [GLUTEN-1452][CH]Bug fix select column with lots of NULL value by @KevinyhZou in #1454
  • [GLUTEN-1434] [VL] Delete some unused files and functions by @jinchengchenghh in #1436
  • [VL] Use a config to control coalesce batches by @rui-mo in #1424
  • [GLUTEN-1389][CH] Fix ORC/Parquet column name case-insensitive matching issue by @taiyang-li in #1390
  • [VL] Short decimal should memcpy int64 data to arrow int128 by @liujiayi771 in #1399
  • [VL] Upgrade velox to 2023/4/20 by @zhejiangxiaomai in #1378
  • [GLUTEN-1483][CH] Fix ut errors after disable CoalesceBatchesExec for CH backend by @zzcclp in #1484
  • [VL] Fix ci compile error by @zhejiangxiaomai in #1491
  • [GLUTEN-1458][VL][CI]feat: Adding Spark3.3 w/ Ubuntu22.04 test by @zhouyuan in #1459
  • [GLUTEN-1409][VL] feat: Support named_struct in Velox backend by @rui-mo in #1385
  • [GLUTEN-1474][VL][Fix] Align with spark's cast behavior by allowing decimal by @PHILO-HE in #1475
  • [GLUTEN-CORE] Refactor cpp dependency on gtest and gbenchmark in more general way by @Yohahaha in #1469
  • [GLUTEN-1376][CH] feat: support spill settings: max_bytes_before_external_sort and max_bytes_before_external_group_by by @taiyang-li in #1377
  • [GLUTEN--1491] [VL] [feat] Refine row_number() method in velox backend. by @JkSelf in #1493
  • [GLUTEN-1416][VL] NoSuchMethodError from shaded Arrow by @kerwin-zk in #1487
  • [Gluten] Fix expression ut result is not check by @loneylee in #1465
  • [GLUTEN-1460] [CH] upgrade clickhouse version to 23.3 by @liuneng1994 in #1461
  • [GLUTEN-1500][VL] feat: Use 0.6 * task memory cap as spill threshold for all spillable operators by @zhztheplayer in #1463
  • [GLUTEN-384][CH] support reading from S3 and local cache on S3 by @binmahone in #1497
  • [GLUTEN-1391][CH]support function regexp_extract_all by @taiyang-li in #1324
  • [GLUTEN-1392][CH] Support new ExpandRel by @exmy in #1432
  • [GLUTEN-1502][CH] Fix timezone config sync to backend by @loneylee in #1506
  • [GLUTEN-1205][VL] Refactor shuffle partition writer by @kerwin-zk in #1414
  • [GLUTEN-1199] Avoid throwing exception from destructor of JavaInputStreamAdaptor by @cambyzju in #1505
  • [GLUTEN-1472][CH] Refine clickhouse memory audit by @binmahone in #1473
  • [VL][GLUTEN-1243] Support bit_xor aggregate function by @Yohahaha in #1501
  • [GLUTEN-1480][DOC] Refactor to enable github pages by @xieqi in #1481
  • [GLUTEN-842] remove group id transformer by @zhli1142015 in #1519
  • [VL]minor: Enable coalesce batches by default by @marin-ma in #1510
  • [VL] Adjust batch size to 4096 by @zhejiangxiaomai in #1522
  • [GLUTEN-1521][Core] Support to add the customer columnar rules by config by @zzcclp in #1523
  • [VL] Use clang-format 12 to format check by @zhejiangxiaomai in #1529
  • [CH] Avoid exporting unnecessary symbols by @baibaichen in #1517
  • [VL] [DOC] Add the limitation for df.describe() method by @JkSelf in #1520
  • [GLUTEN-1394][CH] support sha1/sha2/crc32 hash functions by @taiyang-li in #1417
  • [GLUTEN-1386][CH] Improve max(NULL)/min(NULL) execution efficiency by @KevinyhZou in #1387
  • [CH] un-ignore some test cases by @binmahone in #1516
  • [GLUTEN-1393] [VL] feat: Change velox pipeline input from arrow to velox ValueStreamNode by @jinchengchenghh in #1397
  • [GLUTEN-1404][CH] Bug fix no result returned in max/min/sum/count while select from empty table/partition by @KevinyhZou in #1345
  • [GLUTEN-1545][DOC] Fix link issues in github pages by @xieqi in #1546
  • [GLUTEN-1534] [VL] reduce pre-projection for hash join by @zhli1142015 in #1535
  • [GLUTEN-1205][VL] Refactor shuffle partitioner by @CLTFOREVER in #1531
  • [GLUTEN-1507][VL] fix issue that GlutenBroadcastJoinSuite run tests using spark without gluten enabled by @gaoyangxiaozhu in #1508
  • [VL] Rename kSparkBatchSize by @zhejiangxiaomai in #1542
  • [GLUTEN-1533][VL][Feat] Replace sort agg with gluten hash agg by @PHILO-HE in #984
  • [VL] Shuffle reader is not tracked by Spark memory manager by @zhztheplayer in #1547
  • [GLUTEN-1548][CH] Fix ci in GlutenClickHouseTPCHParquetSuite test max(NULL)/min(NULL) by @KevinyhZou in #1549
  • [GLUTEN-1524][CH] Apply new functions merged into ClickHouse/ClickHouse after v23.1: regexpExtract/jsonarraylength by @taiyang-li in #1540
  • [GLUTEN-1434] [VL] Refactor to add ColumnarBatchIterator by @jinchengchenghh in #1504
  • [VL] add support for reading ORC by @zuochunwei in #1513
  • [VL] set velox repo/branch to oap-project by @zuochunwei in #1556
  • [Gluten-Core] Make GlutenConfig slightly more extensible by @jackylee-ch in #1563
  • [GLUTEN-1500][VL] Implement OOM cap shared by tasks, and spill threshold shared by tasks and operators by @zhztheplayer in #1527
  • [GLUTEN-1431][CH]Fix cast exception when input value is null by @taiyang-li in #1566
  • [CH][Refactor Repo] minor improments on symbolic link to substrait (t… by @binmahone in #1570
  • [GLUTEN-1560][VL][FIX]Optimize parquet write perf by @JkSelf in #1561
  • [GLUTEN-1524][CH]support function to_date/to_timestamp/dayofweek/weekday/btrim/ltrim/rtrim by @taiyang-li in #1562
  • [GLUTEN-1575][VL] Add a parameter run_setup_script to control whether to run velox setup by @liujiayi771 in #1576
  • [VL] Fix some compilation errors with Clang 12 by @zhztheplayer in #1573
  • [DOC] Fix links for [Build with Velox] and [Build with ClickHouse Backend] by @kerwin-zk in #1574
  • [VL] Script dev/package.sh ignores C++ compilation failures by @zhztheplayer in #1586
  • [VL] Add shortcut tools/gluten-it/sbin/gluten-it.sh to run gluten-it by @zhztheplayer in #1588
  • [VL] C++: Rename identifiers based on the CPP code style by @zhztheplayer in #1580
  • [GLUTEN-1511][CH] add ut to make sure native exceptions are well caug… by @binmahone in #1512
  • [GLUTEN-1583][VL] Fix the memory leak of shuffle reader by @kerwin-zk in #1602
  • [GLUTEN-1205][VL][FOLLOWUP] Refactor shuffle partition writer by @kerwin-zk in #1544
  • [VL] Generate compile_commands.json on arrow build by @FelixYBW in #1605
  • [VL] minor improvements for building velox behind proxy by @binmahone in #1557
  • [GLUTEN-1577][CORE] Respect spark's config for case sensitive when get attribute name by @exmy in #1578
  • [GLUTEN-1306][VL] feat: Link static depends via vcpkg by @ccat3z in #1384
  • [GLUTEN-1551][VL][Fix] Trim leading/trailing whitespace in string for casting to integral type by @PHILO-HE in #1569
  • [VL] CI: Update mirror of CentOS 7 by @ccat3z in #1603
  • [VL] Following GLUTEN-1205, fix malformed namings by @zhztheplayer in #1608
  • [VL] Change OAP/main_squash branch as main branch by @zhejiangxiaomai in #1601
  • [CH] Enable Clickhouse Backend TPCDS Suite by @loneylee in #1579
  • [VL][HOTFIX] disable static linking test temporary by @zhouyuan in #1619
  • [GLUTEN-1585][CH]feat: Use --version-script to limit export symbols by @baibaichen in #1614
  • [GLUTEN-1612] [CH] add format check for ch backend by @liuneng1994 in #1618
  • [GLUTEN-1306][FOLLOWUP]vcpkg setup script add alinux3 support by @liujiayi771 in #1616
  • [GLUTEN-CORE] [VL] Avoid duplicate window column name by @Yohahaha in #1606
  • [CH] Fix github action ch trigger error by @liuneng1994 in #1629
  • [GLUTEN-1543] Use Spark's function name for substrait mapping if not supported by substrait by @Yohahaha in #1550
  • [GLUTEN-1617][VL] Fix static build cpp unit tests by @ccat3z in #1625
  • [CH] Enable tpcds more tests by @loneylee in #1624
  • Revert "[CH] Enable tpcds more tests" by @zzcclp in #1637
  • [GLUTEN-1609][CORE] Fix MemoryLeak with speculation by @jackylee-ch in #1610
  • [Gluten-1315][CH] Enable tpcds fixed tests by @loneylee in #1646
  • [GLUTEN-1434] [VL] Remove unused arrow code and add GLUTEN_CHECK and GLUTEN_DCHECK by @jinchengchenghh in #1611
  • [GLUTEN-1634][CH] enable update clickhouse version with config file by @lwz9103 in #1635
  • [GLUTEN-1433][VL] feat: offload timestamp scan to Velox - phase 1 by @rui-mo in #1435
  • (VL) bugfix for query benchmark and test for orc/decimal reader by @zuochunwei in #1613
  • [VL][Minor] Fix arm build by @marin-ma in #1650
  • [GLUTEN-1485] [CH] make the parameters for aggregator configurable by @shuai-xu in #1486
  • [Minor] Auto set system arch library path by @marin-ma in #1655
  • [GLUTEN-1594][CH] support native parquet writer for clickhouse backend by @binmahone in #1595
  • [GLUTEN-1371][VL] Support First/Last aggregate functions by @Yohahaha in #1581
  • [GLUTEN-1433][VL] Enable GlutenStatisticsCollectionSuite by @rui-mo in #1665
  • [GLUTEN-1490] refactor substrait literals using generics, and support map/struct/array literals based on it by @taiyang-li in #1494
  • [VL] Avoid initializing C++ backend instance more than once by @zhztheplayer in #1642
  • [GLUTEN-1658] [CORE] feat: Support SparkResourcesUtil.scala in k8s by @zbbkeepgoing in #1657
  • [GLUTEN-1478][VL][Fix] Fix a unit test: "missing cases - from boolean" by @PHILO-HE in #1686
  • [GLUTEN-1645][CH] Disable vectorized reading of vanilla Spark for CH backend by @exmy in #1647
  • [GLUTEN-1638][VL] feat: Add hdfs support in parquet write by @JkSelf in #1639
  • [MINOR] add missing ASF header by @leesf in #1701
  • [CH] minor improvements for BroadCastJoinBuilder by @exmy in #1591
  • [CH] minor, update jenkins ci url by @lwz9103 in #1711
  • [VL] In unit testing, optimize diff tolerance for doubles by @zhztheplayer in #1693
  • [GLUTEN-1690][VL] Header Search Paths do not include the complete velox directory during development by @kerwin-zk in #1691
  • [GLUTEN-1478][VL] fix: Enable the date relate functions in GlutenDateExpressionSuite by @JkSelf in #1724
  • [GLUTEN-1716][CORE][Fix] Evaluate foldable expression used to specify window bound type by @PHILO-HE in #1717
  • [CH] Add clean CH backend broadcast data after execution end by @loneylee in #1604
  • [GLUTEN-1709][CH] add jvmArgs in scala-maven-plugin by @lwz9103 in #1710
  • open decimal test in literal by @Ma-Jian1 in #1735
  • [VL][DOCS] Improve hdfs and kerberos docs with velox by @ulysses-you in #1734
  • [GLUTEN-1704][VL] Support metrics on splits and row groups by @rui-mo in #1705
  • [GLUTEN-1632][CH]Update Clickhouse Version (20230517) by @lwz9103 in #1692
  • [GLUTEN-1738][CH]Update libchbuilder dockerfile to make build latest Clickhouse by @lwz9103 in #1740
  • [GLUTEN-1654] [VL] support approx_count_distinct for velox by @zhli1142015 in #1676
  • [GLUTEN-1478][VL] enable timestamp expression tests by @rui-mo in #1721
  • [GLUTEN-1478][VL]Fix SPARK-30633: xxHash with different type seeds by @marin-ma in #1730
  • [GLUTEN-1679][CH] Fixed: compatibility between collect_list/collect_set and groupArray/groupUniqArray by @lgbo-ustc in #1698
  • [GLUTEN-1478][VL] fix: enable makedecimal suite by @JkSelf in #1736
  • [GLUTEN-1712][VL] feat: Write parquet data into temp dir by @JkSelf in #1713
  • [VL] Logging velox log to stderr by @jackylee-ch in #1694
  • [GLUTEN-1640] Support judging whether the execution plan has a fallback by @zheniantoushipashi in #1641
  • [GLUTEN-1623] [VL] Support asinh, acosh, atanh, sec, csc math functions for Velox backend by @Yohahaha in #1742
  • [VL] Upgrade arrow to 12 by @jinchengchenghh in #1660
  • [GLUTEN-1478] Support ordered result check for MapData by @rui-mo in #1739
  • [GLUTEN-1695][VL] Bug: Show the right writen row number of DataWritingCommandExec in spark UI by @JkSelf in #1696
  • [VL] Add parquet write benchmark by @JkSelf in #1723
  • [VL] add long decimal type support for Orc file format by @yimin-yang in #1726
  • [GLUTEN-1478] Enable failed UT in GlutenIntervalExpressionsSuite by @marin-ma in #1750
  • [VL] Make git checkout script in get_arrow.sh use an unique local tag name by @zhztheplayer in #1757
  • [VL] Result mismatch with Vanilla Spark on log2/log10 by @zhztheplayer in #1707
  • [VL] fix: fix parquet write benchmark compile by @JkSelf in #1764
  • [VL] Enable tests for string for ascii, replace by @izchen in #1628
  • [GLUTEN-1632][CH]Daily Update Clickhouse Version (20230525) by @lwz9103 in #1771
  • [VL] Remove unused option --fixed-width-as-double from gluten-it by @zhztheplayer in #1607
  • [GLUTEN-1714][VL][Fix] Align the implementation for ascii func with latest spark by @PHILO-HE in #1715
  • [GLUTEN-1478][VL] Enable some spark UTs for cast function by @PHILO-HE in #1756
  • [VL] Result mismatch with vanilla Spark SQL on atan2 by @zhztheplayer in #1689
  • [GLUTEN-1760][VL] fix alinux3 velox compilation exception by @kerwin-zk in #1761
  • [VL] Port shuffle mirco benchmark by @marin-ma in #1678
  • [GLUTEN-1772] Pass batch size to native by @rui-mo in #1774
  • [GLUTEN-1747][VL] Change the profile for backends velox and rss by @jackylee-ch in #1748
  • [GLUTEN-1478][VL] Enable tests on casting from string to decimal by @rui-mo in #1766
  • [VL] Fixup micro benchmark failed to compile by @marin-ma in #1779
  • [GLUTEN-1632][CH]Daily Update Clickhouse Version (20230526) by @lwz9103 in #1784
  • [CH] Fix crash on ubuntu 16.04 by @zhanglistar in #1773
  • [GLUTEN-1792][VL] Remove local cache files on Executor exit by @jackylee-ch in #1793
  • [GLUTEN-1662][VL] feat: Support InsertIntoHiveDirCommand in velox parquet write by @JkSelf in #1663
  • [GLUTEN-1478][VL] Enable test on casting from decimal to bool by @rui-mo in #1783
  • [GLUTEN-1632][CH]Daily Update Clickhouse Version (20230530) by @lwz9103 in #1797
  • [GLUTEN-1796][CH] Collect performance metrics from CH backend by @zzcclp in #1674
  • [GLUTEN-1518][CH] Improve compatibility of get_json_object by @lgbo-ustc in #1333
  • [GLUTEN-1794][VL] support split preload by @zhli1142015 in #1791
  • [GLUTEN-1515][CH] Support agg functions bit_and/bit_or/bit_xor by @exmy in #1536
  • [Gluten-1809][VL] Block offloading window operator by checking some limitations by @PHILO-HE in #1729
  • [GLUTEN-1754][CH] Support aggregate functions with multiple arguments by @lgbo-ustc in #1788
  • [CH]fix a typos by @lgbo-ustc in #1811
  • [GLUTEN-1754][CH] Followup: update substrait plan metrics test cases by @zzcclp in #1814
  • [GLUTEN-1812][VL] package missing libthrift.so into jar on Ubuntu20/22 by @ulysses-you in #1813
  • [VL] Fix: Support select sum(1) by @zhouyuan in #1810
  • [GLUTEN-1767][CH]Bug fix posexplode function argument check error by @KevinyhZou in #1768
  • [CH] Format log output for CH backend by @exmy in #1746
  • [GLUTEN-1682] [CH] fix hasNext will always be true if input is empty by @shuai-xu in #1683
  • [GLUTEN-384][CH] fix bug for aws cn case by @binmahone in #1815
  • [GLUTEN-1809][VL] Enable date type for kPreceding & kFollowing window range bound by @PHILO-HE in #1825
  • [GLUTEN-1826][CH] Fix error convert time statistics for RowToCHNativeColumnarExec by @zzcclp in #1828
  • [CH] Suppport asinh/acosh/atanh by @exmy in #1778
  • [CH] Fix expand transformer can't collect metrics by @exmy in #1808
  • [GLUTEN-1829] [CH] Refactor: introduce gluten_clickhouse_backend_libs and gluten_spark_functions by @baibaichen in #1830
  • [VL] Use current available off-heap memory to decide on partition buffer size in shuffle writer by @zhztheplayer in #1675
  • [GLUTEN-1620][CH] Fix error 'attribute binding failed.' when executing hash agg by @zzcclp in #1827
  • [GLUTEN-1478][VL][Fix] fix compare function UT in spark3.2 by @yma11 in #1749
  • [GLUTEN-1582][CH] Improve txt/json read by use native engine by @KevinyhZou in #1584
  • [GLUTEN-1817][CH] Fix: Initialize global thread pool at backend initalization by @baibaichen in #1834
  • [GLUTEN-1336][VL] add spark3.3 UT under connector and expression by @yma11 in #1685
  • [GLUTEN-1632][CH]Daily Update Clickhouse Version (20230603) by @lwz9103 in #1837
  • [CORE] minor, add another profile for hadoop 3.3 by @binmahone in #1816
  • [CH] Support json_tuple function by @KevinyhZou in #1135
  • [VL] refactor setBackendFactory by @zuochunwei in #1820
  • [GLUTEN-1855][VL] Fix the describe and summary issue in gluten by @JkSelf in #1856
  • [VL][BUILD] Fix use pre-build arrow/parquet/thrift lib by @Yohahaha in #1821
  • [GLUTEN-1806][CH] More friendly to build gluten binary package by @lwz9103 in #1833
  • [GLUTEN-1803][VL] Refactor gluten-data and cleanup class names' prefixes by @rui-mo in #1807
  • [GLUTEN-1847][CH]Prefech data from source nodes by @lgbo-ustc in #1849
  • [GLUTEN-1632][CH]Daily Update Clickhouse Version (20230607) by @lwz9103 in #1867
  • [GLUTEN-CORE] [VL] Add SparkTaskInfo to wrap stageId, taskId, partitionId by @Yohahaha in #1857
  • [GLUTEN-1848][Core] Fix execute subquery repeatedly issue with ReusedSubquery when aqe is on by @zzcclp in #1851
  • [GLUTEN-1875][CH] Support UnionExecTransformer as BroadcastRelation by @exmy in #1876
  • [GLUTEN-1589][CH] support udf for clickhouse backend by @taiyang-li in #1596
  • [GLUTEN-1476][CORE] Use correct field name in struct type by @rui-mo in #1878
  • [GLUTEN-1336][VL] move Spark3.3 Unit tests to seperate job by @zhouyuan in #1845
  • [GLUTEN-1871][CH] Refactor: StorageJoinFromReadBuffer doesn't inherit from StorageSetOrJoinBase any more. by @baibaichen in #1895
  • [GLUTEN-1860] StructLiteral support null literal by @exmy in #1861
  • [GLUTEN-1858][CORE] Add PlanOneRowRelation to make gluten work with OneRowRelation by @ulysses-you in #1859
  • [CORE] Support submit subqueries concurrently to improve scalar subquery performance by @WangGuangxin in #1097
  • [VL] Minor: fix centos job by using new mirror by @zhouyuan in #1890
  • [GLUTEN-1476][VL] Support GetStructField by @rui-mo in #1495
  • [VL]Fix aarch64 linux arrow-c-data jni lib missing by @liujiayi771 in #1896
  • [GLUTEN-1879][Core] Skip some operator when judging whether the execution plan has a fallback by @zzcclp in #1905
  • [GLUTEN-1632][CH]Daily Update Clickhouse Version (20230611) by @lwz9103 in #1909
  • [GLUTEN-1910][CH] Fix fallback when executing window function: lead, lag and so on by @zzcclp in #1911
  • [VL] Fix QAT/IAA build by @marin-ma in #1888
  • [GLUTEN-1871][CH]Fix: Remove BroadcastJoinBuilder :: storage_join_map and using guava cache to synchronize access by @baibaichen in #1873
  • [CH] Early throw exception if ColumnarBatch can't convert to CHNativeBlock by @exmy in #1886
  • [VL][DOC] Update celeborn doc by @kerwin-zk in #1917
  • [GLUTEN-1918][Core]Fix: Using Caffeine cache instead of guava cache by @baibaichen in #1919
  • [VL] Parameterized benchmark tool for gluten-it by @zhztheplayer in #1795
  • [GLUTEN-1868][VL] Optimize ColumnarToRow to reuse the allocated Buffer in native side. by @JkSelf in #1869
  • [VL] Add quotes for CPU_TARGET to avoid warning by @liujiayi771 in #1923
  • [GLUTEN-1877][CH] Fix bugs when reading hive text by @taiyang-li in #1881
  • Backport all commits since 1.0 code freeze by @xieqi in #2032
  • [GLUTEN-2100] backport recent important changes to 1.0 branch by @zhouyuan in #2123
  • [GLUTEN-2187][VL] Fix coverity cpp scan issues by @xieqi in #2189
  • [GLUTEN-2213][VL] Upgrade the dependencies version to meet the requir… by @xieqi in #2214
  • [GLUTEN-2227][VL] Fix coverity Java scan issues by @xieqi in #2230
  • [VL] Backport Delete shuffle spilled file directories by @marin-ma in #2249
  • [GLUTEN-2291][DOC] Cherry pick doc update to 1.0 branch by @xieqi in #2294
  • [VL][DOC] Cherry-pick: refine build guide for velox backend by @PHILO-HE in #2334
  • [GLUTEN-2274] Preparing Gluten release v1.0.0 by @xieqi in #2331

New Contributors

Usage

We provide two categories of jars for you to download and use, depending on whether Intel QuickAssist Technology (QAT) accelerator will be used.

No QAT Test

You can download and deploy just one single jar (without "qat" suffix) corresponding to your OS & Spark version. No extra third-party jar is needed. For example, you should download the below jar for Ubuntu 20.04 & Spark 3.2.
gluten-velox-bundle-spark3.2_2.12-ubuntu_20.04-1.0.0.jar

QAT Test

For testing Gluten on QAT, you need to download & deploy a jar (with "qat" suffix) corresponding to your OS & Spark version. As Gluten's QAT jar is built with dynamically linked native library, you need also download and deploy a corresponding third-party jar. For example, for Centos 8 & Spark 3.2, you need to download the below two jars. And spark.gluten.loadLibFromJar should also be enabled in your Spark configuration.
gluten-velox-bundle-spark3.2_2.12-centos_8-1.0.0-qat.jar
gluten-thirdparty-lib-centos-8.jar

Configuration Example

spark-shell --name run_gluten \
 --master yarn --deploy-mode client \
 --conf spark.plugins=io.glutenproject.GlutenPlugin \
 --conf spark.memory.offHeap.enabled=true \
 --conf spark.memory.offHeap.size=20g \
 --conf spark.shuffle.manager=org.apache.spark.shuffle.sort.ColumnarShuffleManager \
 --jars https://github.com/oap-project/gluten/releases/download/v1.0.0/gluten-velox-bundle-spark3.2_2.12-ubuntu_20.04-1.0.0.jar

Note: This release doesn't support Hadoop HA and Kerberos feature