Release Notes - Gluten - Version 1.1.0

We are excited to announce the release of Gluten-1.1.0.
This version is the culmination of work from 45 contributors who have worked on features and bug-fixes for a total of over 800 commits since 1.0.0

Highlights (Velox backend only)

20% performance improvement in Decision Support Benchmarks comparing to v1.0.0
Support Spark 3.2 and Spark 3.3
Support Spark 3.4 (experimental)
Run Pass all Velox UTs, Spark 3.2/3.3 SQL related UTs
Support Ubuntu 20.04/22.04, CentOS 7/8, alinux 3, Anolis 7/8
Support File System: localfs, HDFS, S3, OSS(via s3a), GCS
Support File Format: Parquet, ORC
Support Data Lake: deltalake (experimental)
Support Data Types: Primitive Type, Decimal, Date, Timestamp, Array (partial), Map (partial), Struct (partial)
Support 28 common Spark Operators, detail here
Support 199 common Spark Functions, detail here
Support Dynamic Memory Pool and Spill
Support Velox UDF
Support Gluten UI to print fallback event in History Server
Support Hadoop HA and Kerberos
Velox code updated to 20231123(commit-id: aff0cde)
Document improvement for support features and configuration

Known Issues

Only support static partition write in Spark 3.2 and 3.3

New Features


#3722	[CH] improve mutex usage in shuffle writer
#2063	[CH] Spark sql config load dynamic by task
#3257	[VL] We may need more metrics collected by Velox
#3528	[VL] Construct unique partition/sort keys and removing overlapping sort key for window plan
#3381	[CH]Reuse last WholeStageTransformer instead of creating new one in FileFormatWriter
#2118	[CH] Support hive udtf
#2128	[CH]Support tablesample clause
#2163	[CH] support approx_percentile aggregate function
#2193	[CH] Support some array functions
#2207	[CH] Support function to_utc_timestamp/from_utc_timestamp
#2136	[CH] HiveTransform add metrics `readBytes`
#2439	[VL] array_aggregate support with lambda function
#2451	[CH] Support StaticInvoke function
#2460	Avoid force check Java thread in native side
#2465	Remove operator level fallback policy
#2472	[CH] Remove BasicScanExecTransformer#getInputFilePaths when CH support more general partition location parsing
#3187	[CH] Implement runtime native bloom filter
#2267	[CH] Support urldecoder which is used in reflect(""java.net.URLDecoder"", ""decode"",event.event_info['currenturl'], ""UTF-8"")
#2309	Implement Streaming Window in Velox backend to reduce the memory usage.
#2323	[CH] Build optimization
#2343	[VL] ShuffleWrite: Larger shuffle size than vanilla spark and long compression time
#2365	[CH] gluten should support setting max bytes for a partition for orc/parquet
#2390	[CH] Aligning the NULL and NaN compare semantics of Spark and CH
#2600	[CH] enhance S3 client caching
#2617	[VL][Spark 3.3+] support pushdown aggregate to native scan insteads of fallback
#2619	[VL][Spark 3.3+] support match columns use filedIds in native insteads of fallback
#2667	[VL] Stacktrace-categorized memory allocation dumping for debugging
#2730	Request for documentation on how to write a backend for 3rd party engines
#2761	[DOC] A doc named index.md share same content with README.md
#2772	[VL] When performance degradation，What factors may affect the performance？
#2783	[VL]Run CI with DEBUG build mode to enhance stability
#2791	[VL] Support spark function: concat_ws
#2793	Code refactor: move some common code to a root module named common
#2807	Code cleanup: FunctionConfig may be useless
#2515	when we will support spark -gpu ,now we need spark -gpu feature to train big model
#2535	UnsupportedOperationException is abused
#2593	List parquet write semantic differents in Spark and gluten
#2804	Handle timeZoneId for TimezoneAwareExpression
#2815	[VL] complex data type support in parquet scan
#2825	[VL] In Java, consolidate GlutenColumnarBatchSerializer and CelebornColumnarBatchSerializer
#2826	[VL] Use a dedicate class to maintain gluten native config
#2845	[VL] Separate each jni wrapper to different files
#2874	[VL] support `spark.sql.decimalOperations.allowPrecisionLoss`
#2877	[VL] Support read iceberg
#2905	[VL] Support percentile function
#2919	[VL] Support ORC format in HiveTableScanExecTransformer
#2956	[VL] Support NullType in Project
#2975	[VL] Track MemoryManager feature
#3015	[CH] ReusedExchange: Gluten does not touch it or does not support it
#3017	[VL] Allow users to set spill partitions/levels
#3033	[CH] Support aggregation spill for the second stage
#3049	[CORE] Statement level controls whether to use gluten
#3817	[CH] Optimize mergetree prewhwhere
#3704	[CH] support tuple subcolumn pruning for orc/parquet
#3784	DNM
#3144	[CH] Aggregation supports complicate type
#3715	[VL] Add support for GCS
#2106	[VL] CI: allow to benchmark TPCH performance on comment
#3702	[VL] Add sort based window support in velox backend
#2404	[VL] Enable Velox memory reclaimer for auto disk-spilling
#3082	[CORE] Support columnar CollectLimit
#3739	[VL] Add config to disable velox file handle cache
#3055	[VL] Use mixed memory (off-heap and on-heap) for native
#3077	[VL] EP: Centralized lifecycle management for C++ / JNI contextual objects
#3142	[VL] Tight Java-C++ object binding
#3075	[VL] Support static partition write in VL backend
#2533	Degrade Arrow version to 8.0 in VL backend.
#2629	Use Project + Unnest to implement Expand operator
#3132	Add streamingwindow support in velox backend
#3361	Support Spark 3.4 in Gluten.
#3425	[VL] Create Hdfs folder in Gluten side when writing hdfs file
#3541	[VL] Add minimal GHA CI job for debug build
#3705	[CORE] Support mapping one custom aggregate function to more than one backend functions
#3689	[CH] s3 cache enhancement
#3667	[CH] Controller read mergetree block size by config
#3635	[VL] Clean up arrow related doc
#3553	[CH] Support bucket scan for ch backend
#3594	[VL] Allow users to set bloom filter configurations
#3609	[CH] csv integer read enhancement in excel format
#3141	[CH] Do optimization for date comparison
#3542	[CH] cancel empty string as null representation when reading csv with excel format
#3598	[CH] Support to config the hash algorithm for the ch shuffle hash partitioner
#3590	[CORE] Reducing driver memory usage by using serialized bytes instead of proto objects in GlutenPartition
#3297	[CH] utilize ORC filter push down to reduce remote read IO
#3383	[CH] respect spark config spark.sql.orc.compression.codec
#3511	[CORE]Support to add custom aggregate functions for extension
#3459	Skip unnecessary local sort for static partition write. From [SPARK-37194]
#3335	in package name io.glutenproject.substrait.ddlplan, ddlplan maybe dllplan?
#3408	Remove ENABLE_LOCAL_FORMATS build option
#3405	[CH] tolerate empty blocks when native writing
#3388	[VL] Remove datediff function name mapping
#3373	[CH] Support uuid() function
#2756	Add a contributing guide
#2436	[CH] clickhouse startsWith and endsWith function support utf8 version
#2583	[CH] Implement native orc reader without arrow dependency.
#3145	[CH] array join supports nullable array/map as input argument, which simplifies explode/posexplode transformation
#3334	fix: fix typo in `ExpandExecTransformer`
#3206	[CH] SPARK-36926 is not supported by ch backend
#3244	[VL] Refine ExecutionCtx to return raw pointer instead of shared pointer
#3250	[CH] Enable aggregation hash table prefetch
#3258	[DOC] Corrent bitwise_or function symbol and malposition
#3195	[CH] Unsupport sort field
#3248	[CORE] Export gluten version in property
#3190	[CH] Support decode function
#2943	[VL] implement `LATERAL VIEW` for Velox backend
#3143	[CH] Multiple Lateral view posexplode use too much memory
#3160	[CH] simplify the logic of array_contains transformer
#2849	[VL] EP: Memory management enhancements after integrating with Velox arbitrator API
#2848	[VL] Use one single Velox root memory pool instance per Spark task
#3150	[CH] Simplify filter rel parsing in CH
#3115	[CH] using new native orc input format and support it with nullable complex types
#3019	[CH/VL] Memory management should only care about physical memory
#3040	[VL] Make Backend be stateful to hold all native resources
#2484	[CH] Enable Spark33 UT
#2974	[VL] Let VeloxMemoryManager manage AllocationListenr and be non copyable/assignable
#2808	[VL] All VeloxMemoryPools should be shrinkable through Spark's spill calls
#2970	[CH] respect clickhouse settings: query_plan_enable_optimizations
#2686	[CH] Gluten shuffle service with ClickHouse backend supports Celeborn
#2886	[CH] Improve clickhouse spill settings
#2881	[CH] use both \N and empty string as null representation when reading csv with excel format
#2967	[CH] add a switch to control int type infer reading with excel format
#2853	[VL] Refactor arrow memory pool usage in native
#2504	[VL][Discussion] Better way to deploy UDF libraries
#2942	[CH] Scanning natively written orc files with/without native engine returns different results
#2929	[CH] Read s3 file using context readsetting instead of new
#2953	Simplify SerializedPlanParser
#2854	[VL] Shuffle writer should use tracked memory pool in native
#2915	[VL] Enhance the way to debug cpp
#2816	[VL] hive scan in ORC format support
#2860	[CH] Support removing files cache for CH Backend
#2862	[VL] OrcTest was broken and should be fixed or skip
#2822	[VL] Let NativeMemoryAllocator#contextInstance return a prepared mempool
#2829	[CH] adding fallback related suites for CH backend
#2830	[CH] Enable more Spark 32 UT
#2723	Improve CH Backend Broadcast Join performance
#2851	[VL] Use more verbose way for ArrowSchema's allocate and release
#2610	[CORE] Refactor some ExpressionTransformer
#2833	[CH] Avoid unnecessary copy of std::vector in c++
#2781	[CH] Enable spark33 UT
#2785	[CH] Optimize the setting of lib path
#2710	[CH] Fallback when date_format with non-literal format
#2688	[CH] Support function find_in_set
#2203	[CH] Support read file from alluxio
#2682	[VL] support array extract
#2587	[VL] Support GetArrayItem/GetMapValue functions
#2627	[VL]surpport GetMapValue expression
#2456	[CH] Support filter push down in parquet format reader
#2626	[CORE] JsonString is initialized during wholestage transform
#2312	Move some of velox functions to gluten side
#2738	[CH] Improve join performance while left/right table is empty
#2733	Introduce automatic-resource-management pattern
#2670	[CH] ignore a certain number of special characters surrounding int value when reading from csv with excelformat
#2677	[CH] ignore a certain number of special characters surrounding float value when reading from csv with excelformat
#2611	[CH] Remove throw exception for improve excelformat performence
#2673	[CH] lpad/rpad support non-constant third arguments
#2679	[CH] Support function elt
#2676	[CH] support function shiftrightunsigned
#2655	[CH] support function substring_index
#2672	Let C2R/R2C operators be parts of whole stage
#2561	[VL] can you provide the release 1.0 jar package for spark 3.3?
#2490	[CH] Improve the performance of FormatFile::readSchema
#2228	[CH] support ansi cast used in native writer scenario
#2164	[CH] Support function concat_ws
#2552	[CH] support native insert into table with orc input format
#2590	[VL] Can the community provide version 1.0.0 of the thirdparty-centos7 jar?
#2182	[VL] native UDF framework
#2508	[CH] support initcap function
#2531	Introduce GlutenException
#2391	[CH] Improve date type read when using excel format
#2443	[CH] Inttype support read float value from csv when using excelformat
#2348	[CH] Function needed to be supported with high priority
#2389	Support scan from custom partition
#2199	[CH] support dynamically configure per bucket roles
#2360	[CORE] Adding fallback polcies
#2432	[CH] support str_to_map
#2217	[CH] Introduce Signal Handler to print stack info when get `SIGSEGV`
#2397	[VL] enable stacktrace for velox user error
#2399	[CH] Seperate aggregate function parses from scalar function parers
#2264	[CH] Support parse_url
#2367	[CH] Support functions positive/negative
#2288	[CH] Functions needed to be support in high priority
#2362	Record all the changes to Substrait
#2265	[CH] Support str_to_map function
#2355	[CH] Support user define `TextInputFormat` in HiveTableScanExecTransformer
#2263	[CH] Support last_day function
#2346	[VL] Move Velox-Substrait plan conversion from Velox to Gluten cpp

Bugs Fixed


#3865	[CH] Refactor the method of aggregating without keys
#2093	[VL] JDK ERROR
#3245	[VL] Runtime issue when serializeFromObjectExec is used
#2595	[VL] Distinct agg doesn't spill to disk
#2113	[CH] Csv reader has different result with spark when has boolean field
#3215	[VL] centos7 spark 3.2.4 + velox protobuf conflict lead to jvm crash.
#3230	[CH] missing input size in spark ui
#3231	[CH] Tpcds71 can not pass if set `supportShuffleWithProject` return true
#3234	build source error centos not sudo -E
#3235	[VL] Shuffle writer stop throws Error writing bytes to file
#3778	[VL] Remaining filter behavior change after velox switch to meta velox
#2195	compile from source code error on ubuntu
#2209	[CH] Inconsistent behavior on arithmetic calculations
#2233	[CH] Empty data using agg cause different result betweem gluten and spark
#2416	[VL] readInt128 error
#2418	[VL] enqueueRowGroup error
#3826	[CH] Data is lost with large shuffle partition number
#2144	[CH] `date_trunc`: support specified timezone and `foldable` `format` expression
#2419	[VL] eval error
#2420	[VL] TreeOfLosers error
#2428	[CH] spark AQE may not work for adjusting partition number
#2433	error "Unable to locate package thrift" when run ./build_velox.sh
#2438	CMake Error at CMake/resolve_dependency_modules/re2.cmake:38
#2447	Gluten 1.0 executor crash issue on CentOS
#2453	Not in a Spark task
#2461	[CH] memory usage record in gluten is not correct
#2471	[CH] output rows seem to double in some operators
#2476	[VL] gluten not support alios 7
#2376	[CH] collect_list(null) will throw exception
#2378	[VL] Writing parquet table still use vanilla spark InsertIntoHadoopFsRelationCommand
#2415	[VL] Invalid varint value: too few bytes
#2615	[VL]The problem with running./build_velox.sh stopping
#2620	[VL] free(): invalid pointer
#2621	OOM with TPCDS Q4 and Q95
#2658	[core] attribute bind error happens when Exchange rangepartitioning get attribute from its child's output
#2684	Spark parquet scan always return timestamp_ntz column when reading test file
#2685	Memory not controled
#2776	[VL] get_json object error
#2777	[CH] get_json_object path parse error while path is '$.123'
#2812	[VL] with the same memory size, vanilla spark passed but Gluten reports OOM
#2477	`checkExceptionInExpression` not hooked by gluten
#2497	[VL] driverUrl NullPointerException
#2503	[CH] read from hdfs may cost miniutes
#2506	[CH] ListLiteralNode may encounter exception: Not supported on UnsafeArrayData.
#2525	[VL] DictionaryVector error
#2532	native shuffle reader seems has bug for array type
#2537	[VL] Crush at shuffle using gluten with velox backend at spark3.2
#2549	[VL] java.lang.RuntimeException: java.io.FileNotFoundException: libthrift-0.13.0.so
#2591	[CH] precision may overflow
#2798	[VL] Failed to read FlatVector
#2805	[CH] ExpandStep may have duplicated col name
#2814	[VL] substrait plan conversion error
#2821	[CH]per_bucket_clients is not thread safe
#2876	[CH] io.glutenproject.exception.GlutenException: Indices in strings are 1-based
#2917	[CH] Read parquet from s3 accured an error.
#2938	[Core]InvalidClassException BasicWriteJobStatsTracker
#2951	[CH] if exchange fallback job will fail using celeborn
#2952	[VL]Bad memory usage with Spark speculation, possible memory leakage
#2969	[CH] aggregate meta mismatch
#3007	[VL] Would not build Gluten from spark 3.2.0
#3008	[VL] nonSpilledRows is not empty after spill finish
#3016	[CH] Task core while set config `spark.speculation true`
#3034	[VL] Fix allocating negative buffer in R2C
#3801	[VL] isAdaptiveContext null value for ColumnarOverrideRules
#3796	[VL] remove flaky unit tests
#2509	Is Gluten limited by CPU type?
#3779	[CH] Fix core dump when executing sql with runtime filter
#2048	[VL] Extra copy introduced to workaround the non-16B aligned crash
#2820	[VL] CI: Error setting up docker container
#3118	[CH] Function Row_Number result is not same compare to spark
#3119	[CH] Sum of float field’s result is not unique
#3048	[VL] Unreliable exception handling causing JVM crash
#3054	[CH] core when free memory
#3718	[VL] The latest code is very slow when reading HDFS data
#3668	[CH] Performance regresses seriously after PR 3169 merge when executing convert string to date
#3692	[CH] No result returned while query sum/count from empty table
#3548	[CH] TextFile reader can not read csv file while end of line is CR(\r)
#3058	[VL] executor core dump on speculation on
#1902	CH has some limitations on translate function, but velox doesn't
#3109	[VL] Core dump in gluten::VeloxShuffleWriter::shrinkPartitionBuffers()
#2746	Enable velox parquet write in upstream_velox branch
#3390	Spark Plan error when running TPC-DS Q75
#3596	[VL] Fix the NPE exception when static partition writing in S3 system
#3695	[VL] libre2.so.5: cannot open shared object file: No such file or directory
#3684	[VL] Cannot turn off the shuffle compression with --conf spark.shuffle.compress=false
#3670	[CH] Escape error when read csv use excel format
#2421	[VL] values_->capacity() >= byteSize
#2891	Velox doesn't work with Spark Delta Lake
#3576	[VL]Why is Gluten significantly slower than Spark with the same configuration?
#3654	TPC-DS Q74 failed with found duplicate key in TopNNode.
#3637	WholeStage pipeline metric was broken
#3572	[VL] Compile.sh script fails with libarrow not found when built with Debug flag.
#3644	[CH] Revert the logic to support the custom aggregate functions
#3627	[CH] null value bug when reading csv with excel format
#3631	[VL] Operators not getting self-spilled
#3451	[CH] to_date can not handle timestamp which equals '1970-01-01 00:00:00'
#3552	[CH] TextFile csv reader should not remove the field whitespaces
#3521	[CH] Bug fix substring function start index must start from 1
#3621	Unit test OnHeapFileSystemTest always failed
#2564	[VL][UT] GlutenBroadcastJoinSuite failed with LiveListenerBus is stopped
#3380	[VL] Gluten is Slower than Vanilla Spark
#3534	[CH] Fix incorrect logic of judging whether supports pre-project for the shuffle
#3535	[CH] Support hive scan dir recursive
#3509	[CH] Fix partition lock problems when multi-thread in shuffle write
#3423	[CORE] convertBroadcastExchangeToColumnar will fail for non-WholeStageCodegenExec fallback plan
#3501	[CH] to_json function result is diffrent from spark when the input is `struct` type
#3512	[CH] Empty block causes IllegalStateException when native writing
#3412	[CH] element_at function error while first argument type is map type
#3450	[CH] Support decimal `allowPrecisionLoss=false`
#3462	[CH] Round function get different result from spark
#3489	[CH] In Filter contains null value will cause exception
#3446	[CH] Cast float-type string to float lost precision
#3113	When set `spark.plugins` with gluten, but disabled `spark.gluten.enabled`, cause some issue
#3241	[CH] Disable conversion of ColumnarToRowExec to CHColumnarToRowExec when operator of SourceScan fallback vilina spark
#3252	[CH] fix race condition at the global variable of isAdaptiveContext in ColumnarOverrideRules
#3467	Fix 'Names of tuple elements must be unique' error for ch backend
#3480	Fix incorrect metrics values for ch backend
#3429	[VL] TPCDS 100T Core in __memmove_evex_unaligned_erms
#3431	[CH] Unexpected inf or nan to integer conversion
#3371	Fix CoalesecExec operator
#3417	[CH] Reduce memory consumption in `Expand` operator
#3134	[CH] Left join result is null while the field type not match
#3377	Columnar table cache should be compatible with TableCacheQueryStage
#3400	[CH] Backend named literal direct use literal value causing performance issues
#3366	[CH] Empty block make celeborn shuffle write failed
#3136	[CH] Array join tries to allocate too much memory leading to fail
#3094	[CH] CAST float-type string to int is null
#1649	libch.so coredump when release rocksdb static resources
#3135	[CH] to_date function can not work on timestamp string with timezone
#3216	[CH]Core dump casuse by `ReadRel` built from aggregate transform
#3287	[CH] Diff when divide zero
#3271	[CH] Result not same from spark while get arrayElement from split function array result
#3337	[CH] get_json_object can not parse string contains ctrl chars
#3351	HiveFileFormat has incompatible class error when running TPC-H q18
#3328	[VL] batch holder in executorCtx is not thread safe but batch access from two threads
#3253	[CH] csv integer read bug with sign(+-) at the end of line in excel format
#3221	[CH] negative decimal read wrong when r2c
#3104	[CH] get_json_object can not parse string which contains multi backslash json string
#3302	[VL] PR 3239 introduce a dead lock in our environment
#1653	Is the nothing type of substrait will be stable?
#3189	[VL] ShuffleWriter spill throws Error reallocating 0 bytes
#3208	[VL] Too many logs print to stdout
#3149	[CH] Unexpected inf or nan to integer conversion
#3213	[CH] null column determination bug when execute row to column transform
#3124	[CH] lateral view result is different from spark while the array is null
#3226	[CH] Data is not evenly divided in parquet/orc reader
#2752	[fatal bug] query case when with hour result wrong
#2759	case when unix_timestamp(xx) query data result wrong
#3026	[VL] The comparison query result between string time and timestamp is incorrect.
#3028	[VL] DATE_FORMAT(time_str, 'yyyyMMdd') BETWEEN xx AND xx query incorrect
#3071	[CH] MapFromArrays throws exception
#2612	[VL] updateHdfsTokens double free or corruption
#2964	[CH] Memory statistic issue
#2870	[CH] Memory statistics maybe not correct
#3201	[CH] spark load gluten jars classpath not consistent
#3140	[CH] Array_contains can not use with split function
#2941	[VL] Error: (shuffle) nativeEvict: evict failed - Cannot evict partition -1 because writer is stopped
#2847	[CH] HDFS config not work
#3148	[VL] Unreasonable OOMs in isolated memory mode
#3154	[CORE] BatchScanExecTransformer not override doCanonicalize
#3137	[CH] Skip map field with complex key type in native orc input format when reading shema from orc file
#3072	[CH] Optimization #2410 takes large memory consumption
#3108	[CH] Hive table can not be deserialized while set input format `org.apache.hadoop.hive.serde2.OpenCSVSerde`
#3105	[CH] to_json has different result with vanilla spark
#3127	[CH] posexplode index start from 1
#3012	[CH] Incorrect metric for shuffle read records when running with celeborn
#3098	[CH] Field can not be deserialized while field.delim is set empty string in TextInputFormat
#2454	[CH] posexplode return incorrect position
#2492	[CH] posexplode cannot handle null value as argument
#2894	[CH] Missing output rows metric in native insert
#3102	[CH] Column shuffle with celeborn hangs on spill
#3090	[CH] Method get64 is not supported for Nothing
#3067	[CH] unix_timestamp throw exception
#3062	[VL] Crash in libunwind during shuffle reader is actively running, after task is killed
#1727	[CORE] Protocol message had too many levels of nesting
#2414	[CORE] Protocol message had too many levels of nesting
#3057	[VL] Crash by dangling pointer to Arrow memory pool in shuffle writer, when task is killed
#3051	[CH] native write core because of invalid partition column index passed to Java_org_apache_spark_sql_execution_datasources_CHDatasourceJniWrapper_splitBlockByPartitionAndBucket
#3041	[CH] `stop` is called before `write` finished in `CelebornHashBasedCHColumnarShuffleWriter`
#3023	[CH] scan from json table failed
#2987	[VL] Error: Native split: shuffle writer stop failed - Merging from spilled file NO.7 is not exhausted
#2988	[CH] Memory leak in native writer
#2947	[CH] arg types not supportted for multiIf
#2505	[VL]VeloxMemoryPool error
#2340	[CH] Parse List Literal does't consider nullable
#2834	[CH] wrong negative decimal32 values after columnar to row
#2767	[CH] Cast to decimal32 reutrn wrong value
#2966	[CH] native write compatiability for union operator
#2978	[VL] GlutenConfig.getConf should not be called before SQLConf initialize
#2972	[CH] Read bool field from excel csv cause core dump
#2934	[CH] Scala.MatchError: knownfloatingpointnormalized
#2959	Issue in getting mapped substrait name for expression not defined
#2925	[CH] Error with partition key is date
#2930	[CH] Crash at hash when using celeborn column shuffle
#2900	[CH] blocks are not materialized if native insert's child is union, which will cause crash in CH ORC/Parquet output format
#2906	[CH] ThreadStatus destroy on other threads.
#2884	[CH] Fix crash when native write empty dataset
#2863	[CH] error reading for double type column with excel format
#2839	[CH] duplicated actions in step: "Remove nullable properties"
#2796	[VL] unix_timestamp("20230817", "yyyyMMdd") return wrong value
#2836	[VL] Complex type does not release memory util shuffle writer destructor
#2157	[CH] Some string functions cannot handle non-ASCII strings
#2498	[CH] gluten submit additional jobs to get input paths
#2809	[CH] S3 config bucket name not support with char '.'
#2661	[Core] not support insert values
#2762	TPCDS q4 need more memory and run slower after memory arbitrator
#2491	[CH] sequence(1, null) will throw exception
#2625	Memory leak found on TPCDS
#2758	[CH] Thrift server failed start due to jar conflict
#2417	[VL] Invalid regular expression ^0+(?!$)
#2740	The Gluten endpoint address should support IPV6
#2180	[CH] TxtFormat table deserialize failed while the input fields too much/few, or type mismatch
#2699	[VL] org.apache.spark.sql.execution.ColumnarBuildSideRelation cannot be cast to org.apache.spark.sql.execution.joins.HashedRelation
#2513	[fatal bug] can't query hive data
#2703	[CH]`get_json_object` result's format is wrong when there is only one result
#2711	[CH] Core dump when filter not has expression
#2689	[VL]HdfsConfigNotFound: config key: dfs.ha.namenodes.xxxx not found
#2641	[CH] Tuple cast error while query array
#2582	[CH] Program crash while use `array<struct>` or `map<struct>` type as filter
#1680	[CH] ExpandTransform cause exception "Cannot pull block from port which is not needed"
#2639	[CH] log1p(-1.0) should return null
#2474	[CH] unequal result with spark
#1907	Get exception after config spark.gluten.sql.columnar.backend.ch.runtime_config.config_file for libch
#2652	[CH] init common csv input format settings in global context
#2375	[CH] split function throw exception
#2638	[VL] s3 endpoint can't use default setting of instance
#2586	[CH] window operator meet parse exception
#2029	[CH] HiveTableScanTransform cann't read child dirs files
#2431	[CH] Transform `ObjectHashAggregate` failed with `concat_ws`
#2584	[CH] mismatch complex type values during native orc writing
#2430	[CH] Add pre-projection for hash shuffle with expressions failed
#2569	Potential bug when read orc and parquet table in the same time
#2424	[VL] ProjectExecTransformer: Validation failed for class io.glutenproject.execution.ProjectExecTransformer due to key not found: transform
#2556	Why use glutenAlloc only for allocateBytes and freeBytes?
#2459	[CH] Reading from struct fields returns null
#2520	No Java license checker
#2463	[CH] wrong result when filter with struct subcolumn
#2222	The implementation for to_unix_timestamp unix_timestamp should be separated for velox/ch
#2306	[VL] Support timezone for timezone aware functions
#2450	[CH] permission denied during native insert
#2441	Dataproc/Spark 3.3 build fails
#2466	[CH] Empty array/tuple/map values are written in orc/parquet files when null values are natively inserted into table
#2219	Cover more modules: let the build fail if there is a warning
#2448	[CH] `count` failed with multiple arguments with different types
#2357	[VL] MakeDecimal is not able get the correct precision
#2316	overwrite dir failed
#1567	Expression cast(3 as bool) returns different results in spark and ch
#2394	[VL]Fix wrong schema info in Velox parquet writer
#2422	[CH] Parse NaN/Inf exception in RangeSelectorBuilder
#1914	[CH] return empty when read complex type field which is null

Note

Binary Files only provide for Velox backend only

Full Changelog: v1.0.0...v1.1.0

Usage

We provide two categories of jars for you to download and use, depending on whether Intel QuickAssist Technology (QAT) accelerator will be used. Please notice:

To simplify the usage for Gluten users, we only provide the jar file with static built. However, Gluten can support dynamic built from the source code.
The release jar use the minimal version of dependency library to cover most OS. If you wish to run the best performance, we would suggest to build Gluten on your environment via source code.

No QAT Test

You can download and deploy just one single jar (without "qat" suffix) corresponding to your Spark version. No extra third-party jar is needed. For example, if you are using Spark 3.2, you should download the below jar for Spark 3.2.
gluten-velox-bundle-spark3.2_2.12-1.1.0.jar

QAT Test

For testing Gluten on QAT, you need to download & deploy a jar (with "qat" suffix) corresponding to your OS & Spark version.
gluten-velox-bundle-spark3.2_2.12-ubuntu_22.04-1.1.0-qat.jar

Configuration Example

spark-shell --name run_gluten \
 --master yarn --deploy-mode client \
 --conf spark.plugins=io.glutenproject.GlutenPlugin \
 --conf spark.memory.offHeap.enabled=true \
 --conf spark.memory.offHeap.size=20g \
 --conf spark.shuffle.manager=org.apache.spark.shuffle.sort.ColumnarShuffleManager \
 --jars https://github.com/oap-project/gluten/releases/download/v1.1.0/gluten-velox-bundle-spark3.2_2.12-1.1.0.jar

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gluten v1.1.0