Changelog

11.0.0 (2022-08-16)

Full Changelog

Breaking changes:

Implement exact median, add AggregateState #3009 [sql] (andygrove)

Implemented enhancements:

Make RowAccumulator public #3138
docs: proposal for consolidating docs into a Contributor Guide #3127
feat: support Timestamp +/- Interval #3103
a arrow_typeof like posgresql's pg_typeof #3095
Add DataFrame section to user guide #3066
Document all scalar SQL functions in user guide #3065
Simplify implementation of approx_median so that it can be exposed in Python #3063
Support double quoted literal strings for dialects(such as mysql,bigquery) #3055
Simplify / speed up implementation of character_length to unicode points #3049
Follow-up on Clickbench benchmark #3048
Why the PhysicalPlanner is an async trait ? #3032
Optimize file stream metrics. #3024
Proposal: Enable typed strings expressions for VALUES clause #3017
Proposal: Add date_bin function #3015
The upcoming release of Arrow (20?) breaks datafusion #3006
Can I select some files for query based on the filtering rules in the directory? #2993
Rename FormatReader to FileOpener #2990
Derive Hash trait for JoinType #2971
CAST from Utf8 to Boolean #2967
Add baseline_metrics for FileStream to record metrics like elapsed time, record output, etc #2961
Example to show how to convert query result into rust struct #2959
simplify not clause #2957
Implement Debug for ColumnarValue #2950
Parallel fetching of column chunks when reading parquet files #2949
Extension mechanism for SessionConfig #2939
Streaming CSV/JSON Object Store Read #2935
Support CSV Limit Pushdown to Object Storage #2930
Add support for pow scalar function #2926
Add support for exact median aggregate function #2925
Support mean as synonym for avg #2922
Rename a column name #2919
Move ScalarValue tests alongside implementation, move from_slice to core #2913
Fail gracefully if optimization rule fails #2908
Make ObjectStoreRegistry as a trait which can allow Ballista to introduce a self registry ObjectStoreRegistry #2905
Remove datafusion-data-access crate #2903
Improve formatting of logical plans containing subquery expressions #2898
Atan2 added to built-in functions #2897
The explain statements only print logical plans for debug/other purpose. #2894
JSON version of display_indent() #2889
It would be nice to have a way to generate unique IDs in optimizer rules #2886
Add support for TIME literal values #2883
Add h2o benchmark #2879
Implement from_unixtime function #2871
Add cast function for creating logical cast expression #2870
Release DataFusion 10.0.0 #2862
Implement information_schema.views #2857
Migrate from avro_rs to apache_avro #2783
Add optimizer rule to remove OFFSET 0 #2584
Preserve Element Name in ScalarValue::List #2450
Add EXISTS subquery support to Ballista #2338
Add documentation on supported functions to datafusion website #1487
documentations for datafusion-cli can be consolidated a bit more #1352
Optimizer: Predicate Rewrite pass for TPCH Q19 #217
feat: add optimize rule rewrite_disjunctive_predicate #2858 (xudong963)

Fixed bugs:

Regression in SQL support for ORDER BY and aliased expressions #3160
panic when deal with @ operator #3137
Incorrect type coercion rule for date + interval #3093
Cast string to timestamp crash while we input time before 1970 with floating number second #3082
INTEGER type does't work while importing csv #3059
Cannot GROUP BY Binary #3050
incorrect i32 coercion for to_timestamp #3046
Error pruning IsNull expressions: Column 'instance_null_count' is declared as non-nullable but contains null values #3042
I want to query some files in a directory. Is there any way? #3013
The expression to get an indexed field is only valid for List types (common_sub_expression_eliminate) #3002
Double to_timestamp_seconds produces abnormal result #2998
External parquet table fails when schema contains differing key / value metadata #2982
SELECT on column with uppercase column name fails with FieldNotFound error #2978
panic reading AWS-generated parquet file #2963
Can't filter rowgroup for parquet prune for some data type #2962
CI test is failing with final link failed: No space left on device #2947
bug: new ObjectStore breaks backward compatibility with contrib plugins #2931
bug: file types handled wrong #2929
bug: changing the number of partitions does not increase concurrency #2928
csv_explain fails on RC verifier #2916
index out of range error from datafusion_row::write::write_field #2910
Optimization rule CommonSubexprEliminate creates invalid projections #2907
serde_json requires that either std (default) or alloc feature is enabled #2896
Inconsistent type coercion rules with comparison expressions #2890
Doc Error: the test directory link 404 which is in CONTRIBUTING.md #2880
Round trips through ScalarValue's sometimes don't preserve types (e.g. change types from DictionaryArray) #2874
Error with CASE and DictionaryArrays: ArrowError(InvalidArgumentError("arguments need to have the same data type")) #2873
window functions not supported in expressions #2869
Unable to work with month intervals #2796
Discord invite link in communication page has expired #2743
Test (path normalization) failures while verifying release candidate 9.0.0 RC1 #2719
Reading parquet with (pre-release) arrow fails with "out of order projection is not supported" #2543
Fix SQL planner bug when resolving columns with same name as a relation #3003 [sql] (andygrove)
fix RowWriter index out of bounds error #2968 (comphead)
fix: support decimal statistic for row group prune #2966 (liukun4515)
Fix invalid projection in CommonSubexprEliminate #2915 (andygrove)

Documentation updates:

MINOR: Fix broken links in contrib guide #3135 (andygrove)
MINOR: User Guide: Move expressions to top-level page #3134 (andygrove)
User Guide: Combine CLI pages #3133 (andygrove)
User Guide: Add documentation for JOIN syntax #3130 (andygrove)
separate contributors guide #3128 (kmitchener)
minor: remove python docs, now they're in another project #3119 (kmitchener)
minor: doc fixes: fix link to datafusion-python project and add link to slides for rece… #3118 (kmitchener)
Add all scalar SQL functions to user guide #3090 (andygrove)
Add DataFrame reference to the user guide #3067 (andygrove)
MINOR: Add CeresDB to list of products using DataFusion #3060 (andygrove)
Minor: improve some docstrings about pruning #3041 (alamb)
doc: add a new video link about datafusion #3025 (xudong963)
Update README.md to add CnosDB into the Known Uses #2933 (cnoshb)

Performance improvements:

Use code points instead of grapheme clusters for string functions #3054 (Dandandan)

Closed issues:

Rename do_data_time_math() to do_date_time_math() #3172
Automatic version updates for github actions with dependabot #3106
[EPIC] Proposal for Date/Time enhancement #3100
Upgrade prost/tonic everywhere #3028
[Question] interested in helping with documentation #2866
Introducing a new optimizer framework for datafusion. #2633
Enable discussion tab? #2350
Add support for AVG(Timestamp) types #200
TPC-H Query 22 #175
TPC-H Query 21 #172
TPC-H Query 20 #171
TPC-H Query 17 #168
TPC-H Query 11 #163
TPC-H Query 4 #160
TPC-H Query 2 #159
[Datafusion] Optimize literal expression evaluation #106

Merged pull requests:

Rename do_data_time_math() to do_date_time_math() #3173 (JasonLi-cn)
[Minor] Remove some redundant code #3169 (alamb)
Support INTEGER again in addition to INT in CREATE TABLE and CAST statements #3167 [sql] (alamb)
Fix regression in SQL parser related to resolution of aliased expressions #3165 [sql] (andygrove)
update cargo lock #3164 (waitingkuo)
add test case for cast_timestamp_before_1970 #3163 (waitingkuo)
Return proper error message for ill formed variable reference #3162 (alamb)
Remove outdated license text left over from arrow repo #3154 (alamb)
Expose RowAccumulator in physical_plan #3151 (iajoiner)
Rename DateIntervalExpr to DateTimeIntervalExpr #3150 (alamb)
Bump actions/labeler from 4.0.0 to 4.0.1 #3144 (dependabot[bot])
User Guide: Add documentation for subquery syntax #3132 (andygrove)
MINOR: User Guide: Move Data Types and Information Schema to their own pages #3131 (andygrove)
Minor: Clean up array test #3121 (alamb)
add arrow_typeof #3120 (waitingkuo)
Bump actions/labeler from 2.2.0 to 4.0.0 #3114 (dependabot[bot])
Bump actions/checkout from 2 to 3 #3113 (dependabot[bot])
Bump actions/setup-node from 2 to 3 #3112 (dependabot[bot])
Bump actions/setup-python from 3 to 4 #3111 (dependabot[bot])
Feature/support timestamp plus minus interval #3110 (JasonLi-cn)
docs: fix typo #3109 (dzvon)
Remove offset if its zero #3102 (turbo1912)
Hash binary values #3098 [sql] (Dandandan)
Update to object_store 0.4 #3089 (tustvold)
Add cast function for creating cast expression #3084 (turbo1912)
Upgrade to arrow 20.0.0 (but no change to object_store), including prost, and tonic #3083 [sql] (avantgardnerio)
impl Debug for ColumnarValue, add some docs #3076 (alamb)
[Minor] run cargo update in datafusion-cli directory #3075 (alamb)
update cargo.lock in datafusion-cli #3074 (waitingkuo)
Update sql parser to v0.20.0 #3072 [sql] (waitingkuo)
Add opening, scanning, processing metrics in file stream #3070 (Ted-Jiang)
Simplify approx_median implementation, expose via DataFrame API #3064 [sql] (andygrove)
docs: fix PruningStatistics example and some typos #3062 (roeap)
feat: support double quoted literal strings for dialects(such as mysql,bigquery,spark) #3056 [sql] (Rachelint)
Allow Overriding AsyncFileReader used by ParquetExec #3051 (Cheappie)
to_timestamp i32 coerced to i64 #3047 (waitingkuo)
Fix IsNull pruning expression generation without null_count statistics #3044 (alamb)
feat: Support week, decade, century for Interval literal #3038 [sql] (ovr)
feat: Support Binary bitwise shift operators (<< and >>) #3037 [sql] (ovr)
Use concat_elements_utf8 from arrow rather than custom kernel #3036 (alamb)
minor: update minimal rust version to 1.62, matching arrow-rs #3035 [sql] (kmitchener)
feat: Add date_bin built-in function #3034 (stuartcarnie)
Split binary_expr.rs into smaller modules #3026 (alamb)
feat: Enable typed strings expressions for VALUES clause #3018 [sql] (stuartcarnie)
fix typo for PR3003 #3011 (waitingkuo)
feat: Add support for TIME literal values #3010 [sql] (stuartcarnie)
add TimeUnit::Second as signature for ToTimestampSeconds #3004 (waitingkuo)
Rename FileReader to FileOpener (#2990) #2991 (tustvold)
minor: collation the prune test #2986 (liukun4515)
Optionally skip metadata from schema when merging parquet files #2985 (alamb)
[Minor] Extract interval parsing logic, add unit tests #2984 [sql] (alamb)
Update sqlparser to 0.19 #2981 [sql] (alamb)
test: add file/SQL level test for pruning parquet row group with decimal data type. #2977 (liukun4515)
Derive Hash for JoinType #2972 (liurenjie1024)
Example that shows how to convert query result into rust struct #2959 #2969 (thomas-k-cameron)
Add baseline_metrics for FileStream to record metrics like elapsed ti… #2965 (Ted-Jiang)
test: add test for decimal and pruning for decimal column #2960 (liukun4515)
Simplify expressions with NOT clause #2958 (AssHero)
chore: update jit-related dependencies #2956 (xudong963)
Update to arrow 19.0.0 #2955 [sql] (alamb)
Remove CI Caching to preserve diskspace #2948 (alamb)
Add metadata_size_hint for optimistic fetching of parquet metadata #2946 (thinkharderdev)
Minor: Remove left over debugging statement #2944 (alamb)
add Atan2 #2942 (waitingkuo)
Use Arc<ObjectStoreRegistry> and remove ObjectStoreRegistry::clone #2941 (tustvold)
add extension system to SessionConfig #2940 (crepererum)
Update prost-build requirement from 0.7 to 0.10 #2937 (dependabot[bot])
Add streaming JSON and CSV reading, `NewlineDelimitedStream' (#2935) #2936 (tustvold)
feat(catalog): Implement information_schema.views #2934 [sql] (BaymaxHWY)
Support window functions in expressions by re-write projection after building window plan #2932 [sql] (AssHero)
Add pow as synonym for power #2927 (andygrove)
Add from_unixtime function #2924 (waitingkuo)
fix(aggregate): support mean as synonym avg #2923 (BaymaxHWY)
Add DataFrame::with_column_renamed #2920 (andygrove)
Run clippy with optional features #2918 (tustvold)
Fix release verification script by not overriding ARROW_TEST_DATA or PARQUET_TEST_DATA #2917 (alamb)
Move ScalarValue tests alongside implementation, move from_slice to datafusion_core #2914 (alamb)
Optimizer should have option to skip failing rules #2909 (andygrove)
Introduce ObjectStoreProvider to create an object store based on the url #2906 (yahoNanJing)
Remove datafusion-data-access crate #2904 (yahoNanJing)
Combine all comparison coercion rules #2901 (andygrove)
Add Projection::try_new and Projection::try_new_with_schema #2900 (andygrove)
Improve formatting of logical plans containing subqueries #2899 [sql] (andygrove)
add session option 'datafusion.explain.logical_plan'. when set to true, the explain statement will only print logical plans. #2895 (AssHero)
Preserve field name in ScalarValue::List #2893 [sql] (comphead)
Adds optional serde support to datafusion-proto #2892 (tustvold)
Implement ScalarValue::Dictionary and preserve type through conversion back/forth to Array #2891 (alamb)
Add an ID generator in preparation for PR 2885 #2887 (avantgardnerio)
Add support for correlated subqueries & fix all related TPC-H benchmark issues #2885 (avantgardnerio)
fix(doc): update test directory link in CONTRIBUTING.md #2882 (BaymaxHWY)
Add h2o bench groupby queries #2881 (andygrove)
Add support for month & year intervals #2797 (avantgardnerio)
Migrate from avro_rs (0.13) to apache_avro (0.14) #2784 (martin-g)

10.0.0-rc1 (2022-07-12)

Full Changelog

10.0.0 (2022-07-12)

Full Changelog

Breaking changes:

Convert batch_size to config option #2771 (andygrove)
MINOR: Remove Offset struct #2734 (andygrove)
feat: async extension planner #2713 (waynexia)
Switch to object_store crate (#2489) #2677 (tustvold)

Implemented enhancements:

update documentation, fix styling to match main Arrow project #2864
Update top-level README #2850
[Question]How to call an async function in ExecutionPlan::exec method? #2847
Add DataFrame::with_column #2844
Improve ergonomics of physical expr lit #2827
Add Python examples for reading CSV and query by SQL in Doc #2824
eliminate multi limit-offset nodes to EmptyRelation if possible #2822
Make LogicalPlan::Union be consistent with other plans #2816
Use coerced data type from value and list expressions during planning inlist expression #2793
Add configuration option to enable/disalbe CoalesceBatchesExec #2790
Simplify FilterNullJoinKeys rule #2780
Allow configuration settings to be specified with environment variables #2776
Automatically update configs.md in user guide #2770
Support multiple paths for ListingTableScanNode #2768
Reduce outer joins #2757
support data type coerced and decimal in INLIST expr #2755
Change ExtensionPlanner::plan_extension() to an async function #2749
Add IsNotNull filter to join inputs if one side of join condition does not allow null #2739
Sort preserving MergeJoin #2698
Improve readability of table scan projections in query plans #2697
DataFusion 9.0.0 Release #2676
Improve UX for UNION vs UNION ALL (introduce a LogicalPlan::Distinct) #2573 [sql]
Implement some way to show the sql used to create a view #2529
Consider adopting IOx ObjectStore abstraction #2489
Support sum0 as a built-in agg function #2067
implement grouping sets, cubes, and rollups #1327
Ruby bindings #1114
Support dates in hash join #2746 (andygrove)

Fixed bugs:

Docker Error #2851
Anti join ignores join filters #2842
Can't test or compile sub-model code after upgrade to arrow-rs 17.0.0 #2835
Not evaluate the set expr in the InList for the optimization #2820
CASE When: result type should be coercible to a common type #2818
IN/NOT IN List: NULL is not equal to NULL #2817
panic when case statement returns null #2798
InList: Can't cast the list expr data type to value expr data type directly #2774
InList Expr: expr and list values must can be converted to a same data type #2759
tpchgen docker syntax change prevents volume from binding #2751
Cannot join on date columns (Unsupported data type in hasher: Date32) #2744
rewrite_expression does not properly handle Exists and ScalarSubquery #2736
LocalFileSystem Not sorted by file name， As a result, the data lines queried in multiple files are out of order. #2730
Filter push down need consider alias columns #2725
Recent API change in GlobalLimitExec breaks compatibility with Ballista #2720
Common Subexpression Eliminiation pass errors if run twice on some plans: Schema contains duplicate unqualified field name 'IsNull-Column-sys.host' #2712
The data type is not compatible with other system, for example spark or PG database #1379

Documentation updates:

Fix docs styling #2865 (kmitchener)
Various updates to top-level README #2854 (andygrove)
MINOR: Add documentation for running integration tests #2839 (andygrove)
add csv registration and sql query to examples #2825 (waitingkuo)
[minor] refine doc #2753 (Ted-Jiang)

Closed issues:

Consider adding a prominent note in the readme about ballista #2853
support decimal in (NULL) #2800
InList: Don't treat Null as UTF8(None) #2782
InList: don't need to treat Null as UTF8 data type #2773
Implement extensible configuration mechanism #138

Merged pull requests:

Update CONTRIBUTING.md #2876 (waitingkuo)
Make LogicalPlan::Union be consistent with other plans #2868 (comphead)
minor: remove unneeded files from project root #2863 (kmitchener)
chore: make cargo clippy happy in nigtly #2860 [sql] (xudong963)
Update to arrow 18.0.0 #2856 [sql] (alamb)
chore: remove ballista-related docker-compose file #2852 (xudong963)
Adding dataframe with_column function #2849 (comphead)
anti joins now respect join filters #2843 (andygrove)
MINOR: make name meaningful and clean up code #2841 (liukun4515)
Make lit implementation more concise #2838 (alamb)
InList: set/list value must be evaluated to get the values #2834 (liukun4515)
Add SHOW CREATE TABLE with initial support for views #2830 [sql] (mrob95)
Improve ergonomics of physical expr lit #2828 (alamb)
Eliminate multi limit-offset nodes to emptyRelation #2823 (AssHero)
Fix the ci #2821 (liukun4515)
CaseWhen: coerce the all then and else data type to a common data type #2819 (liukun4515)
Fix ScalarValue::isNull calculation #2815 (alamb)
Fix nullability calculation for CASE expressions #2814 (alamb)
Bump numpy from 1.21.3 to 1.22.0 in /integration-tests #2811 (xudong963)
Fix data type calculation for CaseExpr s with NULLs #2810 (AssHero)
InList: fix bug for comparing with Null in the list using the set optimization #2809 (liukun4515)
Use specialized dictionary kernels (#1178) #2808 (tustvold)
fix schema nullability for information_schema schema #2804 (alamb)
fix: correctly calculate join output schema nullability #2803 (alamb)
Correct schema nullability declaration in tests #2802 (alamb)
Don't treat Null as UTF8(None) and change error info. #2801 (liukun4515)
MINOR: Remove reference to docker image that is no longer available #2795 (andygrove)
Use coerced type in inlist expr planning #2794 (viirya)
Add LogicalPlan::Distinct #2792 [sql] (mrob95)
Add config option for coalesce_batches physical optimization rule, make optional #2791 (andygrove)
Improve readability of table scan projections in query plans (remove Some and None) #2789 [sql] (comphead)
Simplify FilterNullJoinKeys rule #2781 (andygrove)
MINOR: re-export sqlparser from datafusion-sql crate #2779 [sql] (andygrove)
Update to arrow 17.0.0 #2778 [sql] (alamb)
Support multiple paths for ListingTableScanNode #2775 (Ted-Jiang)
Remove expr_sub_expressions and rewrite_expression functions #2772 (mrob95)
minor: update cranelift related dependencies #2769 (xudong963)
minor: panic rather than fail silently on bad dictionary in hash join #2767 (alamb)
MINOR: make prettier use consistent between CI and contributing guide #2766 (andygrove)
Rewrite subexpressions of InSubquery in rewrite_expression #2765 (mrob95)
Support DataType::Decimal for IN and NOT IN expressions #2764 (liukun4515)
Implement extensible configuration mechanism #2754 (andygrove)
Remove redundant docker argument #2752 (avantgardnerio)
Add optimizer pass to reduce left/right/full joins to inner join if possible #2750 [sql] (AssHero)
MINOR: Remove legacy CLI context enum #2748 (andygrove)
CSE unit test for duplicate fields #2747 (waynexia)
MINOR: Improve unsupported data type error message #2745 (andygrove)
Add optimizer rule to filter out null keys before a join #2740 (andygrove)
Sort file names in a directory #2730 #2735 (yourenawo)
fix: filter push down with InList expressions #2729 (Ted-Jiang)
[Minor] add debug info in optimizer.rs #2726 (Ted-Jiang)
Add public API for GlobalLimitExec and LocalLimitExec #2722 (andygrove)
Add additional data types are supported in hash join #2721 (AssHero)
Upgrade to arrow 16.0.0 #2718 [sql] (alamb)
Fix clippy warnings with toolchain 1.63 #2717 [sql] (waynexia)
Support for GROUPING SETS/CUBE/ROLLUP #2716 (thinkharderdev)
fix: check redundant fields while building projection plan #2715 (waynexia)
Sort preserving SortMergeJoin #2699 (korowa)
fix: union schema fix #2688 [sql] (gandronchik)
Support default precision and scale toCAST <EXPR> AS DECIMAL #2680 [sql] (gandronchik)

9.0.0 (2022-06-10)

Full Changelog

Breaking changes:

MINOR: Move simplify_expression rule to datafusion-optimizer crate #2686 (andygrove)
Move physical expression planning to datafusion-physical-expr crate #2682 (andygrove)
Create new datafusion-optimizer crate for logical optimizer rules #2675 (andygrove)
Remove ExecutionProps dependency from OptimizerRule #2666 (andygrove)
Remove ObjectStoreSchemaProvider (#2656) #2665 (tustvold)
Move LogicalPlanBuilder to datafusion-expr crate #2576 (andygrove)
LogicalPlanBuilder now uses TableSource instead of TableProvider #2569 (andygrove)
Remove scan_empty method from LogicalPlanBuilder #2568 (andygrove)
MINOR: Move expression utils from sql module to expr crate #2553 (andygrove)
Remove scan_json methods from LogicalPlanBuilder #2541 (andygrove)
Remove scan_avro methods from LogicalPlanBuilder #2540 (andygrove)
Remove scan_parquet methods from LogicalPlanBuilder #2539 (andygrove)
MINOR: Move ExprVisitable and exprlist_to_columns to datafusion-expr crate #2538 (andygrove)
Remove scan_csv methods from LogicalPlanBuilder #2537 (andygrove)
Fix Redundant ScalarValue Boxed Collection #2523 (comphead)
Support for OFFSET in LogicalPlan #2521 (jdye64)

Implemented enhancements:

[EPIC] JIT support for DataFusion #2703
Show column names instead of column indices in query plans #2689
Proposal: remove automated ballista CI checks from DataFusion #2679
Pass SessionState to TableProvider #2658
Is ObjectStoreSchemaProvider Still Needed? #2656
Add logical plan support to datafusion-proto #2630
Like, NotLike expressions work with literal NULL #2626
Move JOIN ON predicates push down logic from planner to optimizer #2619
Remove ExecutionProps from OptimizerRule trait #2614
Add, Minus, Multiply, divide, Modulo operator work with literal NULL #2609
Support DESCRIBE <table> to show table schemas #2606
Support CREATE OR REPLACE TABLE #2605
filter_push_down tests should not rely on TableProvider and ExecutionPlan #2600
Move logical optimizer rules out of the core datafusion crate #2599
Push Limit through outer Join #2579
datafusion_proto crate should have exhaustive match statements for handling Expr #2565
String representation of Expr variant #2563
File URI Scheme Interpretation #2562
Implement physical plan for OFFSET #2551
Update limit pushdown rule to support offsets #2550
Move LogicalPlanBuilder to datafusion-expr crate #2536
Logical optimizer rule "simplify expressions" should not depend on the core datafusion crate #2535
Support optional filter in Join #2509
Improve SQL planner & logical plan support for JOIN conditions #2496
Numeric, String, Boolean comparisons with literal NULL #2482
Redundant ScalarValue Boxed Collection #2449
ObjectStore Directory Semantics #2445
Add support for OFFSET in SQL query planner + logical plan #2377
SQL planner should use TableSource not TableProvider #2346
Move SQL query planning to new crate #2345
Update LogicalPlan rustdoc code to not use LogicalPlanBuilder #2308
[Optimizer] Refactor convert join #2256
[Optimizer] Infer is not null predicate from where clause #2254
Support ArrayIndex for ScalarValue(List) #2207
[Ballista] Fill functional gaps between datafusion and ballista #2062
[Ballista] support datafusion built_in UDAF work in ballista cluster #1985
Export C API #1113

Fixed bugs:

Fix Typos in Docs #2695
Unable to build a docker image #2691
Optimization pass AggregateStatistics changes type of output from Int64 to UInt64 #2673
ViewTable Circular Reference #2657
ScalarValue::to_array_of_size panics computing statistics for nested parquet file #2653
The result type of count/count_distinct #2635
limit_push_down is not working properly with OFFSET #2624
Avro Tests Fail To Compile #2570
Unused Window functions experssion is wrongly removed from LogicalPlan during optimalization #2542
Bug: ObjectStoreRegistry get_by_uri does not return correct path when "scheme" is provided #2525
There are duplicate and inconsistent copies of datafusion.proto #2514
Projection pushdown produces incorrect results when column names are reused #2462
Incorrect Parquet Projection For Nested Types #2453
LogicalPlanBuilder::scan_csv creates scans with invalid table names #2278
Inner join incorrectly pushdown predicate with OR operation #2271
Ignored alias for columns with aggregate function and incorrect results when collecting statistics is enabled #2176
Join on path partitioned columns fails with error #2145

Documentation updates:

Fix Ballista link #2654 (dsaxton)
MINOR: Add Blaze as a project using DataFusion #2618 (yjshen)
[MINOR] remove datafusion-cli's ballista feature from docs #2612 (Ted-Jiang)
chore(doc) remove ballista from datafusion-cli readme #2604 (ming535)

Closed issues:

[Question] Converting TableSource to custom TableProvider #2644
[Question] Why DataFusion is shipped with arrow version 9.1.0 on crates.io ? #2474

Merged pull requests:

Test optional features in CI #2708 (tustvold)
support indexed fields proto #2707 (nl5887)
Update sqlparser-rs to 0.18.0 #2705 (alamb)
[MINOR]: Add documentation to datafusion-row modules #2704 (alamb)
Make sure that the data types are supported in hashjoin before genera… #2702 (AssHero)
Move remaining code out of legacy core/logical_plan module #2701 (andygrove)
Move some tests from core to expr #2700 (andygrove)
MINOR: Improve Docs Readability #2696 (ryanrussell)
Combine limit and offset to fetch and skip and implement physical plan support #2694 (ming535)
MINOR: Add datafusion-sql example #2693 (andygrove)
Remove Ballista related lines from Dockerfile #2692 (mocknen)
Show column names instead of indices in query plans #2690 (andygrove)
MINOR: Remove uses of TryClone for Parquet #2681 (tustvold)
Fix AggregateStatistics optimization so it doesn't change output type #2674 (alamb)
If statistics of column Max/Min value does not exists in parquet file, sent Min/Max to None #2671 (AssHero)
MINOR: Move more expression code to datafusion-expr crate #2669 (andygrove)
MINOR: Rewrite imports in optimizer moduler #2667 (andygrove)
Update snmalloc-rs requirement from 0.2 to 0.3 #2663 (dependabot[bot])
Add module doc for RuntimeEnv, SessionContext, TaskContext, etc... #2655 (tustvold)
Prune unused dependencies from datafusion-proto #2651 (tustvold)
MINOR: Implement serde for join filter #2649 (andygrove)
pushdown support for predicates in ON clause of joins #2647 (korowa)
Move SortKeyCursor and RowIndex into modules, add sort_key_cursor test #2645 (alamb)
Implement DESCRIBE <table> #2642 (LiuYuHui)
Implement LogicalPlan serde in datafusion-proto #2639 (andygrove)
Fix limit + offset pushdown #2638 (ming535)
change result type of count/count_distinct from uint64 to int64 #2636 (liukun4515)
if none columns in window expr are needed, remove the window exprs #2634 (AssHero)
Like, NotLike expressions work with literal NULL #2627 (WinkerDu)
MINOR: Refactor datafusion-proto dependencies and imports #2623 (andygrove)
MINOR: add optimizer struct #2616 (jackwener)
Remove FilterPushDown dependency on physical plan #2615 (andygrove)
Support CREATE OR REPLACE TABLE #2613 (AssHero)
Support binary mathematical operators work with NULL literals #2610 (WinkerDu)
chore: try fix CI coverage #2608 (Ted-Jiang)
MINOR: Rename benchmark crate #2607 (andygrove)
chore(dep): bump cranelift to 0.84.0 #2598 (waynexia)
fix some typos #2597 (ming535)
Support limit pushdown through left right outer join #2596 (Ted-Jiang)
Unignore rustdoc code examples in datafusion-expr crate #2590 (andygrove)
Evaluate JIT'd expression over arrays #2587 (waynexia)
[minor]Fix ci clippy for unused import #2586 (Ted-Jiang)
[Doc]add doc for enable SIMD need cargo nightly #2577 (Ted-Jiang)
Add DataFrame union_distinct and fix documentation for distinct #2574 (andygrove)
Fix avro tests (#2570) #2571 (tustvold)
Make datafusion-proto match exhaustive #2567 (andygrove)
Support limit push down for offset_plan #2566 (Ted-Jiang)
Introduce Expr.variant_name() function #2564 (jdye64)
Fix some 404 links in the contribution guide #2561 (hi-rustin)
Update datafusion-cli readme cli version #2559 (hi-rustin)
MINOR: Move expr_rewriter.rs to datafusion-expr crate #2552 (andygrove)
Fix JOINs with complex predicates in ON (split ON expressions only by AND operator) #2534 (korowa)
Reduce duplication in file scan tests #2533 (tustvold)
Fix size_of_scalar test #2531 (alamb)
Update to arrow-rs 14.0.0 #2528 (alamb)
ObjectStoreRegistry get_by_uri now returns correct path when "scheme" is provided #2526 (timvw)
MINOR: Add ORDER BY clause to test #2524 (andygrove)
Remove unused binary_array_op_scalar! in binary.rs #2512 (alamb)
fix NULL <op> column evaluation, tests for same #2510 (alamb)
Fix projection pushdown produces incorrect results when column names are reused #2463 (jonmmease)
Benchmark for sort preserving merge #2431 (alamb)
Support GetIndexedFieldExpr for ScalarValue #2196 (ovr)

8.0.0 (2022-05-12)

Full Changelog

Breaking changes:

Add SQL planner support for ROLLUP and CUBE grouping set expressions #2446 (andygrove)
Make ExecutionPlan::execute Sync #2434 (tustvold)
Introduce new DataFusionError::SchemaError type #2371 (andygrove)
Add Expr::InSubquery and Expr::ScalarSubquery #2342 (andygrove)
Add Expr::Exists to represent EXISTS subquery expression #2339 (andygrove)
Move LogicalPlan enum to datafusion-expr crate #2294 (andygrove)
Remove dependency from LogicalPlan::TableScan to ExecutionPlan #2284 (andygrove)
Move logical expression type-coercion code from physical-expr crate to expr crate #2257 (andygrove)
feat: 2061 create external table ddl table partition cols #2099 [sql] (jychen7)
Reorganize the project folders #2081 (yahoNanJing)
Support more ScalarFunction in Ballista #2008 (Ted-Jiang)
Merge dataframe and dataframe imp #1998 (vchag)
Rename ExecutionContext to SessionContext, ExecutionContextState to SessionState, add TaskContext to support multi-tenancy configurations - Part 1 #1987 (mingmwang)
Add Coalesce function #1969 (msathis)
Add Create Schema functionality in SQL #1959 [sql] (matthewmturner)
omit some clone when converting sql to logical plan #1945 [sql] (doki23)
[split/16] move physical plan expressions folder to datafusion-physical-expr crate #1889 (Jimexist)
remove sync constraint of SendableRecordBatchStream #1884 (doki23)
[split/15] move built in window expr and partition evaluator #1865 (Jimexist)

Implemented enhancements:

Include Expr to datafusion::prelude #2347
Implement Serialization API for DataFusion #2340
Implement power function #1493
allow lit python function to support boolean and other types #1136
Automate dependency updates #37
Add CREATE VIEW #2279 (matthewmturner)
[Ballista] Support Union in ballista. #2098 (Ted-Jiang)
Change the DataFusion explain plans to make it clearer in the predicate/filter #2063 (Ted-Jiang)
Add write_json, read_json, register_json, and JsonFormat to CREATE EXTERNAL TABLE functionality #2023 (matthewmturner)
Qualified wildcard #2012 [sql] (doki23)
support bitwise or/'|' operation #1876 [sql] (liukun4515)
Introduce JIT code generation #1849 (yjshen)

Fixed bugs:

CASE expr with NULL literals panics 'WHEN expression did not return a BooleanArray' #1189
Function calls with NULL literals do not work #1188
Add SQL planner support for calling round function with two arguments #2503 (andygrove)
nested query fix #2402 (comphead)
fix issue#2058 file_format/json.rs attempt to subtract with overflow #2066 (silence-coding)
fix bug the optimizer rule filter push down #2039 (jackwener)
fix: replace ExecutionContex and ExecutionConfig with SessionContext and SessionConfig #2030 (xudong963)
Fixed parquet path partitioning when only selecting partitioned columns #2000 (pjmore)
Fix ambiguous reference error in filter plan #1925 (jonmmease)
platform aware partition parsing #1867 (korowa)
Fix incorrect aggregation in case that GROUP BY contains duplicate column names #1855 (alex-natzka)

Documentation updates:

MINOR: Make crate READMEs consistent #2437 (andygrove)
minor: Improve documentation for DFSchema join and merge functions #2367 (andygrove)
Change the code location and add annotation #2037 [sql] (jackwener)
Fix typos (Datafusion -> DataFusion) #1993 (andygrove)
Add examples to use MemTable and TableProvider (#1864) #1946 (PierreZ)
Add doc for building datafusion-cli when connect the ballista #1866 (liukun4515)
Add benchmarks section to DEVELOPERS.md #1838 (tustvold)

Performance improvements:

Avoid an Arc::clone per row in benchmark #1975 (jhorstmann)
Update datafusion-cli allocator #1878 (matthewmturner)

Closed issues:

Make expected result string in unit tests more readable #2412
remove duplicated fn aggregate() in aggregate expression tests #2399
split distinct_expression.rs into count_distinct.rs and array_agg_distinct.rs #2385
move sql tests in context.rs to corresponding test files in datafustion/core/tests/sql #2328
Date32/Date64 as join keys for merge join #2314
Error precision and scale for decimal coercion in logic comparison #2232
Support Multiple row layout #2188
TPC-H Query 18 #169
TPC-H Query 16 #167
Implement Sort-Merge Join #141
Split logical expressions out into separate source files #114

Merged pull requests:

Minor: remove code that is now included in arrow-rs #2511 (alamb)
MINOR: Enable multi-statement benchmark queries #2507 (andygrove)
MINOR: Add ignored tests for all remaining benchmark queries #2506 (andygrove)
Update to sqlparser 0.17.0 #2500 (alamb)
Add metrics for ParquetExec #2499 (Ted-Jiang)
Limit cpu cores used when generating changelog #2494 (andygrove)
Optimize MergeJoin by storing joined indices instead of creating small record batches for each match #2492 (richox)
Add SQL planner support for grouping() aggregate expressions #2486 (andygrove)
MINOR: Parameterize changelog script #2484 (jychen7)
Numeric, String, Boolean comparisons with literal NULL #2481 (WinkerDu)
Adds unit test cases of mathematical expressions working with null literal #2478 (WinkerDu)
Minor: Move test code from context.rs into sql_integration #2473 (alamb)
Minor: Use ExprVisitor to find columns referenced by expr #2471 (alamb)
minor: remove expr dependency from the row crate, update crate-deps.dot/svg #2470 (yjshen)
Fix read_from_registered_table_with_glob_path fails if path contains // #2465 #2468 (timvw)
Add support for list_dir() on local fs #2467 (wjones127)
MINOR: Partial fix for SQL aggregate queries with aliases #2464 (andygrove)
minor: move struct definition out of aggregate/mod.rs, etc #2458 (WinkerDu)
Fix bugs in SQL planner with GROUP BY scalar function and alias #2457 (andygrove)
feat: Support CompoundIdentifier as GetIndexedField access #2454 (ovr)
Table provider error propagation #2438 (jdye64)
MINOR: Improve error messages for GROUP BY / HAVING queries #2435 (andygrove)
minor: remove redundant code #2432 (jackwener)
minor: update versions and paths in changelog scripts #2429 (andygrove)
Fix Ballista executing during plan #2428 (tustvold)
minor: format table result vec & remove some unnecessary semicolons #2425 (WinkerDu)
Basic support for IN and NOT IN Subqueries by rewriting them to SEMI / ANTI Join #2421 (korowa)
Allow subqueries without aliases #2418 (andygrove)
Fix bug in subquery join filters referencing outer query #2416 (andygrove)
MINOR: remove duplicated function format_state_name() #2414 (WinkerDu)
Make expected result string in unit tests more readable #2413 (WinkerDu)
sum(distinct) support #2405 (WinkerDu)
Update ordered-float requirement from 2.10 to 3.0 #2403 (dependabot[bot])
remove duplicated fn aggregate() in aggregate expression tests #2400 (WinkerDu)
Support type-coercion from Decimal to Float64 #2396 (comphead)
minor: SchemaError code cleanup and improvements #2391 (andygrove)
Support struct_expr generate struct in sql #2389 (Ted-Jiang)
Re-organize and rename aggregates physical plan #2388 (yjshen)
refactor distinct_expressions.rs and split into count_distinct.rs and array_agg_distinct.rs #2386 (WinkerDu)
Allow CTEs to be referenced from subquery expressions #2384 (andygrove)
Upgrade to arrow 13 #2382 (alamb)
Grouped Aggregate in row format #2375 (yjshen)
Fix bugs with CTE aliasing and normalize all identifiers in the SQL planner #2373 (andygrove)
Stop optimizing queries twice #2369 (andygrove)
feat: Support casting to arrays to primitive type #2366 (ovr)
Add proper support for null literal by introducing ScalarValue::Null #2364 (WinkerDu)
minor: fix duplicate column bug in subquery support #2362 (andygrove)
Normalize subquery aliases #2359 (andygrove)
Implement physical planner support for DATE +/- INTERVAL #2357 (andygrove)
Add SQL query planner support for Scalar Subqueries #2354 (andygrove)
Add SQL query planner support for IN subqueries #2352 (andygrove)
Add Expr to prelude #2348 (alamb)
Add SQL planner support for EXISTS subqueries #2344 (andygrove)
Add public Serialization/Deserialization API for Expr to/from bytes #2341 (alamb)
Support for date32 and date64 in sort merge join #2336 (hntd187)
[physical-expr] move aggregate exprs and window exprs to their own modules #2335 (yjshen)
fix: union schema #2334 (gandronchik)
Improve sql integration test organization #2333 (alamb)
Support scalar values for func Array #2332 (Ted-Jiang)
move sql tests from context.rs to corresponding test files in tests/sql #2329 (WinkerDu)
deprecate index_of and make index_of_column_by_name public #2320 (jdye64)
Fix HashJoin evaluating during plan #2317 (tustvold)
minor: remove two source files that only had re-exports #2313 (andygrove)
Don't sort batches during plan #2312 (tustvold)
Move case/when expressions to datafusion-expr crate #2311 (andygrove)
Fix CrossJoinExec evaluating during plan #2310 (tustvold)
Make SortPreservingMerge Usable Outside Tokio (#2201) #2305 (tustvold)
chore: update cranelift to 0.83.0 #2304 (yjshen)
Always increment timer on record #2298 (tustvold)
Remove unnecessary env var for parquet_sql example #2297 (sergey-melnychuk)
Simplify sort streams #2296 (tustvold)
MINOR: beautify code with neat idents #2295 (WinkerDu)
Move FileType enum from sql module to logical_plan module #2290 (andygrove)
Remove Parquet Empty Projection Workaround #2289 (tustvold)
Add BatchPartitioner (#2285) #2287 (tustvold)
Make row its crate to make it accessible from physical-expr #2283 (yjshen)
Enable filter pushdown when using In_list on parquet #2282 (Ted-Jiang)
Update uuid requirement from 0.8 to 1.0 #2280 (dependabot[bot])
Add bytes scanned metric to ParquetExec #2273 (thinkharderdev)
Fix outer join output with all-null indices on empty batch #2272 (yjshen)
Re-export DataFusion crates #2264 (andygrove)
rewrite approx_median to approx_percentile_cont while planning phase #2262 (korowa)
Introduce RowLayout to represent rows for different purposes #2261 (yjshen)
fix string coercion missing in Eq/NotEq operator #2258 (WinkerDu)
Update to Arrow 12.0.0, update tonic and prost #2253 (alamb)
minor: move field_util from physical-expr crate to expr crate #2250 (andygrove)
Move identifer case tests to sql_integ, add negative cases, Debug for DataFrame #2243 (alamb)
Implement sort-merge join #2242 (richox)
fix: find the right wider decimal datatype for comparison operation #2241 (liukun4515)
Fix join without constraints #2240 (Dandandan)
Add type coercion rule for date + interval #2235 (andygrove)
support array with scalar arithmetic operation for decimal data type #2233 (liukun4515)
chore: add debug! log in some execution operators #2231 (NGA-TRAN)
Introduce new optional scheduler, using Morsel-driven Parallelism + rayon (#2199) #2226 (tustvold)
minor: add editor config file #2224 (jackwener)
minor: Refactor to avoid repeated code in replace_qualifier #2222 (andygrove)
update cli readme #2220 (liukun4515)
Use filter (filter_record_batch) instead of take to avoid using indices #2218 (Dandandan)
Add single line description of ExecutionPlan (#2216) #2217 (tustvold)
Remove tokio::spawn from HashAggregateExec (#2201) #2215 (tustvold)
Remove tokio::spawn from WindowAggExec (#2201) #2203 (tustvold)
Make ParquetExec usable outside of a tokio runtime (#2201) #2202 (tustvold)
add sql level test for decimal data type #2200 (liukun4515)
case when supports NULL constant #2197 (WinkerDu)
feat: Support simple Arrays with Literals #2194 (ovr)
[Ballista] Enable ApproxPercentileWithWeight in Ballista and fill UT #2192 (Ted-Jiang)
refactor: simplify prepare_select_exprs #2190 (jackwener)
Multiple row-layout support, part-1: Restructure code for clearness #2189 (yjshen)
make nightly clippy happy #2186 (xudong963)
[Ballista]Make PhysicalAggregateExprNode has repeated PhysicalExprNode #2184 (Ted-Jiang)
MINOR: handle NULL in advance to avoid value copy in string_concat #2183 (WinkerDu)
fix: Sort with a lot of repetition values #2182 (yjshen)
cli: update lockfile #2178 (happysalada)
Add LogicalPlan::SubqueryAlias #2172 (andygrove)
minor: Avoid per cell evaluation in Coalesce, use zip in CaseWhen #2171 (yjshen)
Handle merged schemas in parquet pruning #2170 (thinkharderdev)
Implement fast path of with_new_children() in ExecutionPlan #2168 (mingmwang)
enable explain for ballista #2163 (doki23)
Add delimiter for create external table #2162 (matthewmturner)
[MINOR] enable EXTRACT week and add test (after sqlparser update to 0.16) #2157 (Ted-Jiang)
Optimize the evaluation of IN for large lists using InSet #2156 (Ted-Jiang)
Update sqlparser requirement from 0.15 to 0.16 #2152 (dependabot[bot])
fix not(null) with constant null #2144 (WinkerDu)
Add IF NOT EXISTS to CREATE TABLE and CREATE EXTERNAL TABLE #2143 (matthewmturner)
implement 'StringConcat' operator to support sql like "select 'aa' || 'b' " #2142 (WinkerDu)
#2109 By default, use only 1000 rows to infer the schema #2139 (jychen7)
[CLI] Add show tables in ballista for datafusion-cli #2137 (gaojun2048)
fix: incorrect memory usage track for sort #2135 (yjshen)
Update quarterly roadmap for Q2 #2133 (matthewmturner)
Reduce SortExec memory usage by void constructing single huge batch #2132 (yjshen)
MINOR: fix concat_ws corner bug #2128 (WinkerDu)
Minor add clarifying comment in parquet #2127 (alamb)
Minor: make disk_manager public #2126 (yjshen)
JIT-compille DataFusion expression with column name #2124 (Dandandan)
minor: replace array_equals in case evaluation with eq_dyn from arrow-rs #2121 (alamb)
Serialize timezone in timestamp scalar values #2120 (thinkharderdev)
minor: fix some clippy warnings from nightly rust #2119 (alamb)
Fix case evaluation with NULLs #2118 (alamb)
issue#1967 ignore channel close #2113 (silence-coding)
cli: add cargo.lock #2112 (happysalada)
doc: update release schedule #2110 (jychen7)
fix df union all bug #2108 [sql] (WinkerDu)
Reduce repetition in Decimal binary kernels, upgrade to arrow 11.1 #2107 (alamb)
update zlib version to 1.2.12 #2106 (waitingkuo)
Create jit-expression from datafusion expression #2103 (Dandandan)
Add CREATE DATABASE command to SQL #2094 [sql] (matthewmturner)
Refactor SessionContext, BallistaContext to support multi-tenancy configurations - Part 3 #2091 (mingmwang)
minor: remove duplicate test #2089 (jackwener)
minor: remove repeated test #2085 (jackwener)
Fix lost filters and projections in ParquetExec, CSVExec etc #2077 (Ted-Jiang)
Remove dependency of common for the storage crate #2076 (yahoNanJing)
[MINOR] fix doc in `EXTRACT(field FROM source) #2074 (Ted-Jiang)
[Bug][Datafusion] fix TaskContext session_config bug #2070 (gaojun2048)
Short-circuit evaluation for CaseWhen #2068 (yjshen)
split datafusion-object-store module #2065 (yahoNanJing)
Allow CatalogProvider::register_catalog to return an error #2052 (alamb)
Add test in register_catalog and change to use named symbolic constants #2050 (alamb)
Update to arrow/parquet 11.0 #2048 (alamb)
minor: format comments (// to // ) #2047 (jackwener)
use cargo-tomlfmt to check Cargo.toml formatting in CI #2033 (WinkerDu)
feat: #2004 approx percentile with weight #2031 (jychen7)
Refactor SessionContext, SessionState and SessionConfig to support multi-tenancy configurations - Part 2 #2029 (mingmwang)
Simplify prerequisites for running examples #2028 (doki23)
Replace usage of println! with logger macros #2020 (silence-coding)
Automatically test examples in user guide #2018 (vchag)
return VecDeque for DFParser::parse_sql #2017 [sql] (doki23)
Eliminate the scalar value filter #2002 (jackwener)
Fixing a typo in documentation #1997 (psvri)
Correct documentation of ExprVisitor #1996 (alamb)
Make it possible to only scan part of a parquet file in a partition #1990 (yjshen)
Update Dockerfile to fix integration tests #1982 (andygrove)
Remove some more unecessary cloning in sql_expr_to_logical_expr #1981 [sql] (alamb)
Add ticket reference to clippy allow #1978 [sql] (alamb)
Implement EXTRACT expression with week, month, day, hour #1974 (Ted-Jiang)
Address typo in ExprVisitable trait documentation #1970 (jdye64)
Update sqlparser requirement from 0.14 to 0.15 #1966 (dependabot[bot])
PruningPredicate should take owned Expr #1960 (thinkharderdev)
Update to arrow 10.0.0, pyo3 0.16 #1957 (alamb)
update jit-related dependencies #1953 (xudong963)
minor code refinement: if_exists name change, wildcard field for logical plan, etc. #1951 [sql] (xudong963)
Allow different types of query variables (@@var) rather than just string #1943 [sql] (maxburke)
Pruning serialization #1941 (thinkharderdev)
Add write_parquet to DataFrame #1940 (matthewmturner)
Fix select from EmptyExec always return 0 row after optimizer passes #1938 (Ted-Jiang)
Add debug log when waiting for spilling on other consumers #1933 (viirya)
Add db benchmark script #1928 (matthewmturner)
Add write_csv to DataFrame #1922 (matthewmturner)
[MINOR] Update copyright year in Docs #1918 (alamb)
add metadata to DFSchema, close #1806. #1914 [sql] (jiacai2050)
Clippy fix on nightly #1907 (yjshen)
Updated Rust version to 1.59 in all the files #1903 (NaincyKumariKnoldus)
support extract second and minute in expr. #1901 (Ted-Jiang)
Update crate descriptions #1899 (alamb)
Remove uneeded Mutex in Ballista Client #1898 (alamb)
[split/17] move the rest of physical expr to datafusion-physical-expr crate #1892 (Jimexist)
Avoid unnecessary branching in row read/write if schema is null-free #1891 (yjshen)
Make parquet support optional for datafusion-common crate #1886 (jonmmease)
Fix clippy lints #1885 (HaoYang670)
Add support for ~/.datafusionrc and cli option for overriding it to datafusion-cli #1875 (matthewmturner)
[Minor] Clean up DecimalArray API Usage #1869 [sql] (alamb)
Changes after went through "Datafusion as a library section" #1868 (nonontb)
Enhance MemorySchemaProvider to support register_listing_table #1863 (matthewmturner)
Increase default partition column type from Dict(UInt8) to Dict(UInt16) #1860 (Igosuki)
Update to arrow 9.1.0 #1851 (alamb)
move some tests out of context and into sql #1846 (alamb)
[split/14] create datafusion-physical-expr module #1843 (Jimexist)
Return Error when parquet reader fails rather than no data with println! #1837 (alamb)
determine build side in hash join by total_byte_size instead of num_rows #1831 (xudong963)
Make ballista support an optional feature to datafusion-cli #1816 (alamb)
Update documentation example for change in API #1812 (alamb)
rename references of expr in physical plan module after datafusion-expr split #1798 (Jimexist)
DataFusion + Conbench Integration #1791 (dianaclarke)
The returned path value of get_by_uri should be self-described with entire path #1779 (yahoNanJing)
Useeq_dyn, neq_dyn, lt_dyn, lt_eq_dyn, gt_dyn, gt_eq_dyn kernels from arrow #1475 (alamb)

7.1.0 (2022-04-10)

Full Changelog

Fixed bugs:

By default, use only 1000 rows to infer the schema #2159

7.0.0 (2022-02-14)

Full Changelog

Breaking changes:

Consolidate various configurations options, remove unrelated batch_size #1565
Extract logical plans in LogicalPlan as independent struct #1228
Update ExecutionPlan to know about sortedness and repartitioning optimizer pass respect the invariants #1776 (alamb)
Update to arrow 8.0.0 #1673 (alamb)
Remove non idiomatic DataFusionError::into_arrow_external_error in favor of From conversion #1645 (alamb)
Remove Accumulator::update and Accumulator::merge #1582 (Jimexist)
implement Hash for various types and replace PartialOrd #1580 (Jimexist)
Replace DatafusionError with GenericError in ObjectStore interface #1541 (matthewmturner)
Make FLOAT SQL type map to Float32 rather than Float64 #1423 [sql] (liukun4515)
Map REAL SQL type to Float32 rather than Float64 to be consistent with pg #1390 [sql] (hntd187)

Implemented enhancements:

Create new datafusion_expr crate #1753
Create new datafusion_common crate #1752
API to get Expr's type and nullability without a DFSchema #1725
Cleaner API to create Expr::ScalarFunction programatically #1718
Introduce a Vec<u8> based row-wise representation for DataFusion #1708
Simplify creating new ListingTable #1705
Implement TableProvider for DataFrameImpl to allow registration of logical plans #1698
Public Expr simplification API #1694
Query Optimizer: Add OUTER --> INNER join conversion #1670
Support reading from CSV, Avro and Json files that have mergeable/compatible, but not identical schemas #1669
Remove DataFusionError::into_arrow_external_error in favor of From conversion #1644
Include join type in display implementation for logical plan #1620
Switch datafusion to using eq_dyn_scalar, etc kernels #1610
Proposal: Remove Accumulator::update and Accumulator::merge #1549
Replace DataFusionError/Result with impl Error for ObjectStore and Reader #1540
Add approx_quantile support #1538
support sorting decimal data type #1522
Keep all datafusion's packages up to date with Dependabot #1472
ExecutionContext support init ExecutionContextState with new(state: Arc<Mutex<ExecutionContextState>>) method #1439
support the decimal scalar value #1393
Documentation for using scalar functions with the the DataFrame API #1364
Support boolean == boolean and boolean != boolean operators #1159
Support DataType::Decimal(15, 2) in TPC-H benchmark #174
Make MemoryStream public #150
Add support for Parquet schema merging #132
Add SQL support for IN expression #118
Add logging to datafusion-cli #1789 (alamb)
Add approx_median() aggregate function #1729 (realno)
Add join type for logical plan display #1674 [sql] (xudong963)
Fix null comparison for Parquet pruning predicate #1595 (viirya)
Add corr aggregate function #1561 (realno)
Add covar, covar_pop and covar_samp aggregate functions #1551 (realno)
Add approx_quantile() aggregation function #1539 (domodwyer)
Initial MemoryManager and DiskManager APIs for query execution + External Sort implementation #1526 (yjshen)
Add stddev and variance #1525 (realno)
Add rem operation for Expr #1467 (liukun4515)
support decimal data type in create table #1431 [sql] (liukun4515)
Ordering by index in select expression #1419 [sql] (hntd187)
Add support for ORDER BY on unprojected columns #1415 (viirya)
Support decimal for min and max aggregate #1407 (liukun4515)
Consolidate ConstantFolding and SimplifyExpression #1375 (alamb)
Datafusion cli quiet mode command to contain option bool #1345 (Jimexist)
Implement array_agg aggregate function #1300 (viirya)
Add a command to switch output format in cli #1284 (capkurmagati)
Support =, <, <=, >, >=, !=, is distinct from, is not distinct from for BooleanArray #1163 (alamb)

Fixed bugs:

Unsupported data type in hasher: Timestamp(Second, None) #1768
SQL column identifiers should be converted to lowercase when unquoted #1746
Data type Dictionary(Int32, Utf8) not supported for binary operation 'eq' on dyn arrays #1605
datafusion doesn't process predicate pushdown correctly when there is outer join #1586
casting Int64 to Float64 unsuccessfully caused tpch8 to fail #1576
CTE/WITH .. UNION ALL confuses name resolution in WHERE #1509
ORDER BY min(x) results in error Plan("No field named 'foo.x'. Valid fields are 'MIN(foo.x)'.") #1479
Sort discards field metadata on the output schema #1476
Datafusion should not strip out timezone information from existing types #1454
Error on some queries: "column types must match schema types, expected XXX but found YYY" #1447
Query failing to return any results when filter is an equality check on strings (bad statistics in parquet) #1433
Field names containing period such as f.c1 cannot be named in SQL query #1432
Select * returns an unexpected result #1412
Turn off unused default features of chrono and ahash #1398
real data type is float32 in PG database, but in the datafusion it is as float64 #1380
TPC-H q10 performance regression (expression for filter with added alias is not pushed down) #1367
ProjectionExec Loses Field Metadata #1361
Support Filter on unprojected columns #1351
NULLS ORDER is inconsistent with postgres #1343
Fix bug while merging RecordBatch, add SortPreservingMerge fuzz tester #1678 (alamb)
fix a cte block with same name for many times #1639 [sql] (xudong963)
fix: casting Int64 to Float64 unsuccessfully caused tpch8 to fail #1601 (xudong963)
Fix single_distinct_to_groupby for arbitrary expressions #1519 (james727)
Fix SortExec discards field metadata on the output schema #1477 (alamb)
fix calculate in many_to_many_hash_partition test. #1463 (Ted-Jiang)
Add Timezone to Scalar::Time* types, and better timezone awareness to Datafusion's time types #1455 (maxburke)
Support identifiers with . in them #1449 [sql] (alamb)
Fixes for working with functions in dataframes, additional documentation #1430 (tobyhede)
[Minor] Fix send_time metric for hash-repartition #1421 (Dandandan)
fix: Select * returns an unexpected result #1413 [sql] (xudong963)
Make cli handle multiple whitespaces #1388 (capkurmagati)
Metadata is kept in projections for non-derived columns #1378 (hntd187)
Fix Predicate Pushdown: split_members should be able to split aliased predicate #1368 (viirya)
Change the arg names and make parameters more meaningful #1357 (liukun4515)
collect table stats by default for listing table #1347 (houqp)
fix: make nulls-order consistent with postgres #1344 [sql] (xudong963)
Avoid changing expression names during constant folding #1319 (viirya)
improve error message for invalid create table statement #1294 [sql] (houqp)
Forbid creating the table with the same name #1288 (liukun4515)

Documentation updates:

Clarify docs about Accumulator::update and Accumulator::update_batch #1542 (alamb)
Fix duplicated cargo run --example parquet_sql #1482 (sergey-melnychuk)
add documentation to Datafusion cli's new commands #1348 (liukun4515)
fix some clippy warnings from nightly channel #1277 [sql] (Jimexist)

Performance improvements:

Parquet pruning predicate for IS NULL #1591
Fix predicate pushdown for outer joins #1618 (james727)
fix: sql planner creates cross join instead of inner join from select predicates #1566 [sql] (xudong963)
Split fetch_metadata into fetch_statistics and fetch_schema #1365 (Dandandan)
Optimize the performance queries with a single distinct aggregate #1315 (ic4y)
Left join could use bitmap for left join instead of Vec<bool> #1291 (boazberman)

Closed issues:

Add release compile to CI #1728
DiskManager and TempFiles getting created several times per query #1690
Add a test for the pyarrow feature in CI #1635
SQL tests for when sorting exceeded available memory and had to spill to disk #1573
Consolidate the N-way merging code and SortPreservingMergeStream (which has quite good tests of what is often quite tricky code, and it will be performance critical) #1572
Consolidate the SortExec code (so there is only a single sort operator that does in memory sorting if it has enough memory budget but then spills to disk if needed). #1571
Track memory usage in Non Limited Operators #1569
[Question] Why does ballista store tables in the client instead of in the SchedulerServer #1473
Consolidate Projection for Schema and RecordBatch #1425
Support Sort on unprojected columns #1372
Unused code in hash_aggregate #1362
Why use the expr types before coercion to get the result type? #1358
A problem about the projection_push_down optimizer gathers valid columns #1312
apply constant folding to LogicalPlan::Values #1170
reduce usage of IntoIterator<Item = Expr> in logical plan builder window fn #372
Why does DataFusion throw a Tokio 0.2 runtime error? #176
TPC-H Query 14 #165
Length kernel returns bytes not character length #156
Split the logical operators out into separate source files #115

Merged pull requests:

Fixup some doc warnings #1811 (alamb)
Ensure most of links in docs are correct #1808 [sql] (HaoYang670)
Update CHANGELOG.md, update release scripts #1807 (alamb)
Update versions for split crates #1803 (matthewmturner)
Improve the error message and UX of tpch benchmark program #1800 (alamb)
rename references of expr in logical plan module after datafusion-expr split #1797 (Jimexist)
Update to sqlparser 0.14 #1796 [sql] (alamb)
[split/13] move rest of expr to expr_fn in datafusion-expr module #1794 (Jimexist)
Update datafusion versions #1793 (matthewmturner)
Less verbose plans in debug logging #1787 (alamb)
[split/11] split expr type and null info to be expr-schemable #1784 (Jimexist)
Introduce Row format backed by raw bytes #1782 (yjshen)
rewrite predicates before pushing to union inputs #1781 (korowa)
Update datafusion to use arrow 9.0.0 #1775 (alamb)
[split/10] split up expr for rewriting, visiting, and simplification traits #1774 [sql] (Jimexist)
#1768 Support TimeUnit::Second in hasher #1769 (jychen7)
TPC-H benchmark can optionally write JSON output file with benchmark summary #1766 (andygrove)
[split/8] move Accumulator and ColumnarValue to datafusion-expr #1765 (Jimexist)
[split/7] move built-in scalar function to datafusion-expr #1764 (Jimexist)
[split/6] move signature, type signature, volatility to datafusion-expr #1763 (Jimexist)
[split/9+12] move udf, udaf, Expr to datafusion-expr module #1762 [sql] (Jimexist)
[split/5] move window frame and operator to datafusion-expr module #1761 (Jimexist)
[split/4] move scalar value to datafusion-common #1760 (Jimexist)
[split/3] split datafusion expr module and move aggregate and window function expr #1759 (Jimexist)
[split/2] move column and dfschema to datafusion-common module #1758 (Jimexist)
Use ordered-float 2.10 #1756 (andygrove)
[split/1] split datafusion-common module #1751 (Jimexist)
use clap 3 style args parsing for datafusion cli #1749 (Jimexist)
fix: Case insensitive unquoted identifiers in SQL #1747 [sql] (mkmik)
Move more tests out of context.rs #1743 (alamb)
Move optimize test out of context.rs #1742 (alamb)
Fix typos in crate documentation #1739 (r4ntix)
add cargo check --release to ci #1737 (xudong963)
Update parking_lot requirement from 0.11 to 0.12 #1735 (dependabot[bot])
Create built-in scalar functions programmatically #1734 (HaoYang670)
Prevent repartitioning of certain operator's direct children (#1731) #1732 (tustvold)
API to get Expr's type and nullability without a DFSchema #1726 (alamb)
minor: fix cargo run --release error #1723 (xudong963)
substitute parking_lot::Mutex for std::sync::Mutex #1720 (xudong963)
Convert boolean case expressions to boolean logic #1719 (tustvold)
Add Expression Simplification API #1717 (alamb)
Create ListingTableConfig which includes file format and schema inference #1715 (matthewmturner)
make select_to_plan clearer #1714 [sql] (xudong963)
Add upper bound for public function signature #1713 (HaoYang670)
Add tests and CI for optional pyarrow module #1711 (wjones127)
Create SchemaAdapter trait to map table schema to file schemas #1709 (thinkharderdev)
refine test in repartition.rs & coalesce_batches.rs #1707 (xudong963)
Fuzz test for spillable sort #1706 (yjshen)
Support create_physical_expr and ExecutionContextState or DefaultPhysicalPlanner for faster speed #1700 (alamb)
Implement TableProvider for DataFrameImpl #1699 (cpcloud)
Move timestamp related tests out of context.rs and into sql integration test #1696 (alamb)
Lazy TempDir creation in DiskManager #1695 (alamb)
Add MemTrackingMetrics to ease memory tracking for non-limited memory consumers #1691 (yjshen)
(minor) Reduce memory manager and disk manager logs from info! to debug! #1689 (alamb)
Make SortPreservingMergeStream stable on input stream order #1687 (alamb)
Incorporate dyn scalar kernels #1685 (matthewmturner)
Move information_schema tests out of execution/context.rs to sql_integration tests #1684 (alamb)
Add a new metric type: Gauge + CurrentMemoryUsage to metrics #1682 (yjshen)
refactor array_agg to not to have update and merge #1681 (Jimexist)
Use NamedTempFile rather than String in DiskManager #1680 (alamb)
upgrade clap to version 3 #1672 (Jimexist)
Improve configuration and resource use of MemoryManager and DiskManager #1668 (alamb)
feat: Support quarter granularity in date_trunc function #1667 (ovr)
Fix can not load parquet table form spark in datafusion-cli. #1665 (Ted-Jiang)
Make MemoryManager and MemoryStream public #1664 (yjshen)
[Cleanup] Move AggregatedMetricsSet to metrics for further reuse #1663 (yjshen)
fix: substr - correct behaivour with negative start pos #1660 (ovr)
suppport bitwise and as an example #1653 [sql] (liukun4515)
refine match pattern related code #1650 (xudong963)
update md-5, sha2, blake2 #1647 (xudong963)
Add DataFusionError -> ArrowError conversion #1643 (alamb)
Add spill_count and spilled_bytes to BaselineMetrics, test sort with spill #1641 (yjshen)
support hash decimal array and group by #1640 (liukun4515)
Consolidate Schema and RecordBatch projection #1638 (alamb)
Update hashbrown requirement from 0.11 to 0.12 #1631 (dependabot[bot])
Update pyo3 requirement from 0.14 to 0.15 #1627 (dependabot[bot])
Optimize SortPreservingMergeStream to avoid SortKeyCursor sharing #1624 (yjshen)
Handle merging of evolved schemas in ParquetExec #1622 (thinkharderdev)
feat: Support Substring(str [from int] [for int]) #1621 [sql] (ovr)
feat: Support complex interval via IntervalMonthDayNano #1615 [sql] (ovr)
consolidate binary_expr coercion rule code into binary_rule.rs module #1607 (alamb)
Fix comparison of dictionary arrays #1606 (alamb)
add test for decimal to decimal #1603 (liukun4515)
update nightly version #1597 (Jimexist)
Consolidate sort and external_sort #1596 (yjshen)
support from_slice for binary, string, and boolean array types #1589 (Jimexist)
add from_slice trait to ease arrow2 migration #1588 (Jimexist)
Implement ARRAY_AGG(DISTINCT ...) #1579 (james727)
Rename sql integration tests from mod to sql_integration #1575 (alamb)
minor: improve the benchmark readme #1567 (xudong963)
Consolidate batch_size configuration in ExecutionConfig, RuntimeConfig and PhysicalPlanConfig #1562 (yjshen)
Update to rust 1.58 #1557 (xudong963)
support mathematics operation for decimal data type #1554 (liukun4515)
Address clippy warnings #1553 (sergey-melnychuk)
enhance arithmetic operation for array with scalar #1552 (liukun4515)
Remove unused update and merge implementations from Aggregates and supporting ScalarValue arithmetic #1550 (alamb)
Add batch operations to stddev #1547 (realno)
Mark ARRAY_AGG(DISTINCT ...) not implemented #1534 (james727)
Update to arrow-7.0.0 #1523 (alamb)
Fix ORDER BY on aggregate #1506 (viirya)
Add example on how to query multiple parquet files #1497 (nitisht)
Refactor testing modules #1491 (hntd187)
add rfcs for datafusion #1490 (xudong963)
support comparison for decimal data type and refactor the binary coercion rule #1483 (liukun4515)
Minor: Rename predicate_builder --> pruning_predicate for consistency #1481 (alamb)
Tests for support try_cast/cast decimal to numeric #1465 (liukun4515)
Avoid send empty batches for Hash partitioning. #1459 (Ted-Jiang)
Planner code cleanup #1450 [sql] (alamb)
Fix bug in projection: "column types must match schema types, expected XXX but found YYY" #1448 (alamb)
Update arrow-rs to 6.4.0 and replace boolean comparison in datafusion with arrow compute kernel #1446 (xudong963)
support cast/try_cast for decimal: signed numeric to decimal #1442 (liukun4515)
Consolidate decimal error checking and improve error messages #1438 [sql] (alamb)
use 0.13 sql parser #1435 (Jimexist)
Minor Code cleanups #1428 (alamb)
Clarify communication on bi-weekly sync #1427 (alamb)
support sum/avg agg for decimal, change sum(float32) --> float64 #1408 [sql] (liukun4515)
Fix bugs with nullability during rewrites: Combine simplify and Simplifier #1401 (alamb)
Minimize features #1399 (carols10cents)
Update rust vesion to 1.57 #1395 [sql] (xudong963)
support decimal scalar value #1394 (liukun4515)
Add coercion rules for AggregateFunctions #1387 (liukun4515)
upgrade the arrow-rs version #1385 (liukun4515)
add array agg name #1382 (liukun4515)
Make tests for simplify and Simplifer consistent #1376 (alamb)
Refactor: Consolidate expression simplification code in simplify_expression.rs #1374 (alamb)
remove unused code in hash_aggregate #1370 (ic4y)
Use BufReader for LocalFileReader to revert performance regression in parquet reading #1366 (Dandandan)
Add unit test for constant folding on values #1355 (viirya)
Extract logical plan: rename the plan name (follow up) #1354 [sql] (liukun4515)
Moved aggr_test_schema to test_utils #1338 (rdettai)
upgrade arrow-rs to 6.2.0 #1334 (liukun4515)
Update release instructions #1331 (alamb)
#1268: allow datafusion-cli to toggle quiet flag within CLI #1330 (jgoday)
Extract Aggregate, Sort, and Join to struct from AggregatePlan #1326 (matthewmturner)
Extract EmptyRelation, Limit, Values from LogicalPlan #1325 (liukun4515)
Extract CrossJoin, Repartition, Union in LogicalPlan #1322 (liukun4515)
Fifth batch of updating sql tests to use assert_batches_eq #1318 (matthewmturner)
Extract Explain, Analyze, Extension in LogicalPlan as independent struct #1317 [sql] (xudong963)
Extract CreateMemoryTable, DropTable, CreateExternalTable in LogicalPlan as independent struct #1311 [sql] (liukun4515)
Extract Projection, Filter, Window in LogicalPlan as independent struct #1309 (ic4y)
Add PSQL comparison tests for except, intersect #1292 (mrob95)
Extract logical plans in LogicalPlan as independent struct: TableScan #1290 (xudong963)
Add statement helper command to cli #1285 (matthewmturner)
Python bindings for window functions #819 [sql] (jgoday)

6.0.0 (2021-11-13)

Full Changelog

Breaking changes:

Removed deprecated with_concurrency #1200 (rdettai)
File partitioning for ListingTable #1141 (rdettai)
Add function volatility to Signature #1071 [sql] (pjmore)
fix: allow duplicate field names in table join, fix output with duplicated names #1023 (houqp)
Make TableProvider.scan() and PhysicalPlanner::create_physical_plan() async #1013 (rdettai)
Reorganize table providers by table format #1010 (rdettai)
Make Metrics::labels() public #999 (alamb)
Rename NthValue::{first_value,last_value,nth_value} to satisfy clippy in Rust 1.55 #986 (alamb)
Move CBOs and Statistics to physical plan #965 (rdettai)
Update to sqlparser v 0.10.0 #934 [sql] (alamb)
FilePartition and PartitionedFile for scanning flexibility #932 [sql] (yjshen)
Improve SQLMetric APIs, port existing metrics #908 (alamb)
Add support for EXPLAIN ANALYZE #858 [sql] (alamb)
Rename concurrency to target_partitions #706 (andygrove)

Implemented enhancements:

Add booleans support to the CASE statement #1156
Implement General Purpose Constant Folding with the Expression Evaluator #1070
Mark volatility categories of functions #1069
Add "show" support to DataFrame API #937
Add support for TRIM BOTH/LEADING/TRAILING #935
Add "baseline" metrics to all built in operators #866
Add SQL support for referencing fields in structs #119
add filename completer for create table statement #1278 (Jimexist)
Add drop table support #1266 [sql] (viirya)
Dataframe supports except and update readme #1261 (xudong963)
Implement EXCEPT & EXCEPT DISTINCT #1259 [sql] (xudong963)
Add DataFrame support for INTERSECT and update readme #1258 (xudong963)
use arrow 6.1.0 #1255 (Jimexist)
fix 1250, add editor support for datafusion cli with validation #1251 (Jimexist)
Add support for create table as via MemTable #1243 [sql] (Dandandan)
Add cli show columns command to describe tables #1231 (Jimexist)
datafusion-cli to add list table command #1229 (Jimexist)
datafusion cli to handle EoF and interrupt signal #1225 (Jimexist)
add \q as quit command and add ? for help #1224 (Jimexist)
Add algebraic simplifications to constant_folding #1208 (matthewmturner)
Improve GetIndexedFieldExpr adding utf8 key based access for struct v… #1204 [sql] (Igosuki)
Fix between in select query #1202 [sql] (capkurmagati)
Move code to fold Stable functions like now() from Simplifier to ConstEvaluator #1176 (alamb)
DataFrame supports window function #1167 [sql] (xudong963)
add values list expression #1165 [sql] (Jimexist)
Add booleans support to the CASE statement #1161 (xudong963)
Improve error messages when operations are not supported #1158 (alamb)
Generic constant expression evaluation #1153 (alamb)
python lit function to support bool and byte vec #1152 (Jimexist)
[nit] simplify datafusion optimizer module codes #1146 (panarch)
Add ScalarValue support for arbitrary list elements #1142 (jonmmease)
Multiple files per partitions for CSV Avro Json #1138 (rdettai)
Implement INTERSECT & INTERSECT DISTINCT #1135 [sql] (xudong963)
Simplify file struct abstractions #1120 (rdettai)
Implement is [not] distinct from #1117 [sql] (Dandandan)
Clean up spawned task on drop for RepartitionExec, SortPreservingMergeExec, WindowAggExec #1112 (crepererum)
add hyperloglog implementation (add and count) #1095 (Jimexist)
Add ScalarValue::Struct variant #1091 (jonmmease)
add digest(utf8, method) function and refactor all current hash digest functions #1090 (Jimexist)
[crypto] add blake3 algorithm to digest function #1086 (Jimexist)
[crypto] add blake2b and blake2s functions #1081 (Jimexist)
[nit] make schema qualifier error message in field lookup more readable #1079 (Jimexist)
[window function] add percent_rank window function #1077 (Jimexist)
[window function] add cume_dist implementation #1076 (Jimexist)
Add a LogicalPlanBuilder::schema() function #1075 (alamb)
Add support for UNION [DISTINCT] sql #1068 [sql] (xudong963)
fix: fix joins on Float32/Float64 columns bug #1054 (francis-du)
Update sqlparser-rs to 0.11 #1052 [sql] (alamb)
Support querying CSV files without providing the schema #1050 [sql] (xudong963)
remove hard coded partition count in ballista logicalplan deserialization #1044 (xudong963)
feat: add lit_timestamp_nanosecond #1030 (NGA-TRAN)
Ignore metadata on schema merge #1024 (Smurphy000)
add ExecutionConfig.with_optimizer_rules #1022 (seddonm1)
Add baseline execution stats to WindowAggExec and UnionExec, and fixup CoalescePartitionsExec #1018 (alamb)
Derive PartialOrd for Expr #1015 (alamb)
Indexed field access for List #1006 [sql] (Igosuki)
Add metrics for Limit and Projection, and CoalesceBatches #1004 (alamb)
Update DataFusion to arrow 6.0 #984 (alamb)
Implement Display for Expr, improve operator display #971 [sql] (matthewmturner)
Add metrics for FilterExec #960 (alamb)
Change compound column field name rules #952 (waynexia)
ObjectStore API to read from remote storage systems #950 (yjshen)
Add baseline metrics to SortPreservingMergeExec #948 (alamb)
Add support for TRIM LEADING/TRAILING/BOTH syntax #947 [sql] (adsharma)
fixes #933 replace placeholder fmt_as fr ExecutionPlan impls #939 (tiphaineruy)
Add metrics for SortExect + HashAggregateExec #938 (alamb)
Add some additional asserts in utils::from_plan #930 (alamb)
Avro Table Provider #910 [sql] (Igosuki)
Add BaselineMetrics, Timestamp metrics, add for CoalescePartitionsExec, rename output_time -> elapsed_compute #909 (alamb)
add cross join support to ballista #891 (houqp)
Add Ballista support to DataFusion CLI #889 (andygrove)
support like on DictionaryArray #876 (b41sh)
Register table based on known schema without file IO #872 (Dandandan)
Add support for PostgreSQL regex match #870 [sql] (b41sh)
Include planning time in datafusion-cli printing #860 (Dandandan)
Implement basic common subexpression eliminate optimization #792 (waynexia)
Impl ops::Not for expr #763 (Jimexist)

Fixed bugs:

Can not use between in the select list: #1196
ORDER BY does not work with literals: Sort operation is not applicable to scalar value 'foo' #1195
window functions with NULL literals in partition by and order by do not work: Internal("Sort operation is not applicable to scalar value NULL") #1194
Operation name not included in internal errors -- Internal("Data type Boolean not supported for binary operation on dyn arrays") #1157
Physical plan explain UNION query says "ExecutionPlan(PlaceHolder)" #933
Can not use LIKE on DictionaryArray encoded strings #815
physical_plan::repartition::tests::repartition_with_dropping_output_stream failing locally #614
Fix some BuiltinScalarFunction panics with zero arguments #1249 (capkurmagati)
fix: not do boolean folding on NULL and/or expr #1245 (NGA-TRAN)
ignore case of with header row in sql when creating external table #1237 [sql] (lichuan6)
fix: Min/Max aggregation data type should not be dictionary #1235 (NGA-TRAN)
Fix build with --no-default-features #1219 (alamb)
Prevent "future cannot be sent between threads safely" compilation error #1155 (jonmmease)
Clean up spawned task on drop for AnalyzeExec, CoalescePartitionsExec, HashAggregateExec #1121 (crepererum)
Clean up spawned task on SortStream drop #1105 (crepererum)
fix UNION ALL bug: thread 'main' panicked at 'index out of bounds: the len is 1 but the index is 1', ./src/datatypes/schema.rs:165:10 #1088 (xudong963)
python: fix generated table name in dataframe creation #1078 (houqp)
fix subquery alias #1067 [sql] (xudong963)
fix pattern handling in regexp_match function #1065 (houqp)
fix: joins on Timestamp columns #1055 (francis-du)
Fix metric name typo #943 (alamb)
EXPLAIN ANALYZE should run all Optimizer passes #929 (alamb)

Documentation updates:

update docs to fix DataFusion User Guide link #1238 (jiangzhx)
[docs] datafusion cli run via homebrew #1198 (Jimexist)
add support for unary and binary values in values list, update docs #1172 [sql] (Jimexist)
Add additional docstring comments to from_plan #1168 (alamb)
[nit] fix document issue for approx_distinct #1110 (Jimexist)
implement approx_distinct function using HyperLogLog #1087 (Jimexist)
Remove unused use statements from examples #1032 (alamb)
consolidate datafusion docs with sphinx #993 (houqp)
Updated user-guide library docs with optimized config #976 (matthewmturner)
Improve User Guide #954 (andygrove)
[MINOR] Fix typos in doc comments #945 (alamb)
[DataFusion] - Add show and show_limit function for DataFrame #923 (francis-du)
Typo fix in DataFusion crate documentation #914 (antoinewdg)

Performance improvements:

Improve avro reader performance by avoiding some cloning on avro_rs::Value #1206 (Igosuki)
optimize build profile for datafusion python binding, cli and ballista #1137 (houqp)
Avoid stack overflow by reducing stack usage of BinaryExpr::evaluate in debug builds #1047 (alamb)
Add ScalarValue::eq_array optimized comparison function #844 (alamb)
Rework GroupByHash to for faster performance and support grouping by nulls #808 (alamb)

Closed issues:

InList expr with NULL literals do not work #1190
update the homepage README to include values, approx_distinct, etc. #1171
[Python]: Inconsistencies with Python package name #1011
Wanting to contribute to project where to start? #983
delete redundant code #973
How to build DataFusion python wheel #853
Add support for partition pruning #204
[Datafusion] Support joins on TimestampMillisecond columns #187
TPC-H Query 21 #173
TPC-H Query 13 #164
TPC-H Query 8 #162
implement split_part(string, delimiter, position) #157
Join Statement: Schema contains duplicate unqualified field name #155
ParquetTable should avoid scanning all files twice #136
Add support for reading partitioned Parquet files #133
Add support for Parquet schema merging #132
Catalog abstraction #126
Optimizer rules should work with qualified column names #125
Add optional qualifier to Expr::Column #121
Implement modulus expression #99
[Rust] Add constant folding to expressions during logically planning #98
[Rust] Implement pretty print for physical query plan #93
Can not group by boolean columns (add boolean to valid keys of groupBy) #91
improve performance of building literal arrays #90
[rust][datafusion] optimize count(*) queries on parquet sources #89
Produce a design for a metrics framework #21

Merged pull requests:

Add timezome string to stablize test #1265 (viirya)
numerical_coercion pattern match optimize #1256 (Jimexist)
fix and update window function sql tests #1059 (Jimexist)
reduce ScalarValue from trait boilerplate with macro #989 (houqp)

For older versions, see apache/arrow/CHANGELOG.md

5.0.0 (2021-08-10)

Full Changelog

Breaking changes:

Box ScalarValue:Lists, reduce size by half size #788 (alamb)
JOIN conditions are order dependent #778 (seddonm1)
Show the result of all optimizer passes in EXPLAIN VERBOSE #759 (alamb)
#723 Datafusion add option in ExecutionConfig to enable/disable parquet pruning #749 (lvheyang)
Update API for extension planning to include logical plan #643 (alamb)
Rename MergeExec to CoalescePartitionsExec #635 (andygrove)
fix 593, reduce cloning by taking ownership in logical planner's from fn #610 (Jimexist)
fix join column handling logic for On and Using constraints #605 (houqp)
Rewrite pruning logic in terms of PruningStatistics using Array trait (option 2) #426 (alamb)
Support reading from NdJson formatted data sources #404 (heymind)
Add metrics to RepartitionExec #398 (andygrove)
Use 4.x arrow-rs from crates.io rather than git sha #395 (alamb)
Return Vec<bool> from PredicateBuilder rather than an Fn #370 (alamb)
Refactor: move RowGroupPredicateBuilder into its own module, rename to PruningPredicateBuilder #365 (alamb)
[Datafusion] NOW() function support #288 (msathis)
Implement select distinct #262 (Dandandan)
Refactor datafusion/src/physical_plan/common.rs build_file_list to take less param and reuse code #253 (Jimexist)
Support qualified columns in queries #55 (houqp)
Read CSV format text from stdin or memory #54 (heymind)
Use atomics for SQLMetric implementation, remove unused name field #25 (returnString)

Implemented enhancements:

Allow extension nodes to correctly plan physical expressions with relations #642
Filters aren't passed down to table scans in a union #557
Support pruning for boolean columns #490
Implement SQLMetrics for RepartitionExec #397
DataFusion benchmarks should show executed plan with metrics after query completes #396
Use published versions of arrow rather than github shas #393
Add Compare to GroupByScalar #364
Reusable "row group pruning" logic #363
Add an Order Preserving merge operator #362
Implement Postgres compatible now() function #251
COUNT DISTINCT does not support dictionary types #249
Use standard make_null_array for CASE #222
Implement date_trunc() function #203
COUNT DISTINCT does not support for Float64 #199
Update SQLMetric to use atomics rather than a Mutex #30
Implement PartialOrd for ScalarValue #838 (viirya)
Support date datatypes in max/min #820 (viirya)
Implement vectorized hashing for DictionaryArray types #812 (alamb)
Convert unsupported conditions in left right join to filters #796 [sql] (Dandandan)
Implement streaming versions of Dataframe.collect methods #789 (andygrove)
impl from str for column and scalar #762 (Jimexist)
impl fmt::Display for PlanType #752 (Jimexist)
Remove unnecessary projection in logical plan optimization phase #747 (waynexia)
Support table columns alias #735 (Dandandan)
Derive PartialEq for datasource enums #734 (alamb)
Allow filetype to be lowercase, Implement FromStr for FileType #728 (Jimexist)
Update to use arrow 5.0 #721 (alamb)
#554: Lead/lag window function with offset and default value arguments #687 (jgoday)
dedup using join column in wildcard expansion #678 (houqp)
Implement metrics for HashJoinExec #664 (andygrove)
Show physical plan with metrics in benchmark #662 (andygrove)
Allow non-equijoin filters in join condition #660 (Dandandan)
Add End-to-end test for parquet pruning + metrics for ParquetExec #657 (alamb)
Add support for leading field in interval #647 (Dandandan)
Remove hard-coded PartitionMode from Ballista serde #637 (andygrove)
Ballista: Implement scalable distributed joins #634 (andygrove)
implement rank and dense_rank function and refactor built-in window function evaluation #631 (Jimexist)
Improve "field not found" error messages #625 (andygrove)
Support modulus op #577 (gangliao)
implement std::default::Default for execution config #570 (Jimexist)
to_timestamp_millis(), to_timestamp_micros(), to_timestamp_seconds() #567 (velvia)
Filter push down for Union #559 (Dandandan)
Implement window functions with partition_by clause #558 (Jimexist)
support table alias in join clause #547 (houqp)
Not equal predicate in physical_planning pruning #544 (jgoday)
add error handling and boundary checking for window frames #530 (Jimexist)
Implement window functions with order_by clause #520 (Jimexist)
support group by column positions #519 [sql] (jychen7)
Implement constant folding for CAST #513 (msathis)
Add window frame constructs - alternative #506 (Jimexist)
Add partition by constructs in window functions and modify logical planning #501 (Jimexist)
Add support for boolean columns in pruning logic #500 (alamb)
#215 resolve aliases for group by exprs #485 (jychen7)
Support anti join #482 (Dandandan)
Support semi join #470 (Dandandan)
add order by construct in window function and logical plans #463 (Jimexist)
Remove reundant filters (e.g. c> 5 AND c>5 --> c>5) #436 (jgoday)
fix: display the content of debug explain #434 (NGA-TRAN)
implement lead and lag built-in window function #429 (Jimexist)
add support for ndjson for datafusion-cli #427 (Jimexist)
add first_value, last_value, and nth_value built-in window functions #403 (Jimexist)
export both now and random functions #389 (Jimexist)
Function to create ArrayRef from an iterator of ScalarValues #381 (alamb)
Sort preserving merge (#362) #379 (tustvold)
Add support for multiple partitions with SortExec (#362) #378 (tustvold)
add window expression stream, delegated window aggregation to aggregate functions, and implement row_number #375 (Jimexist)
Add PartialOrd and Ord to GroupByScalar (#364) #368 (tustvold)
Implement readable explain plans for physical plans #337 (alamb)
Add window expression part 1 - logical and physical planning, structure, to/from proto, and explain, for empty over clause only #334 (Jimexist)
Use NullArray to Pass row count to ScalarFunctions that take 0 arguments #328 (Jimexist)
add --quiet/-q flag and allow timing info to be turned on/off #323 (Jimexist)
Implement hash partitioned aggregation #320 (Dandandan)
Support COUNT(DISTINCT timestamps) #319 (charlibot)
add random SQL function #303 (Jimexist)
allow datafusion cli to take -- comments #296 (Jimexist)
Add json print format mode to datafusion cli #295 (Jimexist)
Add print format param with support for tsv print format to datafusion cli #292 (Jimexist)
Add print format param and support for csv print format to datafusion cli #289 (Jimexist)
allow datafusion-cli to take a file param #285 (Jimexist)
add param validation for datafusion-cli #284 (Jimexist)
[breaking change] fix 265, log should be log10, and add ln #271 (Jimexist)
Implement count distinct for dictionary arrays #256 (alamb)
Count distinct floats #252 (pjmore)
Add rule to eliminate LIMIT 0 and replace it with an EmptyRelation #213 (Dandandan)
Allow table providers to indicate their type for catalog metadata #205 (returnString)
Use arrow eq kernels in CaseWhen expression evaluation #52 (Dandandan)
Re-export Arrow and Parquet crates from DataFusion #39 (returnString)
[DataFusion] Optimize hash join inner workings, null handling fix #24 (Dandandan)
[ARROW-12441] [DataFusion] Cross join implementation #11 (Dandandan)

Fixed bugs:

Projection pushdown removes unqualified column names even when they are used #617
Panic while running join datatypes/schema.rs:165:10 #601
Indentation is incorrect for joins in formatted physical plans #345
Error while running COUNT DISTINCT (timestamp): 'Unexpected DataType for list #314
When joining two tables, get Error: Plan("Schema contains duplicate unqualified field name 'xxx'") #311
Incorrect answers with SELECT DISTINCT queries #250
Intermitent failure in CI join_with_hash_collision #227
Concat from Dataframe API no longer accepts multiple expressions #226
Fix right, full join handling when having multiple non-matching rows at the left side #845 (Dandandan)
Qualified field resolution too strict #810 [sql] (seddonm1)
Better join order resolution logic #797 [sql] (seddonm1)
Produce correct answers for Group BY NULL (Option 1) #793 (alamb)
Use consistent version of string_to_timestamp_nanos in DataFusion #767 (alamb)
#723 limit pruning rule to simple expression #764 (lvheyang)
#699 fix return type conflict when calling builtin math fuctions #716 (lvheyang)
Fix Date32 and Date64 parquet row group pruning #690 (alamb)
Remove qualifiers on pushed down predicates / Fix parquet pruning #689 (alamb)
use Weak ptr to break catalog list <> info schema cyclic reference #681 (crepererum)
honor table name for csv/parquet scan in ballista plan serde #629 (houqp)
fix 621, where unnamed window functions shall be differentiated by partition and order by clause #622 (Jimexist)
RFC: Do not prune out unnecessary columns with unqualified references #619 (alamb)
[fix] select * on empty table #613 (rdettai)
fix 592, support alias in window functions #607 (Jimexist)
RepartitionExec should not error if output has hung up #576 (alamb)
Fix pruning on not equal predicate #561 (alamb)
hash float arrays using primitive usigned integer type #556 (houqp)
Return errors properly from RepartitionExec #521 (alamb)
refactor sort exec stream and combine batches #515 (Jimexist)
Fix display of execution time in datafusion-cli #514 (Dandandan)
Wrong aggregation arguments error. #505 (jgoday)
fix window aggregation with alias and add integration test case #454 (Jimexist)
fix: don't duplicate existing filters #409 (e-dard)
Fixed incorrect logical type in GroupByScalar. #391 (jorgecarleitao)
Fix indented display for multi-child nodes #358 (alamb)
Fix SQL planner to support multibyte column names #357 (agatan)
Fix wrong projection 'optimization' #268 (Dandandan)
Fix Left join implementation is incorrect for 0 or multiple batches on the right side #238 (Dandandan)
Count distinct boolean #230 (pjmore)
Fix Filter / where clause without column names is removed in optimization pass #225 (Dandandan)

Documentation updates:

No way to get to the examples from docs.rs #186
Update docs to use vendored version of arrow #772 (alamb)
Fix typo in DEVELOPERS.md #692 (lvheyang)
update stale documentations related to window functions #598 (Jimexist)
update readme to reflect work on window functions #471 (Jimexist)
Add examples section to datafusion crate doc #457 (mluts)
add invariants spec #443 (houqp)
add output field name rfc #422 (houqp)
Update more docs and also the developer.md doc #414 (Jimexist)
use prettier to format md files #367 (Jimexist)
Add new logo svg with white background #313 (parthsarthy)
Add projects (Squirtle and Tensorbase) to list in readme #312 (parthsarthy)
docs - fix the ballista link #274 (haoxins)
misc(README): Replace Cube.js with Cube Store #248 (ovr)
Initial docs for SQL syntax #242 (Dandandan)
Deduplicate README.md #79 (msathis)

Performance improvements:

Speed up inlist for strings and primitives #813 (Dandandan)
perf: improve performance of SortPreservingMergeExec operator #722 (e-dard)
Optimize min/max queries with table statistics #719 (b41sh)
perf: Improve materialisation performance of SortPreservingMergeExec #691 (e-dard)
Optimize count(*) with table statistics #620 (Dandandan)
optimize window function's find_ranges_in_range #595 (Jimexist)
Collapse sort into window expr and do sort within logical phase #571 (Jimexist)
Use repartition in window functions to speed up #569 (Jimexist)
Constant fold / optimize to_timestamp function during planning #387 (msathis)
Speed up create_batch_from_map #339 (Dandandan)
Simplify math expression code (use unary kernel) #309 (Dandandan)

Closed issues:

Confirm git tagging strategy for releases #770
arrow::util::pretty::pretty_format_batches missing #769
move the assert_batches_eq! macros to a non part of datafusion #745
fix an issue where aliases are not respected in generating downstream schemas in window expr #592
make the planner to print more succinct and useful information in window function explain clause #526
move window frame module to be in logical_plan #517
use a more rust idiomatic way of handling nth_value #448
create a test with more than one partition for window functions #435
COUNT DISTINCT does not support for Boolean #202
Read CSV format text from stdin or memory #198
Fix null handling hash join #195
Allow TableProviders to indicate their type for the information schema #191
Make DataFrame extensible #190
TPC-H Query 19 #170
TPC-H Query 7 #161
Upgrade hashbrown to 0.10 #151
Implement vectorized hashing for hash aggregate #149
More efficient LEFT join implementation #143
Implement vectorized hashing #142
RFC Roadmap for 2021 (DataFusion) #140
Implement hash partitioning #131
Grouping by column position #110
[Datafusion] GROUP BY with a high cardinality doesn't seem to finish #107
[Rust] Add support for JSON data sources #103
[Rust] Implement metrics framework #95
Publically export Arrow crate from datafusion #36
Implement hash-partitioned hash aggregate #27
Consider using GitHub pages for DataFusion/Ballista documentation #18
Update "repository" in Cargo.toml #16

Merged pull requests:

Use RawTable API in hash join #827 (Dandandan)
Add test for window functions on dictionary #823 (alamb)
Update dependencies: prost to 0.8 and tonic to 0.5 #818 (alamb)
Move hash_array into hash_utils.rs #807 (alamb)
Remove GroupByScalar and use ScalarValue in preparation for supporting null values in GroupBy #786 (alamb)
fix 226, make concat, concat_ws, and random work with Python crate #761 (Jimexist)
Test for parquet pruning disabling #754 (alamb)
Add explain verbose with limit push down #751 (Jimexist)
Move assert_batches_eq! macros to test_utils.rs #746 (alamb)
Show optimized physical and logical plans in EXPLAIN #744 (alamb)
update python crate to support latest pyo3 syntax and gil sematics #741 (Jimexist)
update python crate dependencies #740 (Jimexist)
provide more details on required .parquet file extension error message #729 (Jimexist)
split up windows functions into a dedicated module with separate files #724 (Jimexist)
Use pytest in integration test #715 (Jimexist)
replace once iter chain with array::IntoIter #704 (houqp)
avoid iterator materialization in column index lookup #703 (houqp)
Fix build with 1.52.1 #696 (alamb)
Fix test output due to logical merge conflict #694 (alamb)
add more integration tests #668 (Jimexist)
Bump arrow and parquet versions to 4.4 #654 (toddtreece)
Add query 15 to TPC-H queries #645 (Dandandan)
Improve error message and comments #641 (alamb)
add integration tests for rank, dense_rank, fix last_value evaluation with rank #638 (Jimexist)
round trip TPCH queries in tests #630 (houqp)
use Into<String> as argument type wherever applicable #615 (houqp)
reuse alias map in aggregate logical planning and refactor position resolution #606 (Jimexist)
fix clippy warnings #581 (Jimexist)
Add benchmarks to window function queries #564 (Jimexist)
reuse code for now function expr creation #548 (houqp)
turn on clippy rule for needless borrow #545 (Jimexist)
Refactor hash aggregates's planner building code #539 (Jimexist)
Cleanup Repartition Exec code #538 (alamb)
reuse datafusion physical planner in ballista building from protobuf #532 (Jimexist)
remove redundant into_iter() calls #527 (Jimexist)
Fix 517 - move window_frames module to logical_plan #518 (Jimexist)
Refactor window aggregation, simplify batch processing logic #516 (Jimexist)
Add datafusion::test_util, resolve test data paths without env vars #498 (mluts)
Avoid warnings in tests when compiling without default features #489 (alamb)
update cargo.toml in python crate and fix unit test due to hash joins #483 (Jimexist)
use prettier check in CI #453 (Jimexist)
Optimize nth_value, remove first_value, last_value structs and use idiomatic rust style #452 (Jimexist)
Fixed typo / logical merge conflict #433 (jorgecarleitao)
include test data and add aggregation tests in integration test #425 (Jimexist)
Add some padding around the logo #411 (parthsarthy)
Benchmark subcommand to distinguish between DataFusion and Ballista #402 (jgoday)
refactor datafusion/scalar_value to use more macro and avoid dup code #392 (Jimexist)
Update TPC-H benchmark to show physical plan when debug mode is enabled #386 (andygrove)
Update arrow dependencies again #341 (alamb)
Update arrow-rs deps #317 (alamb)
Update PR template by commenting out instructions #315 (alamb)
fix clippy warning #286 (Jimexist)
add integration test to compare datafusion-cli against psql #281 (Jimexist)
Update arrow deps #269 (alamb)
Use multi-stage build dockerfile in datafusion-cli and reduce image size from 2.16GB to 89.9MB #266 (Jimexist)
Enable redundant_field_names clippy lint #261 (Dandandan)
fix clippy lint #259 (alamb)
Move datafusion-cli to new crate #231 (Dandandan)
Make test join_with_hash_collision deterministic #229 (Dandandan)
Update arrow-rs deps (to fix build due to flatbuffers update) #224 (alamb)
Use standard make_null_array for CASE #223 (alamb)
update arrow-rs deps to latest master #216 (alamb)
MINOR: Remove empty rust dir #61 (andygrove)

* This Changelog was automatically generated by github_changelog_generator

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CHANGELOG.md

CHANGELOG.md

Changelog

11.0.0 (2022-08-16)

10.0.0-rc1 (2022-07-12)

10.0.0 (2022-07-12)

9.0.0 (2022-06-10)

8.0.0 (2022-05-12)

7.1.0 (2022-04-10)

7.0.0 (2022-02-14)

6.0.0 (2021-11-13)

5.0.0 (2021-08-10)

Files

CHANGELOG.md

Latest commit

History

CHANGELOG.md

File metadata and controls

Changelog

11.0.0 (2022-08-16)

10.0.0-rc1 (2022-07-12)

10.0.0 (2022-07-12)

9.0.0 (2022-06-10)

8.0.0 (2022-05-12)

7.1.0 (2022-04-10)

7.0.0 (2022-02-14)

6.0.0 (2021-11-13)

5.0.0 (2021-08-10)