Skip to content
This repository has been archived by the owner on Oct 5, 2022. It is now read-only.

Latest commit

 

History

History
1653 lines (1528 loc) · 218 KB

CHANGELOG.md

File metadata and controls

1653 lines (1528 loc) · 218 KB

Changelog

11.0.0 (2022-08-16)

Full Changelog

Breaking changes:

Implemented enhancements:

  • Make RowAccumulator public #3138
  • docs: proposal for consolidating docs into a Contributor Guide #3127
  • feat: support Timestamp +/- Interval #3103
  • a arrow_typeof like posgresql's pg_typeof #3095
  • Add DataFrame section to user guide #3066
  • Document all scalar SQL functions in user guide #3065
  • Simplify implementation of approx_median so that it can be exposed in Python #3063
  • Support double quoted literal strings for dialects(such as mysql,bigquery) #3055
  • Simplify / speed up implementation of character_length to unicode points #3049
  • Follow-up on Clickbench benchmark #3048
  • Why the PhysicalPlanner is an async trait ? #3032
  • Optimize file stream metrics. #3024
  • Proposal: Enable typed strings expressions for VALUES clause #3017
  • Proposal: Add date_bin function #3015
  • The upcoming release of Arrow (20?) breaks datafusion #3006
  • Can I select some files for query based on the filtering rules in the directory? #2993
  • Rename FormatReader to FileOpener #2990
  • Derive Hash trait for JoinType #2971
  • CAST from Utf8 to Boolean #2967
  • Add baseline_metrics for FileStream to record metrics like elapsed time, record output, etc #2961
  • Example to show how to convert query result into rust struct #2959
  • simplify not clause #2957
  • Implement Debug for ColumnarValue #2950
  • Parallel fetching of column chunks when reading parquet files #2949
  • Extension mechanism for SessionConfig #2939
  • Streaming CSV/JSON Object Store Read #2935
  • Support CSV Limit Pushdown to Object Storage #2930
  • Add support for pow scalar function #2926
  • Add support for exact median aggregate function #2925
  • Support mean as synonym for avg #2922
  • Rename a column name #2919
  • Move ScalarValue tests alongside implementation, move from_slice to core #2913
  • Fail gracefully if optimization rule fails #2908
  • Make ObjectStoreRegistry as a trait which can allow Ballista to introduce a self registry ObjectStoreRegistry #2905
  • Remove datafusion-data-access crate #2903
  • Improve formatting of logical plans containing subquery expressions #2898
  • Atan2 added to built-in functions #2897
  • The explain statements only print logical plans for debug/other purpose. #2894
  • JSON version of display_indent() #2889
  • It would be nice to have a way to generate unique IDs in optimizer rules #2886
  • Add support for TIME literal values #2883
  • Add h2o benchmark #2879
  • Implement from_unixtime function #2871
  • Add cast function for creating logical cast expression #2870
  • Release DataFusion 10.0.0 #2862
  • Implement information_schema.views #2857
  • Migrate from avro_rs to apache_avro #2783
  • Add optimizer rule to remove OFFSET 0 #2584
  • Preserve Element Name in ScalarValue::List #2450
  • Add EXISTS subquery support to Ballista #2338
  • Add documentation on supported functions to datafusion website #1487
  • documentations for datafusion-cli can be consolidated a bit more #1352
  • Optimizer: Predicate Rewrite pass for TPCH Q19 #217
  • feat: add optimize rule rewrite_disjunctive_predicate #2858 (xudong963)

Fixed bugs:

  • Regression in SQL support for ORDER BY and aliased expressions #3160
  • panic when deal with @ operator #3137
  • Incorrect type coercion rule for date + interval #3093
  • Cast string to timestamp crash while we input time before 1970 with floating number second #3082
  • INTEGER type does't work while importing csv #3059
  • Cannot GROUP BY Binary #3050
  • incorrect i32 coercion for to_timestamp #3046
  • Error pruning IsNull expressions: Column 'instance_null_count' is declared as non-nullable but contains null values #3042
  • I want to query some files in a directory. Is there any way? #3013
  • The expression to get an indexed field is only valid for List types (common_sub_expression_eliminate) #3002
  • Double to_timestamp_seconds produces abnormal result #2998
  • External parquet table fails when schema contains differing key / value metadata #2982
  • SELECT on column with uppercase column name fails with FieldNotFound error #2978
  • panic reading AWS-generated parquet file #2963
  • Can't filter rowgroup for parquet prune for some data type #2962
  • CI test is failing with final link failed: No space left on device #2947
  • bug: new ObjectStore breaks backward compatibility with contrib plugins #2931
  • bug: file types handled wrong #2929
  • bug: changing the number of partitions does not increase concurrency #2928
  • csv_explain fails on RC verifier #2916
  • index out of range error from datafusion_row::write::write_field #2910
  • Optimization rule CommonSubexprEliminate creates invalid projections #2907
  • serde_json requires that either std (default) or alloc feature is enabled #2896
  • Inconsistent type coercion rules with comparison expressions #2890
  • Doc Error: the test directory link 404 which is in CONTRIBUTING.md #2880
  • Round trips through ScalarValue's sometimes don't preserve types (e.g. change types from DictionaryArray) #2874
  • Error with CASE and DictionaryArrays: ArrowError(InvalidArgumentError("arguments need to have the same data type")) #2873
  • window functions not supported in expressions #2869
  • Unable to work with month intervals #2796
  • Discord invite link in communication page has expired #2743
  • Test (path normalization) failures while verifying release candidate 9.0.0 RC1 #2719
  • Reading parquet with (pre-release) arrow fails with "out of order projection is not supported" #2543
  • Fix SQL planner bug when resolving columns with same name as a relation #3003 [sql] (andygrove)
  • fix RowWriter index out of bounds error #2968 (comphead)
  • fix: support decimal statistic for row group prune #2966 (liukun4515)
  • Fix invalid projection in CommonSubexprEliminate #2915 (andygrove)

Documentation updates:

Performance improvements:

  • Use code points instead of grapheme clusters for string functions #3054 (Dandandan)

Closed issues:

  • Rename do_data_time_math() to do_date_time_math() #3172
  • Automatic version updates for github actions with dependabot #3106
  • [EPIC] Proposal for Date/Time enhancement #3100
  • Upgrade prost/tonic everywhere #3028
  • [Question] interested in helping with documentation #2866
  • Introducing a new optimizer framework for datafusion. #2633
  • Enable discussion tab? #2350
  • Add support for AVG(Timestamp) types #200
  • TPC-H Query 22 #175
  • TPC-H Query 21 #172
  • TPC-H Query 20 #171
  • TPC-H Query 17 #168
  • TPC-H Query 11 #163
  • TPC-H Query 4 #160
  • TPC-H Query 2 #159
  • [Datafusion] Optimize literal expression evaluation #106

Merged pull requests:

10.0.0-rc1 (2022-07-12)

Full Changelog

10.0.0 (2022-07-12)

Full Changelog

Breaking changes:

Implemented enhancements:

  • update documentation, fix styling to match main Arrow project #2864
  • Update top-level README #2850
  • [Question]How to call an async function in ExecutionPlan::exec method? #2847
  • Add DataFrame::with_column #2844
  • Improve ergonomics of physical expr lit #2827
  • Add Python examples for reading CSV and query by SQL in Doc #2824
  • eliminate multi limit-offset nodes to EmptyRelation if possible #2822
  • Make LogicalPlan::Union be consistent with other plans #2816
  • Use coerced data type from value and list expressions during planning inlist expression #2793
  • Add configuration option to enable/disalbe CoalesceBatchesExec #2790
  • Simplify FilterNullJoinKeys rule #2780
  • Allow configuration settings to be specified with environment variables #2776
  • Automatically update configs.md in user guide #2770
  • Support multiple paths for ListingTableScanNode #2768
  • Reduce outer joins #2757
  • support data type coerced and decimal in INLIST expr #2755
  • Change ExtensionPlanner::plan_extension() to an async function #2749
  • Add IsNotNull filter to join inputs if one side of join condition does not allow null #2739
  • Sort preserving MergeJoin #2698
  • Improve readability of table scan projections in query plans #2697
  • DataFusion 9.0.0 Release #2676
  • Improve UX for UNION vs UNION ALL (introduce a LogicalPlan::Distinct) #2573 [sql]
  • Implement some way to show the sql used to create a view #2529
  • Consider adopting IOx ObjectStore abstraction #2489
  • Support sum0 as a built-in agg function #2067
  • implement grouping sets, cubes, and rollups #1327
  • Ruby bindings #1114
  • Support dates in hash join #2746 (andygrove)

Fixed bugs:

  • Docker Error #2851
  • Anti join ignores join filters #2842
  • Can't test or compile sub-model code after upgrade to arrow-rs 17.0.0 #2835
  • Not evaluate the set expr in the InList for the optimization #2820
  • CASE When: result type should be coercible to a common type #2818
  • IN/NOT IN List: NULL is not equal to NULL #2817
  • panic when case statement returns null #2798
  • InList: Can't cast the list expr data type to value expr data type directly #2774
  • InList Expr: expr and list values must can be converted to a same data type #2759
  • tpchgen docker syntax change prevents volume from binding #2751
  • Cannot join on date columns (Unsupported data type in hasher: Date32) #2744
  • rewrite_expression does not properly handle Exists and ScalarSubquery #2736
  • LocalFileSystem Not sorted by file name, As a result, the data lines queried in multiple files are out of order. #2730
  • Filter push down need consider alias columns #2725
  • Recent API change in GlobalLimitExec breaks compatibility with Ballista #2720
  • Common Subexpression Eliminiation pass errors if run twice on some plans: Schema contains duplicate unqualified field name 'IsNull-Column-sys.host' #2712
  • The data type is not compatible with other system, for example spark or PG database #1379

Documentation updates:

Closed issues:

  • Consider adding a prominent note in the readme about ballista #2853
  • support decimal in (NULL) #2800
  • InList: Don't treat Null as UTF8(None) #2782
  • InList: don't need to treat Null as UTF8 data type #2773
  • Implement extensible configuration mechanism #138

Merged pull requests:

9.0.0 (2022-06-10)

Full Changelog

Breaking changes:

  • MINOR: Move simplify_expression rule to datafusion-optimizer crate #2686 (andygrove)
  • Move physical expression planning to datafusion-physical-expr crate #2682 (andygrove)
  • Create new datafusion-optimizer crate for logical optimizer rules #2675 (andygrove)
  • Remove ExecutionProps dependency from OptimizerRule #2666 (andygrove)
  • Remove ObjectStoreSchemaProvider (#2656) #2665 (tustvold)
  • Move LogicalPlanBuilder to datafusion-expr crate #2576 (andygrove)
  • LogicalPlanBuilder now uses TableSource instead of TableProvider #2569 (andygrove)
  • Remove scan_empty method from LogicalPlanBuilder #2568 (andygrove)
  • MINOR: Move expression utils from sql module to expr crate #2553 (andygrove)
  • Remove scan_json methods from LogicalPlanBuilder #2541 (andygrove)
  • Remove scan_avro methods from LogicalPlanBuilder #2540 (andygrove)
  • Remove scan_parquet methods from LogicalPlanBuilder #2539 (andygrove)
  • MINOR: Move ExprVisitable and exprlist_to_columns to datafusion-expr crate #2538 (andygrove)
  • Remove scan_csv methods from LogicalPlanBuilder #2537 (andygrove)
  • Fix Redundant ScalarValue Boxed Collection #2523 (comphead)
  • Support for OFFSET in LogicalPlan #2521 (jdye64)

Implemented enhancements:

  • [EPIC] JIT support for DataFusion #2703
  • Show column names instead of column indices in query plans #2689
  • Proposal: remove automated ballista CI checks from DataFusion #2679
  • Pass SessionState to TableProvider #2658
  • Is ObjectStoreSchemaProvider Still Needed? #2656
  • Add logical plan support to datafusion-proto #2630
  • Like, NotLike expressions work with literal NULL #2626
  • Move JOIN ON predicates push down logic from planner to optimizer #2619
  • Remove ExecutionProps from OptimizerRule trait #2614
  • Add, Minus, Multiply, divide, Modulo operator work with literal NULL #2609
  • Support DESCRIBE <table> to show table schemas #2606
  • Support CREATE OR REPLACE TABLE #2605
  • filter_push_down tests should not rely on TableProvider and ExecutionPlan #2600
  • Move logical optimizer rules out of the core datafusion crate #2599
  • Push Limit through outer Join #2579
  • datafusion_proto crate should have exhaustive match statements for handling Expr #2565
  • String representation of Expr variant #2563
  • File URI Scheme Interpretation #2562
  • Implement physical plan for OFFSET #2551
  • Update limit pushdown rule to support offsets #2550
  • Move LogicalPlanBuilder to datafusion-expr crate #2536
  • Logical optimizer rule "simplify expressions" should not depend on the core datafusion crate #2535
  • Support optional filter in Join #2509
  • Improve SQL planner & logical plan support for JOIN conditions #2496
  • Numeric, String, Boolean comparisons with literal NULL #2482
  • Redundant ScalarValue Boxed Collection #2449
  • ObjectStore Directory Semantics #2445
  • Add support for OFFSET in SQL query planner + logical plan #2377
  • SQL planner should use TableSource not TableProvider #2346
  • Move SQL query planning to new crate #2345
  • Update LogicalPlan rustdoc code to not use LogicalPlanBuilder #2308
  • [Optimizer] Refactor convert join #2256
  • [Optimizer] Infer is not null predicate from where clause #2254
  • Support ArrayIndex for ScalarValue(List) #2207
  • [Ballista] Fill functional gaps between datafusion and ballista #2062
  • [Ballista] support datafusion built_in UDAF work in ballista cluster #1985
  • Export C API #1113

Fixed bugs:

  • Fix Typos in Docs #2695
  • Unable to build a docker image #2691
  • Optimization pass AggregateStatistics changes type of output from Int64 to UInt64 #2673
  • ViewTable Circular Reference #2657
  • ScalarValue::to_array_of_size panics computing statistics for nested parquet file #2653
  • The result type of count/count_distinct #2635
  • limit_push_down is not working properly with OFFSET #2624
  • Avro Tests Fail To Compile #2570
  • Unused Window functions experssion is wrongly removed from LogicalPlan during optimalization #2542
  • Bug: ObjectStoreRegistry get_by_uri does not return correct path when "scheme" is provided #2525
  • There are duplicate and inconsistent copies of datafusion.proto #2514
  • Projection pushdown produces incorrect results when column names are reused #2462
  • Incorrect Parquet Projection For Nested Types #2453
  • LogicalPlanBuilder::scan_csv creates scans with invalid table names #2278
  • Inner join incorrectly pushdown predicate with OR operation #2271
  • Ignored alias for columns with aggregate function and incorrect results when collecting statistics is enabled #2176
  • Join on path partitioned columns fails with error #2145

Documentation updates:

Closed issues:

  • [Question] Converting TableSource to custom TableProvider #2644
  • [Question] Why DataFusion is shipped with arrow version 9.1.0 on crates.io ? #2474

Merged pull requests:

  • Test optional features in CI #2708 (tustvold)
  • support indexed fields proto #2707 (nl5887)
  • Update sqlparser-rs to 0.18.0 #2705 (alamb)
  • [MINOR]: Add documentation to datafusion-row modules #2704 (alamb)
  • Make sure that the data types are supported in hashjoin before genera… #2702 (AssHero)
  • Move remaining code out of legacy core/logical_plan module #2701 (andygrove)
  • Move some tests from core to expr #2700 (andygrove)
  • MINOR: Improve Docs Readability #2696 (ryanrussell)
  • Combine limit and offset to fetch and skip and implement physical plan support #2694 (ming535)
  • MINOR: Add datafusion-sql example #2693 (andygrove)
  • Remove Ballista related lines from Dockerfile #2692 (mocknen)
  • Show column names instead of indices in query plans #2690 (andygrove)
  • MINOR: Remove uses of TryClone for Parquet #2681 (tustvold)
  • Fix AggregateStatistics optimization so it doesn't change output type #2674 (alamb)
  • If statistics of column Max/Min value does not exists in parquet file, sent Min/Max to None #2671 (AssHero)
  • MINOR: Move more expression code to datafusion-expr crate #2669 (andygrove)
  • MINOR: Rewrite imports in optimizer moduler #2667 (andygrove)
  • Update snmalloc-rs requirement from 0.2 to 0.3 #2663 (dependabot[bot])
  • Add module doc for RuntimeEnv, SessionContext, TaskContext, etc... #2655 (tustvold)
  • Prune unused dependencies from datafusion-proto #2651 (tustvold)
  • MINOR: Implement serde for join filter #2649 (andygrove)
  • pushdown support for predicates in ON clause of joins #2647 (korowa)
  • Move SortKeyCursor and RowIndex into modules, add sort_key_cursor test #2645 (alamb)
  • Implement DESCRIBE <table> #2642 (LiuYuHui)
  • Implement LogicalPlan serde in datafusion-proto #2639 (andygrove)
  • Fix limit + offset pushdown #2638 (ming535)
  • change result type of count/count_distinct from uint64 to int64 #2636 (liukun4515)
  • if none columns in window expr are needed, remove the window exprs #2634 (AssHero)
  • Like, NotLike expressions work with literal NULL #2627 (WinkerDu)
  • MINOR: Refactor datafusion-proto dependencies and imports #2623 (andygrove)
  • MINOR: add optimizer struct #2616 (jackwener)
  • Remove FilterPushDown dependency on physical plan #2615 (andygrove)
  • Support CREATE OR REPLACE TABLE #2613 (AssHero)
  • Support binary mathematical operators work with NULL literals #2610 (WinkerDu)
  • chore: try fix CI coverage #2608 (Ted-Jiang)
  • MINOR: Rename benchmark crate #2607 (andygrove)
  • chore(dep): bump cranelift to 0.84.0 #2598 (waynexia)
  • fix some typos #2597 (ming535)
  • Support limit pushdown through left right outer join #2596 (Ted-Jiang)
  • Unignore rustdoc code examples in datafusion-expr crate #2590 (andygrove)
  • Evaluate JIT'd expression over arrays #2587 (waynexia)
  • [minor]Fix ci clippy for unused import #2586 (Ted-Jiang)
  • [Doc]add doc for enable SIMD need cargo nightly #2577 (Ted-Jiang)
  • Add DataFrame union_distinct and fix documentation for distinct #2574 (andygrove)
  • Fix avro tests (#2570) #2571 (tustvold)
  • Make datafusion-proto match exhaustive #2567 (andygrove)
  • Support limit push down for offset_plan #2566 (Ted-Jiang)
  • Introduce Expr.variant_name() function #2564 (jdye64)
  • Fix some 404 links in the contribution guide #2561 (hi-rustin)
  • Update datafusion-cli readme cli version #2559 (hi-rustin)
  • MINOR: Move expr_rewriter.rs to datafusion-expr crate #2552 (andygrove)
  • Fix JOINs with complex predicates in ON (split ON expressions only by AND operator) #2534 (korowa)
  • Reduce duplication in file scan tests #2533 (tustvold)
  • Fix size_of_scalar test #2531 (alamb)
  • Update to arrow-rs 14.0.0 #2528 (alamb)
  • ObjectStoreRegistry get_by_uri now returns correct path when "scheme" is provided #2526 (timvw)
  • MINOR: Add ORDER BY clause to test #2524 (andygrove)
  • Remove unused binary_array_op_scalar! in binary.rs #2512 (alamb)
  • fix NULL <op> column evaluation, tests for same #2510 (alamb)
  • Fix projection pushdown produces incorrect results when column names are reused #2463 (jonmmease)
  • Benchmark for sort preserving merge #2431 (alamb)
  • Support GetIndexedFieldExpr for ScalarValue #2196 (ovr)

8.0.0 (2022-05-12)

Full Changelog

Breaking changes:

  • Add SQL planner support for ROLLUP and CUBE grouping set expressions #2446 (andygrove)
  • Make ExecutionPlan::execute Sync #2434 (tustvold)
  • Introduce new DataFusionError::SchemaError type #2371 (andygrove)
  • Add Expr::InSubquery and Expr::ScalarSubquery #2342 (andygrove)
  • Add Expr::Exists to represent EXISTS subquery expression #2339 (andygrove)
  • Move LogicalPlan enum to datafusion-expr crate #2294 (andygrove)
  • Remove dependency from LogicalPlan::TableScan to ExecutionPlan #2284 (andygrove)
  • Move logical expression type-coercion code from physical-expr crate to expr crate #2257 (andygrove)
  • feat: 2061 create external table ddl table partition cols #2099 [sql] (jychen7)
  • Reorganize the project folders #2081 (yahoNanJing)
  • Support more ScalarFunction in Ballista #2008 (Ted-Jiang)
  • Merge dataframe and dataframe imp #1998 (vchag)
  • Rename ExecutionContext to SessionContext, ExecutionContextState to SessionState, add TaskContext to support multi-tenancy configurations - Part 1 #1987 (mingmwang)
  • Add Coalesce function #1969 (msathis)
  • Add Create Schema functionality in SQL #1959 [sql] (matthewmturner)
  • omit some clone when converting sql to logical plan #1945 [sql] (doki23)
  • [split/16] move physical plan expressions folder to datafusion-physical-expr crate #1889 (Jimexist)
  • remove sync constraint of SendableRecordBatchStream #1884 (doki23)
  • [split/15] move built in window expr and partition evaluator #1865 (Jimexist)

Implemented enhancements:

  • Include Expr to datafusion::prelude #2347
  • Implement Serialization API for DataFusion #2340
  • Implement power function #1493
  • allow lit python function to support boolean and other types #1136
  • Automate dependency updates #37
  • Add CREATE VIEW #2279 (matthewmturner)
  • [Ballista] Support Union in ballista. #2098 (Ted-Jiang)
  • Change the DataFusion explain plans to make it clearer in the predicate/filter #2063 (Ted-Jiang)
  • Add write_json, read_json, register_json, and JsonFormat to CREATE EXTERNAL TABLE functionality #2023 (matthewmturner)
  • Qualified wildcard #2012 [sql] (doki23)
  • support bitwise or/'|' operation #1876 [sql] (liukun4515)
  • Introduce JIT code generation #1849 (yjshen)

Fixed bugs:

  • CASE expr with NULL literals panics 'WHEN expression did not return a BooleanArray' #1189
  • Function calls with NULL literals do not work #1188
  • Add SQL planner support for calling round function with two arguments #2503 (andygrove)
  • nested query fix #2402 (comphead)
  • fix issue#2058 file_format/json.rs attempt to subtract with overflow #2066 (silence-coding)
  • fix bug the optimizer rule filter push down #2039 (jackwener)
  • fix: replace ExecutionContex and ExecutionConfig with SessionContext and SessionConfig #2030 (xudong963)
  • Fixed parquet path partitioning when only selecting partitioned columns #2000 (pjmore)
  • Fix ambiguous reference error in filter plan #1925 (jonmmease)
  • platform aware partition parsing #1867 (korowa)
  • Fix incorrect aggregation in case that GROUP BY contains duplicate column names #1855 (alex-natzka)

Documentation updates:

Performance improvements:

Closed issues:

  • Make expected result string in unit tests more readable #2412
  • remove duplicated fn aggregate() in aggregate expression tests #2399
  • split distinct_expression.rs into count_distinct.rs and array_agg_distinct.rs #2385
  • move sql tests in context.rs to corresponding test files in datafustion/core/tests/sql #2328
  • Date32/Date64 as join keys for merge join #2314
  • Error precision and scale for decimal coercion in logic comparison #2232
  • Support Multiple row layout #2188
  • TPC-H Query 18 #169
  • TPC-H Query 16 #167
  • Implement Sort-Merge Join #141
  • Split logical expressions out into separate source files #114

Merged pull requests:

7.1.0 (2022-04-10)

Full Changelog

Fixed bugs:

  • By default, use only 1000 rows to infer the schema #2159

7.0.0 (2022-02-14)

Full Changelog

Breaking changes:

  • Consolidate various configurations options, remove unrelated batch_size #1565
  • Extract logical plans in LogicalPlan as independent struct #1228
  • Update ExecutionPlan to know about sortedness and repartitioning optimizer pass respect the invariants #1776 (alamb)
  • Update to arrow 8.0.0 #1673 (alamb)
  • Remove non idiomatic DataFusionError::into_arrow_external_error in favor of From conversion #1645 (alamb)
  • Remove Accumulator::update and Accumulator::merge #1582 (Jimexist)
  • implement Hash for various types and replace PartialOrd #1580 (Jimexist)
  • Replace DatafusionError with GenericError in ObjectStore interface #1541 (matthewmturner)
  • Make FLOAT SQL type map to Float32 rather than Float64 #1423 [sql] (liukun4515)
  • Map REAL SQL type to Float32 rather than Float64 to be consistent with pg #1390 [sql] (hntd187)

Implemented enhancements:

  • Create new datafusion_expr crate #1753
  • Create new datafusion_common crate #1752
  • API to get Expr's type and nullability without a DFSchema #1725
  • Cleaner API to create Expr::ScalarFunction programatically #1718
  • Introduce a Vec<u8> based row-wise representation for DataFusion #1708
  • Simplify creating new ListingTable #1705
  • Implement TableProvider for DataFrameImpl to allow registration of logical plans #1698
  • Public Expr simplification API #1694
  • Query Optimizer: Add OUTER --> INNER join conversion #1670
  • Support reading from CSV, Avro and Json files that have mergeable/compatible, but not identical schemas #1669
  • Remove DataFusionError::into_arrow_external_error in favor of From conversion #1644
  • Include join type in display implementation for logical plan #1620
  • Switch datafusion to using eq_dyn_scalar, etc kernels #1610
  • Proposal: Remove Accumulator::update and Accumulator::merge #1549
  • Replace DataFusionError/Result with impl Error for ObjectStore and Reader #1540
  • Add approx_quantile support #1538
  • support sorting decimal data type #1522
  • Keep all datafusion's packages up to date with Dependabot #1472
  • ExecutionContext support init ExecutionContextState with new(state: Arc<Mutex<ExecutionContextState>>) method #1439
  • support the decimal scalar value #1393
  • Documentation for using scalar functions with the the DataFrame API #1364
  • Support boolean == boolean and boolean != boolean operators #1159
  • Support DataType::Decimal(15, 2) in TPC-H benchmark #174
  • Make MemoryStream public #150
  • Add support for Parquet schema merging #132
  • Add SQL support for IN expression #118
  • Add logging to datafusion-cli #1789 (alamb)
  • Add approx_median() aggregate function #1729 (realno)
  • Add join type for logical plan display #1674 [sql] (xudong963)
  • Fix null comparison for Parquet pruning predicate #1595 (viirya)
  • Add corr aggregate function #1561 (realno)
  • Add covar, covar_pop and covar_samp aggregate functions #1551 (realno)
  • Add approx_quantile() aggregation function #1539 (domodwyer)
  • Initial MemoryManager and DiskManager APIs for query execution + External Sort implementation #1526 (yjshen)
  • Add stddev and variance #1525 (realno)
  • Add rem operation for Expr #1467 (liukun4515)
  • support decimal data type in create table #1431 [sql] (liukun4515)
  • Ordering by index in select expression #1419 [sql] (hntd187)
  • Add support for ORDER BY on unprojected columns #1415 (viirya)
  • Support decimal for min and max aggregate #1407 (liukun4515)
  • Consolidate ConstantFolding and SimplifyExpression #1375 (alamb)
  • Datafusion cli quiet mode command to contain option bool #1345 (Jimexist)
  • Implement array_agg aggregate function #1300 (viirya)
  • Add a command to switch output format in cli #1284 (capkurmagati)
  • Support =, <, <=, >, >=, !=, is distinct from, is not distinct from for BooleanArray #1163 (alamb)

Fixed bugs:

  • Unsupported data type in hasher: Timestamp(Second, None) #1768
  • SQL column identifiers should be converted to lowercase when unquoted #1746
  • Data type Dictionary(Int32, Utf8) not supported for binary operation 'eq' on dyn arrays #1605
  • datafusion doesn't process predicate pushdown correctly when there is outer join #1586
  • casting Int64 to Float64 unsuccessfully caused tpch8 to fail #1576
  • CTE/WITH .. UNION ALL confuses name resolution in WHERE #1509
  • ORDER BY min(x) results in error Plan("No field named 'foo.x'. Valid fields are 'MIN(foo.x)'.") #1479
  • Sort discards field metadata on the output schema #1476
  • Datafusion should not strip out timezone information from existing types #1454
  • Error on some queries: "column types must match schema types, expected XXX but found YYY" #1447
  • Query failing to return any results when filter is an equality check on strings (bad statistics in parquet) #1433
  • Field names containing period such as f.c1 cannot be named in SQL query #1432
  • Select * returns an unexpected result #1412
  • Turn off unused default features of chrono and ahash #1398
  • real data type is float32 in PG database, but in the datafusion it is as float64 #1380
  • TPC-H q10 performance regression (expression for filter with added alias is not pushed down) #1367
  • ProjectionExec Loses Field Metadata #1361
  • Support Filter on unprojected columns #1351
  • NULLS ORDER is inconsistent with postgres #1343
  • Fix bug while merging RecordBatch, add SortPreservingMerge fuzz tester #1678 (alamb)
  • fix a cte block with same name for many times #1639 [sql] (xudong963)
  • fix: casting Int64 to Float64 unsuccessfully caused tpch8 to fail #1601 (xudong963)
  • Fix single_distinct_to_groupby for arbitrary expressions #1519 (james727)
  • Fix SortExec discards field metadata on the output schema #1477 (alamb)
  • fix calculate in many_to_many_hash_partition test. #1463 (Ted-Jiang)
  • Add Timezone to Scalar::Time* types, and better timezone awareness to Datafusion's time types #1455 (maxburke)
  • Support identifiers with . in them #1449 [sql] (alamb)
  • Fixes for working with functions in dataframes, additional documentation #1430 (tobyhede)
  • [Minor] Fix send_time metric for hash-repartition #1421 (Dandandan)
  • fix: Select * returns an unexpected result #1413 [sql] (xudong963)
  • Make cli handle multiple whitespaces #1388 (capkurmagati)
  • Metadata is kept in projections for non-derived columns #1378 (hntd187)
  • Fix Predicate Pushdown: split_members should be able to split aliased predicate #1368 (viirya)
  • Change the arg names and make parameters more meaningful #1357 (liukun4515)
  • collect table stats by default for listing table #1347 (houqp)
  • fix: make nulls-order consistent with postgres #1344 [sql] (xudong963)
  • Avoid changing expression names during constant folding #1319 (viirya)
  • improve error message for invalid create table statement #1294 [sql] (houqp)
  • Forbid creating the table with the same name #1288 (liukun4515)

Documentation updates:

Performance improvements:

  • Parquet pruning predicate for IS NULL #1591
  • Fix predicate pushdown for outer joins #1618 (james727)
  • fix: sql planner creates cross join instead of inner join from select predicates #1566 [sql] (xudong963)
  • Split fetch_metadata into fetch_statistics and fetch_schema #1365 (Dandandan)
  • Optimize the performance queries with a single distinct aggregate #1315 (ic4y)
  • Left join could use bitmap for left join instead of Vec<bool> #1291 (boazberman)

Closed issues:

  • Add release compile to CI #1728
  • DiskManager and TempFiles getting created several times per query #1690
  • Add a test for the pyarrow feature in CI #1635
  • SQL tests for when sorting exceeded available memory and had to spill to disk #1573
  • Consolidate the N-way merging code and SortPreservingMergeStream (which has quite good tests of what is often quite tricky code, and it will be performance critical) #1572
  • Consolidate the SortExec code (so there is only a single sort operator that does in memory sorting if it has enough memory budget but then spills to disk if needed). #1571
  • Track memory usage in Non Limited Operators #1569
  • [Question] Why does ballista store tables in the client instead of in the SchedulerServer #1473
  • Consolidate Projection for Schema and RecordBatch #1425
  • Support Sort on unprojected columns #1372
  • Unused code in hash_aggregate #1362
  • Why use the expr types before coercion to get the result type? #1358
  • A problem about the projection_push_down optimizer gathers valid columns #1312
  • apply constant folding to LogicalPlan::Values #1170
  • reduce usage of IntoIterator<Item = Expr> in logical plan builder window fn #372
  • Why does DataFusion throw a Tokio 0.2 runtime error? #176
  • TPC-H Query 14 #165
  • Length kernel returns bytes not character length #156
  • Split the logical operators out into separate source files #115

Merged pull requests:

  • Fixup some doc warnings #1811 (alamb)
  • Ensure most of links in docs are correct #1808 [sql] (HaoYang670)
  • Update CHANGELOG.md, update release scripts #1807 (alamb)
  • Update versions for split crates #1803 (matthewmturner)
  • Improve the error message and UX of tpch benchmark program #1800 (alamb)
  • rename references of expr in logical plan module after datafusion-expr split #1797 (Jimexist)
  • Update to sqlparser 0.14 #1796 [sql] (alamb)
  • [split/13] move rest of expr to expr_fn in datafusion-expr module #1794 (Jimexist)
  • Update datafusion versions #1793 (matthewmturner)
  • Less verbose plans in debug logging #1787 (alamb)
  • [split/11] split expr type and null info to be expr-schemable #1784 (Jimexist)
  • Introduce Row format backed by raw bytes #1782 (yjshen)
  • rewrite predicates before pushing to union inputs #1781 (korowa)
  • Update datafusion to use arrow 9.0.0 #1775 (alamb)
  • [split/10] split up expr for rewriting, visiting, and simplification traits #1774 [sql] (Jimexist)
  • #1768 Support TimeUnit::Second in hasher #1769 (jychen7)
  • TPC-H benchmark can optionally write JSON output file with benchmark summary #1766 (andygrove)
  • [split/8] move Accumulator and ColumnarValue to datafusion-expr #1765 (Jimexist)
  • [split/7] move built-in scalar function to datafusion-expr #1764 (Jimexist)
  • [split/6] move signature, type signature, volatility to datafusion-expr #1763 (Jimexist)
  • [split/9+12] move udf, udaf, Expr to datafusion-expr module #1762 [sql] (Jimexist)
  • [split/5] move window frame and operator to datafusion-expr module #1761 (Jimexist)
  • [split/4] move scalar value to datafusion-common #1760 (Jimexist)
  • [split/3] split datafusion expr module and move aggregate and window function expr #1759 (Jimexist)
  • [split/2] move column and dfschema to datafusion-common module #1758 (Jimexist)
  • Use ordered-float 2.10 #1756 (andygrove)
  • [split/1] split datafusion-common module #1751 (Jimexist)
  • use clap 3 style args parsing for datafusion cli #1749 (Jimexist)
  • fix: Case insensitive unquoted identifiers in SQL #1747 [sql] (mkmik)
  • Move more tests out of context.rs #1743 (alamb)
  • Move optimize test out of context.rs #1742 (alamb)
  • Fix typos in crate documentation #1739 (r4ntix)
  • add cargo check --release to ci #1737 (xudong963)
  • Update parking_lot requirement from 0.11 to 0.12 #1735 (dependabot[bot])
  • Create built-in scalar functions programmatically #1734 (HaoYang670)
  • Prevent repartitioning of certain operator's direct children (#1731) #1732 (tustvold)
  • API to get Expr's type and nullability without a DFSchema #1726 (alamb)
  • minor: fix cargo run --release error #1723 (xudong963)
  • substitute parking_lot::Mutex for std::sync::Mutex #1720 (xudong963)
  • Convert boolean case expressions to boolean logic #1719 (tustvold)
  • Add Expression Simplification API #1717 (alamb)
  • Create ListingTableConfig which includes file format and schema inference #1715 (matthewmturner)
  • make select_to_plan clearer #1714 [sql] (xudong963)
  • Add upper bound for public function signature #1713 (HaoYang670)
  • Add tests and CI for optional pyarrow module #1711 (wjones127)
  • Create SchemaAdapter trait to map table schema to file schemas #1709 (thinkharderdev)
  • refine test in repartition.rs & coalesce_batches.rs #1707 (xudong963)
  • Fuzz test for spillable sort #1706 (yjshen)
  • Support create_physical_expr and ExecutionContextState or DefaultPhysicalPlanner for faster speed #1700 (alamb)
  • Implement TableProvider for DataFrameImpl #1699 (cpcloud)
  • Move timestamp related tests out of context.rs and into sql integration test #1696 (alamb)
  • Lazy TempDir creation in DiskManager #1695 (alamb)
  • Add MemTrackingMetrics to ease memory tracking for non-limited memory consumers #1691 (yjshen)
  • (minor) Reduce memory manager and disk manager logs from info! to debug! #1689 (alamb)
  • Make SortPreservingMergeStream stable on input stream order #1687 (alamb)
  • Incorporate dyn scalar kernels #1685 (matthewmturner)
  • Move information_schema tests out of execution/context.rs to sql_integration tests #1684 (alamb)
  • Add a new metric type: Gauge + CurrentMemoryUsage to metrics #1682 (yjshen)
  • refactor array_agg to not to have update and merge #1681 (Jimexist)
  • Use NamedTempFile rather than String in DiskManager #1680 (alamb)
  • upgrade clap to version 3 #1672 (Jimexist)
  • Improve configuration and resource use of MemoryManager and DiskManager #1668 (alamb)
  • feat: Support quarter granularity in date_trunc function #1667 (ovr)
  • Fix can not load parquet table form spark in datafusion-cli. #1665 (Ted-Jiang)
  • Make MemoryManager and MemoryStream public #1664 (yjshen)
  • [Cleanup] Move AggregatedMetricsSet to metrics for further reuse #1663 (yjshen)
  • fix: substr - correct behaivour with negative start pos #1660 (ovr)
  • suppport bitwise and as an example #1653 [sql] (liukun4515)
  • refine match pattern related code #1650 (xudong963)
  • update md-5, sha2, blake2 #1647 (xudong963)
  • Add DataFusionError -> ArrowError conversion #1643 (alamb)
  • Add spill_count and spilled_bytes to BaselineMetrics, test sort with spill #1641 (yjshen)
  • support hash decimal array and group by #1640 (liukun4515)
  • Consolidate Schema and RecordBatch projection #1638 (alamb)
  • Update hashbrown requirement from 0.11 to 0.12 #1631 (dependabot[bot])
  • Update pyo3 requirement from 0.14 to 0.15 #1627 (dependabot[bot])
  • Optimize SortPreservingMergeStream to avoid SortKeyCursor sharing #1624 (yjshen)
  • Handle merging of evolved schemas in ParquetExec #1622 (thinkharderdev)
  • feat: Support Substring(str [from int] [for int]) #1621 [sql] (ovr)
  • feat: Support complex interval via IntervalMonthDayNano #1615 [sql] (ovr)
  • consolidate binary_expr coercion rule code into binary_rule.rs module #1607 (alamb)
  • Fix comparison of dictionary arrays #1606 (alamb)
  • add test for decimal to decimal #1603 (liukun4515)
  • update nightly version #1597 (Jimexist)
  • Consolidate sort and external_sort #1596 (yjshen)
  • support from_slice for binary, string, and boolean array types #1589 (Jimexist)
  • add from_slice trait to ease arrow2 migration #1588 (Jimexist)
  • Implement ARRAY_AGG(DISTINCT ...) #1579 (james727)
  • Rename sql integration tests from mod to sql_integration #1575 (alamb)
  • minor: improve the benchmark readme #1567 (xudong963)
  • Consolidate batch_size configuration in ExecutionConfig, RuntimeConfig and PhysicalPlanConfig #1562 (yjshen)
  • Update to rust 1.58 #1557 (xudong963)
  • support mathematics operation for decimal data type #1554 (liukun4515)
  • Address clippy warnings #1553 (sergey-melnychuk)
  • enhance arithmetic operation for array with scalar #1552 (liukun4515)
  • Remove unused update and merge implementations from Aggregates and supporting ScalarValue arithmetic #1550 (alamb)
  • Add batch operations to stddev #1547 (realno)
  • Mark ARRAY_AGG(DISTINCT ...) not implemented #1534 (james727)
  • Update to arrow-7.0.0 #1523 (alamb)
  • Fix ORDER BY on aggregate #1506 (viirya)
  • Add example on how to query multiple parquet files #1497 (nitisht)
  • Refactor testing modules #1491 (hntd187)
  • add rfcs for datafusion #1490 (xudong963)
  • support comparison for decimal data type and refactor the binary coercion rule #1483 (liukun4515)
  • Minor: Rename predicate_builder --> pruning_predicate for consistency #1481 (alamb)
  • Tests for support try_cast/cast decimal to numeric #1465 (liukun4515)
  • Avoid send empty batches for Hash partitioning. #1459 (Ted-Jiang)
  • Planner code cleanup #1450 [sql] (alamb)
  • Fix bug in projection: "column types must match schema types, expected XXX but found YYY" #1448 (alamb)
  • Update arrow-rs to 6.4.0 and replace boolean comparison in datafusion with arrow compute kernel #1446 (xudong963)
  • support cast/try_cast for decimal: signed numeric to decimal #1442 (liukun4515)
  • Consolidate decimal error checking and improve error messages #1438 [sql] (alamb)
  • use 0.13 sql parser #1435 (Jimexist)
  • Minor Code cleanups #1428 (alamb)
  • Clarify communication on bi-weekly sync #1427 (alamb)
  • support sum/avg agg for decimal, change sum(float32) --> float64 #1408 [sql] (liukun4515)
  • Fix bugs with nullability during rewrites: Combine simplify and Simplifier #1401 (alamb)
  • Minimize features #1399 (carols10cents)
  • Update rust vesion to 1.57 #1395 [sql] (xudong963)
  • support decimal scalar value #1394 (liukun4515)
  • Add coercion rules for AggregateFunctions #1387 (liukun4515)
  • upgrade the arrow-rs version #1385 (liukun4515)
  • add array agg name #1382 (liukun4515)
  • Make tests for simplify and Simplifer consistent #1376 (alamb)
  • Refactor: Consolidate expression simplification code in simplify_expression.rs #1374 (alamb)
  • remove unused code in hash_aggregate #1370 (ic4y)
  • Use BufReader for LocalFileReader to revert performance regression in parquet reading #1366 (Dandandan)
  • Add unit test for constant folding on values #1355 (viirya)
  • Extract logical plan: rename the plan name (follow up) #1354 [sql] (liukun4515)
  • Moved aggr_test_schema to test_utils #1338 (rdettai)
  • upgrade arrow-rs to 6.2.0 #1334 (liukun4515)
  • Update release instructions #1331 (alamb)
  • #1268: allow datafusion-cli to toggle quiet flag within CLI #1330 (jgoday)
  • Extract Aggregate, Sort, and Join to struct from AggregatePlan #1326 (matthewmturner)
  • Extract EmptyRelation, Limit, Values from LogicalPlan #1325 (liukun4515)
  • Extract CrossJoin, Repartition, Union in LogicalPlan #1322 (liukun4515)
  • Fifth batch of updating sql tests to use assert_batches_eq #1318 (matthewmturner)
  • Extract Explain, Analyze, Extension in LogicalPlan as independent struct #1317 [sql] (xudong963)
  • Extract CreateMemoryTable, DropTable, CreateExternalTable in LogicalPlan as independent struct #1311 [sql] (liukun4515)
  • Extract Projection, Filter, Window in LogicalPlan as independent struct #1309 (ic4y)
  • Add PSQL comparison tests for except, intersect #1292 (mrob95)
  • Extract logical plans in LogicalPlan as independent struct: TableScan #1290 (xudong963)
  • Add statement helper command to cli #1285 (matthewmturner)
  • Python bindings for window functions #819 [sql] (jgoday)

6.0.0 (2021-11-13)

Full Changelog

Breaking changes:

  • Removed deprecated with_concurrency #1200 (rdettai)
  • File partitioning for ListingTable #1141 (rdettai)
  • Add function volatility to Signature #1071 [sql] (pjmore)
  • fix: allow duplicate field names in table join, fix output with duplicated names #1023 (houqp)
  • Make TableProvider.scan() and PhysicalPlanner::create_physical_plan() async #1013 (rdettai)
  • Reorganize table providers by table format #1010 (rdettai)
  • Make Metrics::labels() public #999 (alamb)
  • Rename NthValue::{first_value,last_value,nth_value} to satisfy clippy in Rust 1.55 #986 (alamb)
  • Move CBOs and Statistics to physical plan #965 (rdettai)
  • Update to sqlparser v 0.10.0 #934 [sql] (alamb)
  • FilePartition and PartitionedFile for scanning flexibility #932 [sql] (yjshen)
  • Improve SQLMetric APIs, port existing metrics #908 (alamb)
  • Add support for EXPLAIN ANALYZE #858 [sql] (alamb)
  • Rename concurrency to target_partitions #706 (andygrove)

Implemented enhancements:

  • Add booleans support to the CASE statement #1156
  • Implement General Purpose Constant Folding with the Expression Evaluator #1070
  • Mark volatility categories of functions #1069
  • Add "show" support to DataFrame API #937
  • Add support for TRIM BOTH/LEADING/TRAILING #935
  • Add "baseline" metrics to all built in operators #866
  • Add SQL support for referencing fields in structs #119
  • add filename completer for create table statement #1278 (Jimexist)
  • Add drop table support #1266 [sql] (viirya)
  • Dataframe supports except and update readme #1261 (xudong963)
  • Implement EXCEPT & EXCEPT DISTINCT #1259 [sql] (xudong963)
  • Add DataFrame support for INTERSECT and update readme #1258 (xudong963)
  • use arrow 6.1.0 #1255 (Jimexist)
  • fix 1250, add editor support for datafusion cli with validation #1251 (Jimexist)
  • Add support for create table as via MemTable #1243 [sql] (Dandandan)
  • Add cli show columns command to describe tables #1231 (Jimexist)
  • datafusion-cli to add list table command #1229 (Jimexist)
  • datafusion cli to handle EoF and interrupt signal #1225 (Jimexist)
  • add \q as quit command and add ? for help #1224 (Jimexist)
  • Add algebraic simplifications to constant_folding #1208 (matthewmturner)
  • Improve GetIndexedFieldExpr adding utf8 key based access for struct v… #1204 [sql] (Igosuki)
  • Fix between in select query #1202 [sql] (capkurmagati)
  • Move code to fold Stable functions like now() from Simplifier to ConstEvaluator #1176 (alamb)
  • DataFrame supports window function #1167 [sql] (xudong963)
  • add values list expression #1165 [sql] (Jimexist)
  • Add booleans support to the CASE statement #1161 (xudong963)
  • Improve error messages when operations are not supported #1158 (alamb)
  • Generic constant expression evaluation #1153 (alamb)
  • python lit function to support bool and byte vec #1152 (Jimexist)
  • [nit] simplify datafusion optimizer module codes #1146 (panarch)
  • Add ScalarValue support for arbitrary list elements #1142 (jonmmease)
  • Multiple files per partitions for CSV Avro Json #1138 (rdettai)
  • Implement INTERSECT & INTERSECT DISTINCT #1135 [sql] (xudong963)
  • Simplify file struct abstractions #1120 (rdettai)
  • Implement is [not] distinct from #1117 [sql] (Dandandan)
  • Clean up spawned task on drop for RepartitionExec, SortPreservingMergeExec, WindowAggExec #1112 (crepererum)
  • add hyperloglog implementation (add and count) #1095 (Jimexist)
  • Add ScalarValue::Struct variant #1091 (jonmmease)
  • add digest(utf8, method) function and refactor all current hash digest functions #1090 (Jimexist)
  • [crypto] add blake3 algorithm to digest function #1086 (Jimexist)
  • [crypto] add blake2b and blake2s functions #1081 (Jimexist)
  • [nit] make schema qualifier error message in field lookup more readable #1079 (Jimexist)
  • [window function] add percent_rank window function #1077 (Jimexist)
  • [window function] add cume_dist implementation #1076 (Jimexist)
  • Add a LogicalPlanBuilder::schema() function #1075 (alamb)
  • Add support for UNION [DISTINCT] sql #1068 [sql] (xudong963)
  • fix: fix joins on Float32/Float64 columns bug #1054 (francis-du)
  • Update sqlparser-rs to 0.11 #1052 [sql] (alamb)
  • Support querying CSV files without providing the schema #1050 [sql] (xudong963)
  • remove hard coded partition count in ballista logicalplan deserialization #1044 (xudong963)
  • feat: add lit_timestamp_nanosecond #1030 (NGA-TRAN)
  • Ignore metadata on schema merge #1024 (Smurphy000)
  • add ExecutionConfig.with_optimizer_rules #1022 (seddonm1)
  • Add baseline execution stats to WindowAggExec and UnionExec, and fixup CoalescePartitionsExec #1018 (alamb)
  • Derive PartialOrd for Expr #1015 (alamb)
  • Indexed field access for List #1006 [sql] (Igosuki)
  • Add metrics for Limit and Projection, and CoalesceBatches #1004 (alamb)
  • Update DataFusion to arrow 6.0 #984 (alamb)
  • Implement Display for Expr, improve operator display #971 [sql] (matthewmturner)
  • Add metrics for FilterExec #960 (alamb)
  • Change compound column field name rules #952 (waynexia)
  • ObjectStore API to read from remote storage systems #950 (yjshen)
  • Add baseline metrics to SortPreservingMergeExec #948 (alamb)
  • Add support for TRIM LEADING/TRAILING/BOTH syntax #947 [sql] (adsharma)
  • fixes #933 replace placeholder fmt_as fr ExecutionPlan impls #939 (tiphaineruy)
  • Add metrics for SortExect + HashAggregateExec #938 (alamb)
  • Add some additional asserts in utils::from_plan #930 (alamb)
  • Avro Table Provider #910 [sql] (Igosuki)
  • Add BaselineMetrics, Timestamp metrics, add for CoalescePartitionsExec, rename output_time -> elapsed_compute #909 (alamb)
  • add cross join support to ballista #891 (houqp)
  • Add Ballista support to DataFusion CLI #889 (andygrove)
  • support like on DictionaryArray #876 (b41sh)
  • Register table based on known schema without file IO #872 (Dandandan)
  • Add support for PostgreSQL regex match #870 [sql] (b41sh)
  • Include planning time in datafusion-cli printing #860 (Dandandan)
  • Implement basic common subexpression eliminate optimization #792 (waynexia)
  • Impl ops::Not for expr #763 (Jimexist)

Fixed bugs:

  • Can not use between in the select list: #1196
  • ORDER BY does not work with literals: Sort operation is not applicable to scalar value 'foo' #1195
  • window functions with NULL literals in partition by and order by do not work: Internal("Sort operation is not applicable to scalar value NULL") #1194
  • Operation name not included in internal errors -- Internal("Data type Boolean not supported for binary operation on dyn arrays") #1157
  • Physical plan explain UNION query says "ExecutionPlan(PlaceHolder)" #933
  • Can not use LIKE on DictionaryArray encoded strings #815
  • physical_plan::repartition::tests::repartition_with_dropping_output_stream failing locally #614
  • Fix some BuiltinScalarFunction panics with zero arguments #1249 (capkurmagati)
  • fix: not do boolean folding on NULL and/or expr #1245 (NGA-TRAN)
  • ignore case of with header row in sql when creating external table #1237 [sql] (lichuan6)
  • fix: Min/Max aggregation data type should not be dictionary #1235 (NGA-TRAN)
  • Fix build with --no-default-features #1219 (alamb)
  • Prevent "future cannot be sent between threads safely" compilation error #1155 (jonmmease)
  • Clean up spawned task on drop for AnalyzeExec, CoalescePartitionsExec, HashAggregateExec #1121 (crepererum)
  • Clean up spawned task on SortStream drop #1105 (crepererum)
  • fix UNION ALL bug: thread 'main' panicked at 'index out of bounds: the len is 1 but the index is 1', ./src/datatypes/schema.rs:165:10 #1088 (xudong963)
  • python: fix generated table name in dataframe creation #1078 (houqp)
  • fix subquery alias #1067 [sql] (xudong963)
  • fix pattern handling in regexp_match function #1065 (houqp)
  • fix: joins on Timestamp columns #1055 (francis-du)
  • Fix metric name typo #943 (alamb)
  • EXPLAIN ANALYZE should run all Optimizer passes #929 (alamb)

Documentation updates:

Performance improvements:

  • Improve avro reader performance by avoiding some cloning on avro_rs::Value #1206 (Igosuki)
  • optimize build profile for datafusion python binding, cli and ballista #1137 (houqp)
  • Avoid stack overflow by reducing stack usage of BinaryExpr::evaluate in debug builds #1047 (alamb)
  • Add ScalarValue::eq_array optimized comparison function #844 (alamb)
  • Rework GroupByHash to for faster performance and support grouping by nulls #808 (alamb)

Closed issues:

  • InList expr with NULL literals do not work #1190
  • update the homepage README to include values, approx_distinct, etc. #1171
  • [Python]: Inconsistencies with Python package name #1011
  • Wanting to contribute to project where to start? #983
  • delete redundant code #973
  • How to build DataFusion python wheel #853
  • Add support for partition pruning #204
  • [Datafusion] Support joins on TimestampMillisecond columns #187
  • TPC-H Query 21 #173
  • TPC-H Query 13 #164
  • TPC-H Query 8 #162
  • implement split_part(string, delimiter, position) #157
  • Join Statement: Schema contains duplicate unqualified field name #155
  • ParquetTable should avoid scanning all files twice #136
  • Add support for reading partitioned Parquet files #133
  • Add support for Parquet schema merging #132
  • Catalog abstraction #126
  • Optimizer rules should work with qualified column names #125
  • Add optional qualifier to Expr::Column #121
  • Implement modulus expression #99
  • [Rust] Add constant folding to expressions during logically planning #98
  • [Rust] Implement pretty print for physical query plan #93
  • Can not group by boolean columns (add boolean to valid keys of groupBy) #91
  • improve performance of building literal arrays #90
  • [rust][datafusion] optimize count(*) queries on parquet sources #89
  • Produce a design for a metrics framework #21

Merged pull requests:

  • Add timezome string to stablize test #1265 (viirya)
  • numerical_coercion pattern match optimize #1256 (Jimexist)
  • fix and update window function sql tests #1059 (Jimexist)
  • reduce ScalarValue from trait boilerplate with macro #989 (houqp)

For older versions, see apache/arrow/CHANGELOG.md

5.0.0 (2021-08-10)

Full Changelog

Breaking changes:

  • Box ScalarValue:Lists, reduce size by half size #788 (alamb)
  • JOIN conditions are order dependent #778 (seddonm1)
  • Show the result of all optimizer passes in EXPLAIN VERBOSE #759 (alamb)
  • #723 Datafusion add option in ExecutionConfig to enable/disable parquet pruning #749 (lvheyang)
  • Update API for extension planning to include logical plan #643 (alamb)
  • Rename MergeExec to CoalescePartitionsExec #635 (andygrove)
  • fix 593, reduce cloning by taking ownership in logical planner's from fn #610 (Jimexist)
  • fix join column handling logic for On and Using constraints #605 (houqp)
  • Rewrite pruning logic in terms of PruningStatistics using Array trait (option 2) #426 (alamb)
  • Support reading from NdJson formatted data sources #404 (heymind)
  • Add metrics to RepartitionExec #398 (andygrove)
  • Use 4.x arrow-rs from crates.io rather than git sha #395 (alamb)
  • Return Vec<bool> from PredicateBuilder rather than an Fn #370 (alamb)
  • Refactor: move RowGroupPredicateBuilder into its own module, rename to PruningPredicateBuilder #365 (alamb)
  • [Datafusion] NOW() function support #288 (msathis)
  • Implement select distinct #262 (Dandandan)
  • Refactor datafusion/src/physical_plan/common.rs build_file_list to take less param and reuse code #253 (Jimexist)
  • Support qualified columns in queries #55 (houqp)
  • Read CSV format text from stdin or memory #54 (heymind)
  • Use atomics for SQLMetric implementation, remove unused name field #25 (returnString)

Implemented enhancements:

  • Allow extension nodes to correctly plan physical expressions with relations #642
  • Filters aren't passed down to table scans in a union #557
  • Support pruning for boolean columns #490
  • Implement SQLMetrics for RepartitionExec #397
  • DataFusion benchmarks should show executed plan with metrics after query completes #396
  • Use published versions of arrow rather than github shas #393
  • Add Compare to GroupByScalar #364
  • Reusable "row group pruning" logic #363
  • Add an Order Preserving merge operator #362
  • Implement Postgres compatible now() function #251
  • COUNT DISTINCT does not support dictionary types #249
  • Use standard make_null_array for CASE #222
  • Implement date_trunc() function #203
  • COUNT DISTINCT does not support for Float64 #199
  • Update SQLMetric to use atomics rather than a Mutex #30
  • Implement PartialOrd for ScalarValue #838 (viirya)
  • Support date datatypes in max/min #820 (viirya)
  • Implement vectorized hashing for DictionaryArray types #812 (alamb)
  • Convert unsupported conditions in left right join to filters #796 [sql] (Dandandan)
  • Implement streaming versions of Dataframe.collect methods #789 (andygrove)
  • impl from str for column and scalar #762 (Jimexist)
  • impl fmt::Display for PlanType #752 (Jimexist)
  • Remove unnecessary projection in logical plan optimization phase #747 (waynexia)
  • Support table columns alias #735 (Dandandan)
  • Derive PartialEq for datasource enums #734 (alamb)
  • Allow filetype to be lowercase, Implement FromStr for FileType #728 (Jimexist)
  • Update to use arrow 5.0 #721 (alamb)
  • #554: Lead/lag window function with offset and default value arguments #687 (jgoday)
  • dedup using join column in wildcard expansion #678 (houqp)
  • Implement metrics for HashJoinExec #664 (andygrove)
  • Show physical plan with metrics in benchmark #662 (andygrove)
  • Allow non-equijoin filters in join condition #660 (Dandandan)
  • Add End-to-end test for parquet pruning + metrics for ParquetExec #657 (alamb)
  • Add support for leading field in interval #647 (Dandandan)
  • Remove hard-coded PartitionMode from Ballista serde #637 (andygrove)
  • Ballista: Implement scalable distributed joins #634 (andygrove)
  • implement rank and dense_rank function and refactor built-in window function evaluation #631 (Jimexist)
  • Improve "field not found" error messages #625 (andygrove)
  • Support modulus op #577 (gangliao)
  • implement std::default::Default for execution config #570 (Jimexist)
  • to_timestamp_millis(), to_timestamp_micros(), to_timestamp_seconds() #567 (velvia)
  • Filter push down for Union #559 (Dandandan)
  • Implement window functions with partition_by clause #558 (Jimexist)
  • support table alias in join clause #547 (houqp)
  • Not equal predicate in physical_planning pruning #544 (jgoday)
  • add error handling and boundary checking for window frames #530 (Jimexist)
  • Implement window functions with order_by clause #520 (Jimexist)
  • support group by column positions #519 [sql] (jychen7)
  • Implement constant folding for CAST #513 (msathis)
  • Add window frame constructs - alternative #506 (Jimexist)
  • Add partition by constructs in window functions and modify logical planning #501 (Jimexist)
  • Add support for boolean columns in pruning logic #500 (alamb)
  • #215 resolve aliases for group by exprs #485 (jychen7)
  • Support anti join #482 (Dandandan)
  • Support semi join #470 (Dandandan)
  • add order by construct in window function and logical plans #463 (Jimexist)
  • Remove reundant filters (e.g. c> 5 AND c>5 --> c>5) #436 (jgoday)
  • fix: display the content of debug explain #434 (NGA-TRAN)
  • implement lead and lag built-in window function #429 (Jimexist)
  • add support for ndjson for datafusion-cli #427 (Jimexist)
  • add first_value, last_value, and nth_value built-in window functions #403 (Jimexist)
  • export both now and random functions #389 (Jimexist)
  • Function to create ArrayRef from an iterator of ScalarValues #381 (alamb)
  • Sort preserving merge (#362) #379 (tustvold)
  • Add support for multiple partitions with SortExec (#362) #378 (tustvold)
  • add window expression stream, delegated window aggregation to aggregate functions, and implement row_number #375 (Jimexist)
  • Add PartialOrd and Ord to GroupByScalar (#364) #368 (tustvold)
  • Implement readable explain plans for physical plans #337 (alamb)
  • Add window expression part 1 - logical and physical planning, structure, to/from proto, and explain, for empty over clause only #334 (Jimexist)
  • Use NullArray to Pass row count to ScalarFunctions that take 0 arguments #328 (Jimexist)
  • add --quiet/-q flag and allow timing info to be turned on/off #323 (Jimexist)
  • Implement hash partitioned aggregation #320 (Dandandan)
  • Support COUNT(DISTINCT timestamps) #319 (charlibot)
  • add random SQL function #303 (Jimexist)
  • allow datafusion cli to take -- comments #296 (Jimexist)
  • Add json print format mode to datafusion cli #295 (Jimexist)
  • Add print format param with support for tsv print format to datafusion cli #292 (Jimexist)
  • Add print format param and support for csv print format to datafusion cli #289 (Jimexist)
  • allow datafusion-cli to take a file param #285 (Jimexist)
  • add param validation for datafusion-cli #284 (Jimexist)
  • [breaking change] fix 265, log should be log10, and add ln #271 (Jimexist)
  • Implement count distinct for dictionary arrays #256 (alamb)
  • Count distinct floats #252 (pjmore)
  • Add rule to eliminate LIMIT 0 and replace it with an EmptyRelation #213 (Dandandan)
  • Allow table providers to indicate their type for catalog metadata #205 (returnString)
  • Use arrow eq kernels in CaseWhen expression evaluation #52 (Dandandan)
  • Re-export Arrow and Parquet crates from DataFusion #39 (returnString)
  • [DataFusion] Optimize hash join inner workings, null handling fix #24 (Dandandan)
  • [ARROW-12441] [DataFusion] Cross join implementation #11 (Dandandan)

Fixed bugs:

  • Projection pushdown removes unqualified column names even when they are used #617
  • Panic while running join datatypes/schema.rs:165:10 #601
  • Indentation is incorrect for joins in formatted physical plans #345
  • Error while running COUNT DISTINCT (timestamp): 'Unexpected DataType for list #314
  • When joining two tables, get Error: Plan("Schema contains duplicate unqualified field name 'xxx'") #311
  • Incorrect answers with SELECT DISTINCT queries #250
  • Intermitent failure in CI join_with_hash_collision #227
  • Concat from Dataframe API no longer accepts multiple expressions #226
  • Fix right, full join handling when having multiple non-matching rows at the left side #845 (Dandandan)
  • Qualified field resolution too strict #810 [sql] (seddonm1)
  • Better join order resolution logic #797 [sql] (seddonm1)
  • Produce correct answers for Group BY NULL (Option 1) #793 (alamb)
  • Use consistent version of string_to_timestamp_nanos in DataFusion #767 (alamb)
  • #723 limit pruning rule to simple expression #764 (lvheyang)
  • #699 fix return type conflict when calling builtin math fuctions #716 (lvheyang)
  • Fix Date32 and Date64 parquet row group pruning #690 (alamb)
  • Remove qualifiers on pushed down predicates / Fix parquet pruning #689 (alamb)
  • use Weak ptr to break catalog list <> info schema cyclic reference #681 (crepererum)
  • honor table name for csv/parquet scan in ballista plan serde #629 (houqp)
  • fix 621, where unnamed window functions shall be differentiated by partition and order by clause #622 (Jimexist)
  • RFC: Do not prune out unnecessary columns with unqualified references #619 (alamb)
  • [fix] select * on empty table #613 (rdettai)
  • fix 592, support alias in window functions #607 (Jimexist)
  • RepartitionExec should not error if output has hung up #576 (alamb)
  • Fix pruning on not equal predicate #561 (alamb)
  • hash float arrays using primitive usigned integer type #556 (houqp)
  • Return errors properly from RepartitionExec #521 (alamb)
  • refactor sort exec stream and combine batches #515 (Jimexist)
  • Fix display of execution time in datafusion-cli #514 (Dandandan)
  • Wrong aggregation arguments error. #505 (jgoday)
  • fix window aggregation with alias and add integration test case #454 (Jimexist)
  • fix: don't duplicate existing filters #409 (e-dard)
  • Fixed incorrect logical type in GroupByScalar. #391 (jorgecarleitao)
  • Fix indented display for multi-child nodes #358 (alamb)
  • Fix SQL planner to support multibyte column names #357 (agatan)
  • Fix wrong projection 'optimization' #268 (Dandandan)
  • Fix Left join implementation is incorrect for 0 or multiple batches on the right side #238 (Dandandan)
  • Count distinct boolean #230 (pjmore)
  • Fix Filter / where clause without column names is removed in optimization pass #225 (Dandandan)

Documentation updates:

Performance improvements:

  • Speed up inlist for strings and primitives #813 (Dandandan)
  • perf: improve performance of SortPreservingMergeExec operator #722 (e-dard)
  • Optimize min/max queries with table statistics #719 (b41sh)
  • perf: Improve materialisation performance of SortPreservingMergeExec #691 (e-dard)
  • Optimize count(*) with table statistics #620 (Dandandan)
  • optimize window function's find_ranges_in_range #595 (Jimexist)
  • Collapse sort into window expr and do sort within logical phase #571 (Jimexist)
  • Use repartition in window functions to speed up #569 (Jimexist)
  • Constant fold / optimize to_timestamp function during planning #387 (msathis)
  • Speed up create_batch_from_map #339 (Dandandan)
  • Simplify math expression code (use unary kernel) #309 (Dandandan)

Closed issues:

  • Confirm git tagging strategy for releases #770
  • arrow::util::pretty::pretty_format_batches missing #769
  • move the assert_batches_eq! macros to a non part of datafusion #745
  • fix an issue where aliases are not respected in generating downstream schemas in window expr #592
  • make the planner to print more succinct and useful information in window function explain clause #526
  • move window frame module to be in logical_plan #517
  • use a more rust idiomatic way of handling nth_value #448
  • create a test with more than one partition for window functions #435
  • COUNT DISTINCT does not support for Boolean #202
  • Read CSV format text from stdin or memory #198
  • Fix null handling hash join #195
  • Allow TableProviders to indicate their type for the information schema #191
  • Make DataFrame extensible #190
  • TPC-H Query 19 #170
  • TPC-H Query 7 #161
  • Upgrade hashbrown to 0.10 #151
  • Implement vectorized hashing for hash aggregate #149
  • More efficient LEFT join implementation #143
  • Implement vectorized hashing #142
  • RFC Roadmap for 2021 (DataFusion) #140
  • Implement hash partitioning #131
  • Grouping by column position #110
  • [Datafusion] GROUP BY with a high cardinality doesn't seem to finish #107
  • [Rust] Add support for JSON data sources #103
  • [Rust] Implement metrics framework #95
  • Publically export Arrow crate from datafusion #36
  • Implement hash-partitioned hash aggregate #27
  • Consider using GitHub pages for DataFusion/Ballista documentation #18
  • Update "repository" in Cargo.toml #16

Merged pull requests:

  • Use RawTable API in hash join #827 (Dandandan)
  • Add test for window functions on dictionary #823 (alamb)
  • Update dependencies: prost to 0.8 and tonic to 0.5 #818 (alamb)
  • Move hash_array into hash_utils.rs #807 (alamb)
  • Remove GroupByScalar and use ScalarValue in preparation for supporting null values in GroupBy #786 (alamb)
  • fix 226, make concat, concat_ws, and random work with Python crate #761 (Jimexist)
  • Test for parquet pruning disabling #754 (alamb)
  • Add explain verbose with limit push down #751 (Jimexist)
  • Move assert_batches_eq! macros to test_utils.rs #746 (alamb)
  • Show optimized physical and logical plans in EXPLAIN #744 (alamb)
  • update python crate to support latest pyo3 syntax and gil sematics #741 (Jimexist)
  • update python crate dependencies #740 (Jimexist)
  • provide more details on required .parquet file extension error message #729 (Jimexist)
  • split up windows functions into a dedicated module with separate files #724 (Jimexist)
  • Use pytest in integration test #715 (Jimexist)
  • replace once iter chain with array::IntoIter #704 (houqp)
  • avoid iterator materialization in column index lookup #703 (houqp)
  • Fix build with 1.52.1 #696 (alamb)
  • Fix test output due to logical merge conflict #694 (alamb)
  • add more integration tests #668 (Jimexist)
  • Bump arrow and parquet versions to 4.4 #654 (toddtreece)
  • Add query 15 to TPC-H queries #645 (Dandandan)
  • Improve error message and comments #641 (alamb)
  • add integration tests for rank, dense_rank, fix last_value evaluation with rank #638 (Jimexist)
  • round trip TPCH queries in tests #630 (houqp)
  • use Into<String> as argument type wherever applicable #615 (houqp)
  • reuse alias map in aggregate logical planning and refactor position resolution #606 (Jimexist)
  • fix clippy warnings #581 (Jimexist)
  • Add benchmarks to window function queries #564 (Jimexist)
  • reuse code for now function expr creation #548 (houqp)
  • turn on clippy rule for needless borrow #545 (Jimexist)
  • Refactor hash aggregates's planner building code #539 (Jimexist)
  • Cleanup Repartition Exec code #538 (alamb)
  • reuse datafusion physical planner in ballista building from protobuf #532 (Jimexist)
  • remove redundant into_iter() calls #527 (Jimexist)
  • Fix 517 - move window_frames module to logical_plan #518 (Jimexist)
  • Refactor window aggregation, simplify batch processing logic #516 (Jimexist)
  • Add datafusion::test_util, resolve test data paths without env vars #498 (mluts)
  • Avoid warnings in tests when compiling without default features #489 (alamb)
  • update cargo.toml in python crate and fix unit test due to hash joins #483 (Jimexist)
  • use prettier check in CI #453 (Jimexist)
  • Optimize nth_value, remove first_value, last_value structs and use idiomatic rust style #452 (Jimexist)
  • Fixed typo / logical merge conflict #433 (jorgecarleitao)
  • include test data and add aggregation tests in integration test #425 (Jimexist)
  • Add some padding around the logo #411 (parthsarthy)
  • Benchmark subcommand to distinguish between DataFusion and Ballista #402 (jgoday)
  • refactor datafusion/scalar_value to use more macro and avoid dup code #392 (Jimexist)
  • Update TPC-H benchmark to show physical plan when debug mode is enabled #386 (andygrove)
  • Update arrow dependencies again #341 (alamb)
  • Update arrow-rs deps #317 (alamb)
  • Update PR template by commenting out instructions #315 (alamb)
  • fix clippy warning #286 (Jimexist)
  • add integration test to compare datafusion-cli against psql #281 (Jimexist)
  • Update arrow deps #269 (alamb)
  • Use multi-stage build dockerfile in datafusion-cli and reduce image size from 2.16GB to 89.9MB #266 (Jimexist)
  • Enable redundant_field_names clippy lint #261 (Dandandan)
  • fix clippy lint #259 (alamb)
  • Move datafusion-cli to new crate #231 (Dandandan)
  • Make test join_with_hash_collision deterministic #229 (Dandandan)
  • Update arrow-rs deps (to fix build due to flatbuffers update) #224 (alamb)
  • Use standard make_null_array for CASE #223 (alamb)
  • update arrow-rs deps to latest master #216 (alamb)
  • MINOR: Remove empty rust dir #61 (andygrove)

* This Changelog was automatically generated by github_changelog_generator