Improvements
- Spark 3.0.0 - 3.1.3 supported (DATAFU-169)
- New Aggregators replace deprecated UserDefinedAggregateFunction for (DATAFU-173)
Breaking changes
- Spark 2.x no longer supported
Improvements
- dedupWithCombiner method now supports a list of columns in the order / group by params (DATAFU-171)
- Scala Python bridge now uses secure gateway (DATAFU-167)
Breaking changes
- Spark 2.2.0, 2.2.1, and 2.3.0. no longer supported
Additions
- Add collectLimitedList and dedupRandomN methods (DATAFU-165)
- Improve broadcastJoinSkewed function performance and allow all join types (DATAFU-170)
Improvements
- Upgrade Log4j version (DATAFU-162)
- Added count filtering option to broadcastJoinSkewed
Fixes
- explodeArray method not exposed in Python (DATAFU-163)
Breaking changes
- Spark 2.1.x no longer supported
Additions
- Explode Array method (DATAFU-154)
Improvements
- Add support for newer versions of Gradle (DATAFU-157)
- Document Explode Array usage recommendation (DATAFU-158)
Fixes
- Gradle build fails (DATAFU-156)
Additions:
- datafu-spark library (DATAFU-148)
Improvements:
- Remove log suppression in unit tests (DATAFU-82)
Fixes:
- Failure to assemble due to jcenter HTTP usage (DATAFU-152)
Additions:
- dedup macro (DATAFU-129)
- sample_by_keys macro (DATAFU-127)
Improvements:
- Update Ruby gem for site generation (DATAFU-147)
- Make DataFu compile with Java 8 (DATAFU-132)
Changes:
- Upgrade to Gradle v4.8.1 (DATAFU-146)
Changes:
- Removed MD5 hash for source release artifact.
Additions:
- UDF for hash functions such as murmur3 and others. (DATAFU-47)
- UDF for diffing tuples. (DATAFU-119)
- Support for macros in DataFu. Macros count_all_non_distinct and count_distinct_keys were added. (DATAFU-123)
- Macro for TFIDF. (DATAFU-61)
Improvements:
- Added lifecylce hooks to ContextualEvalFunc. (DATAFU-50)
- SessionCount and Sessionize now support millisecond precision. (DATAFU-124)
- Upgraded to Guava 20.0. (DATAFU-48)
- Updated Gradle to 3.5.1. (DATAFU-125)
- Rat tasks automatically run during assemble. (DATAFU-118)
- Building now works on Windows. (DATAFU-99)
Improvements:
- LICENSE, NOTICE, and DISCLAIMER now included in META-INF of JARs.
- Test files now generated to build/test-files within projects.
- AliasableEvalFunc now uses getInputSchema.
Additions:
- New UDF CountDistinctUpTo that counts tuples within a bag to a preset limit (DATAFU-117)
Improvements:
- TupleFromBag and FirstTupleFromBag now implement Accumulator interface as well (DATAFU-114, DATAFU-115)
Build System:
- IntelliJ Idea support added to build file (DATAFU-103)
- JDK version now validated when building (DATAFU-95)
Additions:
- New UDFs for entropy and weighted sampling algorithms (DATAFU-2, DATAFU-26)
- Updated SimpleRandomSample to be consistent with SimpleRandomSampleWithReplacement (DATAFU-5)
- Created OpenNLP UDF wrappers (DATAFU-8)
- Created RandomUUID UDF (DATAFU-18)
- Added LSH implementation (DATAFU-37)
- Added Base64Encode/Decode (DATAFU-52)
- URLInfo UDF (DATAFU-62)
- Created SelectFieldByName UDF (DATAFU-69)
- Added generic BagJoin that supports inner, left, and full outer joins (DATAFU-70)
- Added ZipBags UDF which can zip and arbitrary number of bags into one (DATAFU-79)
- Hadoop 2.0 compatibility (DATAFU-58)
- Created TupleFromBag.java file (DATAFU-92)
Improvements:
- Simplified BagGroup output (DATAFU-42)
Changes:
- StagedOutputJob no longer writes counters by default (DATAFU-35)
Fixes:
- ReservoirSample does not behave as expected when grouping by a key other than ALL (DATAFU-11)
- DistinctBy does not work correctly on strings containing minuses (DATAFU-31)
- Hourglass does not honor "fail on missing" in all cases (DATAFU-35)
- Hash UDFs return zero-padded strings of uniform length even when leading bits are zero (DATAFU 46)
- UDF examples work again (DATAFU-49)
- SampleByKey can throw NullPointerException (DATAFU-68)
Build system:
- Removed legacy checked in jars (DATAFU-55)
- Updated to use Pig 0.12.1 (DATAFU-10)
- Switched from Ant to Gradle 1.12 (DATAFU-27, DATAFU-44, DATAFU-43, DATAFU-66)
- Removed checked in jars, download where necessary (DATAFU-55, DATAFU-55)
- Fixed test.sh to use gradlew (DATAFU-77)
Release related:
- NOTICE updated with dependencies used or shipped with DataFu.
- Apache license headers added to all necessary files (DATAFU-4, DATAFU-75)
- Added doap file (DATAFU-36)
- Source tarball generation, gradle bootstrapping, and release instructions (DATAFU-57, DATAFU-78, DATAFU-72)
- Removed author tags (DATAFU-74)
- Resolved issues with build-plugin directory (DATAFU-76)
- Used Apache RAT to verify correct file headers (DATAFU-73, DATAFU-84)
Documentation related:
- New website (DATAFU-20, etc.)
- StreamingQuantile PDF link is broken (DATAFU-29)
- README file updated
Additions:
- Pair of UDFs for simple random sampling with replacement.
- More dependencies now packaged in DataFu so fewer JAR dependencies required.
- SetDifference UDF for computing set difference A-B or A-B-C.
- HyperLogLogPlusPlus UDF for efficient cardinality estimation.
This release adds compatibility with Pig 0.12 (courtesy of jarcec).
Additions:
- Added SHA hash UDF.
- InUDF and AssertUDF added for Pig 0.12 compatibility. These are the same as In and Assert.
- SimpleRandomSample, which implements a scalable simple random sampling algorithm.
Fixes:
- Fixed the schema declarations of several UDFs for compatibility with Pig 0.12, which is now stricter with schemas.
This is not a backwards compatible release.
Additions:
- Added SampleByKey, which provides a way to sample tuples based on certain fields.
- Added Coalesce, which returns the first non-null value from a list of arguments like SQL's COALESCE.
- Added BagGroup, which performs an in-memory group operation on a bag.
- Added ReservoirSample
- Added In filter func, which behaves like SQL's IN
- Added EmptyBagToNullFields, which enables multi-relation left joins using COGROUP
- Sessionize now supports long values for timestamp, in addition to string representation of time.
- BagConcat can now operate on a bag of bags, in addition to a tuple of bags
- Created TransposeTupleToBag, which creates a bag of key-value pairs from a tuple
- SessionCount now implements Accumulator interface
- DistinctBy now implements Accumulator interface
- Using PigUnit from Maven for testing, instead of checked-in JAR
- Added many more test cases to improve coverage
- Improved documentation
Changes:
- Moved WeightedSample to datafu.pig.sampling
- Using Pig 0.11.1 for testing.
- Renamed package datafu.pig.numbers to datafu.pig.random
- Renamed package datafu.pig.bag.sets to datafu.pig.sets
- Renamed TimeCount to SessionCount, moved to datafu.pig.sessions
- ASSERT renamed to Assert
- MD5Base64 merged into MD5 implementation, constructor arg picks which method, default being hex
Removals:
- Removed ApplyQuantiles
- Removed AliasBagFields, since can now achieve with nested foreach
Fixes:
- Quantile now outputs schemas consistent with StreamingQuantile
- Necessary fastutil classes now packaged in datafu JAR, so fastutil JAR not needed as dependency
- Non-deterministic UDFs now marked as so
Additions:
- CountEach now implements Accumulator
- Added AliasableEvalFunc, a base class to enable UDFs to access fields in tuple by name instead of position
- Added BagLeftOuterJoin, which can perform left join on two or more reasonably sized bags without a reduce
Fixes:
- StreamingQuantile schema fix
Additions:
- WeightedSample can now take a seed
Changes:
- Test against Pig 0.11.0
Fixes:
- Null pointer fix for Enumerate's Accumulator implementation