Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
ORC-1610: Reduce the number of hash computation in
CuckooSetBytes
### What changes were proposed in this pull request? Add boundary conditions on "length" with the min/max length stored in the hashes. ### Why are the changes needed? https://issues.apache.org/jira/browse/HIVE-24205 > This would significantly reduce the number of hash computation that needs to happen. ``` main insert:00:00:00.689 main lookup:00:00:01.124 PR insert:00:00:00.628 PR lookup:00:00:01.055 ``` ```java Test public void testLen() { int maxSize = 200000; Random gen = new Random(); String[] strings = new String[maxSize]; for (int i = 0; i < maxSize; i++) { strings[i] = RandomStringUtils.random(Math.abs(gen.nextInt(1000))); } byte[][] values = getByteArrays(strings); StopWatch mainSW = new StopWatch(); // load set mainSW.start(); CuckooSetBytes main = new CuckooSetBytes(strings.length); main.fastLookup = false; for (byte[] v : values) { main.insert(v); } mainSW.split(); System.out.println("main insert:" + mainSW); // test that the values we added are there for (byte[] v : values) { assertTrue(main.lookup(v, 0, v.length)); } mainSW.stop(); System.out.println("main lookup:" + mainSW); StopWatch prSW = new StopWatch(); prSW.start(); CuckooSetBytes pr = new CuckooSetBytes(strings.length); pr.fastLookup = true; for (byte[] v : values) { pr.insert(v); } prSW.split(); System.out.println("PR insert:" + prSW); for (byte[] v : values) { assertTrue(pr.lookup(v, 0, v.length)); } prSW.stop(); System.out.println("PR lookup:" + prSW); } ``` ### How was this patch tested? GA ### Was this patch authored or co-authored using generative AI tooling? No Closes #1785 from cxzl25/ORC-1610. Authored-by: sychen <sychen@ctrip.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 4eff23a) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
- Loading branch information