v 3.0.2

liborty · Oct 29, 2023 · ebfe8ba · ebfe8ba
1 parent 075d6b2
commit ebfe8ba
Show file tree

Hide file tree

Showing 6 changed files with 197 additions and 435 deletions.
diff --git a/Cargo.toml b/Cargo.toml
@@ -1,6 +1,6 @@
 [package]
 name = "medians"
-version = "3.0.1"
+version = "3.0.2"
 authors = ["Libor Spacek"]
 edition = "2021"
 description = "Median, Statistical Measures, Mathematics, Statistics"

diff --git a/README.md b/README.md
@@ -5,32 +5,39 @@
 Fast algorithm for finding medians of one dimensional data, implemented in 100% safe Rust.
 
 ```rust
-use medians::{Medians,Medianf64,Median};
+use medians::{medianu8,Medians,Medianf64,Median};
 ```
 
 ## Introduction
 
-Finding medians is a common task in statistics and data analysis. At least it ought to be, because median is a more stable measure of central tendency than mean. Similarly, `mad` (median of absolute differences) is a more stable measure of data spread than standard deviation, dominated by squared outliers. Median and mad are not used nearly enough mostly for practical historical reasons: they are more difficult to compute. The fast algorithms presented here provide a practical remedy for this situation.
+Finding medians is a common task in statistics and data analysis. At least it ought to be, because median is a more stable measure of central tendency than mean. Similarly, `mad` (median of absolute differences) is a more stable measure of data spread than is standard deviation, which is dominated by squared outliers. Median and mad are not used nearly enough mostly for practical reasons: they are more difficult to compute. The fast algorithms presented here provide remedy for this situation.
 
 We argued in [`rstats`](https://github.com/liborty/rstats), that using the Geometric Median is the most stable way to characterise multidimensional data. The one dimensional case is addressed here.
 
 See [`tests.rs`](https://github.com/liborty/medians/blob/main/tests/tests.rs) for examples of usage. Their automatically generated output can also be found by clicking the 'test' icon at the top of this document and then examining the latest log.
 
 ## Algorithms Analysis
 
-Median can be found naively by sorting the list of data and then picking its midpoint. The best comparison sort algorithm(s) have complexity `O(nlog(n))`. However, faster median algorithms, with complexity `O(n)` are possible. They are based on the observation that not all data need to be sorted, only partitioned and counted off. Therefore, the sort method can not compete, as is demonstrated by the tests. It has been deleted as of version 2.0.0.
+Simple primitive types are best dealt with by extra fast radix search. We have implemented it for `u8`, other primitive types to follow soon:
 
-Currently considered to be the 'state of the art' algorithm is Floyd-Rivest (1975) Median of Medians. This divides the data into groups of five items, finds a median of each group and then recursively finds medians of five of these medians, and so on, until only one is left. This is then used as a pivot for the partitioning of the original data. Such  pivot is guaranteed to produce 'pretty good' partitioning, though not necessarily perfect, so iteration is necessary.
+```rust
+/// Median of primitive type u8 by fast radix search
+pub fn medianu8( s:&[u8] ) -> Result<f64, Me> { ... }
+```
+
+More complex types require general comparison search. Median can be found naively by sorting the list of data and then picking its midpoint. The best comparison sort algorithm(s) have complexity `O(nlog(n))`. However, faster median algorithms, with complexity `O(n)` are possible. They are based on the observation that not all data need to be sorted, only partitioned and counted off. Therefore, the sort method can not compete, as is demonstrated by the tests. It has been deleted as of version 2.0.0.
 
-However, to find the best pivot is not the main overall objective. Rather, it is to eliminate (count off) eccentric data items as fast as possible. Therefore, the expense of choosing the pivot must be considered. It is possible to allow less optimal pivots, as we do here and yet, on average, to find the median faster.
+Currently considered to be the 'state of the art' comparison algorithm is Floyd-Rivest (1975) Median of Medians. This divides the data into groups of five items, finds a median of each group by sort and then recursively finds medians of five of these medians, and so on, until only one is left. This is then used as a pivot for the partitioning of the original data. Such pivot is guaranteed to produce 'pretty good' partitioning, though not necessarily perfect, so iteration is still necessary.
 
-Let our average ratio of items remaining after one partitioning be rs and the Floyd-Rivest be rf. Where `1/2 <= rf <= rs < 1` and rf is more optimal, being nearer to the perfect ratio `1/2`. However, suppose that we can perform two partitions in the time it takes Floyd-Rivest to do one (because of their expensive pivot selection process). Then it is enough for better performance that `rs^2 < rf`, which is entirely possible and seems to be confirmed in practice. For example, `rf=0.65` (nearly optimal), `rs=0.8` (deeply suboptimal), yet `rs^2 < rf`.
+To find the best pivot is not the main overall objective. Rather, it is to eliminate (count off) eccentric data items as fast as possible. Therefore, the expense of choosing the pivot must be considered. It is possible to allow less optimal pivots, as we do here and yet, on average, to find the median faster.
+
+Let our average ratio of items remaining after one partitioning be `rs` and the Floyd-Rivest's be `rf`. Typically, `1/2 <= rf <= rs < 1`, i.e. `rf` is more optimal, being nearer to the perfect partitioning ratio `1/2`. However, suppose that we can perform two partitions in the time it takes Floyd-Rivest to do one (because of their expensive pivot selection process). Then it is enough for better performance that `rs^2 < rf`, which is entirely possible and seems to be confirmed in practice. For example, `rf=0.65` (nearly optimal), `rs=0.8` (deeply suboptimal), yet `rs^2 < rf`.
 
 The main features of our median algorithm are:
 
 * Linear complexity.
-* Fast (in place) iterative partitioning into three subranges (lesser,equal,greater), minimising data movements and memory management.
-* Simple pivot selection strategy. We define the `middling` value of a sample of four as one of the middle pair of sorted items. Whereas full sort of four items takes at least five comparisons, we only need three. A `middling` pivot is enough to guarantee convergence of iterative schemes, such as the search for the median. Also, poor pivots are unlikely to be picked repeatedly.
+* Fast (in-place) iterative partitioning into three subranges (lesser,equal,greater), minimising data movements and memory management.
+* Simple pivot selection strategy. We define the `middling` value of a sample of four as one of the middle pair of items in order. Whereas full (merge) sort of four items takes five comparisons, we only need three. A `middling` pivot is enough to guarantee convergence of iterative schemes, such as the search for the median. Also, poor pivots are unlikely to be picked repeatedly.
 
 ## Trait Medianf64
 
@@ -60,7 +67,8 @@ Most of its methods take a quantify closure `q`, which converts its generic argu
 
 Weaker partial ordinal comparison is used instead of numerical comparison. The search algorithm remains the same. The only additional cost is the extra layer of referencing to prevent the copying of data.
 
-**`median_by()`**  
+**`median_by()`**
+
 For all end-types quantifiable to f64, we simply averaged the two midpoints of even length data to obtain a single median (of type `f64`). When the data items are unquantifiable to `f64`, this is no longer possible. Then `median_by` should be used. It returns both middle values within `Medians` enum type, the lesser one first:
 
 ```rust
@@ -101,9 +109,11 @@ pub trait Median<'a, T> {
 
 ## Release Notes
 
+**Version 3.0.2** - Added function `medianu8` that finds median byte by superfast radix search. More primitive types to follow.
+
 **Version 3.0.1** - Renamed `correlation` to `med_correlation` to avoid name clashes elsewhere.
 
-**Version 3.0.0** - Numerous improvements to speed and generality and renaming. 
+**Version 3.0.0** - Numerous improvements to speed and generality and renaming.
 
 **Version 2.3.1** - Further speed optimisation of `partf64`.
 

diff --git a/src/algos.rs b/src/algos.rs
@@ -83,19 +83,14 @@ pub fn isort_refs<T>(s: &mut [&T], rng: Range<usize>, c: impl Fn(&T, &T) -> Orde
 
 /// Index of the middling value of four refs. Makes only three comparisons
 fn middling(
-    idx0:usize,idx1:usize,idx2:usize,idx3:usize,
-    c: &mut impl FnMut(usize,usize) -> Ordering,
+    idx0: usize,
+    idx1: usize,
+    idx2: usize,
+    idx3: usize,
+    c: &mut impl FnMut(usize, usize) -> Ordering,
 ) -> usize {
-    let max1 = if c(idx0,idx1) == Less {
-        idx1
-    } else {
-        idx0
-    };
-    let max2 = if c(idx2,idx3) == Less {
-        idx3
-    } else {
-        idx2
-    };
+    let max1 = if c(idx0, idx1) == Less { idx1 } else { idx0 };
+    let max2 = if c(idx2, idx3) == Less { idx3 } else { idx2 };
     if c(max1, max2) == Less {
         max1
     } else {
@@ -115,9 +110,9 @@ pub fn min<'a, T>(s: &[&'a T], rng: Range<usize>, c: &mut impl FnMut(&T, &T) ->
     min
 }
 
-/// Two minimum values within rng in slice s
-/// Finds maxima, when invoked with arguments of c swapped: `|a,b| c(b,a)`
-/// min1 is the overall minimum/maximum.
+/// Two minimum values within rng in slice s.  
+/// Finds maxima, when invoked with arguments of c swapped: `|a,b| c(b,a)`.  
+/// The first returned item refers to the primary minimum/maximum.
 pub fn min2<'a, T>(
     s: &[&'a T],
     rng: Range<usize>,
@@ -153,11 +148,11 @@ pub fn qbalance<T>(s: &[T], centre: &f64, q: impl Fn(&T) -> f64) -> i64 {
     let mut eq = 0_i64;
     for si in s {
         match &q(si).total_cmp(centre) {
-            Less => bal -= 1,
-            Equal => eq += 1,
+            Less => bal -= 1, 
             Greater => bal += 1,
+            _ => eq += 1,
         };
-    }
+    };
     if bal == 0 {
         return 0;
     };
@@ -167,22 +162,21 @@ pub fn qbalance<T>(s: &[T], centre: &f64, q: impl Fn(&T) -> f64) -> i64 {
     1
 }
 
-/// Partitions `s: &mut [&T]` within range `rng` by pivot, which is the first item: s[rng.start]
+/// Partitions `s: &mut [&T]` within range `rng` by pivot, which is the first item: `s[rng.start]`.  
 /// The three resulting partitions are divided by eqstart,gtstart, where:  
-/// `rng.start..eqstart` (may be empty) contains refs to items lesser than the pivot,
-/// `gtstart-eqstart` is the number (>= 1) of items equal to the pivot,
-///  values of s within this subrange are undefined,  
-/// `gtstart..rng.end` (may be empty) contains refs to items greater than the pivot
+/// `rng.start..eqstart` (may be empty) contains refs to items lesser than the pivot,  
+/// `gtstart-eqstart` is the number (>= 1) of items equal to the pivot, values within this subrange are undefined,  
+/// `gtstart..rng.end` (may be empty) contains refs to items greater than the pivot.
 pub fn part<T>(
     s: &mut [&T],
-    rng: &Range<usize>, 
+    rng: &Range<usize>,
     c: &mut impl FnMut(&T, &T) -> Ordering,
-) -> (usize, usize) { 
-    // place pivot in the first location
+) -> (usize, usize) {
+    // get pivot from the first location
     let pivot = s[rng.start];
     let mut eqstart = rng.start;
-    let mut gtstart = eqstart+1;
-    for t in rng.start+1..rng.end {
+    let mut gtstart = eqstart + 1;
+    for t in rng.start + 1..rng.end {
         match c(s[t], pivot) {
             Less => {
                 s[eqstart] = s[t];
@@ -200,14 +194,63 @@ pub fn part<T>(
     (eqstart, gtstart)
 }
 
+/// Odd median of `&[u8]`
+pub fn oddmedianu8(s: &[u8]) -> f64 {
+    let need = s.len() / 2; // median target position
+    let mut histogram = [0_usize; 256];
+    let mut cummulator = 0_usize;
+    let mut resbucket = 0_u8;
+    for &u in s.iter() {
+        histogram[u as usize] += 1;
+    };
+    for &hist in histogram.iter() {
+        cummulator += hist;
+        if need < cummulator { break; };
+        resbucket += 1;
+    };
+    resbucket as f64
+}
+
+/// Even median of `&[u8]`
+pub fn evenmedianu8(s: &[u8]) -> f64 {
+    let mut need = s.len() / 2; // first median target position
+    let mut histogram = [0_usize; 256];
+    let mut cummulator = 0_usize;
+    let mut resbucket = 0u8;
+    let mut res = f64::MIN;
+    for &u in s.iter() {
+        histogram[u as usize] += 1;
+    }
+    for &hist in histogram.iter() {
+        cummulator += hist; 
+        if need < cummulator { break; }; // cummulator exceeds need, found at least two items
+        if need == cummulator { // the last (possibly only) item in this bucket
+            if res == f64::MIN { // is the first median
+                res = resbucket as f64; // save it
+                need += 1; // next look for the second one (in the following buckets)
+            } else { break; }; // this item is already the second median
+        };
+        resbucket += 1;
+        continue;
+    };
+    if  res == f64::MIN { resbucket as f64 } else { (res + resbucket as f64)/ 2.0 } 
+}
+
 /// Median of odd sized generic data with Odering comparisons by custom closure
 pub fn oddmedian_by<'a, T>(s: &mut [&'a T], c: &mut impl FnMut(&T, &T) -> Ordering) -> &'a T {
     let mut rng = 0..s.len();
     let need = s.len() / 2; // median target position in fully partitioned set
     loop {
-        let pivotsub = middling(rng.start,rng.start + 1,rng.end - 2,rng.end - 1,
-            &mut |a,b| c(s[a],s[b]));
-        if pivotsub != rng.start { s.swap(rng.start,pivotsub); };
+        let pivotsub = middling(
+            rng.start,
+            rng.start + 1,
+            rng.end - 2,
+            rng.end - 1,
+            &mut |a, b| c(s[a], s[b]),
+        );
+        if pivotsub != rng.start {
+            s.swap(rng.start, pivotsub);
+        };
         let pivotref = s[rng.start];
         let (eqsub, gtsub) = part(s, &rng, c);
         // well inside lt partition, iterate on it
@@ -235,7 +278,7 @@ pub fn oddmedian_by<'a, T>(s: &mut [&'a T], c: &mut impl FnMut(&T, &T) -> Orderi
         };
         // second place in gt partition, the solution is the next minimum
         if need == gtsub + 1 {
-           return min2(s, gtsub..rng.end, c).1;
+            return min2(s, gtsub..rng.end, c).1;
         };
         // well inside gt partition, iterate on it
         rng.start = gtsub;
@@ -250,11 +293,18 @@ pub fn evenmedian_by<'a, T>(
     let mut rng = 0..s.len();
     let need = s.len() / 2 - 1; // median target position in fully partitioned set
     loop {
-        let pivotsub = middling(rng.start,rng.start + 1,rng.end - 2,rng.end - 1,
-            &mut |a,b| c(s[a],s[b]));
-        if pivotsub != rng.start { s.swap(rng.start,pivotsub); };
+        let pivotsub = middling(
+            rng.start,
+            rng.start + 1,
+            rng.end - 2,
+            rng.end - 1,
+            &mut |a, b| c(s[a], s[b]),
+        );
+        if pivotsub != rng.start {
+            s.swap(rng.start, pivotsub);
+        };
         let pivotref = s[rng.start];
-        let (eqsub,gtsub) = part(s, &rng, c); 
+        let (eqsub, gtsub) = part(s, &rng, c);
         // well inside lt partition, iterate on it narrowing the range
         if need + 2 < eqsub {
             rng.end = eqsub;
@@ -263,7 +313,7 @@ pub fn evenmedian_by<'a, T>(
         // penultimate place in lt partition, solution:
         if need + 2 == eqsub {
             // swapping comparison arguments to get two maxima
-            let (m1,m2) = min2(s, rng.start..eqsub, &mut |a, b| c(b, a));
+            let (m1, m2) = min2(s, rng.start..eqsub, &mut |a, b| c(b, a));
             return (m2, m1);
         };
         // last place in the lt partition, solution:
@@ -284,7 +334,7 @@ pub fn evenmedian_by<'a, T>(
             return min2(s, gtsub..rng.end, c);
         };
         // inside gt partition, iterate on it, narrowing the range
-        rng.start = gtsub; 
+        rng.start = gtsub;
     }
 }