Skip to content

Commit

Permalink
v 3.0.2
Browse files Browse the repository at this point in the history
  • Loading branch information
liborty committed Oct 29, 2023
1 parent 075d6b2 commit ebfe8ba
Show file tree
Hide file tree
Showing 6 changed files with 197 additions and 435 deletions.
2 changes: 1 addition & 1 deletion Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[package]
name = "medians"
version = "3.0.1"
version = "3.0.2"
authors = ["Libor Spacek"]
edition = "2021"
description = "Median, Statistical Measures, Mathematics, Statistics"
Expand Down
30 changes: 20 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,32 +5,39 @@
Fast algorithm for finding medians of one dimensional data, implemented in 100% safe Rust.

```rust
use medians::{Medians,Medianf64,Median};
use medians::{medianu8,Medians,Medianf64,Median};
```

## Introduction

Finding medians is a common task in statistics and data analysis. At least it ought to be, because median is a more stable measure of central tendency than mean. Similarly, `mad` (median of absolute differences) is a more stable measure of data spread than standard deviation, dominated by squared outliers. Median and mad are not used nearly enough mostly for practical historical reasons: they are more difficult to compute. The fast algorithms presented here provide a practical remedy for this situation.
Finding medians is a common task in statistics and data analysis. At least it ought to be, because median is a more stable measure of central tendency than mean. Similarly, `mad` (median of absolute differences) is a more stable measure of data spread than is standard deviation, which is dominated by squared outliers. Median and mad are not used nearly enough mostly for practical reasons: they are more difficult to compute. The fast algorithms presented here provide remedy for this situation.

We argued in [`rstats`](https://github.com/liborty/rstats), that using the Geometric Median is the most stable way to characterise multidimensional data. The one dimensional case is addressed here.

See [`tests.rs`](https://github.com/liborty/medians/blob/main/tests/tests.rs) for examples of usage. Their automatically generated output can also be found by clicking the 'test' icon at the top of this document and then examining the latest log.

## Algorithms Analysis

Median can be found naively by sorting the list of data and then picking its midpoint. The best comparison sort algorithm(s) have complexity `O(nlog(n))`. However, faster median algorithms, with complexity `O(n)` are possible. They are based on the observation that not all data need to be sorted, only partitioned and counted off. Therefore, the sort method can not compete, as is demonstrated by the tests. It has been deleted as of version 2.0.0.
Simple primitive types are best dealt with by extra fast radix search. We have implemented it for `u8`, other primitive types to follow soon:

Currently considered to be the 'state of the art' algorithm is Floyd-Rivest (1975) Median of Medians. This divides the data into groups of five items, finds a median of each group and then recursively finds medians of five of these medians, and so on, until only one is left. This is then used as a pivot for the partitioning of the original data. Such pivot is guaranteed to produce 'pretty good' partitioning, though not necessarily perfect, so iteration is necessary.
```rust
/// Median of primitive type u8 by fast radix search
pub fn medianu8( s:&[u8] ) -> Result<f64, Me> { ... }
```

More complex types require general comparison search. Median can be found naively by sorting the list of data and then picking its midpoint. The best comparison sort algorithm(s) have complexity `O(nlog(n))`. However, faster median algorithms, with complexity `O(n)` are possible. They are based on the observation that not all data need to be sorted, only partitioned and counted off. Therefore, the sort method can not compete, as is demonstrated by the tests. It has been deleted as of version 2.0.0.

However, to find the best pivot is not the main overall objective. Rather, it is to eliminate (count off) eccentric data items as fast as possible. Therefore, the expense of choosing the pivot must be considered. It is possible to allow less optimal pivots, as we do here and yet, on average, to find the median faster.
Currently considered to be the 'state of the art' comparison algorithm is Floyd-Rivest (1975) Median of Medians. This divides the data into groups of five items, finds a median of each group by sort and then recursively finds medians of five of these medians, and so on, until only one is left. This is then used as a pivot for the partitioning of the original data. Such pivot is guaranteed to produce 'pretty good' partitioning, though not necessarily perfect, so iteration is still necessary.

Let our average ratio of items remaining after one partitioning be rs and the Floyd-Rivest be rf. Where `1/2 <= rf <= rs < 1` and rf is more optimal, being nearer to the perfect ratio `1/2`. However, suppose that we can perform two partitions in the time it takes Floyd-Rivest to do one (because of their expensive pivot selection process). Then it is enough for better performance that `rs^2 < rf`, which is entirely possible and seems to be confirmed in practice. For example, `rf=0.65` (nearly optimal), `rs=0.8` (deeply suboptimal), yet `rs^2 < rf`.
To find the best pivot is not the main overall objective. Rather, it is to eliminate (count off) eccentric data items as fast as possible. Therefore, the expense of choosing the pivot must be considered. It is possible to allow less optimal pivots, as we do here and yet, on average, to find the median faster.

Let our average ratio of items remaining after one partitioning be `rs` and the Floyd-Rivest's be `rf`. Typically, `1/2 <= rf <= rs < 1`, i.e. `rf` is more optimal, being nearer to the perfect partitioning ratio `1/2`. However, suppose that we can perform two partitions in the time it takes Floyd-Rivest to do one (because of their expensive pivot selection process). Then it is enough for better performance that `rs^2 < rf`, which is entirely possible and seems to be confirmed in practice. For example, `rf=0.65` (nearly optimal), `rs=0.8` (deeply suboptimal), yet `rs^2 < rf`.

The main features of our median algorithm are:

* Linear complexity.
* Fast (in place) iterative partitioning into three subranges (lesser,equal,greater), minimising data movements and memory management.
* Simple pivot selection strategy. We define the `middling` value of a sample of four as one of the middle pair of sorted items. Whereas full sort of four items takes at least five comparisons, we only need three. A `middling` pivot is enough to guarantee convergence of iterative schemes, such as the search for the median. Also, poor pivots are unlikely to be picked repeatedly.
* Fast (in-place) iterative partitioning into three subranges (lesser,equal,greater), minimising data movements and memory management.
* Simple pivot selection strategy. We define the `middling` value of a sample of four as one of the middle pair of items in order. Whereas full (merge) sort of four items takes five comparisons, we only need three. A `middling` pivot is enough to guarantee convergence of iterative schemes, such as the search for the median. Also, poor pivots are unlikely to be picked repeatedly.

## Trait Medianf64

Expand Down Expand Up @@ -60,7 +67,8 @@ Most of its methods take a quantify closure `q`, which converts its generic argu

Weaker partial ordinal comparison is used instead of numerical comparison. The search algorithm remains the same. The only additional cost is the extra layer of referencing to prevent the copying of data.

**`median_by()`**
**`median_by()`**

For all end-types quantifiable to f64, we simply averaged the two midpoints of even length data to obtain a single median (of type `f64`). When the data items are unquantifiable to `f64`, this is no longer possible. Then `median_by` should be used. It returns both middle values within `Medians` enum type, the lesser one first:

```rust
Expand Down Expand Up @@ -101,9 +109,11 @@ pub trait Median<'a, T> {

## Release Notes

**Version 3.0.2** - Added function `medianu8` that finds median byte by superfast radix search. More primitive types to follow.

**Version 3.0.1** - Renamed `correlation` to `med_correlation` to avoid name clashes elsewhere.

**Version 3.0.0** - Numerous improvements to speed and generality and renaming.
**Version 3.0.0** - Numerous improvements to speed and generality and renaming.

**Version 2.3.1** - Further speed optimisation of `partf64`.

Expand Down
126 changes: 88 additions & 38 deletions src/algos.rs
Original file line number Diff line number Diff line change
Expand Up @@ -83,19 +83,14 @@ pub fn isort_refs<T>(s: &mut [&T], rng: Range<usize>, c: impl Fn(&T, &T) -> Orde

/// Index of the middling value of four refs. Makes only three comparisons
fn middling(
idx0:usize,idx1:usize,idx2:usize,idx3:usize,
c: &mut impl FnMut(usize,usize) -> Ordering,
idx0: usize,
idx1: usize,
idx2: usize,
idx3: usize,
c: &mut impl FnMut(usize, usize) -> Ordering,
) -> usize {
let max1 = if c(idx0,idx1) == Less {
idx1
} else {
idx0
};
let max2 = if c(idx2,idx3) == Less {
idx3
} else {
idx2
};
let max1 = if c(idx0, idx1) == Less { idx1 } else { idx0 };
let max2 = if c(idx2, idx3) == Less { idx3 } else { idx2 };
if c(max1, max2) == Less {
max1
} else {
Expand All @@ -115,9 +110,9 @@ pub fn min<'a, T>(s: &[&'a T], rng: Range<usize>, c: &mut impl FnMut(&T, &T) ->
min
}

/// Two minimum values within rng in slice s
/// Finds maxima, when invoked with arguments of c swapped: `|a,b| c(b,a)`
/// min1 is the overall minimum/maximum.
/// Two minimum values within rng in slice s.
/// Finds maxima, when invoked with arguments of c swapped: `|a,b| c(b,a)`.
/// The first returned item refers to the primary minimum/maximum.
pub fn min2<'a, T>(
s: &[&'a T],
rng: Range<usize>,
Expand Down Expand Up @@ -153,11 +148,11 @@ pub fn qbalance<T>(s: &[T], centre: &f64, q: impl Fn(&T) -> f64) -> i64 {
let mut eq = 0_i64;
for si in s {
match &q(si).total_cmp(centre) {
Less => bal -= 1,
Equal => eq += 1,
Less => bal -= 1,
Greater => bal += 1,
_ => eq += 1,
};
}
};
if bal == 0 {
return 0;
};
Expand All @@ -167,22 +162,21 @@ pub fn qbalance<T>(s: &[T], centre: &f64, q: impl Fn(&T) -> f64) -> i64 {
1
}

/// Partitions `s: &mut [&T]` within range `rng` by pivot, which is the first item: s[rng.start]
/// Partitions `s: &mut [&T]` within range `rng` by pivot, which is the first item: `s[rng.start]`.
/// The three resulting partitions are divided by eqstart,gtstart, where:
/// `rng.start..eqstart` (may be empty) contains refs to items lesser than the pivot,
/// `gtstart-eqstart` is the number (>= 1) of items equal to the pivot,
/// values of s within this subrange are undefined,
/// `gtstart..rng.end` (may be empty) contains refs to items greater than the pivot
/// `rng.start..eqstart` (may be empty) contains refs to items lesser than the pivot,
/// `gtstart-eqstart` is the number (>= 1) of items equal to the pivot, values within this subrange are undefined,
/// `gtstart..rng.end` (may be empty) contains refs to items greater than the pivot.
pub fn part<T>(
s: &mut [&T],
rng: &Range<usize>,
rng: &Range<usize>,
c: &mut impl FnMut(&T, &T) -> Ordering,
) -> (usize, usize) {
// place pivot in the first location
) -> (usize, usize) {
// get pivot from the first location
let pivot = s[rng.start];
let mut eqstart = rng.start;
let mut gtstart = eqstart+1;
for t in rng.start+1..rng.end {
let mut gtstart = eqstart + 1;
for t in rng.start + 1..rng.end {
match c(s[t], pivot) {
Less => {
s[eqstart] = s[t];
Expand All @@ -200,14 +194,63 @@ pub fn part<T>(
(eqstart, gtstart)
}

/// Odd median of `&[u8]`
pub fn oddmedianu8(s: &[u8]) -> f64 {
let need = s.len() / 2; // median target position
let mut histogram = [0_usize; 256];
let mut cummulator = 0_usize;
let mut resbucket = 0_u8;
for &u in s.iter() {
histogram[u as usize] += 1;
};
for &hist in histogram.iter() {
cummulator += hist;
if need < cummulator { break; };
resbucket += 1;
};
resbucket as f64
}

/// Even median of `&[u8]`
pub fn evenmedianu8(s: &[u8]) -> f64 {
let mut need = s.len() / 2; // first median target position
let mut histogram = [0_usize; 256];
let mut cummulator = 0_usize;
let mut resbucket = 0u8;
let mut res = f64::MIN;
for &u in s.iter() {
histogram[u as usize] += 1;
}
for &hist in histogram.iter() {
cummulator += hist;
if need < cummulator { break; }; // cummulator exceeds need, found at least two items
if need == cummulator { // the last (possibly only) item in this bucket
if res == f64::MIN { // is the first median
res = resbucket as f64; // save it
need += 1; // next look for the second one (in the following buckets)
} else { break; }; // this item is already the second median
};
resbucket += 1;
continue;
};
if res == f64::MIN { resbucket as f64 } else { (res + resbucket as f64)/ 2.0 }
}

/// Median of odd sized generic data with Odering comparisons by custom closure
pub fn oddmedian_by<'a, T>(s: &mut [&'a T], c: &mut impl FnMut(&T, &T) -> Ordering) -> &'a T {
let mut rng = 0..s.len();
let need = s.len() / 2; // median target position in fully partitioned set
loop {
let pivotsub = middling(rng.start,rng.start + 1,rng.end - 2,rng.end - 1,
&mut |a,b| c(s[a],s[b]));
if pivotsub != rng.start { s.swap(rng.start,pivotsub); };
let pivotsub = middling(
rng.start,
rng.start + 1,
rng.end - 2,
rng.end - 1,
&mut |a, b| c(s[a], s[b]),
);
if pivotsub != rng.start {
s.swap(rng.start, pivotsub);
};
let pivotref = s[rng.start];
let (eqsub, gtsub) = part(s, &rng, c);
// well inside lt partition, iterate on it
Expand Down Expand Up @@ -235,7 +278,7 @@ pub fn oddmedian_by<'a, T>(s: &mut [&'a T], c: &mut impl FnMut(&T, &T) -> Orderi
};
// second place in gt partition, the solution is the next minimum
if need == gtsub + 1 {
return min2(s, gtsub..rng.end, c).1;
return min2(s, gtsub..rng.end, c).1;
};
// well inside gt partition, iterate on it
rng.start = gtsub;
Expand All @@ -250,11 +293,18 @@ pub fn evenmedian_by<'a, T>(
let mut rng = 0..s.len();
let need = s.len() / 2 - 1; // median target position in fully partitioned set
loop {
let pivotsub = middling(rng.start,rng.start + 1,rng.end - 2,rng.end - 1,
&mut |a,b| c(s[a],s[b]));
if pivotsub != rng.start { s.swap(rng.start,pivotsub); };
let pivotsub = middling(
rng.start,
rng.start + 1,
rng.end - 2,
rng.end - 1,
&mut |a, b| c(s[a], s[b]),
);
if pivotsub != rng.start {
s.swap(rng.start, pivotsub);
};
let pivotref = s[rng.start];
let (eqsub,gtsub) = part(s, &rng, c);
let (eqsub, gtsub) = part(s, &rng, c);
// well inside lt partition, iterate on it narrowing the range
if need + 2 < eqsub {
rng.end = eqsub;
Expand All @@ -263,7 +313,7 @@ pub fn evenmedian_by<'a, T>(
// penultimate place in lt partition, solution:
if need + 2 == eqsub {
// swapping comparison arguments to get two maxima
let (m1,m2) = min2(s, rng.start..eqsub, &mut |a, b| c(b, a));
let (m1, m2) = min2(s, rng.start..eqsub, &mut |a, b| c(b, a));
return (m2, m1);
};
// last place in the lt partition, solution:
Expand All @@ -284,7 +334,7 @@ pub fn evenmedian_by<'a, T>(
return min2(s, gtsub..rng.end, c);
};
// inside gt partition, iterate on it, narrowing the range
rng.start = gtsub;
rng.start = gtsub;
}
}

Expand Down
Loading

0 comments on commit ebfe8ba

Please sign in to comment.