SWAT /

Order Statistics

Reading

Outdoors

Games

Hobbies

LEGO

Food

Code

Events

Nook

sidebar

Order Statistics

Order Statistics

The below is some Perl code to calculate the minimum, maximum, and median of an array a of n numbers. If n is odd, the median is taken as the value k in a for which ((n-1)/2) values in a are less than k. If n is even, the median is taken as the mean of the value k in a for which (n/2)-1 values are less than k and the value l in a for which (n/2) values are less than l (its successor in the values of a, handling duplicates). This sort of thing is useful if, like me, you write a lot of Perl scripts to transform and analyze logged data.

The algorithm for this code is based off a typical general selection algorithm and runs in expected linear time. It is not theoretically optimal and the implementation could be optimized more, but some care has been taken and it should perform reasonably well. It could be somewhat easily extended to operate over arrays of arbitrary data and not just numbers (e.g. an array of hash tables, a field of which is used for comparison).

Usage

Generally you would use the code in a script like so:

@test=(0, 3, 9, 11, 7, 10, 2, 5);
($min, $max, $median) = kmedian(\@test);
print "min $min max $max median $median\n";

Another example:

require "kmedian.pl";

@tests = ( [],
           [1],
           [1, 1],
           [2, 0],
           [1, 2, 3],
           [1, 2, 2],
           [1, 2, 1],
           [2, 1, 1],
           [1, 2, 1, 2],
           [2, 2, 1, 2],
           [2, 2, 1, 2],
           [1, 2, 1, 1],
           [1, 2, 3, 4],
           [2, 4, 3, 1],
	   );

foreach (@tests) {
    print "@{$_}\n";
    ($min, $max, $median) = kmedian($_);
    print "min $min max $max median $median\n\n";
}

The above assumes that the code has been saved into a file to be called as a module. That can be a hassle, as it has to be put into Perl's search path. You could just copy your code into your script, or you can use these tips to handle that.

Code

The code is available for download here (rename as kmedian.pl), or as follows:

#---------------------------------------------------------------
#-- kmedian.pl - Function to calculate minimum, maximum, and median of
#-- an array of numbers.  Runs in expected linear time and reasonable
#-- care has been taken in its implementation.
#--
#-- This code is released into the public domain.  Original
#-- author is Joseph B. Kopena (tjkopena@cs.drexel.edu), December 2006.
#---------------------------------------------------------------

#---------------------------------------------------------------
#-------------------------------------------------------
#-- (min, max, median) kmedian(array reference)
#--
#-- Determine the minimum, maximum, and median values for the given
#-- array of numbers.  Duplicate values in the array are acceptable.
#-- The order of the array will not be preserved---a copy is not made,
#-- and neither will it be fully sorted.
sub kmedian {

    my $a = shift;

    # Bail early if there's nothing going on... -tj
    if ($#{@{$a}} == -1) {
        # Should throw an error or something... -tj
	return (0, 0, 0);
    } elsif ($#{@{$a}} == 0) {
	return ($a->[0], $a->[0], $a->[0]);
    } elsif ($#{@{$a}} == 1) {
	if ($a->[0] <= $a->[1]) {
	    return ($a->[0], $a->[1], ($a->[0]+$a->[1])/2);
	} else {
	    return ($a->[1], $a->[0], ($a->[0]+$a->[1])/2);
	}
      # end there is more than 0, 1, or 2 elements
    }

    # mys cost, so push them until they're needed. -tj
    my $left = 0;
    my $right = $#{@{$a}};

    # Pick a random location for the first pivot index, which we'll
    # keep in $k so we can increment and not reassign the first
    # $pivotIndex. -tj
    my $k = int(rand($#{@{$a}}+1)), $pivotIndex = 0, $i;

    # Initialize min and max to a value in the table.
    # $a->[$k] will not be compared in the following loop, so
    # double bonus for using it as the initializer. -tj
    my $min = $a->[$k];
    my $max = $a->[$k];

    swap(\$a->[$k], \$a->[$right]);

    # This whole section is unrolled instead of using kpartition()
    # because we only have to do the min/max comparisons on the first
    # partitioning loop. -tj
    for ($i = 0; $i < $right; $i++) {
	if ($a->[$i] <= $a->[$right]) {
	    if ($a->[$i] < $min) {
		$min = $a->[$i];
	    }
	    swap(\$a->[$pivotIndex], \$a->[$i]);
	    $pivotIndex++;
	} elsif ($a->[$i] > $max) {
	    $max = $a->[$i];
	}

      # end looping over for initial partition
    }

    swap(\$a->[$right], \$a->[$pivotIndex]);

    # This algorithm is based off general selection, but we just want
    # the median. -tj
    $k = int($right/2);

    while ($pivotIndex != $k) {
	if ($k < $pivotIndex) {
	    $right = $pivotIndex - 1;
	} else {
	    $left = $pivotIndex + 1;
	}

        # Partition the subarray.
	swap(\$a->[$right], \$a->[$left+int(rand(($right-$left+1)))]);
	$pivotIndex = $left;
	for ($i = $left; $i < $right; $i++) {
	    if ($a->[$i] <= $a->[$right]) {
		swap(\$a->[$pivotIndex], \$a->[$i]);
		$pivotIndex++;
	    }
	}
	swap(\$a->[$right], \$a->[$pivotIndex]);

      # end looping through the array
    }

    if ($#{@{$a}} % 2) {
      # It's even---$#a = length-1.  Find the successor of the value
      # at the current pivotIndex and then return # the mean of the
      # two as the median. -tj

	$k = $max; # Note that max will either be on the right of
                   # the pivot or will be the pivot. -tj
	for ($i = $pivotIndex+1; $i <= $#{@{$a}}; $i++) {
	    if ($a->[$i] < $k) {
		$k = $a->[$i];
	    }
	}

	return ($min, $max, ($a->[$pivotIndex] + $k)/2);
    }

    return ($min, $max, $a->[$pivotIndex]);

# end kmedian
}

#-------------------------------------------------------
sub swap {
    my $a = ${$_[0]};
    ${$_[0]} = ${$_[1]};
    ${$_[1]} = $a;
}

#-------------------------------------------------------
1; # All modules have to return one -tj

License

This code is released into the public domain. If you find it useful or have questions, comments, or problems, I'd love to hear them. I assume no liability for any bugs or defects in the code or any problems resultant from its correct or incorrect use. It's been tested & looked over, but I'd take a good look before committing it to mission critical use.

Recent Changes (All) | Edit SideBar Page last modified on December 11, 2006, at 09:06 PM Edit Page | Page History