Analytical Paleobiology Workshop 2018

Assignment Tasks

Write a function that computes Kendall rank coefficient Tau (see lecture notes for additional details on Kendall Tau).

Kendall Coefficient {as quoted from Wikipedia):

Let (x1, y1), (x2, y2), ., (xn, yn) be a set of observations of the joint random variables X and Y respectively, such that all the values of xi and yi are unique. Any pair of observations (xi and yi) are said to be concordant if the ranks for both elements (more precisely, the sort order by x and by y) agree: that is, if both xi > xj and yi > yi or if both xi < xj and yi < yj. They are said to be discordant, if xi > xj and yi < yj or if xi < xj and yi > yj. If xi = xj or yi = yj, the pair is neither concordant nor discordant.

Equation 1

TAU=(number of concordant pairs-number of discordant pairs)/(0.5 x n x (n - 1))

The above equations ignore ties (sets of pairs where either xi = xj or yi = yj). One way of correcting for ties is to consider how many ties (or non-ties) are present in the data. This variant known as Kendall’s tau_b is computed by modifying the denominator 0.5n(n-1) in Equation 1 to consider ties:

sqrt(mx) x sqrt(my)

Where mx is number of non-ties for x and my is number of non-ties for y.

Note: The R function {cor} returns tau_b.

YOUR TASKS: 1. Write a custom function that computes Kendall ignoring ties (Equation 1) and correcting for ties (as explained above).

Your function should return a single value (Kendall’s coefficient)
Your function should allow the user to choose to ignore ties or not and thus produce either tau or tau_b.
Apply the function to two vectors provided below and check the results provided by your function against those obtained using the R function {cor}.
Check how fast is your function relative to {cor} function using microbenchmark function from package ‘microbenchmark’.
Finally, generate large vectors x and y (n = 1000) from normal distribution using rnorm function. Check again how fast is your function relative to {cor} function using microbenchmark function from package ‘microbenchmark’.

To carry out this exercise, please start by uploading the library ‘microbenchmark’ and please use x and y variables as defined below. This will ensure that all results are the same for all of us. It will also help you to evaluate if your outcomes agree with those provided in solution functions or alternative R functions.

library(microbenchmark)
x <- c(-30, 0, 14, 5, 6, 7)
y <- c(-5, 0, 12, 16, 16, 8)

Once your function is ready, you can continue to the next step. The function solutions included in this tutorial are hidden here, but can be reviewed in the .rmd file or in the answer key file (.html). Please try to solve this on your own before checking the solution.

my.kend <- function(x, y, ties=T) { 
 n <- length(x)
 nzero <- 0.5*n*(n-1)
 s1 <- sign(x - t(matrix(x, n, n, byrow=F)))
 s2 <- sign(y - t(matrix(y, n, n, byrow=F)))
 t1 <- sum(s1[lower.tri(s1)]!=0)
 t2 <- sum(s2[lower.tri(s2)]!=0)
 n2 <- t1^0.5*t2^0.5
 if (!ties) tau <- sum(s1[lower.tri(s1)]*s2[lower.tri(s2)])/nzero
 if (ties) tau <- sum(s1[lower.tri(s1)]*s2[lower.tri(s2)])/n2
 return(tau)
}

my.kend2 <- function(x, y, ties=T) { 
 n <- length(x)
 nzero <- 0.5*n*(n-1)
 k <- 0
 s1 <- 0
 s2 <- 0
 out <- vector(mode='numeric', length=nzero)
 for (i in 1:length(x)) {
  for (j in 1:length(x))  {
    if (j > i) {
    k <- k + 1
    out[k] <- sign(x[i] - x[j])*sign(y[i] - y[j]) 
    if(x[i] - x[j]==0) s1 = s1 + 1
    if(y[i] - y[j]==0) s2 = s2 + 1
  }
 }
}
if (!ties) tau <- sum(out)/nzero
if (ties) tau <- sum(out)/((nzero-s1)^0.5*sum(nzero-s2)^0.5)
tau
}

Now check if your function yields comparable results to those provided by standard R functions or custom functions developed by your instructor.

x <- c(-30, 0, 14, 5, 6, 7)
y <- c(-5, 0, 12, 16, 16, 8)

# compute kendall tau using generic function {cor}
cor(x, y, method='kendall')

## [1] 0.4140393

# check if your function is consistent with standard cor function
my.kend(x, y, ties=T)

## [1] 0.4140393

# if you developed a second solution check if that solution works as well.
my.kend2(x, y, ties=T)      # check if your function works as well

## [1] 0.4140393

# now check if your function(s) also compute(s) uncorrected tau correctly
my.kend(x, y, ties=F)

## [1] 0.4

my.kend2(x, y, ties=F)

## [1] 0.4

Once you established that your function works, you should investigate if your function is faster or slower than cor function. Keep in mind, however, that the performance of the function may depend on sample size, so your function may work better than ‘cor’ function for small sample sizes, but perform poorly for large sample sizes (or vice versa).

microbenchmark(cor(x, y, method='kendall'), my.kend(x, y, ties=T),
               my.kend2(x, y, ties=T))  # which is faster for a much larger dataset?

## Unit: microseconds
##                           expr    min      lq     mean  median       uq
##  cor(x, y, method = "kendall") 65.700 80.0480 95.31707 95.1510 109.6875
##        my.kend(x, y, ties = T) 23.033 26.2425 36.04066 35.6815  42.2900
##       my.kend2(x, y, ties = T) 10.195 10.9500 15.45499 12.0830  24.5435
##      max neval
##  215.976   100
##   70.986   100
##   40.780   100

You can see that my custom functions my.kend and my.kend2 are much faster than cor function when analyzed data are tiny.

Let’s now check how the functions perform for much larger datasets.

x <- rnorm(500,1,1)
y <- rnorm(500,1,3) + x
cor(x, y, method='kendall')

## [1] 0.19301

my.kend(x, y, ties=T)                                               # check if your function agrees

## [1] 0.19301

my.kend2(x, y, ties=T)                                               # check if your function agrees

## [1] 0.19301

microbenchmark(cor(x, y, method='kendall'), my.kend(x, y, ties=T),
               my.kend2(x, y, ties=T))  # which is faster for a much larger dataset?

## Unit: milliseconds
##                           expr       min        lq      mean    median
##  cor(x, y, method = "kendall")  3.233603  3.529627  4.097588  3.924766
##        my.kend(x, y, ties = T) 13.880630 18.497122 21.550797 20.513782
##       my.kend2(x, y, ties = T) 50.971912 58.680038 64.747797 62.303116
##         uq      max neval
##   4.657272  5.69241   100
##  23.238596 52.78090   100
##  69.767135 94.52662   100

As clear from the report, the custom-written solutions that I wrote (you may come up with better algorithms) perform well at small sample size, but underperform at larger sample sizes. That is, compiled, professionally written R functions are faster for large samples than custom written solutions written by amatuers. This means that when we iterate small datasets, custom functions can be faster. When huge datasets are processed, R functions are likley to be superior to custom solutions.

Let us evaluate this issue here by using ‘cor’ and our custom function to carry out the same analysis.

r1 <- as.data.frame(matrix(rnorm(20000), 10, 2000))
# NOTE r1 is a dataframe of many (s=2000) tiny samples (n=10 observations). Each column is a sample.
r2 <- as.data.frame(matrix(rnorm(20000), 2000, 10)) 
# NOTE r2 is a dataframe of s=10 huge samples (n=2000 observations). Each column is a sample.

system.time(sapply(r1, function(x) cor(x, r1[,1], method='kendall')))

##    user  system elapsed 
##    0.16    0.03    0.19

system.time(sapply(r1, function(x) my.kend(x, r1[,1])))

##    user  system elapsed 
##    0.09    0.00    0.09

system.time(sapply(r1, function(x) my.kend2(x, r1[,1])))

##    user  system elapsed 
##    0.09    0.00    0.09

system.time(sapply(r2, function(x) cor(x, r2[,1], method='kendall')))

##    user  system elapsed 
##    0.59    0.00    0.60

system.time(sapply(r2, function(x) my.kend(x, r2[,1])))

##    user  system elapsed 
##    3.92    1.01    4.93

system.time(sapply(r2, function(x) my.kend2(x, r2[,1])))

##    user  system elapsed 
##    9.39    0.03    9.42

Clearly, my functions worked faster when we processed many, many short vectors (data frame r1), but underperformed badly when a few very large vectors were evaluated (data frame r2).

Always cite packages.

citation('microbenchmark')

## 
## To cite package 'microbenchmark' in publications use:
## 
##   Olaf Mersmann (2018). microbenchmark: Accurate Timing Functions.
##   R package version 1.4-4.
##   https://CRAN.R-project.org/package=microbenchmark
## 
## A BibTeX entry for LaTeX users is
## 
##   @Manual{,
##     title = {microbenchmark: Accurate Timing Functions},
##     author = {Olaf Mersmann},
##     year = {2018},
##     note = {R package version 1.4-4},
##     url = {https://CRAN.R-project.org/package=microbenchmark},
##   }

Comments/Questions/Corrections: Michal Kowalewski (kowalewski@ufl.edu)

Peer-review: This document has NOT been peer-reviewed.

Our Sponsors: National Science Foundation (Sedimentary Geology and Paleobiology Program), National Science Foundation (Earth Rates Initiative), Paleontological Society, Society of Vertebrate Paleontology

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Analytical Paleobiology Workshop 2018

Assignment: Kendall Tau - ANSWER KEY

Michal Kowalewski

August 8, 2018

Assignment Tasks