Write a function that computes Kendall rank coefficient Tau (see lecture notes for additional details on Kendall Tau).

Kendall Coefficient {as quoted from Wikipedia):

Let (x1, y1), (x2, y2), ., (xn, yn) be a set of observations of the joint random variables X and Y respectively, such that all the values of xi and yi are unique. Any pair of observations (xi and yi) are said to be concordant if the ranks for both elements (more precisely, the sort order by x and by y) agree: that is, if both xi > xj and yi > yi or if both xi < xj and yi < yj. They are said to be discordant, if xi > xj and yi < yj or if xi < xj and yi > yj. If xi = xj or yi = yj, the pair is neither concordant nor discordant.

Equation 1

TAU=(number of concordant pairs-number of discordant pairs)/(0.5 x n x (n - 1))

The above equations ignore ties (sets of pairs where either xi = xj or yi = yj). One way of correcting for ties is to consider how many ties (or non-ties) are present in the data. This variant known as Kendall’s tau_b is computed by modifying the denominator 0.5n(n-1) in Equation 1 to consider ties:

sqrt(mx) x sqrt(my)

Where mx is number of non-ties for x and my is number of non-ties for y.

Note: The R function {cor} returns tau_b.

YOUR TASKS: 1. Write a custom function that computes Kendall ignoring ties (Equation 1) and correcting for ties (as explained above).

1. Your function should return a single value (Kendall’s coefficient)

2. Your function should allow the user to choose to ignore ties or not and thus produce either tau or tau_b.

3. Apply the function to two vectors provided below and check the results provided by your function against those obtained using the R function {cor}.

4. Check how fast is your function relative to {cor} function using microbenchmark function from package ‘microbenchmark’.

5. Finally, generate large vectors x and y (n = 1000) from normal distribution using rnorm function. Check again how fast is your function relative to {cor} function using microbenchmark function from package ‘microbenchmark’.

To carry out this exercise, please start by uploading the library ‘microbenchmark’ and please use x and y variables as defined below. This will ensure that all results are the same for all of us. It will also help you to evaluate if your outcomes agree with those provided in solution functions or alternative R functions.

library(microbenchmark)
x <- c(-30, 0, 14, 5, 6, 7)
y <- c(-5, 0, 12, 16, 16, 8)

Once your function is ready, you can continue to the next step. The function solutions included in this tutorial are hidden here, but can be reviewed in the .rmd file or in the answer key file (.html). Please try to solve this on your own before checking the solution.

Now check if your function yields comparable results to those provided by standard R functions or custom functions developed by your instructor.

x <- c(-30, 0, 14, 5, 6, 7)
y <- c(-5, 0, 12, 16, 16, 8)

# compute kendall tau using generic function {cor}
cor(x, y, method='kendall')
##  0.4140393
# check if your function is consistent with standard cor function
my.kend(x, y, ties=T)
##  0.4140393
# if you developed a second solution check if that solution works as well.
my.kend2(x, y, ties=T)      # check if your function works as well
##  0.4140393
# now check if your function(s) also compute(s) uncorrected tau correctly
my.kend(x, y, ties=F)
##  0.4
my.kend2(x, y, ties=F)
##  0.4

Once you established that your function works, you should investigate if your function is faster or slower than cor function. Keep in mind, however, that the performance of the function may depend on sample size, so your function may work better than ‘cor’ function for small sample sizes, but perform poorly for large sample sizes (or vice versa).

microbenchmark(cor(x, y, method='kendall'), my.kend(x, y, ties=T),
my.kend2(x, y, ties=T))  # which is faster for a much larger dataset?
## Unit: microseconds
##                           expr    min      lq     mean  median      uq
##  cor(x, y, method = "kendall") 57.770 59.6590 62.28257 61.1690 62.8675
##        my.kend(x, y, ties = T) 20.390 21.5225 23.72020 23.5995 24.9210
##       my.kend2(x, y, ties = T)  9.062  9.4400 10.29720 10.1960 10.9500
##      max neval
##  120.071   100
##   59.658   100
##   17.370   100

You can see that my custom functions my.kend and my.kend2 are much faster than cor function when analyzed data are tiny.

Let’s now check how the functions perform for much larger datasets.

x <- rnorm(500,1,1)
y <- rnorm(500,1,3) + x
cor(x, y, method='kendall')
##  0.2152305
my.kend(x, y, ties=T)                                               # check if your function agrees
##  0.2152305
my.kend2(x, y, ties=T)                                               # check if your function agrees
##  0.2152305
microbenchmark(cor(x, y, method='kendall'), my.kend(x, y, ties=T),
my.kend2(x, y, ties=T))  # which is faster for a much larger dataset?
## Unit: milliseconds
##                           expr       min        lq      mean    median
##  cor(x, y, method = "kendall")  3.213214  3.283444  3.655006  3.419185
##        my.kend(x, y, ties = T) 13.088842 16.598647 19.032110 17.722893
##       my.kend2(x, y, ties = T) 49.585435 51.565470 55.769142 53.307439
##        uq       max neval
##   3.71804  10.94192   100
##  20.13243  43.07858   100
##  55.73528 134.17111   100

As clear from the report, the custom-written solutions that I wrote (you may come up with better algorithms) perform well at small sample size, but underperform at larger sample sizes. That is, compiled, professionally written R functions are faster for large samples than custom written solutions written by amatuers. This means that when we iterate small datasets, custom functions can be faster. When huge datasets are processed, R functions are likley to be superior to custom solutions.

Let us evaluate this issue here by using ‘cor’ and our custom function to carry out the same analysis.

r1 <- as.data.frame(matrix(rnorm(20000), 10, 2000))
# NOTE r1 is a dataframe of many (s=2000) tiny samples (n=10 observations). Each column is a sample.
r2 <- as.data.frame(matrix(rnorm(20000), 2000, 10))
# NOTE r2 is a dataframe of s=10 huge samples (n=2000 observations). Each column is a sample.

system.time(sapply(r1, function(x) cor(x, r1[,1], method='kendall')))
##    user  system elapsed
##    0.19    0.00    0.19
system.time(sapply(r1, function(x) my.kend(x, r1[,1])))
##    user  system elapsed
##    0.11    0.00    0.11
system.time(sapply(r1, function(x) my.kend2(x, r1[,1])))
##    user  system elapsed
##    0.10    0.00    0.09
system.time(sapply(r2, function(x) cor(x, r2[,1], method='kendall')))
##    user  system elapsed
##    0.69    0.00    0.69
system.time(sapply(r2, function(x) my.kend(x, r2[,1])))
##    user  system elapsed
##    4.05    0.89    4.94
system.time(sapply(r2, function(x) my.kend2(x, r2[,1])))
##    user  system elapsed
##    9.53    0.02    9.54

Clearly, my functions worked faster when we processed many, many short vectors (data frame r1), but underperformed badly when a few very large vectors were evaluated (data frame r2).

Always cite packages.

##
## To cite package 'microbenchmark' in publications use:
##
##   Olaf Mersmann (2018). microbenchmark: Accurate Timing Functions.
##   R package version 1.4-4.
##   https://CRAN.R-project.org/package=microbenchmark
##
## A BibTeX entry for LaTeX users is
##
##   @Manual{,
##     title = {microbenchmark: Accurate Timing Functions},
##     author = {Olaf Mersmann},
##     year = {2018},
##     note = {R package version 1.4-4},
##     url = {https://CRAN.R-project.org/package=microbenchmark},
##   }

Comments/Questions/Corrections: Michal Kowalewski (kowalewski@ufl.edu)

Peer-review: This document has NOT been peer-reviewed.

Our Sponsors: National Science Foundation (Sedimentary Geology and Paleobiology Program), National Science Foundation (Earth Rates Initiative), Paleontological Society, Society of Vertebrate Paleontology 