The Unfinished duplicated Function

Published November 20, 2016

4 minute read

If you’re a regular R user, then you’ve probably used or have seen the function duplicated(). If you’ve only grown up in the world of dplyr and other tidyverse packages with its cool function distinct(), then I highly envy you. In any case, you might find yourself using the duplicated() function when you need some more control. The function has a handy argument incomparables that allows you to ignore certain values when doing the comparison for duplicates. One day I wanted to use this to remove duplicate customer entries from our CRM system based on name and phone number. I went through the typical pre-processing stages of removing unnecessary punctionation, converting to lowercase, removing spaces, etc. After I got familiar with the data I realized that some of the data points were missing or not complete enough to make any meaningful determination of a duplicate. I needed a solution to ignore certain missing patterns, but it needed to encompass two or more columns of data. One hack would be to paste0() the columns together and work on a single pasted column, but this doesn’t handle multiple incomparable values. As I found out, the duplicated function can handle multiple columns, but the incomparables argument does not work for multiple columns, let alone different incomparable values for each column. Here is the error I get:

duplicated(mtcars[,c('am', 'gear', 'cyl')], incomparables=6)

## Error: argument 'incomparables != FALSE' is not used (yet)

This is really a bummer because it is nice to have this explicit functionality of what should really be considered when evaluating for duplicates. The error message sure is hopeful at the end there: “is not used (yet)”! After some digging I concluded that the function existed, as-is, at least since R version 1.4, which was released in 2001! I wouldn’t hold my breath for anyone to actually fix this issue. Contributing changes to base R has become much for difficult. I submitted a patch to R core to fix this very issue, but they weren’t too responsive. I guess it will stay as a hidden gem in base R, waiting for the next data miner to unearth. If you run across this issue and really want a fix, then feel free to use the code below. I’ve implemented a function to support incomparables across multiple columns where a list of vectors will specify the incomparable values for each column. The list elements are recycled if it is shorter than the columns being compared.

new_duplicated <- function(x, incomparables = FALSE, fromLast = FALSE, ...) {
  
 if(!identical(incomparables, FALSE)) {
    n <- ncol(x)
    nmx <- names(x)
    nmincomparables <- names(incomparables)
    lincomparables <- length(incomparables)
    if(is.null(nmincomparables)) {
      if(lincomparables < n) {
        # pad any incomparables lists with the default value if list is shorter than the number columns in the supplied data.frame
        tmp <- c(incomparables, as.list(rep_len(FALSE, n - lincomparables)))
        names(tmp) <- nmx
        incomparables <- tmp 
      }
      if(lincomparables > n) {
        # if the list is unnamed and there are more elements in the list than there are columns, then only first n elements
        warning(paste("more columns in 'incomparables' than x, only using the first", n, "elements"))
        incomparables <- incomparables[1:n]
      }
    } else {
      # for named lists, find match, else default value
      tmp <- as.list(rep_len(FALSE, n))
      names(tmp) <- nmx
      i <- match(nmincomparables, nmx, 0L)
      if(any(i <= 0L))
        warning("not all columns named in 'incomparables' exist")
      tmp[ i[i > 0L] ] <- incomparables[i > 0L]
      incomparables <- tmp[nmx]
    }
    
    # first determine duplicates, then override when an incomparable value is found in a row since the existence of even 1 incomparable value in a row means it cannot be a duplicate
    res <- duplicated(do.call("paste", c(x, sep="\r")), fromLast = fromLast)
    
    #for better performance only bother with the columns that have incomparable values not set to the default: !identical(x, FALSE)
    run_incomp_check <- sapply(incomparables, FUN=function(x){!identical(x, FALSE)})
    if (sum(run_incomp_check) > 0L){
      incomp_check <- mapply(FUN=function(column,incomparables){match(column, incomparables)}, x[run_incomp_check], incomparables[run_incomp_check])
      # any rows with an incomparable match means, TRUE, it can override the duplicated result
      overwrite <- apply(data.frame(incomp_check), 1, function(x){any(!is.na(x))})
      res[overwrite] <- FALSE
    }
    
    return(res)
  } else if(length(x) != 1L) {
    duplicated(do.call("paste", c(x, sep="\r")), fromLast = fromLast)
  } else {
    duplicated(x[[1L]], fromLast = fromLast, ...)
  }
}

Here is the function in action:

mtcars2 <- head(mtcars[,c('am', 'gear', 'cyl')])
mtcars2$dup <- new_duplicated(mtcars2, incomparables=1)
mtcars2

##                   am gear cyl   dup
## Mazda RX4          1    4   6 FALSE
## Mazda RX4 Wag      1    4   6 FALSE
## Datsun 710         1    4   4 FALSE
## Hornet 4 Drive     0    3   6 FALSE
## Hornet Sportabout  0    3   8 FALSE
## Valiant            0    3   6  TRUE

Typically, the second row would be considered a duplicate of the first row, but since 1 is an incomparable value it is not considered a duplicate, in fact, any row containing a 1 in the column will be labeled as false since it renders the row incomparable.

The Unfinished duplicated Function

Related Posts

Creating an RStudio Addin

salesforcer 0.2.2 - Relationship Queries and the Reports API

salesforcer