The R ecosystem already has some very good packages that deal with labelled objects. In particular, the inter-connected packages haven and labelled provide all the functionality most users would ever need.
As nice and useful as these packages are, it has become apparent they have some fundamental design features that run, in some situations, against users’ expectations. This often relates to the treatment of declared missing values that are instrumental for the social sciences.
The following minimal example (adapted from the vignette in package haven) illustrates the situation:
The printed objects from this package nicely display some properties:
x1
#> <labelled_spss<double>[6]>
#> [1] 1 2 3 4 5 -91
#> Missing values: -91
#>
#> Labels:
#> value label
#> -91 Missing
There are 5 normal (non-missing) values (supposedly, they represent
the number of children), and one declared missing value coded
-91
. This value acts as a missing value, but it is
different from a regular missing value in R, coded NA
. The
latter stands for any missing information (something like an empty cell)
regardless of the reason.
On the other hand, the cell is not empty, but the value
-91
is not a valid value either. It cannot possibly
represent -91 children in the household, but for instance, it could have
meant the respondent did not want to respond. It is properly identified
as missing with:
But when calculating a mean, for instance, the normal expectation is
that value -91
would not play any role in the calculations
(since it should be missing). However:
This means the value -91
did play an active role despite
being identified as “missing”. In an ideal world, the expected mean
would be 3, or at best, employ the argument na.rm = TRUE
if the result is NA
because of the declared missing
value.
A solution to this problem is offered by package
labelled, which has a function called user_na_to_na()
:
A similar behavior is observed into another package memisc, that also deals with labels and missing values:
library(memisc)
memx <- c(1:5, -91)
labels(memx) <- c("Missing" = -91)
missing.values(memx) <- -91
mean(memx)
#> [1] -12.66667
Here too, despite being declared as missing, the value
-91
still plays a role in calculating the mean. It has to
be coerced to numeric to really convert that value to
NA
:
While solving the problem, the above solutions force two additional operations:
converting the (already) declared user missing values, and
employing the na.rm
argument.
This should not be necessary, especially if (and it is extremely likely that) users may forget the declared missing values are not actually missing values. This scenario is quite possible, as many users previously using other software like SPSS or Stata where nothing else should be done after declaring the missing values may not realize more is needed.
To solve this problem, the declared package creates
a similar object, where declared missing values are stored (hence
interpreted as) regular NA
missing values in R.
library(declared)
x2 <- declared(
x = c(1:5, -91),
labels = c("Missing" = -91),
na_value = -91
)
x2
#> <declared<numeric>[6]>
#> [1] 1 2 3 4 5 NA(-91)
#> Missing values: -91
#>
#> Labels:
#> value label
#> -91 Missing
The print method makes it obvious the value -91
is not a
regular number but an actual missing value. More importantly, this type
of storage circumvents the need to convert user-defined missing values
to regular NA
values since they are already stored as
regular NA
values. The average value is calculated simply
as follows:
Notice that neither user_na_to_na()
, nor
employing na.rm = TRUE
are necessary, and, despite being stored as an NA
value,
the value -91
is not equivalent to an empty cell.
The information still exists, but it is simply ignored in the
calculations.
At first glance, providing a class method for this function seems
unnecessary because activating the argument na.rm
will
return the correct result. Explaining the importance of the class method
requires a discussion about the base R decision to have this argument
deactivated by default. This is most likely to alert users about
possible problems in the data since a default value of TRUE
would obscure such problems; the mean is calculated irrespective of
potentially problematic NA
values.
This is where differentiating between empty and declared missing values proves valuable. The declared missing values are neither problematic nor signal potential problems in the data, given that once a reason is declared, it is already known why a particular value is missing.
The genuinely problematic values are the empty NA
values, and the custom class method still allows identifying such values
if they exist:
Since all declared values are stored as regular NA values, the base
function is.na()
, as well as
all related functions such as anyNA()
etc., are
unaware and can not differentiate between empty and declared missing
values:
To overcome this situation, package declared complementary provides an additional function to account for the difference:
All missing values, empty and declared, play well with the NA
oriented, base functions such as na.omit()
or na.exclude()
:
na.omit(x2)
#> <declared<numeric>[5]>
#> [1] 1 2 3 4 5
#> Missing values: -91
#>
#> Labels:
#> value label
#> -91 Missing
It should be made obvious the excellent packages memisc, haven and labelled are not inherently doing a bad thing: the very same result is obtained, just via a different route. Package declared was created as an alternative to the design philosophy of these packages, with a fundamental difference: instead of interpreting existing values as missing, the missing values are interpreted as existing.
It does so by storing an additional attribute containing the
positions (indexes) of the regular NA
values in the object,
which should be treated as missing, and, even more so, to be interpreted
as a particular missing response category, as specified in the value
labels attribute.