For the examples in this vignette, the following data frame is created:
library(declared)
n <- 1234
set.seed(n)
dfm <- data.frame(
Area = declared(
sample(1:2, n, replace = TRUE, prob = c(0.45, 0.55)),
labels = c("Rural" = 1, "Urban" = 2)
),
Gender = declared(
sample(1:2, n, replace = TRUE, prob = c(0.55, 0.45)),
labels = c("Males" = 1, "Females" = 2)
),
Opinion = declared(
sample(c(1:5, NA, -91), n, replace = TRUE),
labels = c(
"Very bad" = 1, "Bad" = 2, "Neither" = 3,
"Good" = 4, "Very good" = 5, "Don't know" = -91
),
na_values = -91
),
Age = sample(18:90, n, replace = TRUE),
Children = sample(0:5, n, replace = TRUE)
)
One of the most interesting applications to make use of the declared
missing values are the tables of frequencies. The base function table()
ignores
missing values by default, but they can be revealed by using the useNA
argument:
table(dfm$Opinion, useNA = "ifany")
#>
#> Very bad Bad Neither Good Very good <NA>
#> 180 170 188 171 162 363
However, it does not differentiate between empty and declared missing
values. Since “Opinion” is the equivalent of a categorical variable,
this can be improved through a custom built coercion to the base factor
class:
table(as.factor(undeclare(dfm$Opinion)), useNA = "ifany")
#>
#> Don't know Very bad Bad Neither Good Very good <NA>
#> 180 180 170 188 171 162 183
The dedicated function w_table()
does the
same thing by automatically recognizing objects of class "declared"
,
additionally printing more detailed information:
w_table(dfm$Opinion, values = TRUE)
#>
#> fre rel per vld cpd
#> -----------------------------
#> Very bad 1 180 0.146 14.6 20.7 20.7
#> Bad 2 170 0.138 13.8 19.5 40.2
#> Neither 3 188 0.152 15.2 21.6 61.8
#> Good 4 171 0.139 13.9 19.6 81.4
#> Very good 5 162 0.131 13.1 18.6 100.0
#> ------
#> Don't know -91 180 0.146 14.6
#> NA 183 0.148 14.8
#> -----------------------------
#> 1234 1.000 100.0
The prefix w_
from the function
name stands for “weighted”, this being another example of functionality
where the declared missing values play a different role than the empty,
base NA missing values.
It is important to differentiate between frequency weights, on one
hand, and other probability based, post-stratification weights on one
other, the later being thoroughly treated by the specialized package
survey. The w_
family
of functions are solely dealing with frequency weights, to allow
corrections in descriptive statistics, such as the tables of frequencies
and other similar descriptive measures for both categorical and numeric
variables.
To exemplify, a frequency weight variable is constructed, to correct for the distributions of gender by males and females, as well as the theoretical distribution by residential areas differentiating between urban and rural settlements.
# Observed proportions
op <- with(dfm, proportions(table(Gender, Area)))
# Theoretical / population proportions:
# 53% Rural, and 50% Females
tp <- rep(c(0.53, 0.47), each = 2) * rep(0.5, 4)
weights <- tp / op
dfm$fweight <- weights[
match(10 * dfm$Area + dfm$Gender, c(11, 12, 21, 22))
]
The updated frequency table, this time using the frequency weights,
can be constructed by passing the weights to the argument wt
:
with(dfm, w_table(Opinion, wt = fweight, values = TRUE))
#>
#> fre rel per vld cpd
#> -----------------------------
#> Very bad 1 179 0.145 14.5 20.5 20.5
#> Bad 2 167 0.135 13.5 19.2 39.7
#> Neither 3 187 0.152 15.2 21.4 61.1
#> Good 4 171 0.139 13.9 19.6 80.7
#> Very good 5 168 0.136 13.6 19.3 100.0
#> ------
#> Don't know -91 179 0.145 14.5
#> NA 183 0.148 14.8
#> -----------------------------
#> 1234 1.000 100.0
Except for the empty NA values, for which the weights cannot be applied, almost all other frequencies (including the one for the declared missing value -91) are now updated by applying the weights. This shows that, despite being interpreted as “missing” values, the declared ones can and should also be weighted, with a very useful result. Other versions of weighted frequencies do exist in R, but a custom one was needed to identify (and weight) the declared missing values.
In the same spirit, many other similar functions are provided such as
w_mean()
,
w_var()
, w_sd()
etc., and the
list will likely grow in the future. They are similar to the base
package counterparts, with a single difference: the argument
na.rm
is activated by default, with or without weighting.
This is an informed decision about which users are alerted in the
functions’ respective help pages.
The package declared was built with the specific intention to provide a lightweight, zero dependency resource in the R ecosystem. It contains an already extensive, robust and ready to use functionality that duly takes into account the difference between empty and declared missing values.
It extends base R and opens up data analysis possibilities without
precedent. By providing generic classes for all its objects and
functions, package declared is easily extensible to any
type of object, for both creation and coercion to class "declared"
.