Skip to content

Commit c35deca

Browse files
Merge pull request #183 from florianhartig/182-outlier
182 outlier
2 parents a937ca4 + 20a4e9a commit c35deca

File tree

6 files changed

+203
-137
lines changed

6 files changed

+203
-137
lines changed
+45
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
---
2+
title: "Calibration of p-values"
3+
author: "Florian Hartig"
4+
date: "6/3/2020"
5+
output: html_document
6+
---
7+
8+
```{r setup, include=FALSE}
9+
knitr::opts_chunk$set(echo = TRUE)
10+
```
11+
12+
13+
```{r}
14+
reps = 200
15+
16+
pVals = matrix(ncol = 6, nrow = reps)
17+
18+
for(i in 1:reps){
19+
testData = createData(sampleSize = 10000, overdispersion = 0, pZeroInflation = 0, randomEffectVariance = 0, family = gaussian())
20+
fittedModel <- lm(observedResponse ~ Environment1 , data = testData)
21+
simulationOutput <- simulateResiduals(fittedModel = fittedModel, n = 100)
22+
23+
24+
pVals[i,1] = testOutliers(simulationOutput, plot = F, alternative = "two.sided")$p.value
25+
pVals[i,2] = testOutliers(simulationOutput, plot = F, alternative = "greater")$p.value
26+
pVals[i,3] = testOutliers(simulationOutput, plot = F, alternative = "less")$p.value
27+
28+
pVals[i,4] = testOutliers(simulationOutput, plot = F, alternative = "two.sided", margin = "upper")$p.value
29+
pVals[i,5] = testOutliers(simulationOutput, plot = F, alternative = "greater", margin = "upper")$p.value
30+
pVals[i,6] = testOutliers(simulationOutput, plot = F, alternative = "less", margin = "upper")$p.value
31+
}
32+
33+
```
34+
35+
36+
37+
```{r}
38+
par(mfrow = c(2,3))
39+
for(i in 1:6) hist(pVals[,i], breaks = 50)
40+
41+
42+
```
43+
44+
45+

DHARMa/DESCRIPTION

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
Package: DHARMa
22
Title: Residual Diagnostics for Hierarchical (Multi-Level / Mixed) Regression Models
3-
Version: 0.3.1
3+
Version: 0.3.1.1
44
Date: 2020-05-10
55
Authors@R: c(person("Florian", "Hartig", email = "florian.hartig@biologie.uni-regensburg.de", role = c("aut", "cre"), comment=c(ORCID="0000-0002-6255-9059")))
66
Description: The 'DHARMa' package uses a simulation-based approach to create

DHARMa/R/tests.R

+35-28
Original file line numberDiff line numberDiff line change
@@ -145,54 +145,61 @@ testQuantiles <- function(simulationOutput, predictor = NULL, quantiles = c(0.25
145145

146146
#' Test for outliers
147147
#'
148-
#' This function tests if the number of observations that are strictly greater / smaller than all simulations are larger than expected
148+
#' This function tests if the number of observations outside the simulatio envelope are larger or smaller than expected
149149
#'
150150
#' @param simulationOutput an object of class DHARMa with simulated quantile residuals, either created via \code{\link{simulateResiduals}} or by \code{\link{createDHARMa}} for simulations created outside DHARMa
151-
#' @param alternative a character string specifying whether the test should test if observations are "greater", "less" or "two.sided" compared to the simulated null hypothesis
151+
#' @param alternative a character string specifying whether the test should test if observations are "greater", "less" or "two.sided" (default) compared to the simulated null hypothesis
152+
#' @param margin whether to test for outliers only at the lower, only at the upper, or both sides (default) of the simulated data distribution
152153
#' @param plot if T, the function will create an additional plot
153154
#' @details DHARMa residuals are created by simulating from the fitted model, and comparing the simulated values to the observed data. It can occur that all simulated values are higher or smaller than the observed data, in which case they get the residual value of 0 and 1, respectively. I refer to these values as simulation outliers, or simply outliers.
154155
#'
155-
#' Because no data was simulated in the range of the observed value, we actually don't know "how much" these values deviate from the model expecation, so the term "outlier" should be used with a grain of salt - it's not a judgement about the probability of a deviation from an expectation, but denotes that we are outside the simulated range. The number of outliers would usually decrease if the number of DHARMa simulations is increased.
156+
#' Because no data was simulated in the range of the observed value, we don't know "how strongly" these values deviate from the model expectation, so the term "outlier" should be used with a grain of salt - it's not a judgment about the magnitude of a deviation from an expectation, but simply that we are outside the simulated range, and thus cannot say anything more about the location of the residual.
157+
#'
158+
#' Note also that the number of outliers will decrease as we increase the number of simulations. Under the null hypothesis that the model is correct, we expect nData /(nSim +1) outliers at each margin of the distribution. For a reason, consider that if the data and the model distribution are identical, the probability that a given observation is higher than all simulations is 1/(nSim +1).
156159
#'
157-
#' The probability of an outlier depends on the number of simulations (in fact, it is 1/(nSim +1) for each side), so whether the existence of outliers is a reason for concern depends also on the number of simulations. The expected number of outliers is therefore binomially distributed, and we can calculate a p-value from that
160+
#' Based on this null expectation, we can test for an excess or lack of outliers. Per default, testOutliers() looks for both, so if you get a significant p-value, you have to check if you have to many or too few outliers. An excess of outliers is to be interpreted as too many values outside the simulation envelope. This could be caused by overdispersion, or by what we classically call outliers. A lack of outliers would be caused, for example, by underdispersion.
158161
#'
159162
#'
160163
#' @author Florian Hartig
161164
#' @seealso \code{\link{testResiduals}}, \code{\link{testUniformity}}, \code{\link{testDispersion}}, \code{\link{testZeroInflation}}, \code{\link{testGeneric}}, \code{\link{testTemporalAutocorrelation}}, \code{\link{testSpatialAutocorrelation}}, \code{\link{testQuantiles}}
162-
#' @example inst/examples/testsHelp.R
165+
#' @example inst/examples/testOutliersHelp.R
163166
#' @export
164-
testOutliers <- function(simulationOutput, alternative = c("two.sided", "greater", "less"), plot = T){
165-
166-
out = list()
167-
out$data.name = deparse(substitute(simulationOutput))
167+
testOutliers <- function(simulationOutput, alternative = c("two.sided", "greater", "less"), margin = c("both", "upper", "lower"), plot = T){
168168

169+
# check inputs
170+
alternative = match.arg(alternative)
171+
margin = match.arg(margin)
172+
data.name = deparse(substitute(simulationOutput)) # remember: needs to be called before ensureDHARMa
169173
simulationOutput = ensureDHARMa(simulationOutput, convert = "Model")
170174

171-
alternative <- match.arg(alternative)
172-
173-
out$alternative = alternative
174-
out$method = "DHARMa outlier test based on exact binomial test"
175+
# calculation of outliers
176+
if(margin == "both") outliers = sum(simulationOutput$scaledResiduals %in% c(0, 1))
177+
if(margin == "upper") outliers = sum(simulationOutput$scaledResiduals == 1)
178+
if(margin == "lower") outliers = sum(simulationOutput$scaledResiduals == 0)
175179

176-
outLow = sum(simulationOutput$scaledResiduals == 0)
177-
outHigh = sum(simulationOutput$scaledResiduals == 1)
180+
# calculations of trials and H0
181+
outFreqH0 = 1/(simulationOutput$nSim +1) * ifelse(margin == "both", 2, 1)
178182
trials = simulationOutput$nObs
179-
outFreqH0 = 1/(simulationOutput$nSim +1)
180183

181-
out$testlow = binom.test(outLow, trials, p = outFreqH0, alternative = "less")
182-
out$testhigh = binom.test(outHigh, trials, p = outFreqH0, alternative = "greater")
184+
out = binom.test(outliers, trials, p = outFreqH0, alternative = alternative)
183185

184-
out$statistic = c(outLow = outLow, outHigh = outHigh, nobs = trials, freqH0 = outFreqH0)
186+
# overwrite information in binom.test
185187

186-
if(alternative == "greater") p = out$testlow$p.value
187-
if(alternative == "less") p = out$testhigh$p.value
188-
if(alternative == "two.sided") p = min(min(out$testlow$p.value, out$testhigh$p.value) * 2,1)
188+
out$data.name = data.name
189+
out$margin = margin
190+
out$method = "DHARMa outlier test based on exact binomial test"
189191

190-
out$p.value = p
191-
class(out) = "htest"
192+
names(out$statistic) = paste("outliers at", margin, "margin(s)")
193+
names(out$parameter) = "simulations"
194+
names(out$estimate) = paste("frequency of outliers (expected:", out$null.value,")")
195+
196+
out$interval = "tst"
197+
198+
out$interval =
192199

193200
if(plot == T) {
194201

195-
hist(simulationOutput, main = "", max(round(simulationOutput$nSim / 5), 20))
202+
hist(simulationOutput, main = "")
196203

197204
main = ifelse(out$p.value <= 0.05,
198205
"Outlier test significant",
@@ -467,12 +474,12 @@ testTemporalAutocorrelation <- function(simulationOutput, time = NULL , alternat
467474
#' @export
468475
testSpatialAutocorrelation <- function(simulationOutput, x = NULL, y = NULL, distMat = NULL, alternative = c("two.sided", "greater", "less"), plot = T){
469476

477+
alternative <- match.arg(alternative)
478+
data.name = deparse(substitute(simulationOutput)) # needs to be before ensureDHARMa
470479
simulationOutput = ensureDHARMa(simulationOutput, convert = T)
471480

472481
if(any(duplicated(cbind(x,y)))) stop("testing for spatial autocorrelation requires unique x,y values - if you have several observations per location, either use the recalculateResiduals function to aggregate residuals per location, or extract the residuals from the fitted object, and plot / test each of them independently for spatially repeated subgroups (a typical scenario would repeated spatial observation, in which case one could plot / test each time step separately for temporal autocorrelation). Note that the latter must be done by hand, outside testSpatialAutocorrelation.")
473482

474-
alternative <- match.arg(alternative)
475-
476483
if( (!is.null(x) | !is.null(y)) & !is.null(distMat) ) message("both coordinates and distMat provided, calculations will be done based on the distance matrix, coordinates will only be used for plotting")
477484
# if not provided, fill x and y with random numbers (Null model)
478485
if(is.null(x)){
@@ -498,7 +505,7 @@ testSpatialAutocorrelation <- function(simulationOutput, x = NULL, y = NULL, di
498505
out$method = "DHARMa Moran's I test for spatial autocorrelation"
499506
out$alternative = "Spatial autocorrelation"
500507
out$p.value = MI$p.value
501-
# out$data.name = deparse(substitute(simulationOutput))
508+
out$data.name = data.name
502509

503510
class(out) = "htest"
504511

+24
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
set.seed(123)
2+
3+
testData = createData(sampleSize = 200, overdispersion = 1, randomEffectVariance = 0)
4+
fittedModel <- glm(observedResponse ~ Environment1 , family = "poisson", data = testData)
5+
simulationOutput <- simulateResiduals(fittedModel = fittedModel)
6+
7+
# default outlier test (with plot)
8+
testOutliers(simulationOutput)
9+
10+
# note that default is to test outliers at both margins for both an excess and a lack
11+
# of outliers. Here we see that we mostly have an excess of outliers at the upper
12+
# margin. You see that it is an exces because the frequency of outliers is 0.055,
13+
# while expected is 0.008
14+
15+
# Let's see what would have happened if we would just have checked the lower margin
16+
testOutliers(simulationOutput, margin = "lower", plot = FALSE)
17+
18+
# OK, now the frequency of outliers is 0, so we have too few, but this is n.s. against
19+
# the expectation
20+
21+
# just for completeness, what would have happened if we would have checked both
22+
# margins, but just for a lack of outliers (i.e. underdispersion)
23+
24+
testOutliers(simulationOutput, alternative = "less", plot = FALSE)

DHARMa/man/testOutliers.Rd

+25-46
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

0 commit comments

Comments
 (0)