WTAR-curve-modelling.Rmd

---
title: "Wechsler Test of Adult Reading in Parkinson’s: a stable yet imperfect measure of premorbid cognitive function"
author:
- name: Kyla-Louise Horne, PhD
  affiliation: '1'
  corresponding: yes
  address: 66 Stewart St, Christchurch 8011, New Zealand
  email: kyla.horne@nzbri.org
- name: Reza Shoorangiz, PhD
  affiliation: '1'
- name: Daniel J. Myall, PhD
  affiliation: '1'
- name: Toni L. Pitcher, PhD
  affiliation: '1,2'
- name: Tim J. Anderson, FRACP, MD
  affiliation: '1,2,3'
- name: John C. Dalrymple-Alford, PhD
  affiliation: '1,2,4'
- name: Michael R. MacAskill, PhD
  affiliation: '1,2'
shorttitle: WTAR premorbid measure in Parkinson's
output:
  papaja::apa6_pdf:
    latex_engine: xelatex
  papaja::apa6_word:
link-citations: yes
csl: format/bmj.csl
bibliography: format/wtar-references.bib
appendix: WTAR-supplementary.Rmd
linenumbers: yes
mask: no
draft: no
documentclass: apa6
classoption: man
affiliation:
- id: '1'
  institution: New Zealand Brain Research Institute, 66 Stewart St, 
    Christchurch, New Zealand
- id: '2'
  institution: Department of Medicine, University of Otago, Christchurch; 
    Christchurch, New Zealand
- id: '3'
  institution: Department of Neurology, Christchurch Hospital, Christchurch, New
    Zealand
- id: '4'
  institution: School of Psychology, Speech, and Hearing, University of 
    Canterbury, Christchurch, New Zealand
abstract: Purpose. To assess long-term trajectories of Wechsler Test of Adult
  Reading (WTAR) scores in people with Parkinson’s as a function of age. The
  WTAR has been recommended as a measure of premorbid cognitive function in
  English speakers with Parkinson’s disease and as a reference against which to
  assess current cognitive status. For this, however, it needs to be shown that
  WTAR scores remain stable despite the substantial cognitive deterioration that
  can occur in Parkinson’s. Methods. From 252 Parkinson's and 57 Control
  participants who had completed at least two WTARs, we analyzed scores over
  time using latent class trajectory modeling. This allows for individual
  participants to be classified into data-driven clusters, depending on the
  shape of their longitudinal trajectory. Results. WTAR scores were quite stable
  within both Controls and Parkinson's participants, even for those who
  progressed to dementia. This validates it as a research tool for comparing
  premorbid function at a group level. In both Parkinson's and Controls, and
  regardless of current cognitive status, the distribution of scores was, 
  however, higher than expected from the population norms, making it an 
  unreliable benchmark against which to detect cognitive decline at an 
  individual level. Conclusion. The WTAR is stable in Parkinson's even when
  participants decline from normal cognitive function to early dementia.
  Nonetheless, its apparent over-estimation of premorbid IQ and the 
  impracticality of implementing analogous tests in many other languages makes
  it poorly-suited for detecting current cognitive impairment in individuals
  with Parkinson's.
keywords: Parkinson disease, cognitive impairment, dementia, neuropsychology, 
  premorbid function
wordcount: 'Abstract: 239 (limit 250), main body: 3870 (limit 4000)'
editor_options: 
  chunk_output_type: console
---

```{r setup, include=FALSE}
## NB notice the chunk_output_type: console setting in the YAML header above.
## This was part of addressing an issue where one of the linear models would
## converge in R, "but kept failing during knitting in Rmarkdown. The problem 
## seemed to be caused by the parallel package and how it generates random 
## numbers" also we now "create/destroy a new parallel pool within the 
## for-loop"- problem and solution identified by Reza. 

knitr::opts_chunk$set(
	echo = FALSE,
	message = FALSE,
	warning = FALSE)
options(knitr.kable.NA = '')

```

```{r define-variables}
################################################################################
### 1. NZBRI users can save the time taken to re-import live data online 
###    every time this document is generated, by setting the IMPORT_LIVE_DATA 
###    variable to be FALSE. This will instead import static values from a 
###    locally-cached .csv data file.
### 2. That anonymised cached version of the dataset is distributed publicly 
###    with these files, so that external (non-NZBRI) users can also generate 
###    and alter the document, without needing the live data import functions 
###    provided by the in-house `chchpd` package.
### 3. You should usually choose not to re-compute the latent class models on 
###    each run, as this is very time-consuming (hours to days). Instead use 
###    the cached model objects to save re-computing them each time, by setting 
###    RUN_NEW_CUBIC_MODELS and RUN_NEW_LINEAR_MODELS to FALSE.
################################################################################
IMPORT_LIVE_DATA      = FALSE

RUN_NEW_CUBIC_MODELS  = FALSE
RUN_NEW_LINEAR_MODELS = FALSE

################################################################################

# define the maximum number of classes to run models to:
max_classes_cubic  = 5
max_classes_linear = 5

# define colour scheme values for normal, MCI, and dementia:
cog_colours <- c('#339900', '#FF9326', '#D92121')

# colour schemes for two categories:
blue_orange     = c('#599BC2', '#E08F6B')
blue_orange_sat = c('#0072B2', '#D55E00')
blue_brown      = c('#599BC2', '#CD8969')
blue_brown_sat  = c('#0072B2', '#B85628')

border_colour = '#8E2226' # JMD reddish

fig_1_width = 8.5;  fig_1_height = 20.0
fig_2_width = 17.5; fig_2_height = 21.5

DPI = 600 # set resolution of exported PNGs

```

```{r import-packages}
# An NZBRI-only in-house package for importing the latest live data:
if (IMPORT_LIVE_DATA) {
  library(chchpd) # initial install via devtools::install_github('nzbri/chchpd')
}

library(readr)       # to read/write locally cached .csv data files
library(dplyr)       # for data manipulation
library(tidyr)       # for fill & pivot_
library(lubridate)   # for date intervals
library(magrittr)    # for the %<>% operator
library(ggplot2)     # for visualisation
library(scales)      # improved labelling in ggplot
library(ggnewscale)  # allow multiple colour scales in ggplot
library(directlabels)# to label lines on plots
library(cowplot)     # combine multiple separate plots in one figure
library(lcmm)        # latent class models
library(parallel)    # for parallel computation
library(huxtable)    # tables that work in Word output
library(knitr)       # for kable pretty tables
library(kableExtra)  # extra kable options
library(papaja)      # for apa_table
library(stringr)     # to manipulate apa_table output
```


```{r font-issues}
# Need this to get around a bug after an update in macOS Monterey in not being
# able to find fonts:  
library(showtext)

# Using my favourite font will limit portability of this script:
#font_add('GillSans', '/System/Library/Fonts/Supplemental/GillSans.ttc')

# So instead go online for a font that everyone can access, regardless of 
# platform (Lato is not as beautiful as Gill Sans but an okay-ish substitute).
font_add_google(name = 'Lato', family = 'Lato', regular.wt = 400, bold.wt = 700)
# Note can choose a lighter regular weight for this font if desired (300 or 
# even 100, see https://fonts.google.com/specimen/Lato

showtext_auto()
# needed to avoid microscopic text in PNG output:
showtext_opts(dpi = DPI)
# see https://community.rstudio.com/t/font-gets-really-small-when-saving-to-png-using-ggsave-and-showtext/147029/7
```

```{r import-data}
# evaluated only if the chchpd package has been imported, to 
# access live data sources:
 
if (IMPORT_LIVE_DATA) { # put this here for when not knitting

participants = chchpd::import_participants(anon_id = FALSE) %>% 
    rename(group = participant_group)

sessions = chchpd::import_sessions() %>% 
  # the PDD follow-up sessions were mostly not coded as part of the 
  # Progression study, so gather sessions from both:
  filter(study %in% c('Follow-up', 'PDD Followup')) %>% 
  # but some sessions *were* entered in both studies, so remove any duplicates:
  group_by(session_id) %>% 
  filter(row_number() == 1) %>% 
  # this person's session seems to have a fictitious WTAR score (none should
  # have been gathered on that session, no original can be found, 
  # and the value changes so markedly from the two previous ones that it 
  # results in a separate class just to fit this one person). Can 
  # drop this filter step once the value is removed from the database:
  filter(session_id != '310PDS_2020-03-04')

np = chchpd::import_neuropsyc()
updrs = chchpd::import_motor_scores()

}
```

```{r collate-dataset}
if (IMPORT_LIVE_DATA) {

dat = right_join(participants, sessions, by = 'subject_id') %>%
  left_join(np, by = 'session_id') %>% 
  left_join(updrs, by = 'session_id')

dat = dat %>%
  filter(np_group %in% c('Control', 'PD'), # exclude atypical cases
         !is.na(cognitive_status),         # remove not-yet classified sessions
         !is.na(WTAR)) %>%                
  mutate(group = factor(group, levels = c('Control', 'PD'))) %>% 
  mutate(cognitive_status = factor(cognitive_status, 
                                   levels = c('N', 'MCI', 'D'),
                                   labels = c('Normal', 'MCI', 'Dementia'))) %>% 
  group_by(subject_id) %>% 
  arrange(years_from_baseline) %>%
  mutate(assessment_num = row_number()) %>% 
  mutate(n_sessions = max(assessment_num)) %>% 
  filter(n_sessions > 1) %>% # exclude cases without follow-up
  mutate(ever_demented = 'Dementia' %in% cognitive_status) %>% 
  # to make line colours match the destination point rather than the origin 
	# point, need to use a diagnosis variable that leads the current diagnosis
	# for each person by one step:
	mutate(cognitive_status_lead = lead(cognitive_status)) %>% 
	# this makes the last value for each person NA, which can cause issues. 
	# fill() replaces them with the latest previous diagnosis:
  fill(cognitive_status_lead) %>% 
  mutate(sampling_interval = 
           years_from_baseline - lag(years_from_baseline)) %>% 
  # should be much the same but here to be explicit:
  mutate(wtar_interval = 
           interval(start = lag(session_date), end = session_date)/years(1)) %>% 
  ungroup()

# lcmm insists on access to a numeric subject variable:
dat = dat %>% 
  mutate(subject_id = factor(subject_id)) %>% 
  mutate(subject_id_num = as.integer(subject_id))

# for lcmm we standardise the age to avoid problems with polynomial effects, 
# as per Proust-Lima et al. (2017). Note that we are using the grand mean age at 
# assessment, rather than averaging across some landmark age for each subject. 
# We use the same standardisation for the linear models, so we can compare them.
dat$age_std = (dat$age - mean(dat$age))/10

dat = dat %>% 
  arrange(subject_id_num, age) %>% 
  select(subject_id_num, group,	sex, ethnicity, symptom_onset_age,
         diagnosis_age, education, age, age_std, wtar_interval, group, 
         cognitive_status, cognitive_status_lead, ever_demented, WTAR, MoCA, 
         global_z, attention_domain, executive_domain, visuo_domain,
         learning_memory_domain,	language_domain, H_Y,	Part_III)

# store a cached version of the dataset for optional access on subsequent runs 
# of generating this document:
dat %>% write_csv(file = 'data/dat_wtar.csv')

}

# Regardless of whether IMPORT_LIVE_DATA has been set to TRUE, we import from 
# the cached dataset to ensure consistency:
dat = read_csv(file = 'data/dat_wtar.csv') %>% 
  mutate(cognitive_status = 
           factor(cognitive_status, 
                  levels = c('Normal', 'MCI', 'Dementia')),
         cognitive_status_lead = 
           factor(cognitive_status_lead, 
                  levels = c('Normal', 'MCI', 'Dementia')),
         cognitive_group = 
           case_when(group == 'Control' ~ 'Control',
                     cognitive_status == 'Normal' ~ 'PDN',
                     cognitive_status == 'MCI' ~ 'PD-MCI',
                     cognitive_status == 'Dementia' ~ 'PDD'),
         cognitive_group = 
           factor(cognitive_group, 
                  levels = c('Control', 'PDN', 'PD-MCI', 'PDD'))) %>% 
  # labels to avoid abbreviations in Figure 1:
  mutate(panel_label_1 =
           factor(cognitive_group, 
                  labels = c('Control', 'PD unimpaired', 'PD-MCI', 'PD dementia'))) %>% 
  # divide subjects into three panels for Figure 2:
  mutate(panel_label_2 = 
           case_when(group == 'Control'     ~ 'Control',
                     ever_demented == FALSE ~ 'PD non-dementia',
                     TRUE                   ~ 'PD dementia'),
         panel_label_2 = 
           factor(panel_label_2, 
                  levels = c('Control', 'PD non-dementia', 'PD dementia'))) %>% 
  # do this to avoid an issue with lcmm not correctly detecting numeric
  # column types in a tibble:
  as.data.frame()

```

# Introduction

The prevalence of Parkinson's is 
rising,[@myall2017_ParkinsonOldestOld; @pitcher2018_ParkinsonDiseaseEthnicities]
and hence the burden of dementia associated with the disease will also continue
to grow. There is therefore value in detecting the transitional stage of mild
cognitive impairment in Parkinson's, in particular because it may provide a
therapeutic window for future treatments, prior to the irreversible pathological
damage associated with dementia. To provide for consistent delineation of this
transitional state, the International Parkinson and Movement Disorder
Society (MDS) published guidelines in 2012 for the diagnosis of mild cognitive
impairment in Parkinson's (PD-MCI).[@litvan2012_DiagnosticCriteriaMild]

When the diagnosis is based on a comprehensive battery of multiple individual
neuropsychological tests (rather than on a single scale of global cognitive
ability), the MDS guidelines propose that significant impairment may be
determined in three ways. Firstly, current performance on a test can be shown to
be below appropriate norms (which may be established either relative to
standardized norms for that test, or relative to a local matched control group).
This approach has the disadvantage that the cause of poor current performance
cannot be distinguished between that due to a recent decline (such as the onset
of MCI or dementia), or a long-standing poor level of cognitive ability. To
unambiguously detect recent impairment in cognitive status therefore requires
some way of demonstrating a _decline_ in function within an individual, rather
than simply poor current performance. To achieve this, the guidelines suggest
seeking evidence of either a decline in performance on serial testing, or, in
the absence of such prior testing, evidence of a decline from an individual's
_estimated_ premorbid level of function. For pragmatic reasons, however, methods
based upon current norms still tend to predominate in research
studies.[@aarsland2021_ParkinsonDiseaseassociatedCognitive] In practice, one
seldom has reference to prior formal serial testing. Even in a well-resourced
research setting, serial testing remains somewhat ambiguous, as the first
available test session may itself be contaminated by early disease-related
cognitive impairment. This would produce a falsely low baseline and thereby
impede the ability to detect cognitive decline.

A reliable method of estimating premorbid cognitive function is therefore 
appealing. Establishing a person's premorbid ability addresses the shortcomings
of both the current-norms approach (by showing whether current poor performance
is indeed a decline from past status), and the issue with the timing of serial 
testing (which will usually begin only after disease diagnosis). The Wechsler 
Test of Adult Reading (WTAR)[@psychologicalcorporation2001_WTARWechslerTest] was
suggested in the MDS guidelines as being suitable for this purpose (along with 
the National Adult Reading Test, 
NART).[@strauss2006_CompendiumNeuropsychologicalTests] Both tests are based on 
the ability to correctly pronounce phonetically-irregular English words, such as
'porpoise' or 'hyperbole'. Unlike many neuropsychological functions, this 
ability has been shown to be preserved in the face of both normal ageing and a 
broad range of brain
insults.[@psychologicalcorporation2001_WTARWechslerTest; @strauss2006_CompendiumNeuropsychologicalTests]
 
When the WTAR was initially normed, it was tested in Parkinson's as well as a
number of other neurological or psychiatric disorders, such as Huntington's,
schizophrenia, and traumatic brain
injury.[@psychologicalcorporation2001_WTARWechslerTest] Only in Alzheimer's
disease were scores shown to be significantly lower than in matched
controls.[@psychologicalcorporation2001_WTARWechslerTest; @strauss2006_CompendiumNeuropsychologicalTests]
Additionally in the Alzheimer's group, WTAR scores were lower in those with
lower overall current cognitive
status,[@strauss2006_CompendiumNeuropsychologicalTests] further indicating that
it is not a stable measure of premorbid function in that condition. Those
results were, however, published in 2001, well before the formal MDS guidelines
on diagnosing PD-MCI[@litvan2012_DiagnosticCriteriaMild] and Parkinson's
dementia[@dubois2007_DiagnosticProceduresParkinson] were promulgated. It is
therefore unknown to what extent the findings from that sample are valid in the
face of the significant cognitive deterioration that can occur in Parkinson's,
as it could not have been formally diagnosed at the time. Moreover, their
Parkinson's sub-sample was also comprised of only 10 people: given the marked
heterogeneity of function in this disorder, this restricts the validity of
generalizing from those findings. We know of only one study in which the WTAR
has actually been used as a criterion against which to establish mild cognitive
impairment in Parkinson's.[@marras2013_MeasuringMildCognitive] The proportion
classified as PD-MCI using WTAR-estimated premorbid functioning was strikingly
higher than when using the more conventional method of comparing current
performance against norms (79% vs 33%). This naturally leads to questions about
whether the
WTAR is in fact a valid measure to be used for cognitive classification in
Parkinson's. In particular, if using the WTAR leads to most people with
Parkinson's being classified as having MCI, does that result in a category that
no longer has any prognostic or discriminative utility?

Performance on the WTAR is known to be impacted negatively by Alzheimer’s
dementia, but as noted the initial claim of stability in Parkinson’s was based
on very limited evidence.[@psychologicalcorporation2001_WTARWechslerTest] We
therefore sought to examine WTAR scores longitudinally in a large and
cognitively formally-characterized cohort that spanned the range from normal
function through to mild cognitive impairment and dementia. We included only
those participants who had completed a minimum of two WTARs, so that the
trajectory of the measure within individuals over time could be examined. To do
this, we used latent class trajectory analysis[@proust-lima2017estimation-of-e].
This allows for detecting heterogeneous patterns of change over time in a
data-driven fashion, rather than assigning individuals to _a priori_ sub-groups.
That is, we combined both Control and Parkinson's participants in a single pool.
The modeling process examines individual subjects and assigns them into clusters
depending on similarities in their longitudinal trajectories. If the presence of
either Parkinson's or cognitive impairment affects longitudinal WTAR scores,
this should be reflected by differential membership across groups in the 
resulting data-driven trajectory clusters. For example, Parkinson’s participants
with normal cognition might cluster together with most of the Controls in a
“flat” trajectory class, while those showing cognitive impairment might cluster
in one or more trajectory classes showing decline over time. Conversely, if the
WTAR is truly stable in people with Parkinson’s, we would expect them to cluster
together with controls in relatively flat trajectory classes, regardless of
their current level of cognitive function.In particular, we sought to examine
whether WTAR performance declines in the dementia due to Parkinson's, as it does
in Alzheimer's. If so, this would impact its usefulness in estimating premorbid
function in the way proposed in the MDS guidelines for cognitive diagnosis.

# Methods

## Participants

```{r descriptives}
descriptives = dat %>% 
  group_by(subject_id_num) %>% 
  mutate(session_num = row_number(),
         n_sessions = max(session_num)) %>% 
  slice_head() %>% # one row per subject
  ungroup()
  
n_group = descriptives %>%
  group_by(group) %>% 
  summarise(n = n(), 
            min_sessions = min(n_sessions),
            mean_sessions = mean(n_sessions),
            sd_sessions = sd(n_sessions),
            max_sessions = max(n_sessions))

n_wtar = descriptives %>% 
  summarise(n = n(), 
            min_sessions = min(n_sessions),
            mean_sessions = mean(n_sessions),
            sd_sessions = sd(n_sessions),
            max_sessions = max(n_sessions),
            total_sessions = sum(n_sessions))

n_ethnicity = descriptives %>% 
  group_by(group, ethnicity) %>% 
  summarise(n = n()) %>% 
  # remove some (currently 3) missing values, which otherwise cause problems:
  drop_na() 

n_pd_with_ethnicity = 
  with(n_ethnicity, sum(n[group == 'PD']))

n_pd_pakeha = 
  with(n_ethnicity, n[group == 'PD' & ethnicity == 'New Zealand European'])

n_pd_maori = 
  with(n_ethnicity, n[group == 'PD' & ethnicity == 'Maori'])

n_pd_samoan = 
  with(n_ethnicity, n[group == 'PD' & ethnicity == 'Samoan'])

n_pd_indian = 
  with(n_ethnicity, n[group == 'PD' & ethnicity == 'Indian'])

n_pd_other = 
  with(n_ethnicity, n[group == 'PD' & ethnicity == 'Other'])

n_control_with_ethnicity = 
  with(n_ethnicity, sum(n[group == 'Control']))

n_control_pakeha = 
  with(n_ethnicity, n[group == 'Control' & ethnicity == 'New Zealand European'])

n_control_other = 
  with(n_ethnicity, n[group == 'Control' & ethnicity == 'Other'])

WTAR_by_status = dat %>% 
  group_by(cognitive_group) %>% 
  summarise(wtar_mean = mean(WTAR),
            wtar_median = median(WTAR),
            wtar_sd = sd(WTAR),
            wtar_n = n())

durations = dat %>%  # 1155 rows
  filter(group == 'PD') %>% # 902 rows
  group_by(subject_id_num) %>% 
  arrange(age) %>%
  mutate(disease_duration = age - diagnosis_age) %>% 
  summarise(dur_at_onset = first(disease_duration), # 252 rows
            dur_at_end   = last(disease_duration),
            dur_of_followup = dur_at_end - dur_at_onset) %>%
  summarise(mean_at_onset = 
              format(round(mean(dur_at_onset), digits = 1), digits = 2, nsmall = 1), # 1 row
            min_at_onset = round(min(dur_at_onset)),
            max_at_onset = round(max(dur_at_onset)),
            mean_followup = 
               format(round(mean(dur_of_followup), digits = 1), digits = 2, nsmall = 1),
            min_followup = round(min(dur_of_followup)),
            max_followup = round(max(dur_of_followup)))
            
  
```

The New Zealand Parkinson's Progression Programme (NZP^3^) is a longitudinal
study of a convenience prevalence sample of idiopathic Parkinson's
participants,[@macaskill2022_NewZealandParkinsona] largely recruited from the
specialist Movement Disorders Clinic at the New Zealand Brain Research Institute
(NZBRI). Ethical approval was granted by the Southern Health and Disability
Ethics Committee of the New Zealand Ministry of Health. The ongoing recruitment
commenced in 2007, and includes patients ranging from the recently-diagnosed to
those with advanced disease. Inclusion and exclusion criteria have been 
described previously.[@wood2016_DifferentPDMCICriteria] For this analysis, we
selected data from the `r n_group$n[n_group$group =='PD']` Parkinson's and 
`r n_group$n[n_group$group == 'Control']` Control participants who had completed
at least two WTARs (range 2 – `r n_wtar$max_sessions`, mean =
`r round(n_wtar$mean_sessions, digits = 1)`, total number of measures =
`r n_wtar$total_sessions`). The period between successive WTAR assessments was
multi-modal, mostly clustering around intervals of one and two years, due to
varying follow-up periods over the course of the wider study (see Supplementary
Material). Of the Parkinson's participants, the mean duration since diagnosis
at the first WTAR assessment was `r durations$mean_at_onset` years (range
`r durations$min_at_onset` – `r durations$max_at_onset`) and the mean duration
of follow-up was a further `r durations$mean_followup` years (range
`r durations$min_followup` – `r durations$max_followup`). All participants were
classified as having normal cognition, mild cognitive impairment, or dementia
via a comprehensive Level II neuropsychological
battery,[@dalrymple-alford2011_CharacterizingMildCognitive;@wood2016_DifferentPDMCICriteria]
administered in accordance with MDS
guidelines.[@dubois2007_DiagnosticProceduresParkinson;@litvan2012_DiagnosticCriteriaMild]
A PD-MCI classification was based on at least two test scores falling $\geq$ 1.5
standard deviations below norms in at least one cognitive domain, without
significant impairment in activities of daily living. For PD dementia, a
significant impairment ($\geq$ 2.0 standard deviations below normative data) in
at least two cognitive domains was required, as well as evidence of significant
impairment in activities of daily
living.[@dalrymple-alford2011_CharacterizingMildCognitive;@wood2016_DifferentPDMCICriteria]
The WTAR itself was not used in the cognitive diagnostic classification
procedures. Characteristics of the sample are shown in Table \@ref(tab:table-1).
All participants were speakers of New Zealand English, and were assessed against
the US norms of the WTAR, corrected for age, sex, and years of education. The
provided ethnicity classifications are not applicable in a New Zealand context
and all participants were assessed against norms for "Whites". Of the 
`r n_pd_with_ethnicity` Parkinson's participants with a recorded ethnicity,
`r n_pd_pakeha` 
(`r scales::label_percent(accuracy = 0.1)(n_pd_pakeha/n_pd_with_ethnicity)`) 
identified as Pākehā (New Zealand European), `r n_pd_maori` as Māori, 
`r n_pd_samoan` as Samoan, `r n_pd_indian` as Indian, and `r n_pd_other` were
'Other'. Of the Controls, `r n_control_pakeha` 
(`r scales::label_percent(accuracy = 0.1)(n_control_pakeha/n_control_with_ethnicity)`) 
identified as Pākehā and `r n_control_other` was 'Other'.

## Latent class trajectory modeling

We fitted latent class trajectory models using the _hlme_ function from the
_lcmm_ package [@proust-lima2017estimation-of-e] (version 
`r packageVersion('lcmm')`), running in the R statistical environment
[@R-Core-Team_R_2021] (version `r R.version$major`.`r R.version$minor`), and 
used the _tidyverse_ constellation of packages [@wickham2019_WelcomeTidyverse] 
for data manipulation and vizualisation. The dependent variable was
WTAR-estimated premorbid IQ, initially modeled within each subject as a
polynomial (cubic) function of age, to allow for non-linear trajectories. The
resulting trajectories were close to straight lines (see Supplementary Material)
so we simplified the models to be a linear function of age. The models were not
informed by any other variables (such as the diagnostic categories of PD-MCI or
dementia). We used age rather than disease duration as the time metric to allow
for direct comparison against any aging effect in controls.

A latent class model is fitted by first specifying an _a priori_ number of
classes (i.e. clusters) into which the participants can be divided. For example,
if just one cluster is specified, then the trajectory that is produced will
simply describe the average change in score over time for all participants,
disregarding any possible sub-groups. If two classes are specified, then the
model will determine two trajectories that optimally separate the total sample
into two clusters, on the basis of differing performance over time. Multiple
independent models are fitted, with increasing numbers of classes. Those models
are then compared to see which produces the best description of the data. The
classes are termed 'latent' as they are not known _a priori_, but instead arise
from the data. The resulting classes can, however, then be compared to known
groupings (such as Parkinson's vs Controls). This allows us to examine to what
extent performance over time might be driven by known factors. For example, if
Control and Parkinson's participants perform differently over time, they should
fall unequally into the various latent classes (for models with more than one
class).

We fitted five separate models, with the specified number of latent classes
ranging from 1 to 5. The _gridsearch_ function of the _lcmm_ package was used
to run 500 departures from random initial values for each multi-class model,
using the one-class model from which to generate the starting values. The
maximum number of iterations within the _hlme_ function was set at 1000.
Parameters corresponding to the best log-likelihood were used as initial values
for the final estimation of the parameters[@proust-lima2017estimation-of-e]. The
optimal model of the five candidates was selected on the basis of it converging
and having the lowest BIC (Bayesian information criterion). That is, this
allowed us to determine whether longitudinal performance on the WTAR was best
described by the participants falling into either 1, 2, 3, 4, or 5 underlying
trajectory classes.

When reporting values from the chosen model, we formed the interval in brackets
following the maximum likelihood estimate (MLE) by subtracting and adding 1.96
times the standard error given by _hlme_ to the MLE.


```{r table-1}
table_df = dat %>% 
  group_by(subject_id_num) %>% 
  arrange(age) %>% 
  slice_tail() %>% # last observation per subject
  group_by(group) %>% 
  summarise(n = n(),
            Male   = scales::percent_format(accuracy = 0.1)(sum(sex == 'Male')/n),
            Age = paste0(sprintf('%.1f', mean(age)), ' (', sprintf('%.1f',sd(age)), ')'),
            Education = paste0(sprintf('%.1f', mean(education)), ' (', sprintf('%.1f', sd(education)), ')'),
            WTAR = paste0(sprintf('%.1f', mean(WTAR)), ' (', sprintf('%.1f', sd(WTAR)), ')'),
            Normal = scales::percent_format(accuracy = 0.1)(sum(cognitive_status == 'Normal')/n),
            MCI    = scales::percent_format(accuracy = 0.1)(sum(cognitive_status == 'MCI')/n),
            Dementia = scales::percent_format(accuracy = 0.1)(sum(cognitive_status == 'Dementia')/n)) %>% 
  mutate(n = as.character(n)) %>% 
  pivot_longer(n:Dementia) %>% 
  pivot_wider(names_from = 'group')

table_df %>% 
  mutate(name = case_when(name == 'Age' ~ 'Age (SD)',
                          name == 'Education' ~ 'Years of education (SD)',
                          name == 'WTAR' ~ 'WTAR (SD)',
                          name == 'Normal'    ~ 'Normal cognition',
                          TRUE                ~ name)) %>% 
  rename(' '           = name,
         "Parkinson's" = PD) %>%
  apa_table(caption = "Demographics of the Parkinson's and Control groups, and cognitive status as at their latest assessment.", align = 'lrr')

```


## Reproducibility

The code and anonymised dataset extracts sufficient to reproduce the analyses
and generate this manuscript are publicly available at
[github.com/nzbri/wtar-trajectory](https://github.com/nzbri/wtar-trajectory).
There are a number of decisions that can affect the outcome of a latent class
modeling analysis and for reproducibility in the Supplementary Material we 
therefore report our performance against the 16-item checklist "Guidelines for 
Reporting on Latent Trajectory Studies".[@van_de_schoot_grolts-checklist_2017]

# Results

```{r wtar-modeling-cubic, include=FALSE}

# prepare list to hold models with 1 through max_classes_cubic classes:
cubic_models = list()

if (RUN_NEW_CUBIC_MODELS) {

  # currently have to coerce data to a dataframe (done above), as tibbles lead
  # to a bug with the subject id column not being accepted by lcmm as numeric.
  
  # use the 1-class model to set initial start values:
  cubic_models[[1]] <- 
    lcmm::hlme(WTAR ~ poly(age_std, degree = 3, raw = TRUE), 
               random = ~1 + poly(age_std, degree = 3, raw = TRUE),
               subject = 'subject_id_num', ng = 1, data = dat)
  saveRDS(cubic_models[[1]], 'models/cubic_WTAR_1.RDS') # cache to disk
  
  # for each higher-class model, run with 5000 random starts based on the
  # initial one-class model, then run the selected one up to 1000 times to
  # (hopeful) convergence:
  
  for (class_num in 2:max_classes_cubic) {
    
    print(class_num)

    # make use of parallel computation to speed up working through the large 
    # number of iterations to be used for each model. Needs to be done within
    # the loop to work around an issue when knitting in R Markdown with the
    # generation of random numbers in the parallel package:
    num_cores = parallel::detectCores()
    cl = parallel::makeCluster(num_cores)
    
    # need to explicitly pass some variables to the namespace of each process,
    # or the gridsearch and hlme functions below will trip up by not "knowing"
    # some of their parameters:
    parallel::clusterExport(cl, c('class_num', 'cubic_models'))
    
    cubic_models[[class_num]] <-
      lcmm::gridsearch(rep = 500, maxiter = 100, minit = cubic_models[[1]], cl = cl,
                       hlme(WTAR ~ poly(age_std, degree = 3, raw = TRUE), 
                            random = ~1 + poly(age_std, degree = 3, raw = TRUE), 
                            ng = class_num, nwg = TRUE, maxiter = 1000,
                            mixture = ~ poly(age_std, degree = 3, raw = TRUE),
                            subject = 'subject_id_num', data = dat))
    
    parallel::stopCluster(cl)
    
    saveRDS(cubic_models[[class_num]], paste0('models/cubic_WTAR_', 
                                              class_num, '.RDS'))
  }
  
} else { # use previously-computed cubic models:
  
  for (class_num in 1:max_classes_cubic) {
    cubic_models[[class_num]] = readRDS(paste0('models/cubic_WTAR_', 
                                               class_num, '.RDS'))
  }
}

cubic_WTAR_model_table = 
  summarytable(cubic_models[[1]], cubic_models[[2]], cubic_models[[3]], 
               cubic_models[[4]], cubic_models[[5]],
               which = c('G', 'loglik', 'conv', 'BIC', 'entropy', 
                         '%class')) %>% 
  as_tibble() %>% 
  mutate(conv = factor(conv, levels = c(1, 2, 3, 4, 5),
                       labels = c('yes', 'no', '', 'optimisation problem', 
                                  'optimisation problem'))) %>% 
  rename(converged = conv)

# manually select model, on the basis of one that converged & had lowest BIC:
chosen_cubic_classes = 1
chosen_cubic_model   = cubic_models[[chosen_cubic_classes]]

# create a vector of ages to predict against, centred on the mean across all
# sessions, and in decades rather than years to limit polynomial instabilities):
mean_age = mean(dat$age) 
new_data_cubic = 
  data.frame(age = seq(from = 43, to = 89, by = 0.5)) %>% 
  mutate(age_std = (age - mean_age)/10)

# create a dataframe that contains predictions for all candidate models
# (will be used in the supplementary material).
predictions = list()

for (i in 1:max_classes_cubic) {
  prediction = lcmm::predictY(cubic_models[[i]], newdata = new_data_cubic, 
                              var.time = 'age_std', draws = TRUE)
  predictions[[i]] = 
    as.data.frame(prediction$pred) %>% # extract just the predicted values
    cbind(new_data_cubic) # and add in the corresponding age values
  predictions[[i]]$ng = i  # and label with the number of classes
}

# the first one needs to be modified so that it can act as the template to 
# row bind to the others:
predictions[[1]] = predictions[[1]] %>% # add a suffix to the variable names
  rename('Ypred_class1' = 'Ypred', 
         'lower.Ypred_class1' = 'lower.Ypred', 
         'upper.Ypred_class1' = 'upper.Ypred') %>% 
  select(ng, age, age_std, everything()) # set order for the resulting dataframe

cubic_predictions = bind_rows(predictions[1:5]) %>% 
  # reshape to allow ggplotting:
  pivot_longer(cols = Ypred_class1:last_col(), 
               names_to = c('line', 'class'), names_sep = '_',
               values_to = 'predicted_WTAR') %>% 
  pivot_wider(names_from = line, values_from = predicted_WTAR) %>% 
  filter(!is.na(Ypred)) %>% 
  arrange(ng, class, age)

```

```{r wtar-modeling-linear, include=FALSE}

# prepare list to hold models with 1 through max_classes_linear classes:
linear_models = list()

if (RUN_NEW_LINEAR_MODELS) {
  
  # currently have to coerce data to a dataframe (done above), as tibbles lead 
  # to a bug with the subject id column not being accepted by lcmm as numeric.
  
  # use the 1-class model to set initial start values:
  linear_models[[1]] <- 
    lcmm::hlme(WTAR ~ age_std, 
               random = ~1 + age_std,
               subject = 'subject_id_num', ng = 1, data = dat)
  saveRDS(linear_models[[1]], 'models/linear_WTAR_1.RDS') # cache to disk
  
  # for each higher-class model, run with 5000 random starts based on the
  # initial one-class model, then run the selected one up to 1000 times to
  # (hopeful) convergence:
  
  for (class_num in 2:max_classes_linear) {
    
    print(class_num)
    
    # make use of parallel computation to speed up working through the large 
    # number of iterations to be used for each model. Needs to be done within
    # the loop to work around an issue when knitting in R Markdown with the
    # generation of random numbers in the parallel package:
    num_cores = parallel::detectCores()
    cl = parallel::makeCluster(num_cores)
    
    # need to explicitly pass some variables to the namespace of each process, 
    # or the gridsearch and hlme functions below will trip up by not "knowing" 
    # some of their parameters:
    parallel::clusterExport(cl, c('class_num', 'linear_models'))
    
    linear_models[[class_num]] <- 
      lcmm::gridsearch(rep = 500, maxiter = 100, minit = linear_models[[1]], 
                       cl = cl,
                       hlme(WTAR ~ age_std, 
                            random = ~1 + age_std, 
                            ng = class_num, nwg = TRUE, 
                            mixture = ~ age_std, maxiter = 1000,
                            subject = 'subject_id_num', data = dat))
    
    parallel::stopCluster(cl)
    
    saveRDS(linear_models[[class_num]], paste0('models/linear_WTAR_', 
                                               class_num, '.RDS'))
  }
  
} else { # use previously-computed linear models:
  
  for (class_num in 1:max_classes_linear) {
    linear_models[[class_num]] = readRDS(paste0('models/linear_WTAR_', 
                                                class_num, '.RDS'))
  }
  
}

linear_WTAR_model_table = 
  summarytable(linear_models[[1]], linear_models[[2]], linear_models[[3]], 
               linear_models[[4]], linear_models[[5]],
               which = c('G', 'loglik', 'conv', 'BIC', 'entropy', 
                         '%class')) %>% 
  as_tibble() %>% 
  mutate(conv = factor(conv, levels = c(1, 2, 3, 4, 5),
                       labels = c('yes', 'no', '', 'optimisation problem', 
                                  'optimisation problem'))) %>% 
  rename(converged = conv) 

# manually select model, on the basis of one that converged & had lowest BIC:
chosen_linear_classes = 2
chosen_linear_model   = linear_models[[chosen_linear_classes]]
coeffs = coef(chosen_linear_model)

# create a vector of ages to predict against, centred on the mean across all
# sessions, and in decades rather than years to be consistent with the cubic 
# modeling where we need to limit instabilities):
mean_age = mean(dat$age) 
new_data_linear = 
  data.frame(age = seq(from = 43, to = 89, by = 0.5)) %>% 
  mutate(age_std = (age - mean_age)/10)

# create a dataframe that contains predictions for all candidate models
predictions = list()

for (i in 1:max_classes_linear) {
  prediction = lcmm::predictY(linear_models[[i]], newdata = new_data_linear, 
                              var.time = 'age_std', draws = TRUE)
  predictions[[i]] = 
    as.data.frame(prediction$pred) %>% # extract just the predicted values
    cbind(new_data_linear) # and add in the corresponding age values
  predictions[[i]]$ng = i  # and label with the number of classes
}

# the first one needs to be modified so that it can act as the template to 
# row bind to the others:
predictions[[1]] = predictions[[1]] %>% # add a suffix to the variable names
  rename('Ypred_class1' = 'Ypred', 
         'lower.Ypred_class1' = 'lower.Ypred', 
         'upper.Ypred_class1' = 'upper.Ypred') %>% 
  select(ng, age, age_std, everything()) # set order for the resulting dataframe

linear_predictions = bind_rows(predictions[1:5]) %>% 
  # reshape to allow ggplotting:
  pivot_longer(cols = Ypred_class1:last_col(), 
               names_to = c('line', 'class'), names_sep = '_',
               values_to = 'predicted_WTAR') %>% 
  pivot_wider(names_from = line, values_from = predicted_WTAR) %>% 
  filter(!is.na(Ypred)) %>% 
  arrange(ng, class, age)
  
chosen_linear_predictions = linear_predictions %>% 
  filter(ng == chosen_linear_classes) %>% 
  mutate(class = 
           factor(class, 
                  levels = c('class2', 'class1'),
                  labels = c('High performers', 'Typical'))) %>% 
  rename(class_for_lines = class)

# link subjects to their assigned classes:
dat_with_classes = dat %>% 
  left_join(chosen_linear_model$pprob, by = 'subject_id_num') %>% 
   mutate(class = 
           factor(class, 
                  levels = c('2', '1'),
                  labels = c('High performers', 'Typical'))) %>% 
  rowwise() %>% 
  mutate(class_probability = max(c(prob1, prob2)))

# get a dataframe with just one row per subject
individuals = dat_with_classes %>% 
  group_by(subject_id_num) %>% arrange(age) %>% 
  slice_tail()

class_proportions = individuals %>% 
  group_by(class) %>% 
  tally() %>%
  mutate(proportion = scales::percent_format(accuracy = 1)(n/sum(n)))

class_proportions_by_group = individuals %>% 
  group_by(group, class) %>% 
  tally() %>% group_by(group) %>% 
  mutate(proportion = scales::percent_format(accuracy = 1)(n/sum(n)))
```

```{r percent-above-100}
above_100 = dat %>%
  mutate(above_100 = WTAR >= 100) |> 
  group_by(group, above_100) %>%
  tally() %>%
  pivot_wider(names_from = above_100, 
              values_from = n, names_prefix = 'above_') %>%
  mutate(proportion = 
           scales::percent_format(accuracy = 0.1)(above_TRUE/(above_FALSE + above_TRUE)))
```

```{r chi-square}
class_n_by_group = individuals %>% 
  group_by(group, class) %>% 
  tally() %>% 
  pivot_wider(names_from = group, values_from = n)

chi = chisq.test(class_n_by_group[, -1])

chi_n_observed = chi$observed
chi_n_expected = chi$expected
chi_p_value = chi$p.value
chi_statistic = chi$statistic
chi_df = chi$parameter

```


```{r formatted-text-results, include=FALSE}

# extract coefficients from model summary:
summary = data.frame(summary(chosen_linear_model))

typical_slope   = summary$coef[row.names(summary) == 'age_std class1']
typical_slope_se  = summary$Se[row.names(summary) == 'age_std class1']
typical_intcpt  = summary$coef[row.names(summary) == 'intercept class1']
typical_intcpt_se = summary$Se[row.names(summary) == 'intercept class1']

high_slope   = summary$coef[row.names(summary) == 'age_std class2']
high_slope_se  = summary$Se[row.names(summary) == 'age_std class2']
high_intcpt  = summary$coef[row.names(summary) == 'intercept class2']
high_intcpt_se = summary$Se[row.names(summary) == 'intercept class2']

# construct strings for the coefficients and intervals:
typical_slope_str = sprintf('%.1f', typical_slope)
high_slope_str    = sprintf('%.1f', high_slope)

typical_intcpt_str = sprintf('%.1f', typical_intcpt)
high_intcpt_str    = sprintf('%.1f', high_intcpt)

typical_slope_intrvl = 
  paste0('[', sprintf('%.1f', typical_slope - typical_slope_se * 1.96), ', ', 
         sprintf('%.1f', typical_slope + typical_slope_se * 1.96), ']')

high_slope_intrvl = 
  paste0('[', sprintf('%.1f', high_slope - high_slope_se * 1.96), ', ', 
         sprintf('%.1f', high_slope + high_slope_se * 1.96), ']')

typical_intcpt_intrvl = 
  paste0('[', sprintf('%.1f', typical_intcpt - typical_intcpt_se * 1.96), ', ', 
         sprintf('%.1f', typical_intcpt + typical_intcpt_se * 1.96), ']')

high_intcpt_intrvl = 
  paste0('[', sprintf('%.1f', high_intcpt - high_intcpt_se * 1.96), ', ', 
         sprintf('%.1f', high_intcpt + high_intcpt_se * 1.96), ']')

```

## Distribution of scores
The distributions of all WTAR-estimated IQ scores for the Parkinson's and
Control participants are shown in Figure \@ref(fig:figure-1-modified-histogram).
The mean of the latest score for participants in each of the groups is given in
Table \@ref(tab:table-1), with both being well above the expected population
norm mean IQ score of 100. The population distribution of IQ should be
symmetrical about 100, yet in our sample,
`r above_100$proportion[above_100$group == 'PD']` of Parkinson's WTAR scores
were greater than or equal to 100, as were
`r above_100$proportion[above_100$group == 'Control']` of Control scores.

## Longitudinal trajectories
The raw data showed that WTAR scores within individuals were relatively constant
over time (Figure \@ref(fig:figure-2-trajectories)A). This was evident even in
participants whose overall cognitive performance declined substantially, as they
progressed to PD-MCI or dementia (Figure \@ref(fig:figure-2-trajectories)B). We
modeled WTAR scores as a linear function of age, with repeated measures within
individuals, and fitted candidate models that ranged from having one to five
latent trajectory classes. We selected the two-class model on the basis of it
having the lowest BIC value. Full details on the modeling process are provided
in the Supplementary Material.

Individuals were assigned to the class for which they had the highest posterior
classification probability. The mean assigned probability was
`r round(mean(individuals$class_probability), digits = 2)`
(SD `r round(sd(individuals$class_probability), digits = 2)`), with a value of
0.7 being regarded as acceptable model performance.[@lennon_framework_2018] The
two trajectories are depicted superimposed upon the raw data  in Figure
\@ref(fig:figure-2-trajectories)A. We labeled the first trajectory 'Typical',
as this class captured 
`r with(class_proportions, proportion[class == 'Typical'])` of all 
participants (`r with(class_proportions_by_group, proportion[class == 'Typical' & group == 'PD'])`
of the PD and `r with(class_proportions_by_group, proportion[class == 'Typical' & group == 'Control'])`
of Control participants). The intercept of the Typical trajectory, representing
the estimated WTAR score of a person at the mean age of assessment 
(`r sprintf('%.1f', mean_age)` years) was
`r typical_intcpt_str` `r typical_intcpt_intrvl`. Scores then gently declined
at a rate of `r typical_slope_str` `r typical_slope_intrvl` IQ points per
decade. The second trajectory we labeled 'High performers'. This class,
capturing the remaining 
`r with(class_proportions, proportion[class == 'High performers'])` 
of all participants, had an intercept of 
`r high_intcpt_str` `r high_intcpt_intrvl` and a slope of `r high_slope_str` 
`r high_slope_intrvl` points per decade (effectively flat). That is, both
trajectories were relatively stable over time, and were separated primarily due
to the initial scores of their respective clusters of participants. We make no
strong claims as to the number of classes. For example a single trajectory
model, somewhat intermediate between the two reported here, would also describe
the data relatively well, and a single trajectory was the optimal model when
cubic rather than linear trajectories were specified (see Supplementary
Material). The key findings are that the trajectories are relatively flat, and
do not separate performance between the Parkinson's and Control participants, 
nor between people with preserved versus impaired cognitive function.

(ref:fig-1-caption) The distribution of WTAR-estimated IQ scores across all
participants (containing at least two observations from each). Observations are
color-coded by the participant's cognitive status at each session. Regardless
of current cognitive status, WTAR scores span a wide range. For each cognitive
classification, the center of distribution is shifted well above the population
norm mean of 100. Figure available under an open CC-BY
licence.[@macaskill2022_DistributionWechslerTest]

```{r figure-1-modified-histogram, fig.align='center', fig.cap='(ref:fig-1-caption)', fig.height=fig_1_height/2.54, fig.width=fig_1_width/2.54, warning=FALSE}
figure_1_mod = dat %>% 
  ggplot(aes(x = WTAR, fill = cognitive_status)) +
  facet_wrap(~ panel_label_1, ncol = 1) +
  geom_histogram(binwidth = 1, size = 0.25, colour = 'white') + 
  scale_fill_manual(values = cog_colours, name = 'Cognitive status') +
  scale_y_continuous(limits = c(0, 35), expand = expansion(add = 0),
                     breaks = c(0, 10, 20, 30)) +
  labs(title = 'Frequencies of WTAR-estimated IQ scores',
       x = NULL, y = NULL) +  theme_bw() +
  theme(text = element_text(family = 'Lato'),
        plot.title = element_text(face = 'bold', size = 11),
        axis.text = element_text(colour = 'black', size = 8),
        legend.position = c(0.18, 0.90),
        legend.margin = margin(c(0, 1, 0, 1), unit = 'mm'),
        legend.spacing.y = unit(1,'mm'), # spacing between title & items
        legend.key.size = unit(3.0,'mm'), # spacing between items
        legend.title = element_text(face = 'bold', size = 8),
        legend.text = element_text(size = 8),
        panel.spacing = unit(4, 'mm'),
        axis.ticks = element_line(colour = border_colour, size = 0.15),
        panel.grid.major.y = element_blank(),
        panel.grid.minor.y = element_blank(),
        panel.grid.major.x = element_line(size = 0.25),
        panel.grid.minor.x = element_blank(),
        panel.border = element_rect(colour = border_colour, size = 0.25),
        strip.background = 
          element_rect(colour = border_colour, fill = border_colour),
        strip.text = element_text(face = 'bold', colour = 'white', size = 10))

figure_1_mod

ggsave('images/wtar-figure-1-mod-histogram.pdf', 
       width = fig_1_width, height = fig_1_height, units = 'cm')

ggsave('images/wtar-figure-1-mod-histogram.png',
       width = fig_1_width, height = fig_1_height, units = 'cm', dpi = DPI)

```

(ref:fig-2-caption) ***(A)*** Longitudinal WTAR score trajectories for all
individuals, divided into Controls, Parkinson's participants who remained
dementia-free, and those who developed dementia. Superimposed are the two
predicted latent class trajectories (blue ribbon = _High performers_, brown
ribbon = _Typical_). Blue and brown connecting lines link observations from
individuals from each class. ***(B)***  Mean z-score from multiple tests in a
comprehensive neuropsychological test battery, showing only those Parkinson's
participants who developed dementia. Their overall cognitive performance
declined substantially, in contrast to their relatively stable WTAR scores. To
illustrate this, two individuals are highlighted (with larger symbols and
thicker connecting lines), showing their flat WTAR trajectories despite dramatic
declines in overall neuropsychological performance. In each panel, current
cognitive status is shown as cognitively normal (green circles), mild cognitive
impairment (orange triangles), or dementia (red squares). Upon reaching
dementia, the WTAR was not assessed again. Figure available under an open CC-BY
licence.[@macaskill2022_LongitudinalWTARestimatedPremorbid]

```{r figure-2-trajectories, fig.align='center', fig.cap='(ref:fig-2-caption)', fig.height=fig_2_height/2.54, fig.width=fig_2_width/2.54, warning=FALSE}

# select two individuals to highlight in the figure.
# i.e. a person with high initial WTAR who declines to dementia yet 
# retains high WTAR scores. Then highlight that person's z-score data,
# showing substantial decline. Do the same for a person from the other 
# trajectory class, showing a similar pattern.
dat_with_classes_plus_highlights = dat_with_classes %>%
  # we identified the subject id numbers to highlight by temporarily showing 
  # text labels of each subject id in the lower panel:
  mutate(highlight = case_when(subject_id_num %in% c(63, 110) ~ TRUE,
                                                         TRUE ~ FALSE)) %>%
  arrange(highlight, subject_id_num, age)
  
# upper panel, of WTAR scores by age:
figure_2_upper = dat_with_classes_plus_highlights %>% 
  ggplot(aes(x = age)) +
  facet_grid(panel_label_2 ~ .) +
  scale_colour_manual(values = blue_brown_sat, name = 'Class', guide = 'none') + 
  scale_fill_manual(values = blue_brown, name = 'Class', guide = 'none') + 
  scale_shape_manual(values = c(21, 25, 22), name = 'Cognitive status') +
  geom_ribbon(dat = chosen_linear_predictions,
              aes(ymin = lower.Ypred, ymax = upper.Ypred, fill = class_for_lines),
              colour = NA, alpha = 0.2) +
  geom_line(dat = chosen_linear_predictions,
            aes(y = Ypred, colour = class_for_lines), 
            size = 0.25, linetype = 'dashed') +
  scale_size_manual(values = c(0.25, 1.5)) +
  geom_line(aes(y = WTAR, colour = class, group = subject_id_num, 
                size = highlight), alpha = 1.0) +
  # via the ggnewscale package, allows us to have independent colour scales for 
  # the trajectories (2 levels) and the cognitive categories (3 levels):
  new_scale_color() + new_scale_fill() + new_scale('size') +
  scale_colour_manual(values = c('white', 'black')) + 
  scale_fill_manual(values = cog_colours, name = 'Cognitive status') +
  # setting stroke = 0.1 gives nice sharp shapes:
  scale_size_manual(values = c(1.5, 3.0)) +
  geom_point(aes(y = WTAR, fill = cognitive_status, size = highlight,
                 #stroke = highlight,
                 colour = highlight, # minimal border stroke
                 shape = cognitive_status), stroke = 0.2) +
  scale_x_continuous(limits = c(42, 89.9), expand = c(0,0),
                     labels = scales::label_number(suffix = ' years')) +
  scale_y_continuous(limits = c(75, 130), expand = expansion(add = 0)) +
  labs(title = 'A. Longitudinal WTAR-estimated premorbid IQ',
       subtitle = 'Shaded bands indicate the two predicted latent class trajectories',
       x = NULL, y = NULL) +
  theme_bw() +
  theme(text = element_text(family = 'Lato'),
        plot.title = element_text(face = 'bold', size = 12),
        plot.subtitle = element_text(size = 9),
        axis.text.x = element_text(size = 9, colour = 'black'),
        axis.text.y = element_text(size = 8, colour = 'black'),
        axis.ticks = element_line(colour = border_colour, size = 0.15),
        panel.grid = element_blank(),
        legend.pos = 'none', #c(0.09, 0.095),
        legend.margin = margin(c(1, 1, 1, 1), unit = 'mm'),
        legend.spacing.y = unit(1, 'mm'), # spacing between title & items
        legend.key.size = unit(2.5, 'mm'), # spacing between items
        legend.title = element_text(face = 'bold', size = 6),
        legend.text = element_text(size = 7),
        panel.spacing = unit(4, 'mm'),
        panel.border = element_rect(colour = border_colour, size = 0.25),
        strip.background = 
          element_rect(colour = border_colour, fill = border_colour),
        strip.text = element_text(face = 'bold', colour = 'white', size = 11))

# lower panel, of z-score trajectories of dementia-converters only:
figure_2_lower = dat_with_classes_plus_highlights %>% 
  filter(panel_label_2 == 'PD dementia') %>% 
  ggplot(aes(x = age, y = global_z)) +
  facet_grid(panel_label_2 ~ .) +
  scale_size_manual(values = c(0.25, 1.5)) +
  scale_colour_manual(values = blue_brown_sat, name = 'Class', guide = 'none') +
  scale_fill_manual(values = cog_colours, name = 'Cognitive status') +
  scale_shape_manual(values = c(21, 25, 22), name = 'Cognitive status') +
  geom_line(aes(group = subject_id_num, colour = class, size = highlight),
            show.legend = FALSE) +
  # use ggscale to have independent size scales:
  new_scale_colour() + new_scale('size') +
  scale_size_manual(values = c(1.5, 3.0), guide = 'none') +
  scale_colour_manual(values = c('white', 'black'), guide = 'none') +
  # setting stroke = 0.1 gives nice sharp shapes:
  geom_point(aes(fill = cognitive_status, size = highlight,
                 colour = highlight, # minimal border stroke
                 shape = cognitive_status), stroke = 0.2) +
  # temporarily enable this to allow identifying subjects to highlight:
  #geom_text(aes(label = subject_id_num)) + 
  scale_x_continuous(limits = c(42, 89.9), expand = c(0,0),
                     labels = scales::label_number(suffix = ' years')) +
  labs(title = 'B. Mean z-score from a comprehensive neuropsychological test battery',
       subtitle = 'Showing only the PD participants who developed dementia',
       x = NULL, y = NULL) + 
  theme_bw() +
  theme(text = element_text(family = 'Lato'),
        plot.title = element_text(face = 'bold', size = 12),
        plot.subtitle = element_text(size = 9),
        axis.ticks = element_line(colour = border_colour, size = 0.15),
        axis.text.x = element_text(size = 9, colour = 'black'),
        axis.text.y = element_text(size = 8, colour = 'black'),
        panel.grid = element_blank(),
        legend.position = c(0.085, 0.46), # 0.64 puts it in the top panel
        legend.margin = margin(c(1, 1, 1, 1), unit = 'mm'),
        legend.spacing.y = unit(1,'mm'), # spacing between title & items
        legend.key.size = unit(3.0,'mm'), # spacing between items
        legend.title = element_text(face = 'bold', size = 7),
        legend.text = element_text(size = 7),
        panel.border = element_rect(colour = border_colour, size = 0.25),
        strip.background = 
          element_rect(colour = border_colour, fill = border_colour),
        strip.text = element_text(face = 'bold', colour = 'white', size = 11))

figure_2 = 
  cowplot::plot_grid(figure_2_upper, figure_2_lower, nrow = 2, 
                     axis = 'l', align = 'v', rel_heights = c(1.0, 0.388))

figure_2

ggsave('images/wtar-figure-2-trajectories.pdf', 
       width = fig_2_width, height = fig_2_height, units = 'cm')

ggsave('images/wtar-figure-2-trajectories.png', 
       width = fig_2_width, height = fig_2_height, units = 'cm', dpi = DPI)

```

# Discussion

Unlike the progression to Alzheimer's dementia, we found that WTAR scores were
quite stable over time in Parkinson's, despite many of our participants
undergoing substantial decline in overall cognitive performance over the course
of the study, including to dementia (Figure \@ref(fig:figure-2-trajectories)B).
Latent class analysis revealed that in our sample people were best classified
into two trajectory classes. These appeared to be largely driven by two
clusters of performance at baseline, rather than being trajectories that
substantially differed longitudinally. Those with very high initial WTAR scores
(the 'High performers' class) maintained high scores (Figure
\@ref(fig:figure-2-trajectories)A), even if they subsequently developed dementia
(Figure \@ref(fig:figure-2-trajectories)B). The remaining
`r with(class_proportions, proportion[class == 'Typical'])` of participants (the
'Typical' class) had a wider range of initial scores, but were again relatively
stable, showing an estimated decline of only `r typical_slope_str` points per 
decade. If the WTAR was used to assess disparities between current cognitive 
performance and a premorbid level of function, a decadal decline of that 
magnitude would be unlikely to influence assignment to clinical categories like
MCI and dementia.

Measures of premorbid function are closely tied to the concept of cognitive
reserve.[@stern2012_CognitiveReserveAgeing] For example, higher NART scores have
been associated with better current neuropsychological test performance in
people with Parkinson's,[@koerts2013_InfluenceCognitiveReserve] indicating that
higher cognitive reserve may be protective against subsequent cognitive decline.
This could, however, lead to circular reasoning if the premorbid measure was not
reliable, because if the disease process also reduces current performance on
that measure, then inferring poorer cognitive reserve would not be valid. That
is, due to current impairments, we would be estimating a lower level of
premorbid function than the person actually had. Our findings, of stable WTAR
scores despite substantial current cognitive decline, provide support for the
WTAR being a stable indicator of cognitive reserve in Parkinson's.

We retain reservations about the use of the WTAR, however. Although still in
wide use, it is outdated, inasmuch as it was originally devised to estimate
premorbid IQ scores from the WAIS-III (Wechsler Adult Intelligence Scale III,
released in 1997).[@psychologicalcorporation2001_WTARWechslerTest] The WAIS-IV
was released in 2008) and is now in widespread use, while the WAIS-V is
currently under development. The TOPF (Test of Premorbid Function) estimates
WAIS-IV IQ and was designed as a successor to the WTAR, but has had limited
uptake in the Parkinson's literature.[@bright2020_ComparisonMethodsEstimating]
This perhaps reflects the inertia of substantial research studies (including our
own), in which investigators have a natural desire to retain backwards
compatibility with measures that they have gathered over a long period of time.
The age of an IQ-related test is particularly important. The Flynn
effect[@flynn1999_SearchingJusticeDiscovery] is the robust finding that cohort
IQ scores have increased substantially over time (by up to 1 standard deviation
per generation). Our sample would (on average) have been born well after their
age-matched WAIS-III normative sample group and that might partly explain their
above-average estimated IQ levels. Another possibility for the high scores that
we observed is that those who volunteer for research are unlikely to be
representative of their respective populations. Such a self-selection effect,
however, applies more to the controls than to the Parkinson's group,
particularly because our controls were excluded (at study entry only) if they
showed cognitive impairment. This restriction did not apply to the Parkinson's
participants, but again, as research volunteers, they may not be representative
of the Parkinson's population at large. As shown in Table \@ref(tab:table-1),
the mean education level for both the Controls and Parkinson's participants was
close to that of a complete New Zealand high school education (13 years).
Although a selection bias is undoubtedly a factor in the NZP^3^ sample, we do 
not believe it could account for the predominance of estimated IQ scores
above the putative population norm of 100 
(`r above_100$proportion[above_100$group == 'PD']` of Parkinson's scores and 
`r above_100$proportion[above_100$group == 'Control']` of Control scores).

To provide evidence of cognitive decline, scores on the tests from a
neuropsychological test battery must fall below some reference value. If the
WTAR was used as an individualized premorbid reference for those scores, rather
than a per-test z-score norm of zero, we would expect to see inflated numbers of
participants being classified as MCI, because of the over-estimation of
premorbid function. The PD-MCI guidelines do not actually formalize how the
results of premorbid function tests should be used in practice to demonstrate
decline in cognitive
functioning.[@marras2014_ToolsTradeState; @geurtsen2014_ParkinsonDiseaseMild] 
We are aware of only one study[@marras2013_MeasuringMildCognitive] that has
compared the norms-based vs premorbid estimation approaches to establishing
cognitive impairment. They converted the WTAR-estimated full-scale IQ to a
z-score and used that as the reference against which to test each individual
neuropsychological test z-score. Any test of current performance that was lower
by more than 1.5 SD from the estimated premorbid level was deemed impaired. Of
their sample of 139 non-demented people with Parkinson's, as noted earlier, the
WTAR method classified many more of them as having PD-MCI (79%) than did a
norms-based classification method (33%). Marras _et al._ interpreted this as
evidence that most people with Parkinson's have undergone at least some decline
relative to their premorbid function, with the WTAR providing a sensitive method
to detect it. Conversely, however, that finding might support our contention
that the WTAR simply over-estimates a person's premorbid level of function –
just as in our sample, where the distribution of estimated IQ values was skewed
well above the supposed population mean of 100.

The WTAR (and similar tests such as the NART and TOPF) are also limited in that 
they are inherently tied to peculiarities of written and spoken English. The 
orthography (writing system) of a language can be described as deep/opaque when
it contains many irregularities in its mappings of graphemes to phonemes, or 
shallow/transparent when the mappings are largely 
regular.[@frost1987_StrategiesVisualWord] In a quantitative analysis of 17
orthographies,[@marjou2021_OTEANNEstimatingTransparency] the reading of English
words stood out as particularly opaque, whereas by comparison Arabic, Finnish,
Korean, Serbo-Croatian, and Turkish were highly transparent. Attempting to
devise analogues of tests like the WTAR for speakers of languages with
transparent orthographies is unlikely to be fruitful.

Even across varieties of English, both between and within countries, the US- and
UK-based selection of words can be problematic. For example, the TOPF was found
to not reliably predict WAIS-IV IQ in New Zealand English speakers of Māori
ethnicity.[@dudley2017_TestPremorbidFunctioning] A word like 'porpoise', ranked
in the TOPF as relatively simple (28th of 70 items), resulted in more errors
than a word like 'plethora' (ranked 42nd of 70), likely reflecting different
word frequencies between North American and New Zealand English. After the
commencement of our study, a New Zealand Adult Reading Test (NZART) was
created[@starkey2011_DevelopmentNewZealand], with a more culturally-relevant
selection of words, but this will require further development to supplant the
established international tests. Neuropsychological screening tests developed in
the "Anglosphere" have been shown to have poor cross-cultural validity in
Parkinson's.[@statucka2021_MulticulturalismChallengeCognitive] This extends even
to visuoperceptual and non-verbal executive tasks previously assumed to be
"culture fair".[@statucka2019_OriginsMatterCulture] It is perhaps therefore not
surprising that an inherently language-based test faces issues with
cross-cultural applicability. In subsequent revisions of the PD-MCI diagnostic
criteria, if premorbid measures continue to be advocated, more consideration
could be given to alternative
techniques[@franzen1997_MethodsEstimatingPremorbid] that could have improved
validity in a wider international and cross-cultural context.

In summary, the WTAR is a measure that is stable over time in people with
Parkinson's, even in the face of a marked decline in current cognitive
performance. This makes it a valid research tool to probe cognitive reserve and
premorbid function differences across groups. In absolute terms, however, it
likely overestimates premorbid IQ, making it unsuitable to use as a benchmark to
establish a decline in function within individuals. All such measures likely
suffer from cross-cultural word choice issues, even across different varieties
of English. More importantly, analogous tests are not even able to be created
for languages that do not share the orthographic peculiarities of English. The
strikingly stable recall of atypical pronunciation should not be thought of as a
fundamental neuropsychological function, but rather a phenomenon that is only
evident in particular linguistic and cultural populations. Therefore, in future
revisions of the international guidelines for classifying cognitive impairment
in Parkinson's, we believe that a test of premorbid function like the WTAR
should no longer be recommended as one of the means of formally establishing
cognitive decline at an individual level. Due to its remarkable stability in
Parkinson's despite substantial cognitive impairment, this and related measures
will, however, remain a valuable research tool in the Anglosphere, at least for
testing groups for differences on premorbid functioning. The absolute estimation
of premorbid function in Parkinson's should be determined for newer tests such
as the TOPF and its successors. With newer norms, they may address the
over-estimation we found with the WTAR.


# References

<div id="refs"></div>

\newpage

# (APPENDIX) Appendix {-}

```{r child = "WTAR-supplementary.Rmd"}
```