Using rpact with Raw Data

Nestlé rpact Training 2024, Lausanne

Gernot Wassmer and Friedrich Pahlke

May 24, 2024

Introduction

We show how to use the function getDataset() as an utility function to process adjusted means and estimated standard deviations from raw data, how to convert the raw data into an rpact dataset, and finally how to use this information for the analysis at interim and the final stage. Essentially, this is through the use of the CRAN package emmeans that allows the extraction of least squares means from a specified model, e.g., an ANCOVA model considered here.

What will be illustrated?

How to import raw data from a SAS file
How to get the adjusted means and standard deviations from the data
How to convert the raw data into an dataset
How to analyze the dataset

Load required `R` packages

library(rpact)

# compute estimated marginal means (EMMs) for specified factors 
# in a linear model
library(emmeans)

Analysis of raw data – Data import from csv file

Artificial dataset that was randomly generated with simulated normal data.
The dataset has six variables:

Subject id
Stage number
Group name
An example outcome in that we are interested in
The first covariate gender
The second covariate covariate

data <- read.csv(file = "../files/data_two_arm_adjusted_means.csv")
data <- data[, !(colnames(data) %in% c("visit", "seed"))]
data1 <- data[data$stage == 1, ]
data2 <- data[data$stage == 2, ]
data3 <- data[data$stage == 3, ]
head(data1)

  subject stage           group   outcome gender covariate
1    1101     1 Treatment group  60.91533      m  41.74395
2    1102     1 Treatment group  57.65198      f  43.94315
3    1103     1 Treatment group 118.36755      f  38.82385
4    1104     1 Treatment group 118.25124      m  36.79198
5    1105     1 Treatment group 174.35321      f  42.92030
6    1106     1 Treatment group 113.80682      f  36.30031

Analysis of raw data – Descriptive Statistics

s1 <-aggregate(x = data$outcome, by = list(data$group,
    data$stage), FUN = mean, na.rm = TRUE)
s2 <-aggregate(x = data$outcome, by = list(data$group,
    data$stage), FUN = sd, na.rm = TRUE)
s3 <-aggregate(x = data$outcome, by = list(data$group,
    data$stage), FUN = length)
tab <- data.frame(stage = s1$Group.2,
                    arm = s1$Group.1,
                    n = s3$x,
                    mean = s1$x,
                    std = s2$x) |> 
    print()

  stage             arm  n      mean      std
1     1   Control group 87  91.66591 49.13310
2     1 Treatment group 87 110.88731 45.65258
3     2   Control group 87 101.83443 41.97826
4     2 Treatment group 87 114.06969 49.06744
5     3   Control group 87 103.31525 37.87368
6     3 Treatment group 87 112.17591 39.91061

The mean of outcome is larger in the treatment group as in the control group.

Check whether `group` is a character vector or `factor`:

class(data1$group)

[1] "character"

unique(data1$group)

[1] "Treatment group" "Control group"

Convert group to a factor (so that the order can be changed later):

data1$group <- as.factor(data1$group)

$\hspace{1.5cm}$

Since its first release, R has by default converted character strings to factors when creating data frames directly with data.frame() or as the result of using read.table() variants (e.g., read.csv()) to read in tabular data, i.e., stringsAsFactors = TRUE by default.

With the release of R 4.0.0 the default was changed to stringsAsFactors = FALSE!

Explore covariates

Frequency of gender levels per group:

tapply(data1$gender, data1$group, table)

$`Control group`

 f  m 
45 42 

$`Treatment group`

 f  m 
40 47

The frequency of gender has a good balance, overall and per group.

Mean of covariate per group:

tapply(data1$covariate, data1$group, mean, na.rm = TRUE)

  Control group Treatment group 
       49.96940        40.43999

The mean of covariate is larger in the control group than in the treatment group.

Interim analyses

First look: analysis at stage 1

For illustration we create a group sequential design with default parameters:

design <- getDesignGroupSequential()

Then we create a linear model with dependent variable y = outcome, independent variable group, and covariates covariate and gender, i.e., the model formula is outcome ~ group + covariate + gender:

lmResult1 <- lm(outcome ~ group + covariate + gender, data = data1)
lmResult1


Call:
lm(formula = outcome ~ group + covariate + gender, data = data1)

Coefficients:
         (Intercept)  groupTreatment group             covariate               genderm  
            139.2393               11.8357               -0.8502              -10.5387

Here we are not interested in the model output but we need it to compute the estimated marginal means (EMMs; least-squares means) for the factor group in the model:

emResult1 <- emmeans(lmResult1, "group")
emResult1

 group           emmean   SE  df lower.CL upper.CL
 Control group     95.4 6.64 167     82.3      109
 Treatment group  107.3 6.82 167     93.8      121

Results are averaged over the levels of: gender 
Confidence level used: 0.95

Note that the estimated standard deviation is also included in the object.

The rpact function getDataset() directly accepts the result objects of emmeans(). To create the dataset for the first interim analysis we only have to put the object emResult1 as exclusive argument to the function getDataset():

dataset1 <- getDataset(emResult1)
dataset1

Dataset of means

Stages: 1, 1
Treatment groups: 1, 2
Sample sizes: 84, 87
Means: 107.28, 95.45
Standard deviations: 62.48, 61.92

Calculated data

Cumulative sample sizes: 84, 87
Cumulative means: 107.28, 95.45
Cumulative standard deviations: 62.48, 61.92

Check the getDataset result

The order of the group levels is important to get the correct analysis results. Therefore we do some check before we continue.

rpact expects the following group level order:

1 = Treatment group
2 = Control group

Carefully check and define the order of the group levels. Depending on the group names it is important to guarantee that the order is defined as expected from rpact (see above).

For correct modelling, the control group should be the reference level and is therefore set to position 1:

data1$group <- relevel(data1$group, ref = "Control group")
levels(data1$group)

[1] "Control group"   "Treatment group"

Alternative command to reorder the group levels:

data1$group <- factor(data1$group, 
                      levels = c("Control group", "Treatment group"))

If the treatment group shall be the reference level you can later add the argument directionUpper = FALSE to getAnalysisResult().

After modifying the order double check that the dataset is still correct. We do that by creating the unadjusted dataset using a simple model without covariates:

datasetCheck1 <- getDataset(
    emmeans(lm(outcome ~ group, data = data1), "group"))
tapply(datasetCheck1$means, datasetCheck1$groups, mean, na.rm = TRUE)

        1         2 
110.88731  91.66591

tapply(data1$outcome, data1$group, mean, na.rm = TRUE)

  Control group Treatment group 
       91.66591       110.88731

The control group has a mean of 91.7 in the raw data and in the dataset, where 2 = control.

Recalculate the estimated marginal means for the complete model:

lmresult1 <- lm(outcome ~ group + covariate + gender, data = data1)
emResult1 <- emmeans(lmResult1, "group")
plot(emResult1)

The estimated marginal means and corresponding confidence intervals within the strata of the factors of the model can be obtained as follows:

emmeans(lmResult1, ~ group + covariate + gender)

 group           covariate gender emmean   SE  df lower.CL upper.CL
 Control group        45.3 f       100.7 7.45 167     86.0      115
 Treatment group      45.3 f       112.6 7.94 167     96.9      128
 Control group        45.3 m        90.2 7.68 167     75.0      105
 Treatment group      45.3 m       102.0 7.50 167     87.2      117

Confidence level used: 0.95

Analysis results of stage 1

As from above, let dataset1 be the dataset for the complete model:

analysisResults1 <- getAnalysisResults(design = design,
        dataInput = dataset1)
summary(analysisResults1)

Analysis results of stage 1

Analysis results for a continuous endpoint

Sequential analysis with 3 looks (group sequential design). The results were calculated using a two-sample t-test (one-sided, alpha = 0.025), equal variances option. H0: mu(1) - mu(2) = 0 against H1: mu(1) - mu(2) > 0.

Stage	1	2	3
Planned information rate	0.333	0.667	1
Efficacy boundary (z-value scale)	3.471	2.454	2.004
Cumulative alpha spent	0.0003	0.0072	0.0250
Stage level	0.0003	0.0071	0.0225
Cumulative effect size	11.836
Cumulative (pooled) standard deviation	62.192
Overall test statistic	1.244
Overall p-value	0.1076
Test action	continue
Conditional rejection probability	0.0621
95% repeated confidence interval	[-21.835; 45.506]
Repeated p-value	0.3127

The test decision at stage 1 is to not reject the null hypothesis and to continue the trial.

If the treatment group shall be the reference level you may add directionUpper = FALSE (default is TRUE):

data1b <- data1
data1b$group <- relevel(data1b$group, ref = "Treatment group")
dataset1b <- getDataset(emmeans(lm(
    outcome ~ group + covariate + gender,
    data = data1b), "group"))
analysisResults1b <- getAnalysisResults(
    design = design,
    dataInput = dataset1b, directionUpper = FALSE
)
summary(analysisResults1b)

Analysis results for a continuous endpoint

Stage	1	2	3
Planned information rate	0.333	0.667	1
Efficacy boundary (z-value scale)	3.471	2.454	2.004
Cumulative alpha spent	0.0003	0.0072	0.0250
Stage level	0.0003	0.0071	0.0225
Cumulative effect size	-11.836
Cumulative (pooled) standard deviation	62.192
Overall test statistic	-1.244
Overall p-value	0.1076
Test action	continue
Conditional rejection probability	0.0621
95% repeated confidence interval	[-45.506; 21.835]
Repeated p-value	0.3127

Repeated confidence interval: values turn around from [-21.8; 45.5] to [-45.5; 21.8].

Second look: analysis at stage 2

Similar to stage 1, we create a linear model with dependent variable y = outcome, independent variable group, and covariates covariate and gender:

lmResult2 <- lm(outcome ~ group + covariate + gender, data = data2)

Then we calculate the estimated marginal means of this linear model:

emResult2 <- emmeans(lmResult2, "group")

To create the dataset for the second interim analysis we only have to put the objects emResult1 and emResult2 as exclusive comma-separated arguments to the function getDataset():

dataset2 <- getDataset(emResult1, emResult2)
summary(dataset2)

Dataset of means

The dataset contains the sample sizes, means, and standard deviations of one treatment and one control group. The total number of looks is two; stage-wise and cumulative data are included.

Stage	1	1	2	2
Group	1	2	1	2
Stage-wise sample size	84	87	79	82
Cumulative sample size	84	87	163	169
Stage-wise mean	107.284	95.448	110.150	105.487
Cumulative mean	107.284	95.448	108.673	100.319
Stage-wise standard deviation	62.477	61.916	62.075	61.662
Cumulative standard deviation	62.477	61.916	62.107	61.814

Display the analysis results for stage 2:

analysisResults2 <- getAnalysisResults(design = design, dataInput = dataset2)
summary(analysisResults2)

Analysis results for a continuous endpoint

Stage	1	2	3
Planned information rate	0.333	0.667	1
Efficacy boundary (z-value scale)	3.471	2.454	2.004
Cumulative alpha spent	0.0003	0.0072	0.0250
Stage level	0.0003	0.0071	0.0225
Cumulative effect size	11.836	8.354
Cumulative (pooled) standard deviation	62.192	61.958
Overall test statistic	1.244	1.228
Overall p-value	0.1076	0.1101
Test action	continue	continue
Conditional rejection probability	0.0621	0.0412
95% repeated confidence interval	[-21.835; 45.506]	[-8.430 ; 25.138]
Repeated p-value	0.3127	0.2002

The test decision at stage 2 is to not reject the null hypothesis and to continue the trial.

Last look: analysis at stage 3

After collection of the additional trial data planned for stage 3, we can read the raw data, e.g., from a local csv file:

data3 <- data[data$stage == 3, ]
data3$group <- as.factor(data3$group)
data3$group <- relevel(data3$group, ref = "Control group")
head(data3)

    subject stage           group   outcome gender covariate
349    3101     3 Treatment group        NA      f  37.05855
350    3102     3 Treatment group 132.68682      m  35.38209
351    3103     3 Treatment group 116.46593      m  40.54073
352    3104     3 Treatment group  66.91700      f  40.46552
353    3105     3 Treatment group  47.36712      f  30.74220
354    3106     3 Treatment group  58.37924      f  37.44284

dim(data3)

[1] 174   6

Similar to stages 1 and 2, we create a linear model with dependent variable y = outcome, independent variabe group, and covariates covariate and gender:

lmResult3 <- lm(outcome ~ group + covariate + gender, data = data3)

Calculate the estimated marginal means of this linear model:

emResult3 <- emmeans(lmResult3, "group")

To create the dataset for the third interim analysis we only have to put the objects emResult1, emResult2, and emResult3 as exclusive comma-separated arguments to the function getDataset():

dataset3 <- getDataset(emResult1, emResult2, emResult3) 
summary(dataset3)

Dataset of means

The dataset contains the sample sizes, means, and standard deviations of one treatment and one control group. The total number of looks is three; stage-wise and cumulative data are included.

Stage	1	1	2	2	3	3
Group	1	2	1	2	1	2
Stage-wise sample size	84	87	79	82	81	84
Cumulative sample size	84	87	163	169	244	253
Stage-wise mean	107.284	95.448	110.150	105.487	113.769	101.753
Cumulative mean	107.284	95.448	108.673	100.319	110.365	100.795
Stage-wise standard deviation	62.477	61.916	62.075	61.662	53.135	52.656
Cumulative standard deviation	62.477	61.916	62.107	61.814	59.218	58.830

As final step of stage 3 we create and display the analysis results:

analysisResults3 <- getAnalysisResults(design = design, dataInput = dataset3)
summary(analysisResults3)

Analysis results for a continuous endpoint

Stage	1	2	3
Planned information rate	0.333	0.667	1
Efficacy boundary (z-value scale)	3.471	2.454	2.004
Cumulative alpha spent	0.0003	0.0072	0.0250
Stage level	0.0003	0.0071	0.0225
Cumulative effect size	11.836	8.354	9.570
Cumulative (pooled) standard deviation	62.192	61.958	59.021
Overall test statistic	1.244	1.228	1.807
Overall p-value	0.1076	0.1101	0.0357
Test action	continue	continue	accept
Conditional rejection probability	0.0621	0.0412
95% repeated confidence interval	[-21.835; 45.506]	[-8.430 ; 25.138]	[-1.070 ; 20.210]
Repeated p-value	0.3127	0.2002	0.0404
Final p-value			0.0374
Final confidence interval			[-0.949; 19.884]
Median unbiased estimate			9.483

The test decision at stage 3 is reject the null hypothesis.

Multi-stage raw data import

Let us assume that the data of the different stages are not stored in different data files but in one dataset where a stage column acts as an identifier for the data of the different stages.

Question: what is an efficient way to put the data to getDataset()?

Solution

First, to enable the later usage of sapply, we write a short function that returns the emmeans of a linear model result that uses only the data at the specified stage.

fun <- function(stage, f, formula, data, specs = "group") {
    emmeans(f(formula = formula, data = data[data$stage == stage, ]), 
    specs)
}

The arguments of the function are

stage An integer specifying the stage number.
f The function which shall produce the linear model result, e.g., stats::lm.
formula The model formula.
data The data.frame containing the multi-stage raw data (must contain the column stage).
specs The specs argument of the emmeans() function (used to specify the name of the group column).

Read the data of all stages:

data <- read.csv(file = "../files/dataAllStages.csv")
data$group <- as.factor(data$group)

Then we use the function unique to identify the available stages in the data.frame:

stages <- sort(as.integer(unique(data$stage)))

as.integer ensures that the stages will be valid also if data$stage is a factor.

Now we can create the rpact datasets for different model formulas in one step.

Example 1: Simple model without any covariates

dataset3a <- getDataset(sapply(
    stages, fun, stats::lm, 
    outcome ~ group, data))
summary(dataset3a)

Example 1: Simple model without any covariates

Dataset of means

The dataset contains the sample sizes, means, and standard deviations of one treatment and one control group. The total number of looks is three; stage-wise and cumulative data are included.

Stage	1	1	2	2	3	3
Group	1	2	1	2	1	2
Stage-wise sample size	84	87	79	82	81	84
Cumulative sample size	84	87	163	169	244	253
Stage-wise mean	110.887	91.666	114.070	101.834	112.176	103.315
Cumulative mean	110.887	91.666	112.430	96.600	112.345	98.829
Stage-wise standard deviation	48.078	48.078	46.171	46.171	39.386	39.386
Cumulative standard deviation	48.078	48.078	47.045	47.298	44.567	44.859

Example 2: Linear model with covariates

dataset3b <- getDataset(sapply(
    stages, fun, stats::lm,
    outcome ~ group + covariate + gender, data))
summary(dataset3b)

Example 2: Linear model with covariates

Dataset of means

The dataset contains the sample sizes, means, and standard deviations of one treatment and one control group. The total number of looks is three; stage-wise and cumulative data are included.

Stage	1	1	2	2	3	3
Group	1	2	1	2	1	2
Stage-wise sample size	84	87	79	82	81	84
Cumulative sample size	84	87	163	169	244	253
Stage-wise mean	107.284	95.448	110.150	105.487	113.769	101.753
Cumulative mean	107.284	95.448	108.673	100.319	110.365	100.795
Stage-wise standard deviation	63.292	62.723	62.854	62.437	53.813	53.328
Cumulative standard deviation	63.292	62.723	62.902	62.601	59.974	59.579

Note that the simple model outcome ~ group in this example results in an analysis result that rejects the null hypothesis already at stage 2 because covariate has an effect on the outcome.

getAnalysisResults(design = design, dataInput = dataset3a) |> 
    summary()

Analysis results for a continuous endpoint

Stage	1	2	3
Planned information rate	0.333	0.667	1
Efficacy boundary (z-value scale)	3.471	2.454	2.004
Cumulative alpha spent	0.0003	0.0072	0.0250
Stage level	0.0003	0.0071	0.0225
Cumulative effect size	19.221	15.830	13.516
Cumulative (pooled) standard deviation	48.078	47.174	44.716
Overall test statistic	2.614	3.057	3.369
Overall p-value	0.0049	0.0012	0.0004
Test action	continue	reject and stop	reject
Conditional rejection probability	0.3238	0.7934
95% repeated confidence interval	[-6.808; 45.251]	[3.051 ; 28.609]	[5.455 ; 21.577]
Repeated p-value	0.0798	0.0071	0.0004
Final p-value		0.0014
Final confidence interval		[5.438; 25.824]
Median unbiased estimate		15.652

Conclusions

getDataset supports emmeans result objects since rpact version 3.1.1
Easy to use: just add one emmeans result object per stage as argument of getDataset
rpact will support all models that are supported by emmeans, e.g., complex longitudinal models with random intercept and random slope (not yet supported)
Multi-arm analysis is also supported (carefully define order of treatments and control!)

Using rpact with Raw Data

Introduction

Load required R packages

Analysis of raw data – Data import from csv file

Analysis of raw data – Descriptive Statistics

Check whether group is a character vector or factor:

Explore covariates

Interim analyses

Check the getDataset result

Alternative command to reorder the group levels:

Recalculate the estimated marginal means for the complete model:

Analysis results of stage 1

Analysis results of stage 1

Second look: analysis at stage 2

Last look: analysis at stage 3

Multi-stage raw data import

Solution

Example 1: Simple model without any covariates

Example 1: Simple model without any covariates

Example 2: Linear model with covariates

Example 2: Linear model with covariates

Conclusions

Load required `R` packages

Check whether `group` is a character vector or `factor`: