Plot regression line with confidence intervals

twey · 11 September 2023 17:08

Apologies if this is covered elsewhere, but is there a way to plot confidence intervals around a regression line from the output of ‘ds.glm’ or ‘ds.glmSLMA’? I get the sense that a ‘newobj’ created on the server side from ‘ds.glmSLMA’ might have the elements to do this, but I’m unsure how to access and use this information in graphing functions.

Thank you for any help!

demetris.avraam · 20 September 2023 18:24

Hi Tina,

Sorry for the late response to this.

Do you have a simple linear regression model with only one continuous exposure and one continuous outcome?

There is a way to calculate the confidence intervals around a regression line. Here is a code on how I did it in R:

set.seed(1)

x <- rnorm(100, 0, 1)
y <- x + rnorm(100, 0, 0.5)

plot(x,y)
abline(lm(y~x))

lm.out <- lm(y~x)

newx = seq(min(x), max(x), by = 0.05)
conf_interval <- predict(lm.out, newdata=data.frame(x=newx), interval="confidence",
                         level = 0.95)
plot(x, y, xlab="x", ylab="y")
abline(lm.out, col="blue")
lines(newx, conf_interval[,2], col="blue", lty=2)
lines(newx, conf_interval[,3], col="blue", lty=2)


### USING FORMULAS

# Find length of y to use as sample size
n <- length(y) 

# Extract fitted coefficients from model object
b0 <- lm.out$coefficients[1]
b1 <- lm.out$coefficients[2]

# Calculate critical t-value
t.val <- qt(0.975, n - 2) 

# Fit linear model with extracted coefficients
y.fit <- b1 * x + b0

# Find the standard error of the regression line
se <- sqrt(sum((y - y.fit)^2) / (n - 2)) * sqrt(1 / n + (x - mean(x))^2 / sum((x - mean(x))^2))
# Note: the first sqrt is the MSE which is given as SSE/(n-2)

# Plot the fitted linear regression line and the computed confidence bands
plot(x, y, xlab="x", ylab="y")
lines(x, y.fit, col="red")
lines(sort(x), sort((y.fit + t.val * se)), col = 'red', lty = 2)
lines(sort(x), sort((y.fit - t.val * se)), col = 'red', lty = 2)

We can do something similar in DataSHIELD. However, the DataSHIELD graphical functions, like the scatterplots show anonymised values and not the actual values. So if you want to do it, you will get a scatterplot with anonymized x and y data and anonymised regression line and its CI. For more details on the anonymisatiion process you can have a look on this paper: Privacy preserving data visualizations | EPJ Data Science | Full Text

I am happy to discuss this further in a call if you like and I can show you how to calculate the CIs of the regression line in DataSHIELD.

Many thanks, Demetris

twey · 27 September 2023 20:16

Hi Demetris,

Thank you so much for the detailed response. I am indeed working with simple regression models with continuous outcome and one continuous exposure. The same model was run on different subgroups, which have different slopes. I can plot the different regression lines together using the estimated intercepts and slopes, but thought it would be useful to include the confidence intervals to have additional visual information about the differences between the regression lines in each population across the range of values. I am not interested in plotting any individual points (no scatterplots), but rather just looking to show if/where the confidence intervals overlap for the different regression lines. I thought this would represent summary information but maybe it is still considered a disclosure risk?

My question was about how to get the confidence intervals from the output of the ds.glm function, rather than the output of a call to lm(), so unless I’m misunderstanding, the example script in base R can’t be used in this case. So it sounds like the way to get a plot of the confidence intervals from a ds.glm object would be to create new objects for y.fit and se on the server-side and create a plot with anonymized values? For now I think we will just plot the regression lines as that was the main objective, but I might follow up in the future if needed.

Thank you again!

demetris.avraam · 28 September 2023 16:07

Hi Tina,

I used lm() in my example but you can do the same with glm() since the outcome is continuous.

Yes, the way to plot the regression line and its CI is to create the y.fit and se on the server-side and use the scatterplot function to return back anonymised values that you can then plot using the native R plot function. The anonymised values lie on the same line or curve as the actual values so you get the same line and CI curves as those you expect.

Here is an example on how to do that in DataSHIELD:

ds.dataFrame(c('D$LAB_TRIG','D$PM_BMI_CONTINUOUS'), newobj='Dcompl', completeCases = TRUE)

mod <- ds.glm(formula='LAB_TRIG~PM_BMI_CONTINUOUS', data='Dcompl', family='gaussian')
mod  # estimate of intercept=-0.01941156, estimate of x=0.07714569
mod$Nvalid  # n=1730

ds.make(toAssign = '-0.01941156+0.07714569*Dcompl$PM_BMI_CONTINUOUS', newobj='y.fit')

ds.mean('Dcompl$PM_BMI_CONTINUOUS')  # mean(x)=27.42582

# calculate standard error of regression line using the formula:
# sqrt(sum((y - y.fit)^2) / (n - 2)) * sqrt(1 / n + (x - mean(x))^2 / sum((x - mean(x))^2))

# do the above in parts:

ds.make(toAssign = '(Dcompl$LAB_TRIG-y.fit)^2', newobj='p1')
ds.dataFrame(x = c('p1','p1','p1','p1','p1','p1','p1','p1','p1','p1'), newobj = 'p1df')
ds.rowColCalc(x = 'p1df', operation='colSums', newobj='p2')
ds.mean('p2') # 4047.194
sqrt(4047.194/(1730-2)) # 1.530401

ds.make(toAssign = '(Dcompl$PM_BMI_CONTINUOUS-27.42582)^2', newobj = 'p3')
ds.dataFrame(x = c('p3','p3','p3','p3','p3','p3','p3','p3','p3','p3'), newobj = 'p3df')
ds.rowColCalc(x = 'p3df', operation='colSums', newobj='p4')
ds.mean('p4') # 42465.73
ds.make(toAssign = 'p3/42465.73', newobj = 'p5')
ds.make(toAssign = '(1/1730)+p5', newobj = 'p6')
ds.sqrt('p6', newobj='p7')
ds.make(toAssign = '1.530401*p7', newobj='SE')

# Calculate critical t-value
t.val <- qt(0.975, 1730 - 2) # 1.961338

ds.make(toAssign = 'y.fit + (1.961338 * SE)', newobj = 'upper')
ds.make(toAssign = 'y.fit - (1.961338 * SE)', newobj = 'lower')

upper <- ds.scatterPlot(x='Dcompl$PM_BMI_CONTINUOUS', y='upper', return.coords = TRUE)
lower <- ds.scatterPlot(x='Dcompl$PM_BMI_CONTINUOUS', y='lower', return.coords = TRUE)
fit <- ds.scatterPlot(x='Dcompl$PM_BMI_CONTINUOUS', y='y.fit', return.coords = TRUE)

plot(fit[[1]]$study1[,'x'], fit[[1]]$study1[,'y'], type='l', xlab='x', ylab='y')
lines(sort(lower[[1]]$study1[,'x']), sort(lower[[1]]$study1[,'y']), lty=2)
lines(sort(upper[[1]]$study1[,'x']), sort(upper[[1]]$study1[,'y']), lty=2)

and you get this plot:

In the above code i used only one study but you can do the same if you have multiple studies and also if you run the same model in multiple subgroups.

Let me know if you need to discuss this code in more details or if you need any other help!

Thanks, Demetris

Topic		Replies	Views
Model diagnostics on ds.glm Beginner Support	3	229	14 March 2023
Cox Regression in DataSHIELD Statistical help	20	1958	29 May 2022
Robust standard errors (sandwich estimator) Support	2	176	30 June 2023
LifeCycle wishlist New functionality	15	992	18 July 2019
Where can I find a tutorial for adding a new datashield function and its installation？ Developer support	3	605	19 February 2020

Plot regression line with confidence intervals

Related topics