Model diagnostics on ds.glm

twey · 13 March 2023 21:02

Hello everyone, I don’t suppose there is an approach for model diagnostics that would normally be done with the residuals? For example, if there were highly influential values in a multiple regression that would be detected by Cook’s distance, would there be any way to test for that type of issue in ds.glm outputs?

Thank you!

demetris.avraam · 14 March 2023 10:26

Hi @twey

There is not a specific function for model diagnostics in the current version of dsBase. I am developing a function for plot diagnostics which will be released in one of the next releases.

At the moment you can do some checks for the normality and homoscedasticity of residuals and for linearity, but I didn’t think for a way to detect outliers and influential points with the existing functionality.

Here is an example of some checks you can do:

modDS <- ds.glm(formula = 'D$LAB_TRIG~D$GENDER+D$PM_BMI_CONTINUOUS+D$LAB_HDL', family='gaussian')
modDS$coefficients

ds.asNumeric('D$GENDER', newobj = 'gender.n')
ds.table('D$GENDER')
ds.histogram('gender.n', type='combine')

# get the complete cases of the set of variables in the model
ds.dataFrame(c('D$LAB_TRIG','gender.n','D$PM_BMI_CONTINUOUS','D$LAB_HDL'), newobj='Data_regress')
ds.dim('Data_regress')
ds.completeCases(x1='Data_regress', newobj='Data_regress.compl')
ds.dim('Data_regress.compl')

# create fitted values and residuals
ds.make(toAssign = '1.70005820+((-0.59987524)*Data_regress.compl$gender.n)+(0.05791745*Data_regress.compl$PM_BMI_CONTINUOUS)+((-0.61425296)*Data_regress.compl$LAB_HDL)', newobj = 'fitted_values')
ds.make(toAssign = '(Data_regress.compl$LAB_TRIG-fitted_values)', newobj = 'residuals')
ds.ls()

# check normality of residuals
ds.mean('residuals', type = 'combine')
ds.histogram('residuals', type = 'combine')

# check for linearity and homoscedasticity
ds.scatterPlot(x='fitted_values', y='residuals', type='combine', datasources=connections) 

#ds.var('residuals', type='combine')
#sqrt(2.026274)
#ds.make(toAssign = 'residuals/1.423473', newobj = 'std.residuals')
#ds.mean('std.residuals')

Also note that if you use the ds.glmSLMA instead of ds.glm then you can create the fitted values using the ds.glmPredict function instead of using the ds.make.

Another note is that the points shown in scatterplots (e.g. residuals vs fitted values) are anonymised points and not the actual values (you can find some information about how we do the anonymisation in the plots here: Privacy preserving data visualizations | EPJ Data Science | Full Text).

Also if you are interested on how to calculate other statistics from regressions like Wald test, likelihood ratio test, type I and II errors, etc., let me know and i can share examples for those too.

Many thanks, Demetris

twey · 14 March 2023 15:06

Thank you so much, Demetris! The example scripts are extremely useful. So the calculated residuals are the actual values and can be used with the fitted values as usual, but the plotted points (on the client side) are anonymized.

If you would be willing to share the examples for other regression statistics, that would be greatly appreciated! It’s always very helpful to have example scripts as reference.

Cheers, Tina

demetris.avraam · 14 March 2023 22:24

Yes the calculated residuals saved on the serverside are the actual values but the plotted points shown on the clientside are the anonymised.

I will try to clean and comment some of my scripts and share them here and/or in the DataSHIELD wiki.

In the meantime if you need to calculate a specific regression statistic let me know I will think also what we can do for the Cook’s distance.

Topic		Replies	Views
Plot regression line with confidence intervals Analyst Support	3	178	28 September 2023
Robust standard errors (sandwich estimator) Support	2	176	30 June 2023
Error running ds.glmSummary Analyst Support	4	209	29 March 2023
Incorrect results in pooled regression if the levels of factor covariates are not in the same order across studies Statistical development	0	19	21 January 2025
LifeCycle wishlist New functionality	15	992	18 July 2019

Model diagnostics on ds.glm

Related topics