Model diagnostics on ds.glm

Hello everyone, I don’t suppose there is an approach for model diagnostics that would normally be done with the residuals? For example, if there were highly influential values in a multiple regression that would be detected by Cook’s distance, would there be any way to test for that type of issue in ds.glm outputs?

Thank you!

Hi @twey

There is not a specific function for model diagnostics in the current version of dsBase. I am developing a function for plot diagnostics which will be released in one of the next releases.

At the moment you can do some checks for the normality and homoscedasticity of residuals and for linearity, but I didn’t think for a way to detect outliers and influential points with the existing functionality.

Here is an example of some checks you can do:

modDS <- ds.glm(formula = 'D$LAB_TRIG~D$GENDER+D$PM_BMI_CONTINUOUS+D$LAB_HDL', family='gaussian')
modDS$coefficients

ds.asNumeric('D$GENDER', newobj = 'gender.n')
ds.table('D$GENDER')
ds.histogram('gender.n', type='combine')

# get the complete cases of the set of variables in the model
ds.dataFrame(c('D$LAB_TRIG','gender.n','D$PM_BMI_CONTINUOUS','D$LAB_HDL'), newobj='Data_regress')
ds.dim('Data_regress')
ds.completeCases(x1='Data_regress', newobj='Data_regress.compl')
ds.dim('Data_regress.compl')

# create fitted values and residuals
ds.make(toAssign = '1.70005820+((-0.59987524)*Data_regress.compl$gender.n)+(0.05791745*Data_regress.compl$PM_BMI_CONTINUOUS)+((-0.61425296)*Data_regress.compl$LAB_HDL)', newobj = 'fitted_values')
ds.make(toAssign = '(Data_regress.compl$LAB_TRIG-fitted_values)', newobj = 'residuals')
ds.ls()

# check normality of residuals
ds.mean('residuals', type = 'combine')
ds.histogram('residuals', type = 'combine')

# check for linearity and homoscedasticity
ds.scatterPlot(x='fitted_values', y='residuals', type='combine', datasources=connections) 

#ds.var('residuals', type='combine')
#sqrt(2.026274)
#ds.make(toAssign = 'residuals/1.423473', newobj = 'std.residuals')
#ds.mean('std.residuals')

Also note that if you use the ds.glmSLMA instead of ds.glm then you can create the fitted values using the ds.glmPredict function instead of using the ds.make.

Another note is that the points shown in scatterplots (e.g. residuals vs fitted values) are anonymised points and not the actual values (you can find some information about how we do the anonymisation in the plots here: Privacy preserving data visualizations | EPJ Data Science | Full Text).

Also if you are interested on how to calculate other statistics from regressions like Wald test, likelihood ratio test, type I and II errors, etc., let me know and i can share examples for those too.

Many thanks, Demetris

Thank you so much, Demetris! The example scripts are extremely useful. So the calculated residuals are the actual values and can be used with the fitted values as usual, but the plotted points (on the client side) are anonymized.

If you would be willing to share the examples for other regression statistics, that would be greatly appreciated! It’s always very helpful to have example scripts as reference.

Cheers, Tina

Yes the calculated residuals saved on the serverside are the actual values but the plotted points shown on the clientside are the anonymised.

I will try to clean and comment some of my scripts and share them here and/or in the DataSHIELD wiki.

In the meantime if you need to calculate a specific regression statistic let me know :slight_smile: I will think also what we can do for the Cook’s distance.

1 Like