Incorrect results in pooled regression if the levels of factor covariates are not in the same order across studies

Hi all,

I want to warn you about an issue with the ds.glm function that you might have if you include pre-processed factors as covariates. If a factor covariate has a different reference group in some studies (or if the levels of the factor are not in the same order across all studies) then the ds.glm assumes that all factors have the same reference group across the studies and in the output of the function you get an estimate with a label indicating the reference group of the first study included in the analysis.

In the example below you can see that the regression of a specific model applied in data from 3 studies gives an estimate of gender of -0.1071360 (which should be the estimate of gender=1 compared to the reference group which is gender=0). However if we run the regression in each study separately, we can see from the output of ds.glm that in study 2 the reference group of factor gender is the level gender=1. We can also check that from the output of ds.levels as the order of the levels of gender are not the same across the 3 studies.

So to make sure that the regression results for pooled regression (i.e. ds.glm) are correct, please first check the order of the levels of factor covariates (using the ds.levels functions). If you don’t have the same order of levels then you can use the ds.changeRefGroup function to specify the same reference group of a factor across all studies and then run the ds.glm which in that case will return the correct results (the correct estimate for gender is -0.4425188).

I am looking to add a check in the ds.glm function to return a warning message to the users in such cases. In the meantime, if you have any questions about this please contact me.

Thanks, Demetris

> ds.glm(formula = 'diabetes~bmi+gender', family = 'binomial', datasources = connections)
  Aggregated (glmDS1(diabetes ~ bmi + gender, "binomial", NULL, NULL, NULL)) [===========] 100% / 1s
Iteration 1...
  Aggregated (glmDS2(diabetes ~ bmi + gender, "binomial", "0,0,0", NULL, NULL, ) [=======] 100% / 1s
CURRENT DEVIANCE:      12375.4497617173
Iteration 2...
  Aggregated (glmDS2(diabetes ~ bmi + gender, "binomial", "-2.17504842265572,0.00871150851295946,...
CURRENT DEVIANCE:      2915.6575776071
Iteration 3...
  Aggregated (glmDS2(diabetes ~ bmi + gender, "binomial", "-3.66029698749236,0.0263127570728691,-...
CURRENT DEVIANCE:      1690.39475563765
Iteration 4...
  Aggregated (glmDS2(diabetes ~ bmi + gender, "binomial", "-5.39969079376931,0.0626465593519114,-...
CURRENT DEVIANCE:      1395.63400256565
Iteration 5...
  Aggregated (glmDS2(diabetes ~ bmi + gender, "binomial", "-7.18032222933903,0.110785116443226,-0...
CURRENT DEVIANCE:      1338.4226273586
Iteration 6...
  Aggregated (glmDS2(diabetes ~ bmi + gender, "binomial", "-8.05336242853471,0.135793921983864,-0...
CURRENT DEVIANCE:      1333.0006603598
Iteration 7...
  Aggregated (glmDS2(diabetes ~ bmi + gender, "binomial", "-8.16398000198871,0.138884396458108,-0...
CURRENT DEVIANCE:      1332.92722345965
Iteration 8...
  Aggregated (glmDS2(diabetes ~ bmi + gender, "binomial", "-8.1656229295076,0.138929750884362,-0....
CURRENT DEVIANCE:      1332.92720698603
Iteration 9...
  Aggregated (glmDS2(diabetes ~ bmi + gender, "binomial", "-8.16562330293191,0.138929761172965,-0...
CURRENT DEVIANCE:      1332.92720698603
SUMMARY OF MODEL STATE after iteration 9
Current deviance 1332.92720698603 on 8924 degrees of freedom
Convergence criterion TRUE (5.11708255281577e-16)

beta: -8.16562330293191 0.138929761172964 -0.107136049184977

Information matrix overall:
            (Intercept)        bmi    gender1
(Intercept)   131.51623   4045.766   60.57624
bmi          4045.76620 128005.832 1847.21071
gender1        60.57624   1847.211   60.57624

Score vector overall:
                     [,1]
(Intercept) -3.541611e-12
bmi         -1.109370e-10
gender1     -1.627143e-12

Current deviance: 1332.92720698603

$Nvalid
[1] 8927

$Nmissing
[1] 452

$Ntotal
[1] 9379

$disclosure.risk
       RISK OF DISCLOSURE
study1                  0
study2                  0
study3                  0

$errorMessage
       ERROR MESSAGES
study1 "No errors"   
study2 "No errors"   
study3 "No errors"   

$nsubs
[1] 8927

$iter
[1] 9

$family

Family: binomial 
Link function: logit 


$formula
[1] "diabetes ~ bmi + gender"

$coefficients
              Estimate Std. Error     z-value      p-value low0.95CI.LP high0.95CI.LP         P_OR
(Intercept) -8.1656233 0.53425481 -15.2841362 9.754780e-53   -9.2127435    -7.1185031 0.0002841786
bmi          0.1389298 0.01680753   8.2659218 1.386194e-16    0.1059876     0.1718719 1.1490433897
gender1     -0.1071360 0.17514150  -0.6117114 5.407287e-01   -0.4504071     0.2361350 0.8984034376
            low0.95CI.P_OR high0.95CI.P_OR
(Intercept)   9.975003e-05    0.0008093228
bmi           1.111808e+00    1.1875257282
gender1       6.373686e-01    1.2663452237

$dev
[1] 1332.927

$df
[1] 8924

$output.information
[1] "SEE TOP OF OUTPUT FOR INFORMATION ON MISSING DATA AND ERROR MESSAGES"

> 
> ds.glm(formula = 'diabetes~bmi+gender', family = 'binomial', datasources = connections[1])
  Aggregated (glmDS1(diabetes ~ bmi + gender, "binomial", NULL, NULL, NULL)) [===========] 100% / 0s
Iteration 1...
  Aggregated (glmDS2(diabetes ~ bmi + gender, "binomial", "0,0,0", NULL, NULL, ) [=======] 100% / 0s
CURRENT DEVIANCE:      2864.08415007369
Iteration 2...
  Aggregated (glmDS2(diabetes ~ bmi + gender, "binomial", "-2.13202979794411,0.00744868082955048,...
CURRENT DEVIANCE:      664.091745207632
Iteration 3...
  Aggregated (glmDS2(diabetes ~ bmi + gender, "binomial", "-3.5312037887809,0.0225403720954142,-0...
CURRENT DEVIANCE:      376.077873636936
Iteration 4...
  Aggregated (glmDS2(diabetes ~ bmi + gender, "binomial", "-5.09595491819132,0.0538611685523527,-...
CURRENT DEVIANCE:      305.060218021482
Iteration 5...
  Aggregated (glmDS2(diabetes ~ bmi + gender, "binomial", "-6.64316058862581,0.0954630406913807,-...
CURRENT DEVIANCE:      290.53336981469
Iteration 6...
  Aggregated (glmDS2(diabetes ~ bmi + gender, "binomial", "-7.37241744003646,0.116488672050102,-0...
CURRENT DEVIANCE:      289.008148333213
Iteration 7...
  Aggregated (glmDS2(diabetes ~ bmi + gender, "binomial", "-7.46062108286853,0.118970774445128,-0...
CURRENT DEVIANCE:      288.981069589415
Iteration 8...
  Aggregated (glmDS2(diabetes ~ bmi + gender, "binomial", "-7.46199501499254,0.119009690846053,-0...
CURRENT DEVIANCE:      288.981056969819
Iteration 9...
  Aggregated (glmDS2(diabetes ~ bmi + gender, "binomial", "-7.46199544706323,0.119009703477003,-0...
CURRENT DEVIANCE:      288.981056969815
SUMMARY OF MODEL STATE after iteration 9
Current deviance 288.981056969815 on 2063 degrees of freedom
Convergence criterion TRUE (1.19947276616999e-14)

beta: -7.4619954470633 0.119009703477005 -0.609693778835752

Information matrix overall:
            (Intercept)        bmi    gender1
(Intercept)   28.247764   869.6887   8.891516
bmi          869.688718 27696.7482 261.610627
gender1        8.891516   261.6106   8.891516

Score vector overall:
                     [,1]
(Intercept) -1.926846e-12
bmi         -5.533748e-11
gender1     -1.711487e-12

Current deviance: 288.981056969815

$Nvalid
[1] 2066

$Nmissing
[1] 97

$Ntotal
[1] 2163

$disclosure.risk
       RISK OF DISCLOSURE
study1                  0

$errorMessage
       ERROR MESSAGES
study1 "No errors"   

$nsubs
[1] 2066

$iter
[1] 9

$family

Family: binomial 
Link function: logit 


$formula
[1] "diabetes ~ bmi + gender"

$coefficients
              Estimate Std. Error   z-value      p-value low0.95CI.LP high0.95CI.LP         P_OR
(Intercept) -7.4619954 1.07344198 -6.951466 3.615096e-12    -9.565903    -5.3580878 0.0005741788
bmi          0.1190097 0.03339485  3.563714 3.656438e-04     0.053557     0.1844624 1.1263808480
gender1     -0.6096938 0.41055755 -1.485039 1.375336e-01    -1.414372     0.1949842 0.5435172801
            low0.95CI.P_OR high0.95CI.P_OR
(Intercept)   7.007299e-05     0.004687824
bmi           1.055017e+00     1.202571770
gender1       2.430783e-01     1.215291815

$dev
[1] 288.9811

$df
[1] 2063

$output.information
[1] "SEE TOP OF OUTPUT FOR INFORMATION ON MISSING DATA AND ERROR MESSAGES"

> ds.glm(formula = 'diabetes~bmi+gender', family = 'binomial', datasources = connections[2])
  Aggregated (glmDS1(diabetes ~ bmi + gender, "binomial", NULL, NULL, NULL)) [===========] 100% / 0s
Iteration 1...
  Aggregated (glmDS2(diabetes ~ bmi + gender, "binomial", "0,0,0", NULL, NULL, ) [=======] 100% / 0s
CURRENT DEVIANCE:      4072.93283297024
Iteration 2...
  Aggregated (glmDS2(diabetes ~ bmi + gender, "binomial", "-2.24627544296761,0.0107040207472627,0...
CURRENT DEVIANCE:      956.077881764171
Iteration 3...
  Aggregated (glmDS2(diabetes ~ bmi + gender, "binomial", "-3.87786831037723,0.0323345801998211,0...
CURRENT DEVIANCE:      548.967284423667
Iteration 4...
  Aggregated (glmDS2(diabetes ~ bmi + gender, "binomial", "-5.92977626926902,0.0769132344707401,0...
CURRENT DEVIANCE:      446.162757054575
Iteration 5...
  Aggregated (glmDS2(diabetes ~ bmi + gender, "binomial", "-8.14674303562793,0.135297302043061,0....
CURRENT DEVIANCE:      422.584835929038
Iteration 6...
  Aggregated (glmDS2(diabetes ~ bmi + gender, "binomial", "-9.30614037435908,0.165912147648399,0....
CURRENT DEVIANCE:      419.664329404253
Iteration 7...
  Aggregated (glmDS2(diabetes ~ bmi + gender, "binomial", "-9.49962996098236,0.170750652776734,0....
CURRENT DEVIANCE:      419.59769692462
Iteration 8...
  Aggregated (glmDS2(diabetes ~ bmi + gender, "binomial", "-9.50467644612636,0.170871561506694,0....
CURRENT DEVIANCE:      419.597650514433
Iteration 9...
  Aggregated (glmDS2(diabetes ~ bmi + gender, "binomial", "-9.50467996467969,0.170871641631032,0....
CURRENT DEVIANCE:      419.597650514408
SUMMARY OF MODEL STATE after iteration 9
Current deviance 419.597650514408 on 2935 degrees of freedom
Convergence criterion TRUE (5.91868313099369e-14)

beta: -9.50467996468151 0.170871641631071 0.442407754385301

Information matrix overall:
            (Intercept)        bmi   gender0
(Intercept)    42.34133  1340.5383  28.59491
bmi          1340.53828 43567.8350 922.63471
gender0        28.59491   922.6347  28.59491

Score vector overall:
                     [,1]
(Intercept) -1.115877e-11
bmi         -3.013110e-10
gender0     -2.492645e-12

Current deviance: 419.597650514408

$Nvalid
[1] 2938

$Nmissing
[1] 150

$Ntotal
[1] 3088

$disclosure.risk
       RISK OF DISCLOSURE
study2                  0

$errorMessage
       ERROR MESSAGES
study2 "No errors"   

$nsubs
[1] 2938

$iter
[1] 9

$family

Family: binomial 
Link function: logit 


$formula
[1] "diabetes ~ bmi + gender"

$coefficients
              Estimate Std. Error   z-value      p-value low0.95CI.LP high0.95CI.LP         P_OR
(Intercept) -9.5046800 0.95799362 -9.921444 3.358620e-23  -11.3823130    -7.6270470 7.449679e-05
bmi          0.1708716 0.03023733  5.651016 1.595022e-08    0.1116076     0.2301357 1.186338e+00
gender0      0.4424078 0.33301190  1.328504 1.840116e-01   -0.2102836     1.0950991 1.556450e+00
            low0.95CI.P_OR high0.95CI.P_OR
(Intercept)   1.139513e-05      0.00048686
bmi           1.118074e+00      1.25877084
gender0       8.103544e-01      2.98947889

$dev
[1] 419.5977

$df
[1] 2935

$output.information
[1] "SEE TOP OF OUTPUT FOR INFORMATION ON MISSING DATA AND ERROR MESSAGES"

> ds.glm(formula = 'diabetes~bmi+gender', family = 'binomial', datasources = connections[3])
  Aggregated (glmDS1(diabetes ~ bmi + gender, "binomial", NULL, NULL, NULL)) [===========] 100% / 0s
Iteration 1...
  Aggregated (glmDS2(diabetes ~ bmi + gender, "binomial", "0,0,0", NULL, NULL, ) [=======] 100% / 0s
CURRENT DEVIANCE:      5438.43277867333
Iteration 2...
  Aggregated (glmDS2(diabetes ~ bmi + gender, "binomial", "-2.11966145035052,0.00706903494537585,...
CURRENT DEVIANCE:      1294.72069118682
Iteration 3...
  Aggregated (glmDS2(diabetes ~ bmi + gender, "binomial", "-3.49132828533595,0.021318854784575,-0...
CURRENT DEVIANCE:      762.985935101846
Iteration 4...
  Aggregated (glmDS2(diabetes ~ bmi + gender, "binomial", "-4.98842420098296,0.0506082609379765,-...
CURRENT DEVIANCE:      639.19675364723
Iteration 5...
  Aggregated (glmDS2(diabetes ~ bmi + gender, "binomial", "-6.42974865101814,0.0893235660463956,-...
CURRENT DEVIANCE:      617.463312221223
Iteration 6...
  Aggregated (glmDS2(diabetes ~ bmi + gender, "binomial", "-7.11163311520548,0.109571838249463,-0...
CURRENT DEVIANCE:      615.686548946395
Iteration 7...
  Aggregated (glmDS2(diabetes ~ bmi + gender, "binomial", "-7.19078931189972,0.111919779017506,-0...
CURRENT DEVIANCE:      615.666756661298
Iteration 8...
  Aggregated (glmDS2(diabetes ~ bmi + gender, "binomial", "-7.19171463936573,0.111947040802439,-0...
CURRENT DEVIANCE:      615.666753696161
SUMMARY OF MODEL STATE after iteration 8
Current deviance 615.666753696161 on 3920 degrees of freedom
Convergence criterion TRUE (4.81535632481671e-09)

beta: -7.1917147736614 0.111947044776525 -0.362363530796833

Information matrix overall:
            (Intercept)        bmi   gender1
(Intercept)    60.61566  1822.1393  22.64236
bmi          1822.13933 56229.3265 660.86093
gender1        22.64236   660.8609  22.64236

Score vector overall:
                     [,1]
(Intercept) -1.443043e-06
bmi         -3.712178e-05
gender1     -9.584138e-07

Current deviance: 615.666753696161

$Nvalid
[1] 3923

$Nmissing
[1] 205

$Ntotal
[1] 4128

$disclosure.risk
       RISK OF DISCLOSURE
study3                  0

$errorMessage
       ERROR MESSAGES
study3 "No errors"   

$nsubs
[1] 3923

$iter
[1] 8

$family

Family: binomial 
Link function: logit 


$formula
[1] "diabetes ~ bmi + gender"

$coefficients
              Estimate Std. Error   z-value      p-value low0.95CI.LP high0.95CI.LP         P_OR
(Intercept) -7.1917148 0.82558874 -8.711014 3.011693e-18  -8.80983896    -5.5735906 0.0007522309
bmi          0.1119470 0.02646973  4.229247 2.344747e-05   0.06006732     0.1638268 1.1184536311
gender1     -0.3623635 0.26807060 -1.351747 1.764564e-01  -0.88777226     0.1630452 0.6960292938
            low0.95CI.P_OR high0.95CI.P_OR
(Intercept)    0.000149235     0.003782462
bmi            1.061908032     1.178010230
gender1        0.411571607     1.177089889

$dev
[1] 615.6668

$df
[1] 3920

$output.information
[1] "SEE TOP OF OUTPUT FOR INFORMATION ON MISSING DATA AND ERROR MESSAGES"

> 
> ds.levels(x = 'gender', datasources = connections)
  Aggregated (exists("gender")) [========================================================] 100% / 3s
  Aggregated (classDS("gender")) [=======================================================] 100% / 0s
  Aggregated (levelsDS(gender)) [========================================================] 100% / 1s
$study1
$study1$Levels
[1] "0" "1"

$study1$ValidityMessage
[1] "VALID ANALYSIS"


$study2
$study2$Levels
[1] "1" "0"

$study2$ValidityMessage
[1] "VALID ANALYSIS"


$study3
$study3$Levels
[1] "0" "1"

$study3$ValidityMessage
[1] "VALID ANALYSIS"


> 
> ds.changeRefGroup(x = 'gender', ref = 0, newobj = 'gender', datasources = connections)
  Aggregated (exists("gender")) [========================================================] 100% / 1s
  Aggregated (classDS("gender")) [=======================================================] 100% / 0s
  Assigned expr. (gender <- changeRefGroupDS(gender,'0',FALSE)) [========================] 100% / 1s
  Aggregated (exists("gender")) [========================================================] 100% / 1s
> 
> ds.levels(x = 'gender', datasources = connections)
  Aggregated (exists("gender")) [========================================================] 100% / 1s
  Aggregated (classDS("gender")) [=======================================================] 100% / 0s
  Aggregated (levelsDS(gender)) [========================================================] 100% / 1s
$study1
$study1$Levels
[1] "0" "1"

$study1$ValidityMessage
[1] "VALID ANALYSIS"


$study2
$study2$Levels
[1] "0" "1"

$study2$ValidityMessage
[1] "VALID ANALYSIS"


$study3
$study3$Levels
[1] "0" "1"

$study3$ValidityMessage
[1] "VALID ANALYSIS"


> 
> ds.glm(formula = 'diabetes~bmi+gender', family = 'binomial', datasources = connections)
  Aggregated (glmDS1(diabetes ~ bmi + gender, "binomial", NULL, NULL, NULL)) [===========] 100% / 1s
Iteration 1...
  Aggregated (glmDS2(diabetes ~ bmi + gender, "binomial", "0,0,0", NULL, NULL, ) [=======] 100% / 1s
CURRENT DEVIANCE:      12375.4497617173
Iteration 2...
  Aggregated (glmDS2(diabetes ~ bmi + gender, "binomial", "-2.15622992309117,0.00835653658473421,...
CURRENT DEVIANCE:      2915.15452834427
Iteration 3...
  Aggregated (glmDS2(diabetes ~ bmi + gender, "binomial", "-3.60345579963896,0.0252352254499992,-...
CURRENT DEVIANCE:      1688.85182157007
Iteration 4...
  Aggregated (glmDS2(diabetes ~ bmi + gender, "binomial", "-5.26379007863147,0.0600362362113038,-...
CURRENT DEVIANCE:      1392.2543056008
Iteration 5...
  Aggregated (glmDS2(diabetes ~ bmi + gender, "binomial", "-6.93445026106209,0.105943099087113,-0...
CURRENT DEVIANCE:      1333.38210025201
Iteration 6...
  Aggregated (glmDS2(diabetes ~ bmi + gender, "binomial", "-7.74236578021945,0.129578089015349,-0...
CURRENT DEVIANCE:      1327.52015296092
Iteration 7...
  Aggregated (glmDS2(diabetes ~ bmi + gender, "binomial", "-7.84478123492401,0.132513026838178,-0...
CURRENT DEVIANCE:      1327.43126687773
Iteration 8...
  Aggregated (glmDS2(diabetes ~ bmi + gender, "binomial", "-7.84637183574817,0.132558462251436,-0...
CURRENT DEVIANCE:      1327.43124065026
Iteration 9...
  Aggregated (glmDS2(diabetes ~ bmi + gender, "binomial", "-7.84637225890271,0.132558474471549,-0...
CURRENT DEVIANCE:      1327.43124065026
SUMMARY OF MODEL STATE after iteration 9
Current deviance 1327.43124065026 on 8924 degrees of freedom
Convergence criterion TRUE (2.05530688978913e-15)

beta: -7.84637225890274 0.132558474471551 -0.442518769668178

Information matrix overall:
            (Intercept)        bmi  gender1
(Intercept)    131.3925   4041.377   45.296
bmi           4041.3775 127861.635 1342.211
gender1         45.2960   1342.211   45.296

Score vector overall:
                     [,1]
(Intercept)  1.191935e-12
bmi          4.318679e-11
gender1     -1.723066e-13

Current deviance: 1327.43124065026

$Nvalid
[1] 8927

$Nmissing
[1] 452

$Ntotal
[1] 9379

$disclosure.risk
       RISK OF DISCLOSURE
study1                  0
study2                  0
study3                  0

$errorMessage
       ERROR MESSAGES
study1 "No errors"   
study2 "No errors"   
study3 "No errors"   

$nsubs
[1] 8927

$iter
[1] 9

$family

Family: binomial 
Link function: logit 


$formula
[1] "diabetes ~ bmi + gender"

$coefficients
              Estimate Std. Error    z-value      p-value low0.95CI.LP high0.95CI.LP         P_OR
(Intercept) -7.8463723 0.54307740 -14.447982 2.581403e-47  -8.91078440   -6.78196011 0.0003910155
bmi          0.1325585 0.01697825   7.807546 5.831236e-15   0.09928171    0.16583524 1.1417457771
gender1     -0.4425188 0.18585796  -2.380951 1.726799e-02  -0.80679368   -0.07824386 0.6424162829
            low0.95CI.P_OR high0.95CI.P_OR
(Intercept)   0.0001349078     0.001132765
bmi           1.1043773730     1.180378602
gender1       0.4462867140     0.924738890

$dev
[1] 1327.431

$df
[1] 8924

$output.information
[1] "SEE TOP OF OUTPUT FOR INFORMATION ON MISSING DATA AND ERROR MESSAGES"