DSI and R environments for v6.0

Hi,

I am trying to fix a problem in DataSHIELD v6.0, which is the release that will use DSI for the first time and hence allows the use of DSLite.

If I have understood correctly, DSLite works by having its own environment in R, thus allowing you to run the client and server on the same instance of R. Basically the environments keep the server and client side separate.

Previously @yannick has written:

I think I need to understand more about this. In the *SLMADS2 group of functions, a model formula is created with this envir = parent.frame() parameter. Subsequently offset and weight vectors are created, also with this parameter set. However, when the glm command is called, it cannot find the offset and weight vectors, I think because they are not in the same environment as the formula. glm uses the formula’s environment by default.

So, I wonder how do I create the offset and weight vectors in the same environment as the formula?

Thanks

Tom

Yes, I noticed that also and shown it to Stuart during the last DS workshop. The test suite wasn’t stable enough at that time and still a lot to be done before the release. It’s good that you brought back this subject.

Yannick

Thanks for the reassurance that there is a problem and it is not me doing something wrong!

After a fair amount of investigation, it seems to come down to parent.frame() returning different environments for the following:

formula2use <- stats::as.formula(paste0(Reduce(paste, deparse(originalFormula))), env = parent.frame())

and

offset.to.use <- eval(parse(text=cbindtext.offset), envir = parent.frame())

That is, formula2use and offset.to.use end up in different environments. When you try to fit the glm model using formula2use with the offset offset.to.use, R complains it can’t find offset.to.use, even though exists returns true. I also wrote tests to return the environments of the 2 objects and they are indeed different.

It seems I don’t understand enough about R environments to know why this is happening so I will try investigating more. My suspicion is that it has something to do with parent.frame() returning the environment that the function was invoked from, which could be different for as.formula compared to eval?

I’ve spent more time on this than I like to admit, but I think I have an answer. The problem is it is not easy to convey all the moving parts that generated the errors I saw. I think this will be useful for @swheater to know because I think it could affect other functions.

The shortest possible answer I can give is as follows. To allow a function to work correctly with DSLite, it is not simply a case of adding envir = parent.frame() to every environment-dependent function like eval , get , assign and as.formula. This will also cause problems with non-DSLite set ups (as I found). Instead you need to pay attention to environments of the objects are that you are dealing with.

The detail of the problem for the SLMADS functions is that a formula object has an associated environment, and this environment (rather than the parent environment) is used by model.frame to evaluate variables that are not found in the supplied data argument. So if you put envir = parent.frame() in as.formula, it looks for the offset in the parent environment, and it can’t find it. Therefore the envir statement should be removed.

Hi Tom, I have noticed that adding an envir = parent.frame() has caused a problem with ds.tapply, ds.tapply.assign.and other functions.

I need to review the changes to “dsBase”.

Stuart

Tom, I think you have nailed the problem on the head. DS Lite could be used for development and testing some ideas (at the early stage of function development). However, it should not replace DataSHIELD as a whole.

It has a different architecture and may not be able to check the conversion of data from the databases in the Opal servers and other elements. It is an idea that needs more experimenting with and also more testing. I can see a lot of potential, but perhaps not with the current aspect of DataSHIELD.

Patricia

Yes I think you have the same problem with these functions. For example, with tapplyDS.R it fails on line 50, where you have:

active.factor.name<-eval(parse(text=activation.text.0), envir = parent.frame())

because it can’t find an object that is called for using the activation.text.0 variable. This variable contains the name of another variable that is instantiated in the environment of the function, not in the parent.frame(), so if you tell eval to look there it won’t find it. Hence in this case the eval should not have parent.frame() in it.

I happy to help further if that doesn’t make sense

Tom

I think that DSLite enforces to think clearly where the data are defined. It is not always so obvious when reading dsBase code that some data are assigned in the GlobalEnv, and how the formula variables are resolved. The code is stronger if it works with DSLite the same it works with DSOpal.

Yannick

1 Like

I agree that a DSLite instance cannot replace Opal or other backend in a production or formal testing set up, but it is so useful for developing with and saves a lot of time.

I also agree that it ensures developers are rigorous in where they look for variables in their code, which should reduce the potential for errors.

It is my aim to have the Continuous Integration test both DSOpal and DSLite configurations.

Stuart

As you will have seen, I have submitted a pull request for fixing tapplyDS, which was relatively simple to fix.

Today I started to look at fixing glmSLMADS*, as this will allow me to do the same for (g)lmerSLMADS2. I got as far as having a branch ready for a pull request, and then I came across another issue as follows.

For v5.1, you can fit you model in the following ways:

  1. formula='BMI ~ age + male', dataName = 'D' that is a dataframe called D with the appropriate columns
  2. formula='D$BMI ~ D$age + D$male', dataName = NULL that is a dataframe called D with the appropriate columns, but not using the dataName parameter 3.formula='BMI_vec ~ + age_vec + male_vec', dataName = NULL that is 3 separate vectors (of the same dimensions)

I think #1 tends to be most commonly used. #2 and #3 will require quite a lot of recoding for these types of functions. For #1, the dataName parameter is used to get a copy of the dataframe from the parent environment, and place it in the local environment. Assuming you do the same with the offset and weights, then the model can be evaluated in the local environment. But for #2 and #3, the dataName parameter is not used so the objects exist only in the parent environment. With DSLite they are not automatically found, since the parent environment is not globalenv(). (this explains why the problem doesn’t exist pre-DSLite, you could be lazy about where objects are and they get found in globalenv().

I can think of 2 potential solutions. The first is to do all evaluation in the parent environment. I don’t like this because you will have to create a load of stuff in the parent environment and then clean it up again at the end. The other is to remove options #2 and #3, which might not be so bad because the ds.dataFrame function can be used to make the dataframe before fitting the model.

Of course, I might be missing a much more obvious and simple solution…

Hi Tom,

I don’t have a strong opinion for this, but I believe that there are some (probably rare) occasions where the combination of the required variables in one dataframe is difficult and therefore option #2 could be necessary. For example, if at some point we will move forward with the implementation of vertical DataSHIELD then a syntax like ‘D1$BMI ~ D2$age + D3$male’ where D1, D2, D3 are different data frames will be used. This might be also the case for some omics analysis where exposures and outcomes are stored in different tables (with different structures) even in the same location. This is my thought but I might be wrong. :slight_smile:

I agree that it would be better not to have to change this. I feel that it should be possible to make the changes so that it works both for DSLite and the standard approach. The limiting factor is my understanding of how environments/frames and so on work in R. I must admit I find it really challenging.

This page can help to understand environments in R: http://adv-r.had.co.nz/Environments.html

Tom, I am working on integrating the “ds.glmerSLMA” and “ds.lmerSLAM” into dsBaseClient, also “glmerSLMADS2” and “lmerSLAMDS2” into dsBase. I will try keep this work inline with your changes. What I have done is in the “StuartWheater/dsBaseClient” and “StuartWheater/dsBase” repositories, “to_v6.0-dev” branch on GitHub. Stuart

I think I can see what is happening. Normally a function looks in its enclosing environment for variables, which includes globalenv

image

With a ‘normal’ set up, the data from Opal are created in globalenv and hence get found.

With DSLite, the data are not in globalenv but in the environment that calls glmSLMADS2 (or whatever). Hence we need to use parent.frame.

Anyway, I think I can now solve the problems and get it working for my functions

Sure, I will work with what you have there by merging it into my fork and add some tests.

Just to follow up about the issue of the different formula notations. I managed to get it working for the glmSLMA case, so all variants can be used.

However, for (g)lmerSLMA, the ‘D1$BMI ~ D2$age + (1|D3$male)’ style notation does not work for the grouping factor (D3$male) on DSLite. In the help for lmer it says:

data is an optional data frame containing the variables named in formula . By default the variables are taken from the environment from which lmer is called. While data is optional, the package authors strongly recommend its use, especially when later applying methods such as update and drop1 to the fitted model ( such methods are not guaranteed to work properly if data is omitted ). If data is omitted, variables will be taken from the environment of formula (if specified as a formula) or from the parent frame (if specified as a character vector).

I will have a bit of a dig to see if I can make it work, but clearly the package authors aren’t that comfortable with not using the ‘data’ parameter

Tom

And for my own records as much as anything else, the following code reproduces the problem. It seems that the code for (g)lmer does not look for the grouping factor in the environment of formula but instead in the calling frame.

f <- function(my_formula, dataName) {
  dtStudente <- dtStudent
  the_form <- my_formula
  the_data <- dataName
  g(the_form, the_data)
}
g <- function(my_formula2, dataName2) {
  the_form <- my_formula2
  the_data <- dataName2
  h(the_form, the_data)
}
h <- function(my_formula3, dataName3) {
  
  form_to_use <- as.formula(my_formula3, env=parent.frame(2))
  data_to_use <- eval(parse(text=dataName3), envir=parent.frame(2))
  #res <- lme4::lmer(formula = form_to_use, data = data_to_use)
  res <- lme4::lmer(formula = form_to_use)
  summary(res)
}
#f(my_formula = "test ~ trtGrp + Male + (1|idSchool)", dataName = "dtStudente")
f(my_formula = "dtStudente$test ~ dtStudente$trtGrp + dtStudente$Male + (1|dtStudente$idSchool)", dataName = "dtStudente")