Pipes and DataSHIELD parser

I am working on mixed models, which have formulae like:

bmi.26 ~ noise.56 + female + (1|id)

(the part with the pipe is the random effect)

This seems to work fine when developing on DSLite.

I am now trying it with Opal, and I get a 400 error (bad request), failing where the aggregate function is called.

If I try passing a formula without a pipe into my function, the aggregate function is called, and the expected error of “no random effect specified” which suggests it is working properly.

Also if I try a formula with a pipe in it with the standard glm function I get the same 400 error.

So, my suspicion is that with Opal, the parser is complaining because of the pipe. Is that correct?

That is correct, the DataSHIELD R parser is rather restrictive regarding the formula syntax. Logical operators (& and |) are not allowed. Like I said for the “Polynomial regression” topic, the allowed syntax can be amended if you think it would be useful and safe.

Thanks Yannick.

I don’t think there is another way that we can specify this type of model in a formula, unless @demetris.avraam or @paul.genepi know different?

Is it possible that the constraint could be relaxed in the context of an R formula only, and still block it in other parts of a request? Or is that too much detail for the parser to understand?

Hi @tombishop

The only way that I can think, without changing the parser, is to send the separate variables in the serverside and then create the formula there (as I said also for polynomial regression):

For example instead of trying to pass the argument formula=“bmi.26 ~ noise.56 + female + (1|id)” you can pass three different arguments, e.g. formula.for.fixed.effects=“bmi.26 ~ noise.56 + female”, random.effects.before.the.pipe=“1”, random.effects.after.the.pipe=“id” (I know that these names for arguments are not good!).

Then, you can use R functions in the sevrerside to put the three arguments together and form the mixed model formula.

Hi Demetris,

Yes I will give that some more thought.

Basically I would need to encode (1|X) and (Z+Y|X) on the client side into an acceptable form for as many terms as needed, and then decode this on the server side to construct the true formula.

I guess the trade off is between the extra work in doing this versus the loss of security. But this is what has been done for ds.Boole, right?

Tom

that’s right. For ds.Boole instead of sending the Boolean operators which can’t pass through the parser, we send numerical indicators which are then converted to logical operators in the serverside.

My initial idea for this was to simply split my formula into pieces at each pipe (thus removing them), pass the character vector to the server and then inserting the pipes at the splits on the server side:

on the client:

items = strsplit(x = "BMI ~ trtGrp + Male+ (1|idSurgery) + (trtGrp||idDoctor)", split="|", fixed=TRUE)

on the server:

formula = paste(items[[1]],collapse="|")

However, this seems to fail on transmission to the server side:

Error in if (!is.na(replacement)) { : argument is of length zero

It’s a bit hard to decipher this error message as no further detail is provided, but it occurs when calling the server side function. However, trying a formula with no pipe in it, which gives a character vector length 1, does not cause this error.

Does the parser not like it if a vector is passed to DS?

Only simple types are allowed as parameters: string, numeric, range, subset. Even string content is restricted. See examples in DataShieldExprTest and the grammar definition datashield.jjt.

Yannick

Thanks Yannick, that’s useful to see the actual rules written out.

In that case, my next suggestion is to gsub each pipe with another string on the client side, and reverse this on the server side. At the moment for testing I have just used an arbitrary sequence of numbers (matching on client and server). I am trying to think if this is a bad idea, and what a better idea might look like. Maybe something like [pipe] would be better?

Hi Tom (cc Yannick)

Demetris and I discussed this on Friday. It should be relatively straightforward - it is fundamentally no different to any of the other concealing manoeuvres we use to get things through the parser which can then only be interpreted if the serverside function expects them and they can then be reversed.

Tom: just remembered Demetris is away all week - I’ll call you

P

Having thought I had solved this, I am still having a few problems.

I think I did not fully appreciate that DSLite does not feature the parser, and so during development I was doing things that are not permitted by the parser. @yannick, please can you confirm that DSLite does not feature the parser and hence I cannot use it to test ways of passing through the formulae?

It seems that the only way I can get this to work with the current parser is to make my mixed model formula look like a standard GLM formula by substituting the pipe and brackets for other strings:

 formula <- "BMI ~ trtGrp + Male+ (1|idSurgery)"
 formula <- gsub("|", "xxxxx", formula, fixed = TRUE)
 formula <- gsub("(", "yyyyy", formula, fixed = TRUE)
 formula <- gsub(")", "zzzzz", formula, fixed = TRUE)
forumla_to_send <- as.formula(formula)

and then reverse this on the server side. My concern is that this looks a bit clumsy and is trying to cheat the parser.

I don’t think it is feasible to split the formula into 3 arguments (‘fixed.effects’, ‘random.effect.before.pipe’, ‘random.effect.after.pipe’) because I think this constrains us to having one random effect, or at least having to define a specified number of random effects.

The other option might be to change the parser, because at the moment I think it must understand that a ‘+’ and ‘~’ are allowed in a formula but not a string. So then a ‘|’, ‘(’ and ‘)’ could be added as permitted in a formula. Is that correct? The disadvantage of this is that I presume it would be tied to an Opal release.

There is a parser in DSLite, but a “light” one: it verifies that the function calls are the allowed ones, but it does not verify the arguments. The parser implementation is not as strict as datashield4j’s one because in DSLite the user has access to the individual level data any way. On the other hand I understand your use case of a developer and we could consider implementing a proper DS parser in DSLite as well.

I don’t think it would be an issue to require a minimal version of Opal to have your DS code to work. If you think it is safe to have ‘|’ chars in a formula and that it will make your life easier in the next coming years, it can be done easily.

Yannick

Thanks Yannick. I think it would be useful to have the full parser in the future to help with development, but now I know of the difference I can work on the structure of the information passed between client and server on a proper Opal instance.

As for adding characters to formulae, I would want to make sure that this is specifically applied to formulae only and not just strings. Also, I believe that ‘|’ does act as a logical operator if we have a formula like X ~ A + (B|C). So, we might want the parser to only accept patterns like (1+X|Y), (1|X), (X||Y), (0+X|Y). Would that be possible?

Hi all

Thanks for this useful set of messages around this topic. I’m sorry I’ve not joined in earlier, but I’ve been off sick with a respiratory infection and have almost totally lost my voice and so can’t discuss anything by phone :o(

I hadn’t fully understood the parser situation with DS lite but now I do, I understand Tom’s earlier questions. Once I am better and back in action, it would be good to use one of our various web meetings to ensure we all understand the nature of the interface with the parser in all of the evolving flavours of datashield.

Cheers

Paul