How to use ds.dataFrameSubset to only remove NAs, not compare columns

alexwesterberg · 8 July 2020 14:50

Hi all,

I am working on the beginners tutorial material and trying to show how ds.dataFrameSubset can be used to create a serverside object (dataframe) that is the same except for all rows with missing values (NAs) are removed.

Reading the function help, I expected that the argument “keep.NAs” would be all i would need to specify, thus writing the command:

ds.dataFrameSubset(df.name = “D”, keep.NAs = “FALSE”, newobj = “D_without_NA”, datasources = connections)

However this complained, giving the error

Error: Please provide the name of the column or scalar that holds V1 as a character string: eg ‘xxx’ or ‘3’

so I added V1.name = “D”,

and then got the error,

Error: Please provide the name of the column or scalar that holds V2 as a character string: eg ‘xxx’ or ‘3’

The way this is going it is going to ask me what column to compare against what other column, and what boolean operator to use- which is entirely not what I want. I JUST want it to hunt for rows with NA and remove them…

Any tips?

Cheers, Alex

paul.genepi · 8 July 2020 15:44

Hi Alex.

I can explain what is wrong - keep.NAs does something rather different. You can’t do what you want to do with a single call to ds.dataFrameSubset however the function ds.completeCases does precisely that in one step (but I know you are wanting to describe ds.dataFrameSubset. However, the real reason you are confused is that the help for ds.dataFrameSubset seems to be for an earlier version and the information for keep.NAs has lost the explanation of what it really does. I’m not sure what has happened here - I’ll see if stuart can shed any light.

I’m going to send you a separate email containing the full help for the current version of ds.dataFrameSubset.

Let’s discuss this at tech meeting tomorrow morning

Cheers paul

paul.genepi · 8 July 2020 15:47

Hi Alex

Here is the full text for ds.dataFrameSubset including its full header. I think this should help.

Cheers

Paul

#’ @title subsetting a serverside data.frame #’ @description Subsets a data frame by: row (select rows defined on the basis of #’ the values of one of the columns in the data.frame or of a separate variable); #’ column (specify numbered columns to keep or drop); or both. #’ @details ds.dataFrameSubset calls a serverside aggregate function {dataFrameSubsetDS1} #’ and then a serverside assign function {dataFrameSubsetDS2}. A data.frame is a #’ list of variables all with the same number of rows, #’ which is of class ‘data.frame’. ds.dataFrameSubset will subset a #’ pre-existing data.frame by specifying the values of a subsetting variable #’ (subsetting by row) or by selecting columns to keep or remove (subsetting #’ by column). When subsetting by row, the resultant subset must strictly be #’ as large or larger than the disclosure trap value nfilter.subset. If you #’ wish to keep all rows in the subset (e.g. if the primary plan is to subset by column #’ not by row) then V1.name can be used to specify a vector of the same length #’ as the data.frame to be subsetted in each study which consists entirely of 1s #’ and there are no NAs. Then if V2 encodes the value 1 and the #’ Boolean.operator is ‘==’ then ALL rows will be kept in the generated subset. #’ To achieve this you can easily generate your own vector of 1s and declare that #’ vector as <V1.name>, then specify <V2.name> =‘1’ and Boolean.operator = ‘==’. But, #’ as a simpler alternative if either <keep.cols> or <rm.cols> is non-null #’ i.e. you are deliberately subsetting by column, if you declare <V1.name>, <V2.name> #’ and <Boolean.operator> all to be NULL, then the dataFrameSubset function will #’ create an internal vector of 1s called ‘ONES’ which is equal in length to #’ to the number of rows in the data.frame in each source, will set both #’ <V1.name> and <V2.name> to ‘ONES’, and <Boolean.operator> to ‘==’. This then has #’ precisely the same effect. #’ @param df.name a character string providing the name for the data.frame #’ to be sorted. #’ @param V1.name A character string specifying the name of a subsetting vector #’ to which a Boolean operator will be applied to define the subset to be created. #’ @param V2.name A character string specifying the name of the vector #’ or scalar to which the values in the vector specified by the argument <V1.name> #’ is to be compared. So, for example, if <V2.name> #’ is a scalar (e.g. ‘4’) #’ and the <Boolean.operator> argument is ‘<=’, the subset data.frame that is created #’ will include all rows that correspond to a value of 4 or less in the subsetting #’ vector specified by the <V1.name> argument. If <V2.name> specifies a vector #’ (which must be of strictly the same length as the vector specified by <V1.name>) #’ and the <Boolean.operator> argument is ‘==’, the subset data.frame that is #’ created will include #’ all rows in which the values in the vectors specified by <V1.name> and <V2.name> #’ are equal. #’ @param Boolean.operator A character string specifying one of six possible Boolean operators: #’ ‘==’, ‘!=’, ‘>’, ‘>=’, ‘<’, ‘<=’ #’ @param keep.cols a numeric vector specifying the numbers of the columns to be kept in the #’ final subset when subsetting by column. For example: keep.cols=c(2:5,7,12) will keep #’ columns 2,3,4,5,7 and 12. #’ @param rm.cols a numeric vector specifying the numbers of the columns to be removed before #’ creating the final subset when subsetting by column. For example: rm.cols=c(2:5,7,12) #’ will remove columns 2,3,4,5,7 and 12. #’ @param keep.NAs logical, if TRUE any NAs in the internally generated Boolean vector #’ that indicates whether each row is kept or dropped (1 or TRUE = keep, 0 or FALSE = drop) #’ will be converted into 1 or TRUE and so these rows will be included in the subset. #’ Such NAs could be caused by NAs in the vectors identified by <V1.name> or <V2.name>. #’ If <keep.NAs> is FALSE or NULL, NAs in the final Boolean vector will #’ be converted to 0s and the corresponding row will therefore be dropped from the subset. #’ Default is NULL, so NAs are dropped. #’ @param newobj This a character string providing a name for the subset #’ data.frame representing the primary output of the ds.dataFrameSubset() function. #’ If no name is specified, newobj will be given the default name ‘dfsubset’ #’ @param datasources specifies the particular ‘connection object(s)’ to use. #’ e.g. if you have several data sets in the sources you are working with #’ called opals.a, opals.w2, and connection.xyz, you can choose which of #’ these to work with. The call ‘datashield.connections_find()’ lists all of #’ the different datasets available and if one of these is called ‘default.connections’ #’ that will be the dataset used by default if no other dataset is specified. If you #’ wish to change the connections you wish to use by default the call #’ datashield.connections_default(‘opals.a’) will set ‘default.connections’ #’ to be ‘opals.a’ and so in the absence of specific instructions to the contrary #’ (e.g. by specifiying a particular dataset to be used via the #’ argument) all subsequent function calls will be to the datasets held in opals.a. #’ If the argument is specified, it should be set without #’ inverted commas: e.g. datasources=opals.a or datasources=default.connections. #’ The argument also allows you to apply a function solely to a subset #’ of the studies/sources you are working with. For example, the second source #’ in a set of three, can be specified using a call such as datasources=connection.xyz[2]. #’ On the other hand, if you wish to specify solely the first and third sources, the #’ appropriate call will be datasources=connections.xyz[c(1,3)] #’ @return the object specified by the argument (or default name ‘dfsubset’). #’ which is written to the serverside. In addition, two validity messages are returned #’ indicating whether has been created in each data source and if so whether #’ it is in a valid form. If its form is not valid in at least one study - e.g. because #’ a disclosure trap was tripped and creation of the full output object was blocked - #’ ds.dataFrameSubset() returns helpful error messages which are generated by #’ the first call to the serverside aggregate function dataFrameSubsetDS1. #’ @author Paul Burton #’ @export

ds.dataFrameSubset<-function(df.name=NULL, V1.name=NULL, V2.name=NULL, Boolean.operator=NULL, keep.cols=NULL, rm.cols=NULL, keep.NAs=NULL, newobj=NULL, datasources=NULL){

if no opal login details are provided look for ‘opal’ objects in the environment

if(is.null(datasources)){ datasources <- datasources <- datashield.connections_find() }

check if user has provided the name of the data.frame to be subsetted

if(is.null(df.name)){ stop(“Please provide the name of the data.frame to be subsetted as a character string: eg ‘xxx’”, call.=FALSE) }

check if user has provided the name of the column or scalar that holds V1. If not

have they instead provided a non.null keep.cols or non.null rm.cols argument

in either of the latter cases then one needs to select all rows so

specify V1.name = “ONES” and this will specify a vector of 1s on the serverside

of length equal to dim[1] of the data.frame

If there is already a useful variable called “ONES” on the serverside, then that

will be left untouched but if no such object exists an eval(parse) activation

on the serverside will fail so the object called “ONES” [specifed as above] is automatically

created on the serverside

if(is.null(keep.cols)&&is.null(rm.cols)&&is.null(V1.name)) { return(“Please provide the name of the column or scalar that holds V1 as a character string: eg ‘xxx’ or ‘3’”, call.=FALSE) }

if((!is.null(keep.cols)||!is.null(rm.cols))&&is.null(V1.name)) { V1.name<-“ONES” }

check if user has provided the name of the column or scalar that holds V2. If not

have they instead provided a non.null keep.cols or non.null rm.cols argument

in either of the latter cases then one needs to select all rows so

specify V2.name = “ONES” and this will specify a vector of 1s on the serverside

of length equal to dim[1] of the data.frame

if(is.null(keep.cols)&&is.null(rm.cols)&&is.null(V2.name)) { return(“Please provide the name of the column or scalar that holds V2 as a character string: eg ‘xxx’ or ‘3’”, call.=FALSE) }

if((!is.null(keep.cols)||!is.null(rm.cols))&&is.null(V2.name)) { V2.name<-“ONES” }

if(is.null(keep.cols)&&is.null(rm.cols)&&is.null(Boolean.operator)) { message1<-“Unless you are only subsetting columns, please provide a” message2<-“Boolean operator in character format: eg ‘==’ or ‘>=’ or ‘<’ or ‘!=’.” message3<-“However, if either keep.cols or rm.cols is non-null because you want” message4<-“to subset columns and you specify both V1.name and V2.name as NULL (or ‘ONES’)” message5<-“and Boolean.operator as NULL,ds.dataFrameSubset will subset out” message6<-“the specified columns but will keep all rows.”

error.message<-paste(message1,message2,message3,message4,message5,message6) return(error.message) }

if((!is.null(keep.cols)||!is.null(rm.cols))&&is.null(Boolean.operator)) { Boolean.operator<-"==" }

#if keep.NAs is set as NULL convert to FALSE as otherwise the call to datashield.assign will fail if(is.null(keep.NAs)){ keep.NAs<-FALSE }

#convert Boolean operator to numeric

BO.n<-0 if(Boolean.operator == “==”){ BO.n<-1 }

if(Boolean.operator == “!=”){ BO.n<-2 }

if(Boolean.operator == “<”){ BO.n<-3 }

if(Boolean.operator == “<=”){ BO.n<-4 }

if(Boolean.operator == “>”){ BO.n<-5 }

if(Boolean.operator == “>=”){ BO.n<-6 }

if no value spcified for output object, then specify a default

if(is.null(newobj)){ newobj <- “dfsubset” }

if(!is.null(keep.cols)){ keep.cols<-paste0(as.character(keep.cols),collapse=",") }

if(!is.null(rm.cols)){ rm.cols<-paste0(as.character(rm.cols),collapse=",") }

calltext1 <- call(“dataFrameSubsetDS1”, df.name, V1.name, V2.name, BO.n, keep.cols, rm.cols, keep.NAs=keep.NAs) return.warning.message<-datashield.aggregate(datasources, calltext1)

calltext2 <- call(“dataFrameSubsetDS2”, df.name, V1.name, V2.name, BO.n, keep.cols, rm.cols, keep.NAs=keep.NAs) datashield.assign(datasources, newobj, calltext2)

numsources<-length(datasources) for(s in 1:numsources){ num.messages<-length(return.warning.message[[s]]) if(num.messages==1){ cat("\nSource",s,"\n",return.warning.message[[s]][[1]],"\n\n") }else{ cat("\nSource",s,"\n") for(m in 1:(num.messages-1)){ cat(return.warning.message[[s]][[m]],"\n") } cat(return.warning.message[[s]][[num.messages]],"\n\n") } }

Topic		Replies	Views
Function "ds.dataFrameSubset" breaks when using DSLite Beginner Support	11	423	21 March 2022
WARNING: useNA='ifany' argument in ds.table function might create incorrect results Analyst Support	0	490	30 September 2021
An error when accessing the table using a long vector of column names (ds.subset, ds.rbind and ds.merge) Analyst Support	2	568	4 February 2020
Operating on date class variables Beginner Support	2	523	28 July 2020
Fill in missing values of variable with values of second variable: ds.replaceNA? Beginner Support	11	424	12 December 2022

How to use ds.dataFrameSubset to only remove NAs, not compare columns

if no opal login details are provided look for ‘opal’ objects in the environment

check if user has provided the name of the data.frame to be subsetted

check if user has provided the name of the column or scalar that holds V1. If not

have they instead provided a non.null keep.cols or non.null rm.cols argument

in either of the latter cases then one needs to select all rows so

specify V1.name = “ONES” and this will specify a vector of 1s on the serverside

of length equal to dim[1] of the data.frame

If there is already a useful variable called “ONES” on the serverside, then that

will be left untouched but if no such object exists an eval(parse) activation

on the serverside will fail so the object called “ONES” [specifed as above] is automatically

created on the serverside

check if user has provided the name of the column or scalar that holds V2. If not

have they instead provided a non.null keep.cols or non.null rm.cols argument

in either of the latter cases then one needs to select all rows so

specify V2.name = “ONES” and this will specify a vector of 1s on the serverside

of length equal to dim[1] of the data.frame

if no value spcified for output object, then specify a default

Related topics