Data.frame vs. tibble

Hi,

Following a discussion we had with @swheater and @PatRyserWelch, I would like to know your DS developer opinion regarding the possibility for Opal to assign table into a tibble instead of a plain-old data.frame. The tibble data structure is a “modern reimagining of the data.frame”, part of the Tidyverse project. It aims at being easier to work with and potentially faster. As an example, the package dplyr is a powerful library for manipulating tibbles (select, group, filter, join etc.).

Opal has been using the tibble format to push its tables to R for 2 years (tibble 1.2), except in a Datashield context… for legacy and backward compatibility reasons. Also one specifity of Opal’s data frames for Datashield is that the participant identifiers are set as the row names of the data frame instead of being a separate column. Tibble does not allow to set the row names, and in addition to that, this makes impossible to have a data frame with multiple rows per participant (which can be problematic when there are for instance several measures per individual).

The impact of switching to tibbles will be:

  • the checkClass() functions assumes that the class is a single string, but in the case of a tibble the class names returned are “tbl” and “data.frame” (there can be even more). This breaks the checkClass() current implementation and the subsequent class comparisons (%in% operator is to be used in place of the != operator)
  • there will be a new column for the identifiers, which will appear on colnames() call.

Regarding checkClass(), it is not a big deal and I have already fixed it in the DSI branch of dsBase and dsBaseClient.

Regarding a column with the identifiers, it is up to you to decide whether it is a disclosive information, in which case it should be hidden from the client.

I will make a release of Opal next week, with a magic system setting that makes Opal assign tibbles for Datashield. Default behavior will still be to assign a data.frame, but that setting let’s you the opportunity to test the tibble option.

The decision to use tibble or not will have an impact on the other data repositories willing to integrate the Datashield platform (Molgenis, in a near future).

Yannick

This seems like a good idea to me, particularly around the point of multiple measurements of the same person. @demetris.avraam - is this cropping up as a problem people already have?

@yannick - from an opal maintainer perspective - if we don’t use tibble will it become more difficult to maintain the interface to DS (as you are presumably having to add logic just for DS)?

Would the existing code that currently uses data.frames just work with tibble (other than the chcekClass functions) or would we have to edit everything?

Hi Olly,

Since Opal 2.14 (latest) the code that assigns tables to R has been cleaned and simplified: now it is always a tibble that is assigned, then downgraded to a data.frame using as.data.frame() function and ID column values are set to row names. So I would not say there are maintenance issues.

A tibble IS a data.frame, then server-side R code that process data.frames, can process tibbles the same way. Except for the ID column vs. row names and the checkClass() output.

Yannick

Unless the ids have been poorly chosen, then I don`t see why these would add an additional risk of disclosure.

I can`t comment on whether this will be a problem for Molgenis - @sidohaakma?

I spoke to Paul about this yesterday and he thinks the features in tibble would have made development of some of the new stuff a bit easier.

@swheater is going to build a VM with it turned on for us to have a play with and feed back to you.

Hi, I don’t really know enough (or have enough of an opinion) to comment, but I just saw this post which I thought was interesting and maybe relevant:

I think the specific question raised is whether there are any performance issues, now or potential future ones, due to whether it’s tibble or data.frame?

Hi all,

For MOLGENIS as far as I know it will work as well. For very large datasets we need to work out how we deal with that, but that is more a general question.

Kind regards,

Sido

Thanks for sharing the link Andrei, that’s very interesting. Opal is making tibbles because Tidyverse provides a convenient ecosystem for reading/writing data in various formats. Regarding performance I recently loaded a 20M rows dataset (1G of data) in a tibble in a very reasonable amount of time. If data.table is more appropriate for your analyses (subsetting, grouping etc.), we can still convert using as.data.table(). One interesting (future) use case to be considered is when having relational data: I am not sure whether data.table is handling this as nicely as dplyr.