Our group has written some analysis scripts that are currently running on 5 locations within the MIRACUM consortium and an ongoing difficulty that we face is some kind of disconnect between locations. This occurs on average at least on a weekly basis and till now we couldn’t figure out why this is happening.
Usually we notice the issue when our scripts stop progressing at a random point for > 5min upon which all further commands fail. If we immediately stop our run we get the information that one of the five locations is not reachable.
When trying to log back in we’re still passing the “Logging into the collaborating servers” step without issue but if the problematic location is included the “Assigning table data…” will not progress beyond 0%.
We found a solution by restarting the Docker containers of the problematic location but contacting the responsible people at external locations majorly delays the analysis. Till now we couldn’t determine a pattern when and which location has this situation next.
On another post on this forum Long runs stop with error: Waiting - #2 by yannick it was recommended to clean up unnecessary data as the R memory might be limited. I have to admit that this might not be done in the best way. Our biggest dataset is around 3.5 Mb and multiplied with the theoretical maximum number of subsets (around 1050) we would reach 3.3 Gb. We upgraded to 64 Gb of RAM of which the R part can have up to 62 Gb so I’d assume that this shouldn’t be our bottleneck. As expected, RAM and also CPU measurements didn’t indicate high workloads. Still, our location occasionally has the same issues that we currently solve by restarting Docker containers.
Have you heard of similar issues before or do you have an idea what could happen in the background?
There are several layers of applications:
An R session is a child process of (3) and when it dies/freezes, you can restart it form the Opal’s Administration > R page. This is pure R and the cause of failure is also R related (could be memory management, but could also be a failing R package). In the same administration page it is possible to download the R server log, you may find error messages in it.
The Rock server (2) is very unlikely to fail as it only forwards R commands from (1) and returns R results from (3).
The Opal server (1) could also fail when extracting data from the database (is connection to database stable?). There is some caching that prevents Opal from connecting the database again and again, so that phase could be bypassed in practice. Then there is the assignment of the extracted dataset in the R server. If any error occurred, you should also check the Opal logs, which can be downloaded from the Administration > Java Virtual Machine page.
It is hard to tell where the problem is without knowing applications versions and whether there are some error messages in the logs.
Hope this helps, regards
sorry for the delay. Thank you very much for your answer! It definitely helped to get a better insight into how the system is built up.
I guess then the next step would be to see whether a R restart via Administration also solves the issue we are observing. In that case we could restrict the cause to the R part. As soon as we face our problem again we’ll try that.
We didn’t see obvious error messages that could lead to our problems in the logs, but we found that we definitely should update our systems as we’re running on OPAL v3.0.2 and DataSHIELD 6.1. Maybe that changes the question a bit to whether this has occured in older versions?
I guess an update will happen after the official DataSHIELD 6.2 release. Maybe that will alleviate our issues but if the problems continue after the update I’ll come back with more details.