R is a free and widely used programming language for statistical computation and graphics. ProActive PARConnector provides an API which makes it possible to write distributed R applications which execute over networks of machines.

In this tutorial, we will use ProActive PARConnector API to write a simple R applications which executes on four different machines (a.k.a ProActive Nodes) at the same time. We will use RGui as the software development environment.

1 Install and configure R environment


This tutorial was designed for and is tested with Ubuntu 16.04.

  1. Install and configure R software environment also described here.

    Add the R repository.
    $ sudo echo "deb http://cran.rstudio.com/bin/linux/ubuntu xenial/" | sudo tee -a /etc/apt/sources.list
    Add R repository keys.
    $ gpg --keyserver keyserver.ubuntu.com --recv-key 51716619E084DAB9
    $ gpg -a --export 51716619E084DAB9 | sudo apt-key add -
    Update packages.
    $ sudo apt-get update
    Install R and rJava.
    $ sudo apt-get install r-base r-cran-rjava
  2. Install Java JDK 8:

    $ sudo apt-get install openjdk-8-jdk
  3. Download the PARConnector:

     wget https://s3.amazonaws.com/par-connector-tutorial/par-connector-tutorial-R-x86_64-pc-linux-gnu.tar.gz
  4. Start your R environment:

    $ R
  5. Install the following additional R packages:

    • stringr
    • codetools
    • gtools

    For example, you can type the following command in the R console to install them.

    > install.packages(c('gtools', 'codetools', 'stringr'), Sys.getenv('R_LIBS_USER'), repo='http://cran.case.edu')

    Attention:

    Observe the command output to verify whether all packages have been installed successfully. In case of failure, google is your best bet to troubleshoot and find a solution.

  6. Install the PARConnector package:

    For example, you can type the following R command in the R console to install it.

    > install.packages('<PATH-TO-par-connector-tutorial-R-x86_64-pc-linux-gnu.tar.gz>', Sys.getenv('R_LIBS_USER'), repos = NULL)

  1. Install Docker also described here:

    Install curl.
    $ sudo apt-get install curl
    Add the GPG key for the official Docker repository to your system.
    $ curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
    Add the Docker repository to APT sources.
    $ sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable"
    $ sudo apt-get update
    Install Docker.
    $ apt-get install -y docker-ce
    Check that Docker is well installed.
    $ sudo docker -v
  2. Start the ActiveEon par-connector-tutorial Docker container:

    $ docker run -ti activeeon/par-connector-tutorial

2 Hello World job.


Connection to the Scheduler

  1. Start the R environment.

  2. Type the following commands to load PARConnector and connect to the scheduler.
    Replace 'login' and 'pwd' by the login and password you received when subscribing to the tutorial.

> library("PARConnector");

Le chargement a nécessité le package : rJava
Le chargement a nécessité le package : gtools
Le chargement a nécessité le package : codetools
Le chargement a nécessité le package : stringr

> PAConnect(url='https://tryqa.activeeon.com/rest', login='login', pwd='pwd', insecure=TRUE);

Connected to Scheduler at  https://tryqa.activeeon.com/rest 
[1] "Java-Object{org.ow2.proactive.scheduler.rest.SchedulerClient@108dacb}"

                                        

Hello World n°1 : one parameter, one execution

In this first example, we'll execute remotely a simple Hello World function, in one single machine.

We define the function hellow1 which prints Hello followed by the function argument :

> hellow1 <- function(x) print(paste('Hello',x))
                                        
We submit this function to the Scheduler using the function PASolve. PASolve returns an object which we store in a variable job. If displayed, this object describes the status of the job.

> job <- PASolve( hellow1, 'World')

Job submitted (id : 2725)
 with tasks : t1

> job
PARJob1 (id: 2725)  (status: Running)
t1 : Running at pacagrid.cloudapp.net (SSH-slice1-2) (0%)

                                        
We wait for the job completion by calling the function PAWaitFor. This function returns the result of the job and prints the remote output.
> val <- PAWaitFor(job)

t1 : 
[1630000@tryqa.activeeon.com;13:37:12] [1] 
[1630000@tryqa.activeeon.com;13:37:12]  "Hello World"

> val
$t1
[1] "Hello World"

Hello World n°2 : one parameter, multiple executions

In this second example, we'll execute remotely the hellow1 function across several machines.

We use for that a list parameter as below. The syntax of PASolve is similar to the R function mapply, it will produce as many executions, as the size of its list parameters :
> res <- PASolve( hellow1, list('World1','World2','World3'))
Job submitted (id : 2726)
 with tasks : t1, t2, t3

> val <- PAWaitFor(res)
t1 : 
[1640000@tryqa.activeeon.com;13:40:52] [1] 
[1640000@tryqa.activeeon.com;13:40:52]  "Hello World1" 

t2 : 
[1640001@tryqa.activeeon.com;13:40:52] [1] 
[1640001@tryqa.activeeon.com;13:40:52]  "Hello World2" 

t3 : 
[1640002@tryqa.activeeon.com;13:40:52] [1] 
[1640002@tryqa.activeeon.com;13:40:52]  "Hello World3"

                                        

Explanation :

In this second example, instead of having a single string parameter, we have a list of string of size 3.
PASolve will interpret this list as multiple evaluations, just like mapply does.

It will evaluate in the cloud the following calls:

print(paste('Hello', 'World1'))
print(paste('Hello', 'World2'))
print(paste('Hello', 'World3'))

Hello World n°3 : Multiple parameters, one executions

In this third example, we'll execute remotely a new Hello World function with two parameters, in one single machine.

We define the function hellow3 which prints its two arguments :
> hellow3 <- function(x,y) print(paste(x,y))
                                        
As the function takes two parameters instead of one, the corresponding PASolve call will contain one additional parameter :
> job <- PASolve( hellow3, 'Hello', 'World')
Job submitted (id : 2802)
 with tasks : t1
> val <- PAWaitFor(job)
t1 : 
[1650000@tryqa.activeeon.com;13:43:38] [1] 
[1650000@tryqa.activeeon.com;13:43:38]  "Hello World"
                                        
The execution scheme is similar to example 1, just with two parameters instead of one.

Hello World n°4 : Multiple parameters, multiple executions

In this last example, we'll execute remotely the hellow3 function across several machines.

We use for that list parameters as below :
> res <- PASolve( hellow3, list('Hello1', 'Hello2', 'Hello3'), list('World1','World2','World3'))
Job submitted (id : 2727)
 with tasks : t1, t2, t3

> val <- PAWaitFor(res)
t1 : 
[1660000@tryqa.activeeon.com;13:45:21] [1] 
[1660000@tryqa.activeeon.com;13:45:21]  "Hello1 World1" 

t2 : 
[1660001@tryqa.activeeon.com;13:45:20] [1] 
[1660001@tryqa.activeeon.com;13:45:20]  "Hello2 World2" 

t3 : 
[1660002@tryqa.activeeon.com;13:45:21] [1] 
[1660002@tryqa.activeeon.com;13:45:21]  "Hello3 World3"
                                        

Explanation :

PASolve will match each elements of the first list to the corresponding elements of the second list.

Doing so, it will evaluate in the cloud the following calls:

print(paste('Hello1', 'World1'))
print(paste('Hello2', 'World2'))
print(paste('Hello3', 'World3'))

3 Job With File Transfer


In this chapter, we will demonstrate how to submit simple jobs that use input and output files.

Single file copy

In this example, we will show a file transfer to one single machine.
We define a function mycopy which copies a file into another. It uses its parameter to determine the file name :

> mycopy <- function(i) file.copy(paste0("in",i,".txt"), paste0("out",i,".txt"))
                                        
Check the current directory of your R session by using the command getwd:
> getwd()
[1] "H:/Users/demo/Documents"
                                        
Create a file in1.txt in this directory. You can put some text content in this file if you want.
Submit the mycopy function with 1 as parameter:
Input files and output files are described using the input.files and output.files additional parameters :
> job <- PASolve( mycopy, 1, input.files="in1.txt", output.files="out1.txt")
Job submitted (id : 2805)
 with tasks : t1
> val <- PAWaitFor(job)
> val
$t1
[1] TRUE
                                        
  • At the end of the job, the file out1.txt will be present in the folder H:/Users/demo/Documents next to the in1.txt files.
  • No remote execution log will be displayed as the mycopy function does not print anything.
  • The result of the task t1 will simply be the result value of the copy function which is true if the copy is successful.

Multiple file copies

In this example, we will show a file transfer to multiple machines.

Remove the out1.txt file produced by example 1 and create 2 additional files in2.txt and in3.txt in the working directory.
Submit the mycopy function with an array 1:3 as parameter.
As there are multiple input files, it is not possible any more to describe input/output files with a single name. We will use wildcards :
> job <- PASolve( mycopy, 1:3, input.files="in%1%.txt", output.files="out%1%.txt")
Job submitted (id : 2804)
 with tasks : t1, t2, t3
> val <- PAWaitFor(job)
> val
$t1
[1] TRUE

$t2
[1] TRUE

$t3
[1] TRUE

                                        
  • The input.files and output.files parameters contain a special pattern %1% which is replaced by elements of the array 1:3.
    In this pattern 1 stands for 1st parameter. If the mycopy function had a second parameter, we could have used %2%, etc.
  • At the end of the job, the files out1.txt, out2.txt and out3.txt will be present in the folder H:/Users/demo/Documents next to the in$.txt files.