1. Grid segmentation

Hardware segmentation

Among IGRIDA resources, identical nodes are “virtually” grouped together into clusters. This grouping, performed through an OAR property called cluster, provides a convenient way to request your jobs to always run on the same architecture. Note that some cluster nodes do not belong to any cluster.

Each “cluster” is thus made up of servers having the same hardware architecture (CPU, memory, network, etc ...). Details about this global hardware segmentation are provided in the following table:

Cluster name # CPU REFERENCE RAM DISK Network
Calda 5 2 x 8 cores Sandy Bridge Intel(R) Xeon(R) CPU E5-2450 0 @ 2.10GHz 48GB 2 x 600GB Infiniband + 1GB/s
Lambda 11 2 x 6 cores Westmere-EP Intel(R) Xeon(R) CPU E5645 @ 2.40GHz 48GB 2 x 600GB Infiniband + 1GB/s
Flagada 10 2 x 2 cores Intel(R) Xeon(R) CPU 5140 @ 2.33GHz 4GB 50GB 1 GB/s
Mida 3 2 x 8 cores Sandy Bridge-EP Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz 64GB 2 x 300GB 1 GB/s
Manda 4 2 x 8 cores Sandy Bridge-EP Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz 128GB 2 x 300GB 1 GB/s
Panda 10 2 x 4 cores Clovertown Intel(R) Xeon(R) CPU L5310 0 @ 1.60GHz 8GB 1 x 73GB 1 GB/s
Gouda 8 2 x 4 cores Clovertown Intel(R) Xeon(R) CPU E5345 0 @ 2.33GHz 8GB 1 x 73GB SAS 15k 1 GB/s
Dalida 20 2 x 4 cores Clovertown Intel(R) Xeon(R) CPU E5345 0 @ 2.33GHz 8GB 1 x 73GB SAS 10k 1 GB/s
Bermuda 48 2 x 4 cores Gulftown Intel(R) Xeon(R) CPU E5640 @ 2.67GHz 48GB 2 x 300GB SAS 10k 1 GB/s
Neurinfo1 12 2 x 20 cores (hyperthreading) Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHzGHz 128GB 2 x 300GB SAS 10k 1 GB/s

Note that the HyperThreading feature is activated only on the following nodes.

You can easily access any “cluster” in interactive mode with the following syntax:

my_login@igrida-oar-frontend:~$ oarsub -I -p "cluster='flagada'"

The same syntax may be used in job scripts:

my_login@igrida-oar-frontend:~$ head job.sh
   #!/bin/bash
   #OAR -p cluster = 'flagada'
   #OAR -n test_cluster_restriction
   (...)

Walltime segmentation

To prevent resource starvation, jobs are additionnaly scheduled on specific nodes depending on their walltime duration (as requested). Details about the global walltime segmentation map are provided in the following table:

Job duration (walltime) Nodes list max_walltime (OAR property)
small < 10mn nodes 10
medium < 4h nodes 240
large < 12h nodes 720
huge < 7days nodes 10080

According to the walltime specified in your OAR request, your job will be executed on a node configured to host such jobs or larger jobs. For example, let’s ask for 3h:

oarsub -l walltime=3:0:0 -S ./job.sh
   [ADMISSION RULE] Modify resource description with type constraints
   [ADMISSION RULE] Job walltime greater than 10 minutes, adding property max_walltime >= 180 minutes.
   (...)

You remark that the “admission rules” have automatically added the max_walltime >= 180 property to your job. As a consequence, this job will run on any host configured to host medium, large or huge jobs.

2. Memory limitation per job

Principle

A memory limitation is defined for each job. More precisely, a memory slot is dedicated to each core, on a pro rata basis with the node memory (after saving 512Mb for the system). The 2 following links provide more insight into this mechanism:

Note

How is the amount of memory per core defined? Let’s take an example. On igrida09-01, the total memory is 49554876 Kb (output of “free -o”). After saving at least 512Mo for the system, the node remaining memory is (49554876-512*1024)/1024 = 47881.43 ~ 47881 Mb. This corresponds to the mem_node OAR property you can see in the Monika console. The corresponding mem_core value is 47881Mb/(8 cores) ~ 5985 Mb (see Monika again).

If your job gets killed, try to launch it again with additional cores on the same node. For example, you may use the following syntax to ask for 4 cores on a same node:

oarsub -l /nodes=1/core=4 ...

Example: my job needs 2GB

Let’s assume that your job needs 2GB RAM (at most). To achive this requirement, you may proceed in the following manner:

  • on nodes with less than 1GB/core, make a request for 3 cores (this practically ensures that more than 2GB will be reserved for the job)
  • on nodes with 1-2GB/core, make a request for 2 cores
  • on nodes with more than 2GB/core, make a request for 1 sigle core

A possible syntax to implement this in your OAR script is given here:

cat my_job.sh

#!/bin/sh
#-----------------------------------------------------------------
#OAR -t besteffort
#-----------------------------------------------------------------
# My job needs 2GB of memory (i.e. 2*1024Mo)
#OAR -l {mem_core < 1024}/nodes=1/core=3,walltime=00:00:30
# The following line is commented because this kind of resources does currently not exist in IGRIDA
###OAR -l {mem_core > 1024 AND mem_core < 2048}/nodes=1/core=2,walltime=00:00:30
#OAR -l {mem_core > 2048}/nodes=1/core=1,walltime=00:00:30
#-----------------------------------------------------------------
#OAR -n memory_select_test
#-----------------------------------------------------------------
#OAR -O /temp_dd/igrida-fs1/my_login/job_mem.%jobid%.output
#OAR -E /temp_dd/igrida-fs1/my_login/job_mem.%jobid%.error
#-----------------------------------------------------------------

etc.

You can adapt this example for your specific needs.

Example: I need a 45GB node

Assume you need to work on a node with more than 45GB, in interactive mode (typically for debugging purposes). You may use the following syntax:

my_login@igrida-oar-frontend:~$ oarsub -I -l /nodes=1 -p 'mem_node > 45*1024'

Similarly, to request for one CPU (i.e. socket, with several cores according to the hardware) with more than 3GB, you may use the following syntax:

my_login@igrida-oar-frontend:~$ oarsub -I -l /cpu=1 -p 'mem_cpu > 3*1024'

You can adapt these examples for your specific needs.

3. Best effort computing

An important and original OAR feature is the best effort mode. If you launch your jobs in best effort, they may run on the complete grid, but they can be killed at anytime. However they will be rescheduled automatically with the same behaviour if you additionnaly use the idempotent mode. Here is the command to use these features:

oarsub -t besteffort -t idempotent -S /path/to/your/job

Obviously, using the besteffort mode is a good practice for short enough jobs (typically 1-4h, or a bit more by night or at weekends).

Note that some dedicated nodes are available to IGRIDA users only in besteffort mode. This is the case for the so called “serpico” nodes. To launch besteffort jobs on the whole IGRIDA grid, including those dedicated nodes, you have to add an explicit request:

oarsub -t besteffort -t idempotent -p "dedicated = 'none' or dedicated = 'serpico'" -S /path/to/your/job

For more information refer to the besteffort official OAR Documentation

4. Hybrid MPI and OpenMP jobs

In the following OAR script, you will find several use cases for launching hybrid MPI/OpenMP (or MPI/pthreads) jobs. The source code used below is called hello.c, you may use the example given here.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
#!/bin/bash
#OAR -n hybrid
#OAR -p cluster='lambda' AND dedicated='none' 
#OAR -l /nodes=2
#OAR -O /temp_dd/igrida-fs1/my_login/igrida.%jobid%.output
#OAR -E /temp_dd/igrida-fs1/my_login/igrida.%jobid%.output

#patch to be aware of "module" inside a job
. /etc/profile.d/modules.sh

set -xv

module show openmpi
module load openmpi
rm -f hello
mpicc -fopenmp hello.c -o hello

# N.B. The following use cases are designed for the 'lambda' cluster (each node has 12 cores)

###############################################################################################
# Example1 (1 node): we launch 3 MPI process on this node, and each one launches 4 threads
###############################################################################################

export OMP_NUM_THREADS=4 
mpirun -np 3 ./hello

###############################################################################################
# Example2 (2 nodes): we launch 1 MPI process per node, and each one launches 6 threads
###############################################################################################

mkdir /temp_dd/igrida-fs1/$USER/tmp/

MPIRUN_HOSTFILE1=/temp_dd/igrida-fs1/$USER/tmp/hostfile1_${OAR_JOBID}
rm -f $MPIRUN_HOSTFILE1
cat $OAR_NODEFILE | uniq > $MPIRUN_HOSTFILE1

cat $OAR_NODEFILE
cat $MPIRUN_HOSTFILE1

mpirun -x OMP_NUM_THREADS=6 --machinefile $MPIRUN_HOSTFILE1 ./hello

###############################################################################################
# Example3 (2 nodes): we launch 3 MPI process per node, and each one launches 4 threads
###############################################################################################

NB_ALLOCATED_NODES=`cat $OAR_NODEFILE | uniq | wc -l`
NB_ALLOCATED_CORES=`cat $OAR_NODEFILE|wc -l`

#>>>>>>>>> edit me >>>>>>>>>>
NB_MPIPROCESS_PER_NODE=3
NB_THREADS_PER_MPIPROCESS=4
#<<<<<<<<<<<<<<<<<<<<<<<<<<<<

NB_MPIPROCESS_TOTAL=$(( $NB_MPIPROCESS_PER_NODE * $NB_ALLOCATED_NODES ))

# Build hostfile
MPIRUN_HOSTFILE2=/temp_dd/igrida-fs1/$USER/tmp/hostfile2_${OAR_JOBID}
rm -f $MPIRUN_HOSTFILE2

HOST_LIST=`cat $OAR_NODEFILE | uniq`
for NODE in $HOST_LIST; do
   echo $NODE slots=$NB_MPIPROCESS_PER_NODE max-slots=$NB_MPIPROCESS_PER_NODE >> $MPIRUN_HOSTFILE2
done

cat $MPIRUN_HOSTFILE2

# Consistency check
if [ $(( $NB_MPIPROCESS_TOTAL * $NB_THREADS_PER_MPIPROCESS )) -ne $NB_ALLOCATED_CORES ]; then
   echo "The number of process/task does not appear coherent with the allocated resources"
   echo "Comment this check to proceed..."
   exit 1
fi

# Launch computations
mpirun -x OMP_NUM_THREADS=$NB_THREADS_PER_MPIPROCESS -np $NB_MPIPROCESS_TOTAL --machinefile $MPIRUN_HOSTFILE2 ./hello

5. Using the OAR API

The OAR API is based on the REST model. Note that the API commands de facto enable job submission on passive nodes (for example inside a passive job).

curl -X POST -H'Accept: application/json' -H'Content: application/json' -ki http://igrida-oar-server/oarapi/jobs -d 'resources=core=1&command=/path/to/my/script.sh&name=testAPI'

For more details, please refer to the official API user’s guide.

6. GPU computing

Accessing GPU nodes

You can access specials nodes that allow you to work with CUDA or OpenCL provided by nVidia.

First, you need to get access to the IGRIDA front-end, so please launch:

ssh igrida-oar-frontend

To reserve one single core on a GPU node,

oarsub -I -p "gpu = 'YES'"

Note that the default time limit for this connection is 10 minutes (interactive job defaut walltime). If you need more type please add this option to your command (exemple 30 minutes):

oarsub -I -p "gpu = 'YES'" -l walltime=0:30:0

To reserve a full node,

oarsub -I -l nodes=1 -p "gpu = 'YES'"

Finally, to reserve a interactive session in machine named igrida-dizzy, please call

oarsub -I -p "host = 'igrida-dizzy.irisa.fr' AND gpu='YES'"

Available CUDA versions

The last nVidia driver is installed on all GPU nodes. However, it is possible for backward compatibility to user former CUDA versions. To look for available CUDA versions, use the following command :

module avail

You may use one CUDA version, and easily switch to another version. For example :

my_login@igrida-dizzy:~$ module load cuda/5.0

my_login@igrida-dizzy:~$ nvcc --version
   nvcc: NVIDIA (R) Cuda compiler driver
   Copyright (c) 2005-2012 NVIDIA Corporation
   Built on Fri_Sep_21_17:28:58_PDT_2012
   Cuda compilation tools, release 5.0, V0.2.1221

my_login@igrida-dizzy:~$ echo $LD_LIBRARY_PATH
   /soft/igrida/cuda/5.0/lib:/soft/igrida/cuda/5.0/lib64:/soft/igrida/cuda/5.0/SDK/lib:/lib64

my_login@igrida-dizzy:~$ module switch cuda/4.0

my_login@igrida-dizzy:~$ nvcc --version
   nvcc: NVIDIA (R) Cuda compiler driver
   Copyright (c) 2005-2011 NVIDIA Corporation
   Built on Thu_May_12_11:09:45_PDT_2011
   Cuda compilation tools, release 4.0, V0.2.1221

my_login@igrida-dizzy:~$ echo $LD_LIBRARY_PATH
   /soft/igrida/cuda/4.0/lib:/soft/igrida/cuda/4.0/lib64:/soft/igrida/cuda/4.0/SDK/C/lib:/lib64

Available GPU Nodes

GPU Node CPU GPU RAM
igrida-suanpan bI-Xeon quadcore 5400 ( 2x4 cores) 1 NVIDIA GTX 1080 with 8 Gb GRAM 1 NVIDIA GTX 1070 with 8 Gb GRAM 2x 4 Gb
igrida-quipu bI-Xeon quadcore 5400 ( 2x4 cores) 1 NVIDIA GTX 1080 with 8 Gb GRAM 1 NVIDIA GTX 1070 with 8 Gb GRAM 2x 4 Gb
igrida-dizzy Intel Core2 Duo CPU E8400 ( 2 cores) 2 NVIDIA GTX 580 with 1,5 Gb GRAM 4GB
igrida-paratonnerre Intel Core2 Duo CPU E8400 ( 2 cores) 2 NVIDIA GTX 580 with 1,5 Gb GRAM 4GB
igrida-yupana Intel Xeon CPU E5-2650 ( 2*8 Cores) 1 NVIDIA Quadro K6000 with 12 Gb GRAM 2x 32 Gb
igrida-abacus Bi-Xeon CPU E5-2650 ( 2*2*12 Cores) 1 NVIDIA Tesla K80 with 12 Gb GRAM 1 NVIDIA Tesla M40 with 12 Gb GRAM 2x 64 Gb

7. Transferring data

Use rsync with the encryption algorithm “arcfour” to transfer data from your local PC to the dedicated network storage (and vice-versa).

rsync -avz ~/my_local_data login@igrida-oar-frontend:/temp_dd/igrida-fs1/login

9. Frequently Asked Questions

_images/problem.jpg

If you don’t find the answer to your question in this FAQ, please submit an issue to the HelpDesk in the servinfo/igrida category. This will also help us improve this FAQ with new items.

10. Worst practices

Do not run on the front-end!

The IGRIDA front-end node (igrida-oar-frontend) is dedicated to job submission (oarsub, oarstat, oardel, etc.), and should never be used for running any program.

You’re on holiday

During the execution of your jobs, you must be reachable. Your jobs could fail and/or block other users, we prefer to contact the owner before taking such measures as kill his jobs.

Client/Server connections

Of course, running a job waiting for any communication is not recommended. Like others computing resources, usage of IGRIDA is included in IRISA charter. So all activities like p2p sharing on illegal content is prohibited.