Scheduler
This document contains a set of exercises to learn the basic commands for executing interactive MPI applications on the CrossGrid testbed from an advanced user point of view. This tutorial is not targeted to users that submit jobs to the grid via the Portal, who will never need to execute any of the commands described in this tutorial.
Requirements:
A minimum well-configured LCG-1 testbed is needed for exercises 1 and 2, while for executing exercise 3 an LCG-2 testbed is needed. These testbeds should have at least both the modified RB and the UI working properly. A properly configured II should be also working. In addition the user needs to have a valid certificate signed by a CA.
In this tutorial you will need to create the file.jdl file for doing the three proposed exercises. In addition you will need to use two example applications provided with this tutorial:
1) Learning how to find available resources for your job
We will learn to use the 'edg-job-list-match' command to get the list of resources on which to execute your job. Remember that commands like 'edg-job-submit' will also call the resource selector subsystem although you will not notice it.
The 'edg-job-list-match' command is used to submit jdl files that represent a parallel MPI job to the Resource Broker.
This jdl file shall contain the specifications and requirements of the job:
JobType Field that defines that is a MPI job. Possible values are:
“normal” - (default) common sequential job
“mpich” - defines an MPI job compiled with the ch_p4 device
“mpich-g2” - defines an MPI job compiled with the G2 device
NodeNumber Field that defines the required number of cpus to execute the MPI job
Below is depicted an jdl example (file.jdl) file that looks for groups of CEs whose queue type is PBS and that have at least 10 free CPUs in the group, to run the MPICH G2 job named mpi_app.
The command used to get the available CEs is:
edg-job-list-match file.jdl
The output of the command will depend on the available machines. An example of the output obtained is the following:
Connecting to host
cg07.ific.uv.es, port 7772
********************************************************************************
GROUPS OF CE IDs LIST
The following groups of CE(s)
matching your job requirements have been found:
*Groups with 1 CEs*
*TotalCPUs* *FreeCPUs*
[Rank=650]
ce001.grid.ucy.ac.cy:2119/jobmanager-pbs-infinite 10
10
[Rank=650]
ce001.grid.ucy.ac.cy:2119/jobmanager-pbs-long 10 10
[Rank=650]
ce001.grid.ucy.ac.cy:2119/jobmanager-pbs-short 10 10
[Rank=630]
cluster.ui.sav.sk:2119/jobmanager-pbs-workq 16 16
[Rank=400]
zeus24.cyf-kr.edu.pl:2119/jobmanager-pbs-infinite 58
57
[Rank=400]
zeus24.cyf-kr.edu.pl:2119/jobmanager-pbs-long 58 57
[Rank=400]
zeus24.cyf-kr.edu.pl:2119/jobmanager-pbs-short 58 57
*Groups with 2 CEs*
*TotalCPUs* *FreeCPUs*
[Rank=440 TotalCPUs=12
FreeCPUs=12]
cagnode45.cs.tcd.ie:2119/jobmanager-pbs-infinite 4 4
ce100.fzk.de:2119/jobmanager-pbs-long
8 8
[Rank=498 TotalCPUs=10
FreeCPUs=10]
ce01.lip.pt:2119/jobmanager-pbs-infinite 2 2
ce100.fzk.de:2119/jobmanager-pbs-long 8 8
[Rank=433.6 TotalCPUs=10 FreeCPUs=10]
ce100.fzk.de:2119/jobmanager-pbs-long 8 8
cg01.ific.uv.es:2119/jobmanager-pbs-infinite 2 2
[Rank=448 TotalCPUs=10
FreeCPUs=10]
ce100.fzk.de:2119/jobmanager-pbs-long 8 8
cgnode00.di.uoa.gr:2119/jobmanager-pbs-infinite 2 2
[Rank=498 TotalCPUs=12
FreeCPUs=10]
ce100.fzk.de:2119/jobmanager-pbs-long 8 8
cms.fuw.edu.pl:2119/jobmanager-pbs-infinite 4 2
[Rank=566.667 TotalCPUs=12 FreeCPUs=12]
cagnode45.cs.tcd.ie:2119/jobmanager-pbs-infinite 4 4
xgrid.icm.edu.pl:2119/jobmanager-pbs-infinite 8 8
[Rank=650 TotalCPUs=10
FreeCPUs=10]
ce01.lip.pt:2119/jobmanager-pbs-infinite 2 2
xgrid.icm.edu.pl:2119/jobmanager-pbs-infinite 8 8
[Rank=555 TotalCPUs=16
FreeCPUs=16]
ce100.fzk.de:2119/jobmanager-pbs-long 8 8
xgrid.icm.edu.pl:2119/jobmanager-pbs-infinite 8 8
[Rank=585.6 TotalCPUs=10 FreeCPUs=10]
cg01.ific.uv.es:2119/jobmanager-pbs-infinite 2 2
xgrid.icm.edu.pl:2119/jobmanager-pbs-infinite 8 8
[Rank=600 TotalCPUs=10
FreeCPUs=10]
cgnode00.di.uoa.gr:2119/jobmanager-pbs-infinite 2 2
xgrid.icm.edu.pl:2119/jobmanager-pbs-infinite 8 8
[Rank=650 TotalCPUs=12
FreeCPUs=10]
cms.fuw.edu.pl:2119/jobmanager-pbs-infinite 4 2
xgrid.icm.edu.pl:2119/jobmanager-pbs-infinite 8 8
2) Learning how to submit a simple
MPICH-G2 job
A jdl file must specify the mpich-g2 jobtype and the number of nodes needed to execute the application as the one depicted in the previous exercise. The command used to submit such file is:
edg-job-submit file.jdl
The output of the command will give results in the following way:
Connecting to host
aow5grid.uab.es, port 7772
Logging to host
aow5grid.uab.es, port 9002
*****************************************************************************
JOB SUBMIT
OUTCOME
The job has been successfully submitted to the
Network Server.
Use edg-job-status
command to check job current status. Your job identifier (edg_jobId)
is:
- https://aow5grid.uab.es:9000/jR0hjTzOlyFkRkpP_i1R8Q
*****************************************************************************
This output indicates that the job has been sent to the RB and now you must check its status using the edg-job-status command. The job will pass through three different states: Waiting, Running and Done.
Here is shown the output of the edg-job-status command in each case:
*************************************************************
BOOKKEEPING INFORMATION:
Printing status info for the
Job :
https://aow5grid.uab.es:9000/jR0hjTzOlyFkRkpP_i1R8Q
Current Status: Waiting
reached on: Mon Feb
2 15:20:24 2004
*************************************************************
*************************************************************
BOOKKEEPING INFORMATION:
Printing status info for the
Job :
https://aow5grid.uab.es:9000/jR0hjTzOlyFkRkpP_i1R8Q
Current Status: Running
Status Reason: unavailable
Destination: ce001.grid.ucy.ac.cy:2119/jobmanager-pbs-infinite
reached on: Mon Feb
2 15:31:29 2004
*************************************************************
*************************************************************
BOOKKEEPING INFORMATION:
Printing status info for the
Job : https://aow5grid.uab.es:9000/jR0hjTzOlyFkRkpP_i1R8Q
Current Status: Done (Success)
Exit code: 0
Status Reason: Job terminated successfully
Destination: ce001.grid.ucy.ac.cy:2119/jobmanager-pbs-infinite
reached on: Mon Feb
2 15:38:16 2004
*************************************************************
When the job has finished, you can get the output using the command edg-job-get-output:
Retrieving files from host aow5grid.uab.es
*****************************************************************************
JOB GET OUTPUT OUTCOME
Output sandbox files for the job:
-
https://aow5grid.uab.es:9000/jR0hjTzOlyFkRkpP_i1R8Q
have been
successfully retrieved and stored in the directory:
/tmp/jR0hjTzOlyFkRkpP_i1R8Q
*****************************************************************************
In the directory specified by the edg-job-get-output can be found the output and error files of each subjob of the application.
3) Learning how to submit Interactive MPICH-P4
and MPICH-G2 jobs
For executing this exercise it is supposed that a minimum well-configured LCG-2 testbed is available.
a) First of all, we need to define the interactive feature in the job
descriptor file. In order to do this we will write one of the next attributes:
JobType { “interactive”,
“mpich” } -
Defines an interactive mpich-p4 job
{ “interactive”, “mpich-g2” } - Defines an interactive mpich-g2 job
In the
JDL file, you must specify both the fields which define an mpich-p4/g2 job and
the ones dealing with interactivity. e.g.: ListenerPort
for an interactive Job and NodeNumber for an mpich job.
Interactive jobs cannot have
defined none of the following attributes: OutputSandbox, StdOutput and StdError.
For practicing this new feature we show one jdl file, for example:
b) Next, we submit the jdl file using the modified UI to the modified RB that supports this new feature.
edg-job-submit file.jdl
The output of the command will give results in the following way:
Selected Virtual Organisation name (from
JDL): cg
Connecting to host aorbgrid.uab.es,
port 7772
Logging to host aorbgrid.uab.es,
port 9002
**********************************************************************
JOB SUBMIT OUTCOME
The
job has been successfully submitted to the Network Server.
Use edg-job-status command to check job current status. Your
job identifier (edg_jobId) is:
-
https://aorbgrid.uab.es:9000/IAYUQS7E6J4aySd3bjImVQ
---
The
Interactive Session Listener has been successfully launched
with the following parameters:
Host: 158.109.65.39
Port: 24501
**********************************************************************
***************************************
Interactive Job console started for
https://aorbgrid.uab.es:9000/IAYUQS7E6J4aySd3bjImVQ
Please press ^C to exit from the session
***************************************
This output indicates that the job has been sent to the RB and
now the user must wait for the beginning of the job execution.
When the job is running, if it is an interactive mpich-g2 job,
the output should be similar to the following:
Subjob0: my_id
0 numprocs 2
Subjob0: Number of trips around the
ring ? 2
Subjob0: Verbosity (yes/no) ? yes
Subjob0: Processor name:
cg05.ific.uv.es
Subjob0: Starting trip 1 of 2:
before sending num=1 to dest=1
Subjob0: Inside trip 1 of 2: before
receiving from source=1
Subjob0: End of trip 1 of 2: after
receiving passed_num=2 (should be =trip*numprocs=2) from source=1
Subjob0: Starting trip 2 of 2:
before sending num=3 to dest=1
Subjob0: Inside trip 2 of 2: before
receiving from source=1
Subjob0: End of trip 2 of 2: after
receiving passed_num=4 (should be =trip*numprocs=4) from source=1
Subjob0:
>>>>>>>>>>>>>>> INTERACTIVE JOB FINISHED
<<<<<<<<<<<<<<<
Subjob1: my_id
1 numprocs 2
Subjob1: Processor name:
cg04.ific.uv.es
Subjob1: Top of trip 1 of 2: before
receiving from source=0
Subjob1: Inside trip 1 of 2: after
receiving passed_num=1 from source=0
Subjob1: Inside trip 1 of 2: before
sending passed_num=2 to dest=0
Subjob1: Bottom of trip 1 of 2:
after send to dest=0
Subjob1: Top of trip 2 of 2: before
receiving from source=0
Subjob1: Inside trip 2 of 2: after
receiving passed_num=3 from source=0
Subjob1: Inside trip 2 of 2: before
sending passed_num=4 to dest=0
Subjob1: Bottom of trip 2 of 2:
after send to dest=0
Subjob1:
>>>>>>>>>>>>>>> INTERACTIVE JOB FINISHED
<<<<<<<<<<<<<<<
Note that “Subjob0” and “Subjob1” indicates the mpich process id.
However, if it is an interactive mpich-p4 job, the output will be the following:
my_id 0 numprocs
2
Number of trips around the ring ? 2
Verbosity (yes/no)
? yes
my_id 1 numprocs
2
Slave 1: Processor name:
cg04.ific.uv.es
Slave 1: top of trip 1 of 2: before
receiving from source=0
Slave 1: inside trip 1 of 2: after
receiving passed_num=1 from source=0
Slave 1: inside trip 1 of 2: before
sending passed_num=2 to dest=0
Slave 1: bottom of trip 1 of 2:
after send to dest=0
Slave 1: top of trip 2 of 2: before
receiving from source=0
Slave 1: inside trip 2 of 2: after
receiving passed_num=3 from source=0
Slave 1: inside trip 2 of 2: before
sending passed_num=4 to dest=0
Slave 1: bottom of trip 2 of 2:
after send to dest=0
Master: Processor name:
cg06.ific.uv.es
Master: starting trip 1 of 2:
before sending num=1 to dest=1
Master: inside trip 1 of 2: before
receiving from source=1
Master: end of trip 1 of 2: after
receiving passed_num=2 (should be =trip*numprocs=2) from source=1
Master: starting trip 2 of 2:
before sending num=3 to dest=1
Master: inside trip 2 of 2: before
receiving from source=1
Master: end of trip 2 of 2: after
receiving passed_num=4 (should be =trip*numprocs=4) from source=1
>>>>>>>>>>>>>>> INTERACTIVE JOB
FINISHED
<<<<<<<<<<<<<<<
Note that when the job is an interactive mpich-p4, the execution not shows “Subjob0” nor “SubjobN” strings because in this case we show the mpirun output. For that reason, it also has only one end string.
When all the mpich-g2 job subtasks have finished or when mpirun has finished, the following message will appear:
***************************************
Interactive
Session has finish correctly.
Removing
Listener and input/output streams...
Done
Press
<enter> to go to prompt
***************************************
However, if the user cancelled the job, by pressing ctrl+C, appears the next message in the console:
***************************************
Interactive Session ended by user.
Removing Listener and input/output
streams...
Done
Press <enter> to go to prompt
***************************************
If the job was cancelled, the job must be removed using the command edg-job-cancel, but if the job is not executing in the remote host, it can be restored with the edg-job-attach command:
edg-job-attach
https://aorbgrid.uab.es:9000/ta84GWSCp0qnuTH3g2N8yQ
**********************************************************************
JOB ATTACHED:
The Interactive Session Listener has been
successfully launched
with the following
parameters:
---
Host: 158.109.65.39
Port: 24501
**********************************************************************
***************************************
Interactive
Job console started for
https://aorbgrid.uab.es:9000/ta84GWSCp0qnuTH3g2N8yQ
Please
press ^C to exit from the session
***************************************
From then on, the job will continue its execution exactly as if the edg-job-submit command had been used.