Scheduler
Previous exercise Next exercise Back to
menu
This document contains a set of exercises to learn the basic commands for executing interactive MPI applications on the CrossGrid testbed from an advanced user point of view. This tutorial is not targeted to users that submit jobs to the grid via the Portal, who will never need to execute any of the commands described in this tutorial.
Requirements:
A minimum well-configured LCG-1 testbed is needed for exercises 1 and 2, while for executing exercise 3 an LCG-2 testbed is needed. These testbeds should have at least both the modified RB and the UI working properly. A properly configured II should be also working. In addition the user needs to have a valid certificate signed by a CA. The user proxy should be generated using grid-proxy-init command.
In this tutorial you will need to create the file.jdl file for doing the three proposed exercises for batch job and file_i.jdl for interactive job. In addition you will need to use example mpi_app application provided with this tutorial, which is a simple mpich-g2 application.
1) Learning how to find available resources for your job
We will learn to use the 'edg-job-list-match' command to get the list of resources on which to execute your job. Remember that commands like 'edg-job-submit' will also call the resource selector subsystem although you will not notice it.
The 'edg-job-list-match' command is used to submit jdl files that represent a parallel MPI job to the Resource Broker.
This jdl file shall contain the specifications and requirements of the job:
JobType Field that defines that is a MPI job. Possible values are:
“normal” - (default) common sequential job
“mpich” - defines an MPI job compiled with the ch_p4 device
“mpich-g2” - defines an MPI job compiled with the G2 device
NodeNumber Field that defines the required number of cpus to execute the MPI job
Below is depicted an jdl example (file.jdl) file that looks for groups of CEs whose queue type is PBS and that have at least 10 free CPUs in the group, to run the MPICH G2 job named mpi_app.
Executable =
"mpi_app"; JobType =
“mpich-g2”; NodeNumber =
10; Arguments =
"-n"; StdOutput =
"std.out"; StdError =
"std.err"; Requirements =
other.GlueCEInfoLRMSType=="pbs"; Rank =
other.GlueHostBenchmarkSI00; OutputSandbox = {"std.out","std.err"};
The command used to get the available CEs is:
edg-job-list-match file.jdl
The output of the command will depend on the available machines. An example of the output obtained is the following:
Connecting to host
cg07.ific.uv.es, port 7772
********************************************************************************
GROUPS OF CE IDs LIST
The following groups of CE(s)
matching your job requirements have been found:
*Groups with 1 CEs* *TotalCPUs* *FreeCPUs*
[Rank=650]
ce001.grid.ucy.ac.cy:2119/jobmanager-pbs-infinite 10
10
[Rank=650]
ce001.grid.ucy.ac.cy:2119/jobmanager-pbs-long 10 10
[Rank=650]
ce001.grid.ucy.ac.cy:2119/jobmanager-pbs-short 10 10
[Rank=630]
cluster.ui.sav.sk:2119/jobmanager-pbs-workq 16 16
[Rank=400]
zeus24.cyf-kr.edu.pl:2119/jobmanager-pbs-infinite 58
57
[Rank=400]
zeus24.cyf-kr.edu.pl:2119/jobmanager-pbs-long 58 57
[Rank=400]
zeus24.cyf-kr.edu.pl:2119/jobmanager-pbs-short 58 57
*Groups with 2 CEs* *TotalCPUs* *FreeCPUs*
[Rank=440 TotalCPUs=12
FreeCPUs=12]
cagnode45.cs.tcd.ie:2119/jobmanager-pbs-infinite 4 4
ce100.fzk.de:2119/jobmanager-pbs-long 8 8
[Rank=498
TotalCPUs=10 FreeCPUs=10]
ce01.lip.pt:2119/jobmanager-pbs-infinite 2 2
ce100.fzk.de:2119/jobmanager-pbs-long 8 8
[Rank=433.6
TotalCPUs=10 FreeCPUs=10]
ce100.fzk.de:2119/jobmanager-pbs-long 8 8
cg01.ific.uv.es:2119/jobmanager-pbs-infinite 2 2
[Rank=448 TotalCPUs=10
FreeCPUs=10]
ce100.fzk.de:2119/jobmanager-pbs-long 8 8
cgnode00.di.uoa.gr:2119/jobmanager-pbs-infinite 2 2
[Rank=498 TotalCPUs=12
FreeCPUs=10]
ce100.fzk.de:2119/jobmanager-pbs-long 8 8
cms.fuw.edu.pl:2119/jobmanager-pbs-infinite 4 2
[Rank=566.667 TotalCPUs=12
FreeCPUs=12]
cagnode45.cs.tcd.ie:2119/jobmanager-pbs-infinite 4 4
xgrid.icm.edu.pl:2119/jobmanager-pbs-infinite 8 8
[Rank=650
TotalCPUs=10 FreeCPUs=10]
ce01.lip.pt:2119/jobmanager-pbs-infinite 2 2
xgrid.icm.edu.pl:2119/jobmanager-pbs-infinite 8 8
[Rank=555
TotalCPUs=16 FreeCPUs=16]
ce100.fzk.de:2119/jobmanager-pbs-long 8 8
xgrid.icm.edu.pl:2119/jobmanager-pbs-infinite 8 8
[Rank=585.6 TotalCPUs=10
FreeCPUs=10]
cg01.ific.uv.es:2119/jobmanager-pbs-infinite 2 2
xgrid.icm.edu.pl:2119/jobmanager-pbs-infinite 8 8
[Rank=600 TotalCPUs=10
FreeCPUs=10]
cgnode00.di.uoa.gr:2119/jobmanager-pbs-infinite 2 2
xgrid.icm.edu.pl:2119/jobmanager-pbs-infinite 8 8
[Rank=650 TotalCPUs=12
FreeCPUs=10]
cms.fuw.edu.pl:2119/jobmanager-pbs-infinite 4 2
xgrid.icm.edu.pl:2119/jobmanager-pbs-infinite 8 8
2) Learning how to submit a simple
MPICH-G2 job
A jdl file must specify the mpich-g2 jobtype and the number of nodes needed to execute the application as the one depicted in the previous exercise. The command used to submit such file is:
edg-job-submit file.jdl
The output of the command will give results in the following way:
Connecting to host
aow5grid.uab.es, port 7772
Logging to host
aow5grid.uab.es, port 9002
*****************************************************************************
JOB SUBMIT
OUTCOME
The job has been successfully submitted to the
Network Server.
Use edg-job-status command to check job
current status. Your job identifier (edg_jobId) is:
-
https://aow5grid.uab.es:9000/jR0hjTzOlyFkRkpP_i1R8Q
*****************************************************************************
This output indicates that the job has been sent to the RB and now you must check its status using the edg-job-status command. The job will pass through three different states: Waiting, Running and Done.
Here is shown the output of the edg-job-status command in each case:
*************************************************************
BOOKKEEPING INFORMATION:
Printing status info for the
Job : https://aow5grid.uab.es:9000/jR0hjTzOlyFkRkpP_i1R8Q
Current Status: Waiting
reached on: Mon Feb
2
*************************************************************
*************************************************************
BOOKKEEPING INFORMATION:
Printing status info for the
Job : https://aow5grid.uab.es:9000/jR0hjTzOlyFkRkpP_i1R8Q
Current Status: Running
Status Reason: unavailable
Destination:
ce001.grid.ucy.ac.cy:2119/jobmanager-pbs-infinite
reached on: Mon Feb 2
*************************************************************
*************************************************************
BOOKKEEPING INFORMATION:
Printing status info for the
Job : https://aow5grid.uab.es:9000/jR0hjTzOlyFkRkpP_i1R8Q
Current Status: Done (Success)
Exit code: 0
Status Reason: Job terminated successfully
Destination: ce001.grid.ucy.ac.cy:2119/jobmanager-pbs-infinite
reached on: Mon Feb
2
*************************************************************
When the job has finished, you can get the output using the command edg-job-get-output:
Retrieving files from host aow5grid.uab.es
*****************************************************************************
JOB GET OUTPUT OUTCOME
Output sandbox files for the job:
-
https://aow5grid.uab.es:9000/jR0hjTzOlyFkRkpP_i1R8Q
have been successfully retrieved and stored in
the directory:
/tmp/jR0hjTzOlyFkRkpP_i1R8Q
*****************************************************************************
In the directory specified by the edg-job-get-output can be found the output and error files of each subjob of the application.
3) Learning how to submit Interactive MPICH-P4
and MPICH-G2 jobs
For executing this exercise it is supposed that a minimum well-configured LCG-2 testbed is available.
a) First of all, we need to define the interactive feature in the job
descriptor file. In order to do this we will write one of the next attributes:
JobType { “interactive”, “mpich” } - Defines an interactive mpich-p4 job
{
“interactive”, “mpich-g2” } - Defines an interactive mpich-g2 job
In the
JDL file, you must specify both the fields which define an mpich-p4/g2 job and
the ones dealing with interactivity. e.g.: ListenerPort for an interactive Job
and NodeNumber for an mpich job.
Interactive jobs cannot have
defined none of the following attributes: OutputSandbox, StdOutput and StdError.
For practicing this new feature we show one jdl file, for example:
Executable
="mpig2_app"; JobType=
{"interactive","mpich-g2"}; NodeNumber = 2; ListenerPort=24100; Arguments=
"-n"; InputSandbox
= {"mpig2_app”}
b) Next, we submit the jdl file using the modified UI to the modified RB that supports this new feature.
edg-job-submit file.jdl
The output of the command will give results in the following way:
Selected Virtual Organisation name (from
JDL): cg
Connecting to host aorbgrid.uab.es, port
7772
Logging to host aorbgrid.uab.es, port 9002
**********************************************************************
JOB SUBMIT OUTCOME
The
job has been successfully submitted to the Network Server.
Use
edg-job-status command to check job current status. Your job identifier
(edg_jobId) is:
-
https://aorbgrid.uab.es:9000/IAYUQS7E6J4aySd3bjImVQ
---
The
Interactive Session Listener has been successfully launched
with
the following parameters:
Host: 158.109.65.39
Port: 24501
**********************************************************************
***************************************
Interactive Job console started for
https://aorbgrid.uab.es:9000/IAYUQS7E6J4aySd3bjImVQ
Please press ^C to exit from the session
***************************************
This output indicates that the job has been sent to the RB and
now the user must wait for the beginning of the job execution.
When the job is running, if it is an interactive mpich-g2 job,
the output should be similar to the following:
Subjob1: I am 1 out of 2.
Subjob0: Hello world!
Subjob1: Process 1 on
cgwn07.ifca.org.es
Subjob0: I am 0 out of 2.
Subjob0: Process 0 on
cgwn06.ifca.org.es
Subjob1:
>>>>>>>>>>>>>>> INTERACTIVE JOB FINISHED
<<<<<<<<<<<<<<<
Subjob0:
>>>>>>>>>>>>>>> INTERACTIVE JOB FINISHED
<<<<<<<<<<<<<<<>>>>>>>>>>>>>> INTERACTIVE JOB FINISHED
<<<<<<<<<<<<<<<
Note that “Subjob0” and “Subjob1” indicates the mpich process id.
When all the mpich-g2 job subtasks have finished or when mpirun has finished, the following message will appear:
***************************************
Interactive
Session has finish correctly.
Removing
Listener and input/output streams...
Done
Press
<enter> to go to prompt
***************************************
However, if the user cancelled the job, by pressing ctrl+C, appears the next message in the console:
***************************************
Interactive Session ended by user.
Removing Listener and input/output
streams...
Done
Press <enter> to go to prompt
***************************************
If the job was cancelled, the job must be removed using the command edg-job-cancel, but if the job is still executing in the remote host, it can be restored with the edg-job-attach command:
edg-job-attach
https://aorbgrid.uab.es:9000/ta84GWSCp0qnuTH3g2N8yQ
**********************************************************************
JOB ATTACHED:
The Interactive Session Listener has been
successfully launched
with the following parameters:
---
Host: 158.109.65.39
Port: 24501
**********************************************************************
***************************************
Interactive
Job console started for
https://aorbgrid.uab.es:9000/ta84GWSCp0qnuTH3g2N8yQ
Please
press ^C to exit from the session
***************************************
From then on, the job will continue its execution exactly as if the edg-job-submit command had been used.
Previous exercise Next exercise Back to
menu