Scheduler

 

This document contains a set of exercises to learn the basic commands for executing interactive MPI applications on the CrossGrid testbed from an advanced user point of view.  This tutorial is not targeted to users that submit jobs to the grid via the Portal, who will never need to execute any of the commands described in this tutorial.

 

Requirements:

 

A minimum well-configured LCG-1 testbed is needed for exercises 1 and 2, while for executing exercise 3 an LCG-2 testbed is needed.  These testbeds should have at least both the modified RB and the UI working properly.  A properly configured II should be also working.  In addition the user needs to have a valid certificate signed by a CA.

 

In this tutorial you will need to create the file.jdl file for doing the three proposed exercises.  In addition you will need to use two example applications provided with this tutorial:

 

 

1) Learning how to find available resources for your job

 

We will learn to use the 'edg-job-list-match' command to get the list of resources on which to execute your job.  Remember that commands like 'edg-job-submit' will also call the resource selector subsystem although you will not notice it.

 

The 'edg-job-list-match' command is used to submit jdl files that represent a parallel MPI job to the Resource Broker.

 

This jdl file shall contain the specifications and requirements of the job:

 

 

            JobType         Field that defines that is a MPI job. Possible values are:

                                   normal     - (default) common sequential job

mpich      - defines an MPI job compiled with the ch_p4 device

“mpich-g2”      - defines an MPI job compiled with the G2 device

 

NodeNumber             Field that defines the required number of cpus to execute the MPI job

 

 

Below is depicted an jdl example (file.jdl) file that looks for groups of CEs whose queue type is PBS and that have at least 10 free CPUs  in the group, to run the MPICH G2 job named mpi_app.

 

 

Text Box: Executable    	= "mpi_app";
JobType        	= ?mpich-g2?;
NodeNumber      	= 10;
Arguments     	= "-n";
StdOutput      	= "std.out";
StdError       	= "std.err";
Requirements    = other.GlueCEInfoLRMSType=="pbs";
Rank      		= other.GlueHostBenchmarkSI00;
OutputSandbox	= {"std.out","std.err"};

 

 

The command used to get the available CEs is:

 

            edg-job-list-match   file.jdl

 

The output of the command will depend on the available machines.  An example of the output obtained is the following:

 

Connecting to host cg07.ific.uv.es, port 7772

 

********************************************************************************

GROUPS OF CE IDs LIST

The following groups of CE(s) matching your job requirements have been found:

 

*Groups with 1 CEs*                 *TotalCPUs* *FreeCPUs*

 

[Rank=650]

ce001.grid.ucy.ac.cy:2119/jobmanager-pbs-infinite      10         10

[Rank=650]

ce001.grid.ucy.ac.cy:2119/jobmanager-pbs-long          10         10

[Rank=650]

ce001.grid.ucy.ac.cy:2119/jobmanager-pbs-short         10         10

[Rank=630]

cluster.ui.sav.sk:2119/jobmanager-pbs-workq            16         16

[Rank=400]

zeus24.cyf-kr.edu.pl:2119/jobmanager-pbs-infinite      58         57

[Rank=400]

zeus24.cyf-kr.edu.pl:2119/jobmanager-pbs-long          58         57

[Rank=400]

zeus24.cyf-kr.edu.pl:2119/jobmanager-pbs-short         58         57

 

 

*Groups with 2 CEs*                 *TotalCPUs* *FreeCPUs*

 

[Rank=440 TotalCPUs=12 FreeCPUs=12]

cagnode45.cs.tcd.ie:2119/jobmanager-pbs-infinite       4          4

ce100.fzk.de:2119/jobmanager-pbs-long                  8          8

[Rank=498 TotalCPUs=10 FreeCPUs=10]

ce01.lip.pt:2119/jobmanager-pbs-infinite               2          2

ce100.fzk.de:2119/jobmanager-pbs-long                  8          8

[Rank=433.6 TotalCPUs=10 FreeCPUs=10]

ce100.fzk.de:2119/jobmanager-pbs-long                  8          8

cg01.ific.uv.es:2119/jobmanager-pbs-infinite           2          2

[Rank=448 TotalCPUs=10 FreeCPUs=10]

ce100.fzk.de:2119/jobmanager-pbs-long                  8          8

cgnode00.di.uoa.gr:2119/jobmanager-pbs-infinite        2          2

[Rank=498 TotalCPUs=12 FreeCPUs=10]

ce100.fzk.de:2119/jobmanager-pbs-long                  8          8

cms.fuw.edu.pl:2119/jobmanager-pbs-infinite            4          2

[Rank=566.667 TotalCPUs=12 FreeCPUs=12]

cagnode45.cs.tcd.ie:2119/jobmanager-pbs-infinite       4          4

xgrid.icm.edu.pl:2119/jobmanager-pbs-infinite          8          8

[Rank=650 TotalCPUs=10 FreeCPUs=10]

ce01.lip.pt:2119/jobmanager-pbs-infinite               2          2

xgrid.icm.edu.pl:2119/jobmanager-pbs-infinite          8          8

[Rank=555 TotalCPUs=16 FreeCPUs=16]

ce100.fzk.de:2119/jobmanager-pbs-long                  8          8

xgrid.icm.edu.pl:2119/jobmanager-pbs-infinite          8          8

[Rank=585.6 TotalCPUs=10 FreeCPUs=10]

cg01.ific.uv.es:2119/jobmanager-pbs-infinite           2          2

xgrid.icm.edu.pl:2119/jobmanager-pbs-infinite          8          8

[Rank=600 TotalCPUs=10 FreeCPUs=10]

cgnode00.di.uoa.gr:2119/jobmanager-pbs-infinite        2          2

xgrid.icm.edu.pl:2119/jobmanager-pbs-infinite          8          8

[Rank=650 TotalCPUs=12 FreeCPUs=10]

cms.fuw.edu.pl:2119/jobmanager-pbs-infinite            4          2

xgrid.icm.edu.pl:2119/jobmanager-pbs-infinite          8          8

 

 

 

 

 

2) Learning how to submit a simple MPICH-G2 job

 

A  jdl file must specify the mpich-g2 jobtype and the number of nodes needed to execute the application as the one depicted in the previous exercise. The command used to submit such file is:

 

edg-job-submit file.jdl

 

The output of the command will give results in the following way:

 

Connecting to host aow5grid.uab.es, port 7772

Logging to host aow5grid.uab.es, port 9002

 

 

*****************************************************************************

 

                               JOB SUBMIT OUTCOME

 The job has been successfully submitted to the Network Server.

 Use edg-job-status command to check job current status. Your job identifier (edg_jobId) is:

 

 - https://aow5grid.uab.es:9000/jR0hjTzOlyFkRkpP_i1R8Q

 

 

*****************************************************************************

 

This output indicates that the job has been sent to the RB and now you must check  its status using the edg-job-status command. The job will pass through three different states: Waiting, Running and Done.

Here is shown the output of the edg-job-status command in each case:

 

*************************************************************

BOOKKEEPING INFORMATION:

 

Printing status info for the Job : https://aow5grid.uab.es:9000/jR0hjTzOlyFkRkpP_i1R8Q

Current Status:    Waiting

reached on:        Mon Feb  2 15:20:24 2004

*************************************************************

 

*************************************************************

BOOKKEEPING INFORMATION:

 

Printing status info for the Job : https://aow5grid.uab.es:9000/jR0hjTzOlyFkRkpP_i1R8Q

Current Status:    Running

Status Reason:     unavailable

Destination:       ce001.grid.ucy.ac.cy:2119/jobmanager-pbs-infinite

reached on:        Mon Feb  2 15:31:29 2004

*************************************************************

 

*************************************************************

BOOKKEEPING INFORMATION:

 

Printing status info for the Job : https://aow5grid.uab.es:9000/jR0hjTzOlyFkRkpP_i1R8Q

Current Status:    Done (Success)

Exit code:         0

Status Reason:     Job terminated successfully

Destination:       ce001.grid.ucy.ac.cy:2119/jobmanager-pbs-infinite

reached on:        Mon Feb  2 15:38:16 2004

*************************************************************

 

When the job has finished, you can get the output using the command edg-job-get-output:

 

Retrieving files from host aow5grid.uab.es

 

*****************************************************************************

                        JOB GET OUTPUT OUTCOME

 

 Output sandbox files for the job:

 - https://aow5grid.uab.es:9000/jR0hjTzOlyFkRkpP_i1R8Q

 have been successfully retrieved and stored in the directory:

 /tmp/jR0hjTzOlyFkRkpP_i1R8Q

 

*****************************************************************************

 

In the directory specified by the edg-job-get-output can be found the output and error files of each subjob of the application.

 

 

3)  Learning how to submit Interactive MPICH-P4 and MPICH-G2 jobs

 

For executing this exercise it is supposed that a minimum well-configured LCG-2 testbed is available.

 

a) First of all, we need to define the interactive feature in the job descriptor file. In order to do this we will write one of the next attributes:

 

            JobType         { “interactive”, “mpich” }         - Defines an interactive mpich-p4 job

                                   { “interactive”, “mpich-g2” }     - Defines an interactive mpich-g2 job

 

In the JDL file, you must specify both the fields which define an mpich-p4/g2 job and the ones dealing with interactivity. e.g.: ListenerPort for an interactive Job and NodeNumber for an mpich job.

 

Interactive jobs cannot have defined none of the following attributes: OutputSandbox, StdOutput and StdError.

For practicing this new feature we show one jdl file, for example:

 

 

 

b) Next, we submit the jdl file using the modified UI to the modified RB that supports this new feature.

 

edg-job-submit file.jdl

 

The output of the command will give results in the following way:

 

Selected Virtual Organisation name (from JDL): cg

Connecting to host aorbgrid.uab.es, port 7772

Logging to host aorbgrid.uab.es, port 9002

 

**********************************************************************

                         JOB SUBMIT OUTCOME

 The job has been successfully submitted to the Network Server.

 Use edg-job-status command to check job current status. Your job identifier (edg_jobId) is:

 

 - https://aorbgrid.uab.es:9000/IAYUQS7E6J4aySd3bjImVQ

 

       ---

 The Interactive Session Listener has been successfully launched

 with the following parameters:

 

 Host:                            158.109.65.39

 Port:                            24501

 

**********************************************************************

 

***************************************

Interactive Job console started for

https://aorbgrid.uab.es:9000/IAYUQS7E6J4aySd3bjImVQ

Please press ^C to exit from the session

***************************************

 

This output indicates that the job has been sent to the RB and now the user must wait for the beginning of the job execution.

 

When the job is running, if it is an interactive mpich-g2 job, the output should be similar to the following:

 

Subjob0: my_id 0 numprocs 2

Subjob0: Number of trips around the ring ? 2

Subjob0: Verbosity (yes/no) ? yes

Subjob0: Processor name: cg05.ific.uv.es

Subjob0: Starting trip 1 of 2: before sending num=1 to dest=1

Subjob0: Inside trip 1 of 2: before receiving from source=1

Subjob0: End of trip 1 of 2: after receiving passed_num=2 (should be =trip*numprocs=2) from source=1

Subjob0: Starting trip 2 of 2: before sending num=3 to dest=1

Subjob0: Inside trip 2 of 2: before receiving from source=1

Subjob0: End of trip 2 of 2: after receiving passed_num=4 (should be =trip*numprocs=4) from source=1

Subjob0: >>>>>>>>>>>>>>>  INTERACTIVE JOB FINISHED  <<<<<<<<<<<<<<<

Subjob1: my_id 1 numprocs 2

Subjob1: Processor name: cg04.ific.uv.es

Subjob1: Top of trip 1 of 2: before receiving from source=0

Subjob1: Inside trip 1 of 2: after receiving passed_num=1 from source=0

Subjob1: Inside trip 1 of 2: before sending passed_num=2 to dest=0

Subjob1: Bottom of trip 1 of 2: after send to dest=0

Subjob1: Top of trip 2 of 2: before receiving from source=0

Subjob1: Inside trip 2 of 2: after receiving passed_num=3 from source=0

Subjob1: Inside trip 2 of 2: before sending passed_num=4 to dest=0

Subjob1: Bottom of trip 2 of 2: after send to dest=0

Subjob1: >>>>>>>>>>>>>>>  INTERACTIVE JOB FINISHED  <<<<<<<<<<<<<<<

 

Note that “Subjob0” and “Subjob1” indicates the mpich process id.

However, if it is an interactive mpich-p4 job, the output will be the following:

 

 

my_id 0 numprocs 2

Number of trips around the ring ? 2

Verbosity (yes/no) ? yes

my_id 1 numprocs 2

Slave 1: Processor name: cg04.ific.uv.es

Slave 1: top of trip 1 of 2: before receiving from source=0

Slave 1: inside trip 1 of 2: after receiving passed_num=1 from source=0

Slave 1: inside trip 1 of 2: before sending passed_num=2 to dest=0

Slave 1: bottom of trip 1 of 2: after send to dest=0

Slave 1: top of trip 2 of 2: before receiving from source=0

Slave 1: inside trip 2 of 2: after receiving passed_num=3 from source=0

Slave 1: inside trip 2 of 2: before sending passed_num=4 to dest=0

Slave 1: bottom of trip 2 of 2: after send to dest=0

Master: Processor name: cg06.ific.uv.es

Master: starting trip 1 of 2: before sending num=1 to dest=1

Master: inside trip 1 of 2: before receiving from source=1

Master: end of trip 1 of 2: after receiving passed_num=2 (should be =trip*numprocs=2) from source=1

Master: starting trip 2 of 2: before sending num=3 to dest=1

Master: inside trip 2 of 2: before receiving from source=1

Master: end of trip 2 of 2: after receiving passed_num=4 (should be =trip*numprocs=4) from source=1

>>>>>>>>>>>>>>>  INTERACTIVE JOB FINISHED  <<<<<<<<<<<<<<<

 

Note that when the job is an interactive mpich-p4, the execution not shows “Subjob0” nor SubjobN” strings because in this case we show the mpirun output. For that reason, it also has only one end string.

 

When all the mpich-g2 job subtasks have finished or when mpirun has finished, the following  message will appear:

 

***************************************

Interactive Session has finish correctly.

Removing Listener and input/output streams...

Done

Press <enter> to go to prompt

***************************************

 

However, if the user cancelled the job, by pressing ctrl+C, appears the next message in the console:

 

***************************************

Interactive Session ended by user.

Removing Listener and input/output streams...

Done

Press <enter> to go to prompt

***************************************

 

If the job was cancelled, the job must be removed using the command edg-job-cancel, but if the job is not executing in the remote host, it can be restored with the edg-job-attach command:

 

edg-job-attach https://aorbgrid.uab.es:9000/ta84GWSCp0qnuTH3g2N8yQ

 

 

 

**********************************************************************

 JOB ATTACHED:

 The Interactive Session Listener has been successfully launched

 with the following parameters:

       ---

 Host:                            158.109.65.39

 Port:                            24501

**********************************************************************

 

***************************************

Interactive Job console started for

https://aorbgrid.uab.es:9000/ta84GWSCp0qnuTH3g2N8yQ

Please press ^C to exit from the session

***************************************

 

From then on, the job will continue its execution exactly as if the edg-job-submit command had been used.