GRID APPLICATION MONITORING WITH OCM-G TOOL

Grid Application Monitoring with OCM-G Tool

The exercise of monitoring a Grid application with OCM-G consists of the following steps:

Adaptation the user’s application to monitoring.
Starting the OCM-G Monitoring System.
Running the user’s application.
Getting some output from the monitoring system.
Shutting down the monitoring system.

Prerequisites

To run the exercise you will need:

sample application (‘ping.c’).
‘ocmg’ and ‘ocmg-devel’ packages installed on your user interface machine.
valid proxy certificate (use ‘grid-proxy-init’).

Step One (I): Preparing the application (Source code modification)

Description

Enabling monitoring of an application requires the application source code to do a very small modification. You just need to insert one function call which inform the monitoring system that the application want to cooperate with the OCM-G monitoring system.

The mentioned function call has the following declaration:

int ocmg_register(int rank, int *argc, char **argv)

where

rank - is the calling process identifier, most commonly you will put an MPI process identifier here,

argc – is the pointer to variable containing number of command line parameters,

argv – is a pointer to structure containing command line parameters

Such call must be inserted in the source code only after the MPI_Init() call otherwise the application will not be provided with argc and argv parameters properly.

An example call can be: ocmg_register(rank, &argc, &argv);

Your tasks

1. Insert the ocmg_register() call into ping.c source code about 37^th line. Search for comment line beginning with “/*INSERT OCMG_REGISTER() BELOW*/” and insert the call below it. Use your favourite editor to do that.

%> vi ping.c +37

insert a line containing:

ocmg_register(rank, &argc, &argv);

Step One (II): Preparing the application (Compilation)

Description

After inserting the call to the function which registers with the monitoring system the application must be linked against monitoring libraries. For MPI applications, instrumented version of the MPI library should also be linked to enable monitoring of MPI calls. The OCM-G provides a compiler wrapper named cg-ocmg-cc which will do the necessary compilation and linking automatically. To use this wrapper, one should issue the usual command he uses for compiling, preceded by the name of the mentioned OCM-G wrapper, e.g.:

%> cg-ocmg-cc mpicc myapp.c –o myapp

This will generate a myapp executable, which is now monitoring-ready.

Your tasks

Recompile your application by typing:

%> cg-ocmg-cc mpicc –o ping ping.c

Step Two: Starting the OCM-G Monitoring System

Description

The OCM-G uses a daemon - Main Service Manager – which needs to be started prior to running the application. This process will collect data about the application and expose it to external entities, such as performance evaluation tools or debuggers.

The MainSM is usually started on the User Interface machine. Other components of the OCM-G (Local Monitors on Worker Nodes and Service Managers on Computing Elements) are started automatically. The MainSM returns a connection string which should be passed as a command-line parameter both to the application and tools.

Your tasks

Start Main Service Manager:

%> cg-ocmg-monitor

Main SM Connection String: 959c0921:4e20

Save the connection string returned by MainSM.

Hint: In Linux you can do that by highlighting the string in an xterm window. If you want to get it from the buffer then press a third mouse button.

Step Three: Running the application

Description
The application requires several command line parameters to pass essential information to the OCM-G. These parameters are as follows:

--ocmg-appname <name> [Mandatory]

The name of the monitored application. This name is chosen by the user arbitrarily should be later passed to the tool. This parameter is mandatory.

--ocmg-regcont [Optional]

This parameter which instructs the OCM-G to continue the application execution (i.e., not suspend it) after registration in the monitoring system. This parameter is optional: if not passed, the processes of the application will be suspended right after they register to the OCM-G. A later monitoring request should be used to continue execution.

--ocmg-mainsm <hexIPaddr:hexPort> [Mandatory]

This mandatory parameter is used to pass the MainSM connection string to the OCM-G.

Your tasks

5. Prepare the execution of the application. To do this, edit the ping.jdl file, go to the line starting with “Arguments” and substitute the connection string with the one returned a second ago by the Main Service Manager. If your connection string is e.g. 959c0921:4e20 the line should contain:

Arguments = "--ocmg-regcont --ocmg-appname ping --ocmg-mainsm 959c0921:4e20";

Save the changes and exit the editor.

Your tasks

6. Submit the application to Resource Broker.

%> edg-job-submit -o JOBID ping.jdl

Now, observe the Main Service Manager window, it should show the application processes registering. You ought to notice a string similar to:

authorization callback: peer identity: /C=PL/O=GRID/O=Cyfronet/CN=CG Tutorial User XX

and after that a line of debugging info.

Important: It is highly recommended to stay at this step until the processes registered.

Both the application and the OCM-G are running now. Next, we can run a tool to attach to the OCM-G and obtain some information about the application.

[in case of problem]: Checking the status of the grid job

In case if the Main Service Manger windows does not change it means that your grid job encountered some troubles. To investigate what happened you can check the status of the job by typing:

%> edg-job-status –i JOBID

You will get the information from the Resource Broker where the job is supposed to be executed and what is the current status of the execution.

Step four (I): Obtaining simple information from the monitoring system

Description

Tools communicate with the OCM-G via a standardized protocol OMIS (On-line Monitoring Interface Specification). For example, one may run the G-PM tool and try to set-up some measurements and visualization charts to see how your application performs. The G-PM tool converts user interactions into OMIS requests to collect data from the OCM-G. In this exercise, we will attach a very simple console tool via which we can specify monitoring requests directly in OMIS protocol, and see replies from the OCM-G. Please note that these requests are normally issued by tools.

Your tasks

Run the console tool by typing:

e.g.

%> cg-ocmg-tool --ocmg-mainsm 959c0921:4e20

Do not forget to replace the connection string with your one! You should see the following prompt displayed:

*** Tool

ocm.1>

Send the below OMIS requests and observe incoming replies.

attach to your application

ocm.1> :app_attach2("ping”)

Got reply 1:

Element 0:

| 0 |

Element 1:

app_1 | 0 |

Note that the application name passed as request parameter is the same as passed as a command line parameter via the `--ocmg-appname’ option. This request returns the application token, which should be from now on used as the application’s identifier. In the example above the token is: `app_1’.

obtain the process list of your application

ocm.2> :app_get_proclist([app_1])

This should return a list of application process tokens, e.g.:

Got reply 2:

Element 0:

| 0 |

Element 1: app_1 | 0 | 2,[n_2, p_29026_n_2, n_1, p_21598_n_1]

This list contains two process tokens beginning with prefix `p_’ (p_29026_n_2, p_21598_n_1). The tokens with `n_’ prefix denote hosts on which your processes run.

obtain host information

To obtain some information about the nodes the application is running on, first we have to attach to the nodes:

ocm.3> :node_attach([n_1,n_2])

Got reply 3:

Element 0:

| 0 |

Element 1:

| 0 |

Note that you should pass tokens of nodes you have got in the reply to request ‘app_get_proclist’.

You obtain the host information with the following request:

ocm.4> :node_get_info([n_1],3)

Got reply 4:

Element 0:

| 0 |

Element 1:

n_1 | 0 | "zeus26.cyf-kr.edu.pl","Linux","#1 SMP Fri Jan 10 11:08:19 CET 2003","2.4.20"

Step Four (II): Monitoring MPI calls

Description

So far we have sent a couple of unconditional requests to the OCM-G, i.e., the result of all these requests was a single immediate response containing some information. Now we will see how we can specify conditional requests to trace some events occurring in the application. In the example we will track the MPI calls in the ping pong application.

We assume, like in the steps above, that the tokens of the nodes are ‘n_1’ and ‘n_2’ and two running application processes are ‘p_21598_n_1’ and ‘p_29026_n_2’. We will trace calls to MPI_Send in the first process and MPI_Recv in the second one.

Your tasks

d. Attach your tool to allow monitoring of the processes

ocm.5> :proc_attach([p_21598_n_1,p_29026_n_2])

Got reply 5:

Element 0:

| 0 |

Element 1:

| 0 |

e. Define conditional request to trace MPI_Send calls in process ‘p_21598_n_1’

ocm.6>

thread_has_started_libcall([p_21598_n_1],”MPI_Send”): print([“MPI_Send started! Time stamp:”], $time)

Got reply 6:

Element 0:

| 0 |

Element 1:

c_NDU_1 | 0 |

Note that this request is composed of two parts separated by a colon. The first part is the specification of the event which is of interest to use (in this case: start of a call to MPI_Send in specified process). The second part specifies actions to be executed when this event occurs. In this case there is only one action `print’ which returns a string and a time stamp to the OCM-G.

This request resulted in installing the conditional requests and returned its token: ‘c_NDU_1’. The request is still disabled though.

f. Define conditional request to trace MPI_Recv calls in process ‘p_29026_n_2’

ocm.7>

thread_has_started_libcall([p_29026_n_2],”MPI_Recv”): print([“MPI_Recv started! Time stamp:”], $time)

Got reply 7:

Element 0:

| 0 |

Element 1:

c_NDU_2 | 0 |

Note the token returned for this conditional request. As in the previous case, the request is disabled.

g. To enable both conditional requests, type the following request:

ocm.8> :csr_enable([csr_NDU_1,csr_NDU_2])

Got reply 8:

Element 0:

| 0 |

Element 1:

| 0 |

From now one you should be observing the periodically incoming results of the `print’ action.

h. Disabling conditional request.

When you have done watching the output from the monitoring system, you can disable the conditional requests with ‘csr_disable’. You can also quit the console tool by typing ‘quit’.

Step Five: Shutting down the OCM-G

Your tasks

To shut down the monitoring system, type:

%> cg-ocmg-terminate –ocmg-mainsm 959c0921:4e20

Do not forget to replace the connection string with yours one!

Now the monitoring system is shutted down. Please note that your application processes are still running.