Grid
Application Monitoring with OCM-G Tool
The exercise of monitoring a Grid application with OCM-G consists of the following steps:
Prerequisites
To run the exercise you will need:
Step
One (I): Preparing the application (Source code modification)
Description
Enabling monitoring of an application requires the application source code to do a very small modification. You just need to insert one function call which inform the monitoring system that the application want to cooperate with the OCM-G monitoring system.
The mentioned function call has the following declaration:
int
ocmg_register(int rank, int *argc, char **argv)
where
rank - is the calling process identifier, most
commonly you will put an MPI process identifier here,
argc – is
the pointer to variable containing number of command line parameters,
argv – is
a pointer to structure containing command line parameters
Such call
must be inserted in the source code only after the MPI_Init() call otherwise the application will not be provided with argc and argv parameters properly.
An
example call can be: ocmg_register(rank, &argc, &argv);
Your tasks
1. Insert the ocmg_register() call into ping.c source code about 37th
line. Search for comment line beginning with “/*INSERT OCMG_REGISTER() BELOW*/” and insert the
call below it. Use your favourite editor to do that.
%> vi ping.c
+37
insert a line containing:
ocmg_register(rank, &argc, &argv);
Step
One (II): Preparing the application (Compilation)
Description
After inserting the call to the function which registers with the monitoring system the application must be linked against monitoring libraries. For MPI applications, instrumented version of the MPI library should also be linked to enable monitoring of MPI calls. The OCM-G provides a compiler wrapper named cg-ocmg-cc which will do the necessary compilation and linking automatically. To use this wrapper, one should issue the usual command he uses for compiling, preceded by the name of the mentioned OCM-G wrapper, e.g.:
%>
cg-ocmg-cc mpicc myapp.c –o myapp
This will generate a myapp executable, which is now monitoring-ready.
Your
tasks
%> cg-ocmg-cc
mpicc –o ping ping.c
Step
Two: Starting the OCM-G Monitoring System
Description
The OCM-G uses a daemon - Main Service
Manager – which needs to be started prior to running the application.
This process will collect data about the application and expose it to external
entities, such as performance evaluation tools or debuggers.
The MainSM is usually started on the User Interface machine. Other components of the OCM-G (Local Monitors on Worker Nodes and Service Managers on Computing Elements) are started automatically. The MainSM returns a connection string which should be passed as a command-line parameter both to the application and tools.
Your
tasks
%> cg-ocmg-monitor
Main SM
Connection String: 959c0921:4e20
Hint: In
Linux you can do that by highlighting the string in an xterm window. If you
want to get it from the buffer then press a third mouse button.
Step
Three: Running the application
Description
The application requires several command line parameters to pass essential
information to the OCM-G. These parameters are as follows:
--ocmg-appname <name> [Mandatory]
The name of the monitored application. This name is chosen by the user arbitrarily should be later passed to the tool. This parameter is mandatory.
--ocmg-regcont [Optional]
This parameter which instructs the OCM-G to continue the application execution (i.e., not suspend it) after registration in the monitoring system. This parameter is optional: if not passed, the processes of the application will be suspended right after they register to the OCM-G. A later monitoring request should be used to continue execution.
--ocmg-mainsm <hexIPaddr:hexPort> [Mandatory]
This mandatory parameter is used to pass the MainSM connection string to the OCM-G.
Your tasks
5. Prepare the execution of the application. To do this, edit the ping.jdl file,
go to the line starting with “Arguments” and substitute the
connection string with the one returned a second ago by the Main Service
Manager. If your connection string is
e.g. 959c0921:4e20 the line should contain:
Arguments
= "--ocmg-regcont --ocmg-appname ping --ocmg-mainsm 959c0921:4e20";
Save the changes and exit the editor.
Your tasks
6.
Submit the application to Resource Broker.
%> edg-job-submit -o JOBID ping.jdl
Now,
observe the Main Service Manager window, it should show the application
processes registering. You ought to notice a string
similar to:
authorization
callback: peer identity: /C=PL/O=GRID/O=Cyfronet/CN=CG Tutorial User XX
and after that a line of debugging info.
Important:
It is highly recommended to stay at this step until the processes registered.
Both the application and the OCM-G are running now. Next, we can run a tool to attach to the OCM-G and obtain some information about the application.
[in
case of problem]: Checking the status of the grid job
In case if the Main Service Manger windows does not change it means that your grid job encountered some troubles. To investigate what happened you can check the status of the job by typing:
%> edg-job-status –i JOBID
You will get the information from the Resource Broker where the job is supposed to be executed and what is the current status of the execution.
Step
four (I): Obtaining simple information from the monitoring system
Description
Tools communicate with the OCM-G via a standardized protocol OMIS (On-line Monitoring Interface Specification). For example, one may run the G-PM tool and try to set-up some measurements and visualization charts to see how your application performs. The G-PM tool converts user interactions into OMIS requests to collect data from the OCM-G. In this exercise, we will attach a very simple console tool via which we can specify monitoring requests directly in OMIS protocol, and see replies from the OCM-G. Please note that these requests are normally issued by tools.
Your
tasks
e.g.
%>
cg-ocmg-tool --ocmg-mainsm 959c0921:4e20
Do not forget to replace the connection string with your one! You should see the following prompt displayed:
*** Tool
ocm.1>
ocm.1>
:app_attach2("ping”)
Got
reply 1:
Element
0:
| 0 |
Element 1:
app_1 |
0 |
Note that the application name passed as request parameter is the same as passed as a command line parameter via the `--ocmg-appname’ option. This request returns the application token, which should be from now on used as the application’s identifier. In the example above the token is: `app_1’.
ocm.2>
:app_get_proclist([app_1])
This should return a list of application process tokens, e.g.:
Got
reply 2:
Element 0:
|
0 |
Element 1: app_1 | 0 |
2,[n_2, p_29026_n_2, n_1, p_21598_n_1]
This list contains two process tokens beginning with prefix `p_’ (p_29026_n_2, p_21598_n_1). The tokens with `n_’ prefix denote hosts on which your processes run.
To obtain some information about the nodes the application is running on, first we have to attach to the nodes:
ocm.3> :node_attach([n_1,n_2])
Got
reply 3:
Element 0:
| 0 |
Element 1:
| 0 |
Note that you should pass tokens of nodes you have got in the reply to request ‘app_get_proclist’.
You obtain the host information with the following request:
ocm.4>
:node_get_info([n_1],3)
Got
reply 4:
Element 0:
| 0 |
Element 1:
n_1 | 0 | "zeus26.cyf-kr.edu.pl","Linux","#1
SMP Fri Jan 10
Step
Four (II): Monitoring MPI calls
Description
So far we have sent a couple of unconditional requests to the OCM-G, i.e., the result of all these requests was a single immediate response containing some information. Now we will see how we can specify conditional requests to trace some events occurring in the application. In the example we will track the MPI calls in the ping pong application.
We assume, like in the steps above, that the tokens of the nodes are ‘n_1’ and ‘n_2’ and two running application processes are ‘p_21598_n_1’ and ‘p_29026_n_2’. We will trace calls to MPI_Send in the first process and MPI_Recv in the second one.
Your
tasks
d. Attach your tool to allow
monitoring of the processes
ocm.5> :proc_attach([p_21598_n_1,p_29026_n_2])
Got
reply 5:
Element 0:
| 0 |
Element 1:
| 0 |
e.
Define conditional request to trace MPI_Send calls in process
‘p_21598_n_1’
ocm.6> thread_has_started_libcall([p_21598_n_1],”MPI_Send”): print([“MPI_Send started!
Time stamp:”], $time)
Got
reply 6:
Element 0:
| 0 |
Element 1:
c_NDU_1 | 0 |
Note that this request is composed of two parts separated by a colon. The first part is the specification of the event which is of interest to use (in this case: start of a call to MPI_Send in specified process). The second part specifies actions to be executed when this event occurs. In this case there is only one action `print’ which returns a string and a time stamp to the OCM-G.
This request resulted in installing the conditional requests and returned its token: ‘c_NDU_1’. The request is still disabled though.
f. Define conditional request
to trace MPI_Recv calls in process ‘p_29026_n_2’
ocm.7> thread_has_started_libcall([p_29026_n_2],”MPI_Recv”):
print([“MPI_Recv started! Time stamp:”], $time)
Got
reply 7:
Element 0:
| 0 |
Element 1:
c_NDU_2 | 0 |
Note the token returned for this conditional request. As in the previous case, the request is disabled.
g.
To enable both conditional requests, type the following request:
ocm.8>
:csr_enable([csr_NDU_1,csr_NDU_2])
Got
reply 8:
Element 0:
| 0 |
Element 1:
| 0 |
From now one you should be observing the periodically incoming results of the `print’ action.
h.
Disabling conditional request.
When you have done watching the output from the monitoring system, you can disable the conditional requests with ‘csr_disable’. You can also quit the console tool by typing ‘quit’.
Step
Five: Shutting down the OCM-G
Your tasks
%> cg-ocmg-terminate
–ocmg-mainsm 959c0921:4e20
Do not forget to replace the connection
string with yours one!
Now the monitoring system is shutted down. Please note that your application processes are still running.