Grid Application Monitoring with  OCM-G Tool

 

Previous exercise             Next exercise             Back to menu

 

The exercise of monitoring a Grid application with OCM-G consists of the following steps:

Adaptation the user’s application to monitoring.

Starting the OCM-G Monitoring System.

Running the user’s application.

Getting some output from the monitoring system.

Shutting down the monitoring system.

 

Prerequisites

To run the exercise you will need:

sample application source code: ping.c, for monitoring with probes you will need ping_p.c

‘ocmg’ and ‘ocmg-devel’ packages installed on your user interface machine.

valid proxy certificate (use ‘grid-proxy-init’).

 

Step One: Preparing the application

Description

Enabling monitoring of an MPI application does not require any modification to the application source code. Instrumented MPI library containing initialization functions absolves the user from the responsibility of inserting any additional function calls to the application source code.

The registration with the monitoring system is hidden in the MPI_Init() function. The application must only be linked against monitoring libraries. For MPI applications, instrumented version of the MPI library should also be linked to enable monitoring of MPI calls. The OCM-G provides a compiler wrapper named cg-ocmg-cc which does the necessary compilation and linking automatically. To use this wrapper, one should issue the usual command he uses for compiling, preceded by the name of the mentioned OCM-G wrapper, e.g.:

%> cg-ocmg-cc mpicc myapp.c –o myapp

 

If you would like to insert OCM-G probes into your application which enables to gather application-related metrics, you need to insert the probes into source code. The probe is just a call to an empty function the user puts in a relevant place in the application sources e.g., at the end of main loop. Inside ping_p.c you can find a probe called: ocmg_probe1.

To compile the application with probes you need to use –probes option followed by the name of the file the probes were defined i.e.,

%> cg-ocmg-cc –probes probes.c mpicc myapp.c –o myapp

 

This will generate a myapp executable, which is now probe enabled and monitoring-ready.

 

Your tasks

Recompile your application by typing:

 

%> cg-ocmg-cc –probes probes.c mpicc –o ping_p ping_p.c

 

You can check whether the application is compiled correctly by use of cg-ocmg-check-app tool.

%> cg-ocmg-check-app ping

 

 

 

The correct output should look similar to that:

Application is linked with OCM-G

Instrumented libraries:  libmpich.a

Probes files:  probes.c

 

Step Two: Starting the OCM-G Monitoring System

Description

The OCM-G uses a daemon - Main Service Manager – which needs to be started prior to running the application. This process will collect data about the application and expose it to external entities, such as performance evaluation tools or debuggers.

The MainSM is usually started on the User Interface machine. Other components of the OCM-G (Local Monitors on Worker Nodes and Service Managers on Computing Elements) are started automatically. The MainSM returns a connection string which should be passed as a command-line parameter both to the application and tools.

Your tasks

Start Main Service Manager:

%> cg-ocmg-monitor

     

 

 

Main SM Connection String: 959c0921:4e20

Save the connection string returned by MainSM.

Hint: In Linux you can do that by highlighting the string in an xterm window. If you want to get it from the buffer then press a third mouse button.

 

 

Step Three: Running the  application

Description
The application requires several command line parameters to pass essential information to the OCM-G. These parameters are as follows:

--ocmg-appname <name>                                               [Mandatory]

The name of the monitored application. This name is chosen by the user arbitrarily should be later passed to the tool. This parameter is mandatory.

--ocmg-regcont                                                   [Optional]

This parameter which instructs the OCM-G to continue the application execution (i.e., not suspend it) after registration in the monitoring system. This parameter is optional: if not passed, the processes of the application will be suspended right after they register to the OCM-G. A later monitoring request should be used to continue execution.

--ocmg-mainsm <hexIPaddr:hexPort>    [Mandatory]

This mandatory parameter is used to pass the MainSM connection string to the OCM-G.

Your tasks

Prepare the execution of the application. To do this, edit the ping.jdl file, go to the line starting with “Arguments” and substitute the connection string with the one returned a second ago by the Main Service Manager. If  your connection string is e.g. 959c0921:4e20 the line should contain:

 

Arguments = "--ocmg-regcont --ocmg-appname ping --ocmg-mainsm 959c0921:4e20";

 

Save the changes and exit the editor.

Your tasks

Submit the application to Resource Broker.

 

%> edg-job-submit -o JOBID ping.jdl

 

 

 

Now, observe the Main Service Manager window, it should show the application processes registering. You ought to notice a string similar to:

authorization callback: peer identity: /C=PL/O=GRID/O=Cyfronet/CN=CG Tutorial User XX

and after that a line of debugging info.

Important: It is highly recommended to stay at this step until the processes registered.

When it is successful, both the application and the OCM-G are running now. Next, we can run a tool to attach to the OCM-G and obtain some information about the application.

 

[in case of problem]: Checking the status of the grid job

In case if the Main Service Manger windows does not change it means that your grid job encountered some troubles. To investigate what happened you can check the status of the job by typing:

 

%> edg-job-status –i JOBID

 

 

 

You will get the information from the Resource Broker where the job is supposed to be executed and what is the current status of the execution.

 

Step four (I): Obtaining simple information from the monitoring system

Description

Tools communicate with the OCM-G via a standardized protocol OMIS (On-line Monitoring Interface Specification). For example, one may run the G-PM tool and try to set-up some measurements and visualization charts to see how your application performs. The G-PM tool converts user interactions into OMIS requests to collect data from the OCM-G. In this exercise, we will attach a very simple console tool via which we can specify monitoring requests directly in OMIS protocol, and see replies from the OCM-G. Please note that these requests are normally issued by tools.

Your tasks

Run the console tool by typing:

e.g.

%> cg-ocmg-tool --ocmg-mainsm 959c0921:4e20

 

 

 

Do not forget to replace the connection string with your one! You should see the following prompt displayed:

 

*** Tool

ocm.1>

 

Send the below OMIS requests and observe incoming replies.

1.attach to your application

ocm.1> :app_attach2("ping”)

 

 

 

Got reply 1:

    Element 0:

                         |   0 |

            Element 1:

                        app_1 |   0 |

 

Note that the application name passed as request parameter is the same as passed as a command line parameter via the `--ocmg-appname’ option. This request returns the application token, which should be from now on used as the application’s identifier. In the example above the token is: `app_1’.

 

2.obtain the process list of your application

 

ocm.2> :app_get_proclist([app_1])

 

 

 

 

 

This should return a list of application process tokens, e.g.:

 

Got reply 2:

   Element 0:

                                   | 0 |

   Element 1: app_1 | 0 | 2,[n_2, p_29026_n_2, n_1, p_21598_n_1]

 

This list contains two process tokens beginning with prefix `p_’ (p_29026_n_2, p_21598_n_1). The tokens with `n_’ prefix denote hosts on which your processes run.

3.obtain host information

To obtain some information about the nodes the application is running on, first we have to attach to the nodes:

 

Got reply 3:

    Element 0:

                                  | 0 |

    Element 1:

                                  | 0 |

 

Note that you should pass tokens of nodes you have got in the reply to request ‘app_get_proclist’.

 

You obtain the host information with the following request:

 

Got reply 4:

  Element 0:

                        | 0 |

  Element 1:

                   n_1   | 0 | "zeus26.cyf-kr.edu.pl","Linux","#1 SMP Fri Jan 10 11:08:19 CET 2003","2.4.20"

 

Step Four (II): Monitoring MPI calls

Description

So far we have sent a couple of unconditional requests to the OCM-G, i.e., the result of all these requests was a single immediate response containing some information. Now we will see how we can specify conditional  requests to trace some events occurring in the application. In the example we will track the MPI calls in the ping pong application.

We assume, like in the steps above, that the tokens of the nodes are ‘n_1’ and ‘n_2’ and two running application processes are ‘p_21598_n_1’ and ‘p_29026_n_2’. We will trace calls to MPI_Send in the first process and MPI_Recv in the second one.

Your tasks

 

4.Attach your tool to allow monitoring of  the processes

ocm.5> :proc_attach([p_21598_n_1,p_29026_n_2])

 

 

 

 

 

Got reply 5:

 Element 0:

                      | 0 |

 Element 1:

                      | 0 |

5.Define conditional request to trace MPI_Send calls in process ‘p_21598_n_1

 

Got reply 6:

    Element 0:

                       | 0 |

  Element 1:

                  c_NDU_1 | 0 |

 

Note that this request is composed of two parts separated by a colon. The first part is the specification of the event which is of interest to use (in this case: start of a call to MPI_Send in specified process). The second part specifies actions to be executed when this event occurs. In this case there is only one action `print’ which returns a string and a time stamp to the OCM-G.

This request resulted in installing the conditional requests and returned its token: ‘c_NDU_1’. The request is still disabled though.

 

 

6.Define conditional request to trace MPI_Recv calls in process ‘p_29026_n_2

 

 

Got reply 7:

  Element 0:

                       | 0 |

  Element 1:

                  c_NDU_2 | 0 |

 

Note the token returned for this conditional request. As in the previous case, the request is disabled.

 

7.To enable both conditional requests, type the following request:

ocm.8> :csr_enable([csr_NDU_1,csr_NDU_2])

 

 

 

 

Got reply 8:

  Element 0:

                       | 0 |

  Element 1:

                    | 0 |

 

From now one you should be observing the periodically incoming results of the `print’ action.

 

8.Disabling conditional request.

When you have done watching the output from the monitoring system, you can disable the conditional requests with ‘csr_disable’. You can also quit the console tool by typing ‘quit’.

9.Sample use of probes (optional).

Using probes you can measure application-specific metrics. In the example below, we show how to determine the progress of the application on the basis of a probe inserted at the end of application main loop. The name of the probe is ocmg_probe1.

 

1. Connect to the application, then obtain a list of processes and attach to at least one of them. If you followed previous part of the exercise (paragraphs 1-8) you can skip to next step.

ocm.1> :app_attach2("ping_p")

ocm.2> :app_get_proclist([app_1])

ocm.3> :node_attach([n_c1924b12])

ocm.4> :proc_attach([p_26164_n_c1924b12])

 

 

 

 

 

2.Define an action which will be taken whenever application executes the probe (type it as a single line):

ocm.5> thread_has_started_lib_call([p_26164_n_c1924b12],"ocmg_probe1") : print([$time,$par0])

 

 

 

 

As a result you will be given with a CSR token e.g., c_NDU_3. Now you need to enable it:

ocm.6>

:csr_enable([c_NDU_3])

 

 

 

 

From this moment, you should start receiving responses from the OCM-G whenever a probe is executed. The response should look like as follows:

ocm.17> Got reply 10:

    Element 0:

      p_22380_n_d4570266 |  10 |

    Element 1:

                       |   0 | 2,[1102691387.57487798,709]

The square brackets contain two values: the timestamp and the parameter passed by probe, in this case it is an iteration counter. Notice that the timestamp increments by the value of 10 in each step.

Probes allow to pass more and other (also floating point) values to the monitoring system at any place in the application.

 

 

 

Step Five: Shutting down the OCM-G

Your tasks

To shut down the monitoring system, type:

 

Do not forget to replace the connection string with the proper one!

 

Now the monitoring system is shutted down. Please note that your application processes are still running.

 

Previous exercise             Next exercise             Back to menu