IANOS - Efficient Use of HPC Grid Resources

IANOS solves the nontrivial problem of scheduling applications within a Grid of high-performance computing (HPC) machines, through a novel approach of combined cost and execution time evaluation based on job requirements, application and resource characteristics, and historical data. These algorithms are the backbone of a service-based framework that automatically finds the resources best suited to a certain application request, and negotiates the specific Quality of Service demands with the respective resource providers. For that purpose, IANOS' inherent negotiation framework uses standard service-level agreements, a mechanism suitable for both academic and industrial purposes.

The value of IANOS for users (or customers) is evident: they need not deal with the peculiarities of the underlying Grid system and its topology, but merely define their job-specific requirements. These are processed by IANOS to schedule the job and to generate the Grid middleware-specific job descriptions. The latter are realized through an adapter customized per middleware, since IANOS in general is middleware-agnostic, a fact that minimizes the overhead of its deployment.

A number of Grid schedulers currently exist, including some that are application-oriented. They serve different needs, user groups and applications, apply different scheduling algorithms, and integrate (or not) with different Grid middlewares or infrastructures. IANOS is such a Grid scheduler, but at the same time is more than that. It comprises a number of components and suggests a certain modus operandi for an application-based Grid scheduling, but the system has been designed to match the requirements of a generic Grid-scheduling framework. IANOS:

is based on standards wherever possible. It currently implements the Web Services Resource Framework (WSRF), Job Submission Description Language (JSDL), WS-Agreement, and GLUE (Grid Laboratory for a Uniform Environment), and the IANOS team is cooperating closely with some of the standardization groups. In particular the work towards a generic scheduling architecture conducted by Open Grid Forum's 'Grid Scheduling Architecture Research Group' is worth mentioning.
applies novel scheduling methods for HPC Grids. This is done through the CoreGRID Network of Excellence by close collaboration between application users, system administrators and Grid developers.
is designed to serve multiple application areas. The architecture is service-oriented, based on Web services, modular and as generic as possible. Workflow scheduling, co-allocation and additional applications are already on the research agenda.
is middleware-agnostic. The system has been designed so that it can be adapted to different middlewares with minimal effort. Currently a UNICORE (Uniform Interface to Computing Resources) adapter is available and a Globus adapter is under development.
is targeting production systems. Once the extensive tests have been finalized, IANOS will be used in production mode.

The scientific work done within CoreGRID involves the research groups on Grid Scheduling Architecture, Integration of Intelligent Scheduling Service (ISS) into the VIOLA (Vertically Integrated Optical Testbed for Large Applications)/MetaScheduling Environment, performance prediction using the Gamma model, and service-level agreements for resource management and scheduling. In addition to this, IANOS comprises results from the collaboration with the SwissGrid Association SWING, and collaboration and mutual result exchange with several standardization groups at the Open Grid Forum. Although the focus is on Grid scheduling, areas like business-oriented Grids, security, information management, accounting and so on are also of great importance. The core components of IANOS are as follows:

(1) The broker of the IANOS middleware uses two models: the cost function model and the execution time evaluation model. These are based on a parameterization of the applications and resources. The cost function model calculates the cost value for each candidate resource. The execution time evaluation model forecasts the execution time of a given application on a given resource, a prediction based on knowledge about the CPU node performance of the applications. The broker itself relies on information coming from an information service and diverse monitoring services.

(2) The MetaScheduling Service (MSS) applies multi-level scheduling strategies with interaction between local resource management systems and higher-level Grid-scheduling entities. Collaboration of Grid-level schedulers with local resource management systems in a heterogeneous environment raises a number of issues, such as determining the level of interaction, coping efficiently with heterogeneity, negotiating with entities that in general do not offer this property, and deciding which negotiation protocol to use. Here the IANOS team collaborates closely with the Grid Scheduling Architecture Research Group of the Open Grid Forum.

(3) Service-Level Agreements (SLAs) serve as electronic contracts that formalize the level of service upon which the user and the MetaScheduling Service agree. IANOS implements an SLA framework based on the WS-Agreement standard. Since WS-Agreement also serves as the foundation for a number of business-oriented Grid systems, IANOS may also be applied in a business context.

Figure 1: IANOS architecture.

IANOS has recently been demonstrated at the Open Grid Forum 23 in Barcelona. The demonstration showed the framework in action, spanning three heterogeneous HPC resources at three sites residing in three different administrative domains: Ecole Polytechnique Fédérale de Lausanne, Fraunhofer SCAI and Dortmund University of Technology. Each site has the necessary IANOS components and applications installed, including:

MXM for matrix-matrix multiplication. This core component of many scientific applications is dominated by CPU speed requirements.
MXV for sparse matrix times vector multiplication. MXV is dominated by memory bandwidth.
SpecuLOOS is a spectral element solver for the Navier-Stokes equations in 3D, which runs in parallel and is dominated by point-to-point communication.
LBM-Solver is a Lattice-Boltzman Method solver in 3D. Its dominating characteristic is multicast communications.

Links:
http://www.ianos.org
http://www.coregrid.eu

Please contact:
Philipp Wieder
Dortmund University of Technology, Germany
Tel: +49 231 755 2767
E-mail: philipp.wiederudo.edu