Low overhead, high performance network resource monitoring for distributed real-time applications “Buzzard”

Fault tolerance of distributed applications can be improved by independent, rapid detection of equipment and/or software failure. Ibridge has developed a Computer Failure Response and Notification System (referred to as “Buzzard”) for providing this function with minimal cost overheads.

Buzzard was built in part as a response to the summary findings of the joint Canadian / American committee studying the blackout of August 14, 2003.

Buzzard can be incorporated into a distributed application for improved reliability, resiliency and fault-tolerance of the overall design. It can also be used as a standalone application that will log events in a distributed environment and notify personnel as appropriate when anomalies occur.

Buzzard is built on top of the Ibridge GPPC protocol and borrows elements from the Ibridge HTS distributed application for time synchronization. Buzzard provides a reliable, distributed service for resource state monitoring in mixed operating systems environments.

Reliability, Resiliency and Fault-Tolerance for distributed real-time applications

A process group is a set of processes that can be treated as a single entity for some purposes. In some distributed applications there is only one process group, which implicitly contains all processes; in others, programmers can assign processes to groups statically when configuring their program, or dynamically by having processes create, join and leave groups during execution. Accurate and timely resource state data is critical to the development of these applications.

The application distribution across resources need not be symmetrical. The case may be that during the course of operation one process bears more of the communication or computation burden than another process. The development of a distributed application by envisioning the application as composed of a process group or as a process executing within a process group is very useful. Two process group communication models are frequently used. These are the publisher-subscriber and client-server models.

Communication within a group may need to be reliable. Cases where reliable delivery is not a concern are real-time video and audio transmission. Reliable communication in these cases is not of much use unless strict real -time guarantees can also be put on the messages so that the quality and latency desired by the application can be achieved. Many applications, however, desire that their messages be reliably sent. Especially when transactions and global state information is concerned, reliability is very important.

In the case of a failure, resiliency may be desired. Resiliency of the application attempts to impose that under certain failure conditions the function will be performed. This concept goes hand in hand with fault tolerance. Fault tolerance attempts to assure that under certain failure conditions the system will be able to recover transparently to the application and proceed. Together these concepts attempt to provide a system model that is robust in the face of failures and has the ability to recover from failures.

Buzzard Features

  • Provides resource monitoring in environments where a large “enterprise style” network management package is not practical.
  • Supports monitoring across different platforms such as Unix, Linux, Windows and even OpenVMS.
  • Designed to provide a simple, effective means of exchanging time critical, machine state data among large numbers of heterogeneous nodes.
  • Designed to monitor node up/down status, relational database up/down status and process up/down status. Additional monitoring functions can be added with relative ease.
  • Designed with distributed application program developers in mind.
  • The Buzzard real-time state data can be viewed by a user or can be accessed from an application program.
  • Historical logging of state events are accurately time-stamped and tracked.  These events can be viewed with a user-friendly interface (developed in Java for cross platform support).
  • Designed for high availability systems and can provide state data from multiple application nodes to two redundant servers.
  • Propagation of state data efficiently, with minimization of individual message latency, was of paramount concern in the design.
  • The protocol also supports highly efficient information flow from the client back to the server in order to support customized implementations relating to network resource management and improved resource utilization.

Copyright Ibridge Inc., 2010