In order to support some of the main future use cases of Vitrage, including full HA support, alarm history and RCA history, we will need to make some architectural changes.
This document contains the required use cases and a high level design for supporting them.
Vitrage should have full HA support. There are different aspects that should be considered:
Note: This use case covers only the RCA information. See also the next use case of ‘Alarm History’.
At the moment, Vitrage shows Root Cause Analysis only for alarms that are currently triggered. We would like Vitrage to include also information for alarms that were already disabled.
An example: If the host is down, then the instance is down, as well as an application running on it. Later on, the problem of the host might be fixed, but the application might not recover automatically. The cloud operator should be aware of the fact that the alarm on the application resulted from the alarm on the host (even though this alarm no longer exists).
Vitrage should keep alarm history for a specified period of time. This history can be used for analytics or machine learning purposes, as well as to show the user statistics about the alarms in the cloud.
Note: This use case is of a lower priority, and is not answered by the current design. It can be implemented in the future by storing new tables with alarms information in a relational database.
Vitrage should perform well under load. In order to support it, we might want to introduce a persistent graph database as an alternative to the current in-memory implementation with NetworkX.
There are several aspects to this decision:
For now we believe that an in-memory graph database will be faster, so this use case does not require introducing a persistent graph database.
The in-memory NetworkX graph can work well with XXX number of vertices. In order to support a bigger entity graph, we will have to switch to a persistent graph database.
The Vitrage entity graph must remain consistent even if Vitrage is down. Note that this is usually the case with the current implementation, since the entity graph is recalculated after every restart. The only exception is that the collectd datasource does not have a ‘get all’ implementation and works only by notifications, so after Vitrage recovers we won’t have the alarms that were previously reported by collectd.
The datasource drivers will be responsible for periodically querying the external datasources for all of their resources/alarms. They are already separated from the vitrage-graph process, and run in their own processes. Upon failure of a datasource driver, another driver process will take over calling the ‘get all’ method. A certain delay in the call is not crucial (as by default this method is called every 10 minutes).
The service listeners will be responsible to get notifications from the OpenStack message bus (RabbitMQ1), enrich them and pass them on to the processors. Upon failure, the notifications will remain in the message bus until another service listener gets them.
The current multi-processing queue between the datasource drivers and the processor will be replaced with a RabbitMQ. That way, in case of failure in a processor, the events will be kept in the RabbitMQ until they are processed by another processor.
Events will arrive to the RabbitMQ2 after the filter/enrich phase (done either by the datasource driver or by the service listener). The processor will pass the events to the transformer, as done today.
The persister process will also listen to the RabbitMQ2 (on a different topic) and will asynchronously write the events to a relational database. All events will be stored after the filter/enrich phase. In the first version we will support MariaDB, and we can support other databases if needed in the future.
The processor will be responsible, when it is convenient (i.e. when it is not busy handling events), to export the NetworkX graph as a snapshot into MariaDB. The snapshot frequency should be determined by a combination of the time that passed and the number of events that arrived since the last snapshot.
Reconstructing the graph from the historic data will be controlled by the processor, and will be used in two cases:
The first phase of the graph reconstruction will be to identify the relevant snapshot in MariaDB and import it. The second phase will be to replay all of the events that happened from the time of the snapshot until the wanted time for the graph reconstruction. Replaying the graph will be done by pushing the relevant events to the RabbitMQ2, as if they arrived from the datasources drivers or from the service listeners.
In order to support the RCA history use case, we will have to reconstruct the graph on a separate graph instance and use a different RabbitMQ, while keeping the current active graph intact.
In general, each component will manage its own HA. Specific implementation is required for the processor process. If it fails, a standby will take over. The standby will not be initialized from scratch; instead, it will be initialized in the following way:
TBD: While the processor was down, the persister kept storing events to the database. When the standby processor takes over, the wanted behavior is:
We need a way to determine which events were processed and which were not. This is relevant for the Reliable Notification feature that has been discussed in the past, and will be handled as part of the implementation of this feature.
Short-term RCA history (~1 day long) can be implemented with the current architecture.
Implementation tasks:
In order to query RCA for a longer period in the history, we will do the following:
Will be implemented in the future, probably based on new information that will be stored in the database.
Not affected by this architectural change. Whether a persistent graph DB should be used will be discussed in a different document.
Will require a persistent and distributed graph DB. Replacing the graph DB should have no effect on the overall architectural change.
A full consistency will be achieved by the new architecture, since every un-processed notification will be stored in the RabbitMQ, and every processed notification will be stored as an event in MariaDB.
The service listeners do very little, they call a single enrich method and pass the event on to the RabbitMQ2. They do not have to run on separate processes. The problem is that if we move the code inside the processor processes, we will have two different sources of information to the processor:
The processor can handle this situation, the problem is with the persister. We would like the persister to store only events after the driver processing, and the easiest way to do so is by having all of the events pushed to RabbitMQ2.
Except where otherwise noted, this document is licensed under Creative Commons Attribution 3.0 License. See all OpenStack Legal Documents.