There are several use cases that require support for either alarm equivalence or resource equivalence. The design of these features is in progress, and is not trivial. The purpose of this document is to define the basic requirements and use cases that should be supported, regardless of the implementation that will be selected later on.
The term “equivalence” is used to note resources or alarms that are “equal” although they are reported by different datasources and some of their properties might conflict. Alternative terms could be equality, merge, overlapping, etc.
We currently have two use cases for resource equivalence.
Maybe both cases can be solved hard-coded by the datasources themselves. This option should be checked against the use cases.
We should support the following use cases:
In order to support these use cases, we must define a way for the user to determine which entities are equivalent.
For resources we should define:
For alarms we should define:
Equivalence should be transitive. If the user defines two equivalences with a common entity, then all entities should be equivalent to one another.
For Example:
Vitrage will handle Zabbix, Nagios and Prometheus CPU alarms as all equivalent to one another.
Note: We must support both hard-coded and user-defined equivalence definitions.
There are different approaches for what information the user should see in case there is a conflict between two datasources. The user should be able to define the wanted “merge strategy” out of the following options:
The default, which is the current behavior, will be worst_state.
Expected behavior: Vitrage API returns a single host
Similar to 1.a, but the discovery agent reports first
Expected behavior: There should be no change in what the API returns
Expected behavior: Vitrage API returns a single host with a state that depends on the merge strategy.
Merge Strategy | Aggregated state |
---|---|
last_update | ACTIVE |
most_credible | ERROR |
worst_state | ERROR |
Both vms are equivalent by the Nova UUID.
Expected behavior: Vitrage API will return a single instance. Its name will be determined by one of the datasources in a consistent way (meaning it will be either always the K8s name or always the Nova name).
Expected behavior:
Expected behavior: Vitrage API returns a single alarm with a severity that depends on the merge strategy.
Merge Strategy | Aggregated severity |
---|---|
last_update | WARNING |
most_credible | CRITICAL |
worst_state | CRITICAL |
Expected behavior: depends on the merge strategy.
Merge Strategy | Aggregated severity |
---|---|
last_update | OK (the alarm is deleted) |
most_credible | WARNING |
worst_state | WARNING |
Assume that the merge strategy is worst_state.
Expected behavior: Vitrage API returns a single alarm with severity CRITICAL
Expected behavior: Vitrage API returns two alarms
This use case is detailed also in https://review.openstack.org/#/c/547931/
Expected behavior: Vitrage API returns a single alarm with severity that depends on the merge strategy.
Merge Strategy | Aggregated severity |
---|---|
last_update | CRITICAL |
most_credible | WARNING |
worst_state | CRITICAL |
Expected behavior: depends on the merge strategy.
Merge Strategy | Aggregated severity |
---|---|
last_update | OK (the alarm is deleted) |
most_credible | OK (the alarm is deleted) |
worst_state | WARNING |
The behavior for worst_state strategy:
Expected behavior: Vitrage API returns a single alarm with properties from Nagios, Zabbix and Vitrage and severity that depends on the merge strategy.
Merge Strategy | Aggregated severity |
---|---|
last_update | WARNING |
most_credible | WARNING |
worst_state | CRITICAL |
Assume that the merge strategy is last_update.
Expected behavior: Vitrage API returns two alarms:
Note: Since in Rocky we are going to implement vitrage-graph start-up from the database, there is no real difference if the user restarts the graph after he changes the equivalence definition or not.
Assume that the merge strategy is last_update.
Expected behavior: Vitrage API returns a single alarm with severity CRITICAL
Assume that Zabbix high_cpu alarm is equivalent to Nagios HIGH_CPU alarm.
Template example:
definitions: entities: - entity: category: ALARM rawtext: high_cpu type: zabbix template_id: zabbix_alarm scenarios: - scenario: condition: zabbix_alarm_on_host actions: - ...
Expected behavior: the actions in the scenario are executed as a result of the Nagios alarm.
Assume that Nova host is equivalent to Vitrage discovery agent host.
Template example:
definitions: entities: - entity: category: RESOURCE type: nova.host template_id: nova_host - entity: category: RESOURCE type: discovery_host (???) template_id: discovery_host scenarios: - scenario: condition: discovery_host and discovery_host_contains_instance actions: - ...
Expected behavior: the scenario will work if the host contains an instance, no matter if the host is defined by Nova or by Vitrage discovery agent.
Assume that Zabbix high_cpu alarm is equivalent to Nagios HIGH_CPU alarm and Nova host is equivalent to Vitrage discovery agent host.
Template example:
scenarios: - scenario: condition: discovery_host and discovery_host_contains_instance and zabbix_alarm_on_discovery_host actions: - ...
Expected behavior: the scenario will work if the host contains an instance, no matter if the host is defined by Nova or by Vitrage discovery agent; and if either Zabbix alarm of Nagios alarm was raised on the host.
Assume that Zabbix high_cpu alarm is equivalent to Nagios HIGH_CPU alarm.
Template example:
definitions: entities: - entity: category: ALARM rawtext: high_cpu type: zabbix severity:warning template_id: zabbix_alarm - entity: category: ALARM name: HIGH_CPU type: nagios template_id: nagios_alarm scenarios: - scenario: condition: zabbix_alarm_on_host actions: - ...
This use case is the same as 5.1, with one exception: the template entity zabbix_alarm is defined only for the case that the severity is warning. What will happen if a Nagios alarm is raised with severity warning? and what if it is raised with a different severity?
Is the overlapping templates mechanism somehow related to the equivalence use cases?
Entity equivalence should be defined for a specific tenant. One tenant may want to see Nagios and Zabbix alarms as one alarm, while the other tenant may want to see them separated.
Is it possible that equivalent resources will be reported on different tenants?
What do we do in such a case?
Except where otherwise noted, this document is licensed under Creative Commons Attribution 3.0 License. See all OpenStack Legal Documents.