Enable Health Alerts for this Role |
When set, Cloudera Manager will send alerts when the health of this role reaches the threshold specified by the EventServer setting eventserver_health_events_alert_threshold |
|
true |
enable_alerts
|
false |
Enable Configuration Change Alerts |
When set, Cloudera Manager will send alerts when this entity's configuration changes. |
|
false |
enable_config_alerts
|
false |
Heap Dump Directory Free Space Monitoring Absolute Thresholds |
The health test thresholds for monitoring of free space on the filesystem that contains this role's heap dump directory. |
|
Warning: 10 GiB, Critical: 5 GiB |
heap_dump_directory_free_space_absolute_thresholds
|
false |
Heap Dump Directory Free Space Monitoring Percentage Thresholds |
The health test thresholds for monitoring of free space on the filesystem that contains this role's heap dump directory. Specified as a percentage of the capacity on that filesystem. This setting is not used if a Heap Dump Directory Free Space Monitoring Absolute Thresholds setting is configured. |
|
Warning: Never, Critical: Never |
heap_dump_directory_free_space_percentage_thresholds
|
false |
Log Directory Free Space Monitoring Absolute Thresholds |
The health test thresholds for monitoring of free space on the filesystem that contains this role's log directory. |
|
Warning: 10 GiB, Critical: 5 GiB |
log_directory_free_space_absolute_thresholds
|
false |
Log Directory Free Space Monitoring Percentage Thresholds |
The health test thresholds for monitoring of free space on the filesystem that contains this role's log directory. Specified as a percentage of the capacity on that filesystem. This setting is not used if a Log Directory Free Space Monitoring Absolute Thresholds setting is configured. |
|
Warning: Never, Critical: Never |
log_directory_free_space_percentage_thresholds
|
false |
Rules to Extract Events from Log Files |
This file contains the rules that govern how log messages are turned into events by the custom log4j appender that this role loads. It is in JSON format, and is composed of a list of rules. Every log message is evaluated against each of these rules in turn to decide whether or not to send an event for that message. If a log message matches multiple rules, the first matching rule is used.. Each rule has some or all of the following fields:alert - whether or not events generated from this rule should be promoted to alerts. A value of "true" will cause alerts to be generated. If not specified, the default is "false".rate (mandatory) - the maximum number of log messages matching this rule that can be sent as events every minute. If more than rate matching log messages are received in a single minute, the extra messages are ignored. If rate is less than 0, the number of messages per minute is unlimited.periodminutes - the number of minutes during which the publisher will only publish rate events or fewer. If not specified, the default is one minutethreshold - apply this rule only to messages with this log4j severity level or above. An example is "WARN" for warning level messages or higher.content - match only those messages for which contents match this regular expression.exceptiontype - match only those messages that are part of an exception message. The exception type must match this regular expression. Example:{"alert": false, "rate": 10, "exceptiontype": "java.lang.StringIndexOutOfBoundsException"} This rule sends events to Cloudera Manager for every StringIndexOutOfBoundsException , up to a maximum of 10 every minute.{"alert": false, "rate": 1, "periodminutes": 1, "exceptiontype": ".*"}, {"alert": true, "rate": 1, "periodminutes": 1, "threshold":"ERROR"} In this example, an event generated may not be promoted to alert if an exception is in the ERROR log message, because the first rule with alert = false will match.
|
|
version: 0,
rules: [
alert: false,
rate: 1,
periodminutes: 1,
threshold: FATAL
,
alert: false,
rate: 0,
threshold: WARN,
content: .* is deprecated. Instead, use .*
,
alert: false,
rate: 0,
threshold: WARN,
content: .* is deprecated. Use .* instead
,
alert: false,
rate: 0,
exceptiontype: java.io.IOException
,
alert: false,
rate: 0,
exceptiontype: java.net.SocketException
,
alert: false,
rate: 0,
exceptiontype: java.net.SocketClosedException
,
alert: false,
rate: 0,
exceptiontype: java.io.EOFException
,
alert: false,
rate: 0,
exceptiontype: java.nio.channels.CancelledKeyException
,
alert: false,
rate: 0,
threshold: WARN,
content: IPC Server handler.*took.*appending an edit to hlog.*
,
alert: false,
rate: 1,
periodminutes: 1,
threshold: WARN,
content: ABORTING region server serverName.*
,
alert: false,
rate: 1,
periodminutes: 1,
threshold: WARN,
content: DFSOutputStream ResponseProcessor exception.*
,
alert: false,
rate: 1,
periodminutes: 1,
threshold: WARN,
content: Error Recovery for block blk.*
,
alert: false,
rate: 1,
periodminutes: 1,
threshold: WARN,
content: Failed init
,
alert: false,
rate: 1,
periodminutes: 1,
threshold: WARN,
content: Problem renewing lease for DFSClient.*
,
alert: false,
rate: 1,
periodminutes: 1,
threshold: WARN,
content: remote error telling master we are up
,
alert: false,
rate: 1,
periodminutes: 1,
threshold: WARN,
content: Session.*for server.*closing socket connection and attempting reconnect.
,
alert: false,
rate: 1,
periodminutes: 2,
exceptiontype: .*
,
alert: false,
rate: 0,
threshold: WARN,
content: Error executing shell command .+ No such process.+
,
alert: false,
rate: 0,
threshold: WARN,
content: .*attempt to override final parameter.+
,
alert: false,
rate: 0,
threshold: WARN,
content: [^ ]+ is a deprecated filesystem name. Use.*
,
alert: false,
rate: -1,
threshold: INFO,
content: ^Starting .*compaction on region (.+)$,
attribute:CATEGORY: HBASE,
attribute:EVENTCODE: EV_HBASE_COMPACTION_REGION_STARTED,
attribute:SEVERITY: INFORMATIONAL,
group0: REGION
,
alert: false,
rate: -1,
threshold: INFO,
content: ^completed compaction on region (.+) after (.+)$,
attribute:CATEGORY: HBASE,
attribute:EVENTCODE: EV_HBASE_COMPACTION_REGION_COMPLETED,
attribute:SEVERITY: INFORMATIONAL,
group0: REGION,
group1: DURATION
,
alert: false,
rate: -1,
threshold: INFO,
content: ^Starting compaction on (.+) in region (.+)$,
attribute:CATEGORY: HBASE,
attribute:EVENTCODE: EV_HBASE_COMPACTION_COLUMN_FAMILY_STARTED,
attribute:SEVERITY: INFORMATIONAL,
group0: COLUMN_FAMILY,
group1: REGION
,
alert: false,
rate: -1,
threshold: INFO,
content: ^completed compaction: regionName\u003d(.+), storeName\u003d(.+), fileCount\u003d(.+), fileSize\u003d(.+), priority\u003d(.+), time\u003d(.+); duration\u003d(.+)$,
attribute:CATEGORY: HBASE,
attribute:EVENTCODE: EV_HBASE_COMPACTION_COLUMN_FAMILY_COMPLETED,
attribute:SEVERITY: INFORMATIONAL,
group0: REGION,
group1: COLUMN_FAMILY,
group2: FILE_COUNT,
group3: FILE_SIZE,
group4: PRIORITY,
group5: REQUEST_TIME_NANOS,
group6: DURATION
,
alert: false,
rate: -1,
threshold: INFO,
content: ^Completed compaction: Request \u003d regionName\u003d(.+), storeName\u003d(.+), fileCount\u003d(.+), fileSize\u003d(.+), priority\u003d(.+), time\u003d(.+); duration\u003d(.+)$,
attribute:CATEGORY: HBASE,
attribute:EVENTCODE: EV_HBASE_COMPACTION_COLUMN_FAMILY_COMPLETED,
attribute:SEVERITY: INFORMATIONAL,
group0: REGION,
group1: COLUMN_FAMILY,
group2: FILE_COUNT,
group3: FILE_SIZE,
group4: PRIORITY,
group5: REQUEST_TIME_NANOS,
group6: DURATION
,
alert: false,
rate: -1,
threshold: INFO,
content: ^aborted compaction: regionName\u003d(.+), storeName\u003d(.+), fileCount\u003d(.+), fileSize\u003d(.+), priority\u003d(.+), time\u003d(.+); duration\u003d(.+)$,
attribute:CATEGORY: HBASE,
attribute:EVENTCODE: EV_HBASE_COMPACTION_COLUMN_FAMILY_ABORTED,
attribute:SEVERITY: IMPORTANT,
group0: REGION,
group1: COLUMN_FAMILY,
group2: FILE_COUNT,
group3: FILE_SIZE,
group4: PRIORITY,
group5: REQUEST_TIME_NANOS,
group6: DURATION
,
alert: false,
rate: -1,
threshold: INFO,
content: ^Finished memstore flush of .+ for region (.+) in (.+), sequenceid\u003d(.+), compaction requested\u003d(.+)$,
attribute:CATEGORY: HBASE,
attribute:EVENTCODE: EV_HBASE_FLUSH_COMPLETED,
attribute:SEVERITY: INFORMATIONAL,
group0: REGION,
group1: DURATION,
group2: SEQUENCE_ID,
group3: COMPACTION_REQUESTED
,
alert: false,
rate: -1,
threshold: INFO,
content: ^Flush of region (.+) due to global heap pressure$,
attribute:CATEGORY: HBASE,
attribute:EVENTCODE: EV_HBASE_FLUSH_DUE_TO_HEAP_PRESSURE,
attribute:SEVERITY: IMPORTANT,
group0: REGION
,
alert: false,
rate: -1,
threshold: WARN,
content: ^Region (.+) has too many store files; delaying flush up to 90000ms$,
attribute:CATEGORY: HBASE,
attribute:EVENTCODE: EV_HBASE_FLUSH_DELAYED_TOO_MANY_STORE_FILES,
attribute:SEVERITY: CRITICAL,
group0: REGION
,
alert: false,
rate: -1,
threshold: INFO,
content: ^Starting split of region (.+)$,
attribute:CATEGORY: HBASE,
attribute:EVENTCODE: EV_HBASE_SPLIT_STARTED,
attribute:SEVERITY: INFORMATIONAL,
group0: REGION
,
alert: false,
rate: -1,
threshold: INFO,
content: ^Running rollback/cleanup of failed split of (.+);.+$,
attribute:CATEGORY: HBASE,
attribute:EVENTCODE: EV_HBASE_SPLIT_ABORTED,
attribute:SEVERITY: IMPORTANT,
group0: REGION
,
alert: false,
rate: -1,
threshold: INFO,
content: ^Region split, hbase:meta updated, and report to master. Parent\u003d(.+), new regions: (.+, .*, .+), (.+, .*, .+). Split took (.+)$,
attribute:CATEGORY: HBASE,
attribute:EVENTCODE: EV_HBASE_SPLIT_COMPLETED,
attribute:SEVERITY: INFORMATIONAL,
group0: REGION,
group1: DAUGHTER_REGIONS,
group2: DAUGHTER_REGIONS,
group3: DURATION
,
alert: false,
rate: -1,
threshold: INFO,
content: ^Region split, META updated, and report to master. Parent\u003d(.+), new regions: (.+, .*, .+), (.+, .*, .+). Split took (.+)$,
attribute:CATEGORY: HBASE,
attribute:EVENTCODE: EV_HBASE_SPLIT_COMPLETED,
attribute:SEVERITY: INFORMATIONAL,
group0: REGION,
group1: DAUGHTER_REGIONS,
group2: DAUGHTER_REGIONS,
group3: DURATION
,
alert: false,
rate: 1,
periodminutes: 1,
threshold: WARN
]
|
log_event_whitelist
|
false |
Navigator Audit Failure Thresholds |
The health test thresholds for failures encountered when monitoring audits within a recent period specified by the mgmt_navigator_failure_window configuration for the role. The value that can be specified for this threshold is the number of bytes of audits data that is left to be sent to audit server. |
mgmt.navigator.failure.thresholds
|
Warning: Never, Critical: Any |
mgmt_navigator_failure_thresholds
|
false |
Monitoring Period For Audit Failures |
The period to review when checking if audits are blocked and not getting processed. |
mgmt.navigator.failure.window
|
20 minute(s) |
mgmt_navigator_failure_window
|
false |
Navigator Audit Pipeline Health Check |
Enable test of audit events processing pipeline. This will test if audit events are not getting processed by Audit Server for a role that generates audit. |
mgmt.navigator.status.check.enabled
|
true |
mgmt_navigator_status_check_enabled
|
false |
Process Swap Memory Thresholds |
The health test thresholds on the swap memory usage of the process. This takes precedence over the host level threshold. |
|
Warning: 200 B, Critical: Never |
process_swap_memory_thresholds
|
false |
HBase RegionServer Compaction Queue Monitoring Thresholds |
The health test thresholds of the weighted average size of the HBase RegionServer compaction queue over a recent period. See HBase RegionServer Compaction Queue Monitoring Period. |
|
Warning: 10.0, Critical: Never |
regionserver_compaction_queue_thresholds
|
false |
HBase RegionServer Compaction Queue Monitoring Period |
The period over which to compute the moving average of the compaction queue size. |
|
5 minute(s) |
regionserver_compaction_queue_window
|
false |
HBase Region Server Connectivity Tolerance at Startup |
The amount of time to wait for the HBase Region Server to fully start up and connect to the HBase Master before enforcing the connectivity check. |
|
3 minute(s) |
regionserver_connectivity_tolerance
|
false |
File Descriptor Monitoring Thresholds |
The health test thresholds of the number of file descriptors used. Specified as a percentage of file descriptor limit. |
|
Warning: 50.0 %, Critical: 70.0 % |
regionserver_fd_thresholds
|
false |
HBase RegionServer Flush Queue Monitoring Thresholds |
The health test thresholds of the average size of the HBase RegionServer flush queue over a recent period. See HBase RegionServer Flush Queue Monitoring Period. |
|
Warning: 10.0, Critical: Never |
regionserver_flush_queue_thresholds
|
false |
HBase RegionServer Flush Queue Monitoring Period |
The period over which to compute the moving average of the flush queue size. |
|
5 minute(s) |
regionserver_flush_queue_window
|
false |
Garbage Collection Duration Thresholds |
The health test thresholds for the weighted average time spent in Java garbage collection. Specified as a percentage of elapsed wall clock time. |
|
Warning: 30.0, Critical: 60.0 |
regionserver_gc_duration_thresholds
|
false |
Garbage Collection Duration Monitoring Period |
The period to review when computing the moving average of garbage collection time. |
|
5 minute(s) |
regionserver_gc_duration_window
|
false |
RegionServer Host Health Test |
When computing the overall RegionServer health, consider the host's health. |
|
true |
regionserver_host_health_enabled
|
false |
HBase RegionServer to Master Connectivity Test |
Enables the health test that the RegionServer is connected to the Master |
|
true |
regionserver_master_connectivity_enabled
|
false |
HBase RegionServer Memstore Size Thresholds |
The health test thresholds of the total size of RegionServer's memstores. Specified as a percentage of the configured upper limit. See Maximum Size of All Memstores in RegionServer. |
|
Warning: 95.0 %, Critical: 100.0 % |
regionserver_memstore_size_thresholds
|
false |
HBase RegionServer HDFS Read Latency Thresholds |
The health test thresholds of the latency that the RegionServer sees for HDFS read operations |
|
Warning: 50 millisecond(s), Critical: 100 millisecond(s) |
regionserver_read_latency_thresholds
|
false |
HBase RegionServer HDFS Read Latency Monitoring Period |
The period over which to compute the moving average of the HDFS read latency of the HBase RegionServer. |
|
5 minute(s) |
regionserver_read_latency_window
|
false |
RegionServer Process Health Test |
Enables the health test that the RegionServer's process state is consistent with the role configuration |
|
true |
regionserver_scm_health_enabled
|
false |
Percentage of Heap Used by HStoreFile Index |
The health test thresholds of the size used by the HStoreFile index. Specified as a percentage of the total heap size. |
|
Warning: 10.0 %, Critical: Never |
regionserver_store_file_idx_size_thresholds
|
false |
HBase RegionServer HDFS Sync Latency Thresholds |
The health test thresholds for the latency of HDFS write operations that the RegionServer detects |
|
Warning: 500 millisecond(s), Critical: 5 second(s) |
regionserver_sync_latency_thresholds
|
false |
HBase RegionServer HDFS Sync Latency Monitoring Period |
The period over which to compute the moving average of the HDFS sync latency of the HBase RegionServer. |
|
5 minute(s) |
regionserver_sync_latency_window
|
false |
Web Metric Collection |
Enables the health test that the Cloudera Manager Agent can successfully contact and gather metrics from the web server. |
|
true |
regionserver_web_metric_collection_enabled
|
false |
Web Metric Collection Duration |
The health test thresholds on the duration of the metrics request to the web server. |
|
Warning: 10 second(s), Critical: Never |
regionserver_web_metric_collection_thresholds
|
false |
Role Triggers |
The configured triggers for this role. This is a JSON-formatted list of triggers. These triggers are evaluated as part as the health system. Every trigger expression is parsed, and if the trigger condition is met, the list of actions provided in the trigger expression is executed. Each trigger has the following fields:triggerName (mandatory) - The name of the trigger. This value must be unique for the specific role. triggerExpression (mandatory) - A tsquery expression representing the trigger. streamThreshold (optional) - The maximum number of streams that can satisfy a condition of a trigger before the condition fires. By default set to 0, and any stream returned causes the condition to fire. enabled (optional) - By default set to 'true'. If set to 'false', the trigger is not evaluated.expressionEditorConfig (optional) - Metadata for the trigger editor. If present, the trigger should only be edited from the Edit Trigger page; editing the trigger here can lead to inconsistencies. For example, the following JSON formatted trigger configured for a DataNode fires if the DataNode has more than 1500 file descriptors opened:[{"triggerName": "sample-trigger",
"triggerExpression": "IF (SELECT fd_open WHERE roleName=$ROLENAME and last(fd_open) > 1500) DO health:bad",
"streamThreshold": 0, "enabled": "true"}] See the trigger rules documentation for more details on how to write triggers using tsquery.The JSON format is evolving and may change and, as a result, backward compatibility is not guaranteed between releases. |
|
[] |
role_triggers
|
true |
Unexpected Exits Thresholds |
The health test thresholds for unexpected exits encountered within a recent period specified by the unexpected_exits_window configuration for the role. |
|
Warning: Never, Critical: Any |
unexpected_exits_thresholds
|
false |
Unexpected Exits Monitoring Period |
The period to review when computing unexpected exits. |
|
5 minute(s) |
unexpected_exits_window
|
false |