Concepts and Terminology
Layered Architecture
Two layers of monitoring exist: infrastructure / systems and service / application monitoring. These layers represent the dichotomy between technology and business.
The lowest level is infrastructure and systems monitoring, which seeks to quantify the performance of network elements such as load balancers, switches, routers, firewalls, and other such devices. Metrics tend to be very technical and translate to the technical capabilities of the networking infrastructure. In clouds, this kind of data may be difficult to obtain.
The next highest level is resource monitoring, where computing and storage elements are quantified for their respective performance.
The highest level is service or application-level monitoring, where the performance of the application or service is quantified. At this level, metrics are business focused and reflect the link between technology and business function. The challenge is in defining the services and metrics because these will not be expressed in universal terms as is true with technical metrics. The benefit is that the business can reason about the performance in business terms, which is how customers think of services.
Supporting Tools
IT PM can be defined as the process of monitoring and reporting on the availability and performance of a service, including the detection and diagnosis of performance issues. IT PM tools include:
-
Agents and other mechanisms, which collect or generate metric data relating to application performance and status
-
IT PM servers, which store, compile, and communicate the performance metric data
-
Analysis tools, which analyze and report on the metric data
-
A dashboard portal, which provides data center operators with status information
IT PM analysis tools track and monitor response time or other performance metrics. The analysis tools can be configured by IT PM administrators using Event Situation Rules to detect and respond automatically to Performance Events as they occur based on metric data.
Configuration
Configuring the IT PM analysis tools involves settings that define metrics, condition thresholds, events, rules, and actions. Configuration relies on:
-
Metrics refer to data reported to the IT PM analysis tool concerning specific measures of an application’s performance, such as response time.
-
Thresholds are defined for each metric. A threshold may be a stated value or range of values for a given metric that, if exceeded, indicate a certain condition is true.
-
A Condition is true when a monitored metric data value exceeds a given threshold.
-
A Performance Event occurs when some predefined combination of conditions is true. Each type of Performance Event has its own criteria.
-
A set of Event Situation Rules consist of criteria for a given type of Performance Event and the actions to take when the Performance Event occurs.
-
An Alert is triggered when the Event Situation rules are met.
When the predefined conditions of a rule are met, a Performance Event is logged and the IT PM analysis tools automatically perform the rule’s predefined set of actions such as displaying an alert, sending a notification, or generating a ticket. The metrics, thresholds, and rules used to monitor a CMS application are defined through collaboration between the business application owner, Office of Information Technology (OIT), and the IT PM team.
Other Common Terms
Business Service Metrics
Measures of the availability or performance of a business service as provided by an application.
Event Correlation
Process of correlating multiple events using decision logic to determine a measurement of a multiple-event process or to detect a pattern indicating a performance issue.
Event Management
Activity of monitoring and taking action on events.
Event Enrichment
Information and analysis added to the description of an event defined by a set of Event Situation Rules. An example of Event Enrichment might be a description of the business impact of that event, or a note that the Event may be affected by cache size.
Performance Testing vs. Performance Monitoring
Performance and Stress Testing occurs in an Implementation environment and seeks to:
-
Verify that an application system’s performance meets system baseline performance requirements
-
Baseline the application’s performance and resource use under different workloads to establish initial norms
-
Stress an application system to identify failure points and inform design decisions before service implementation
In the CMS Processing Environment, Performance Monitoring seeks to:
-
Verify that a service is performing within prescribed control limits
-
Detect when a service is not performing within pre-defined standards so that corrective steps can be taken
-
Forecast when a service may exceed prescribed control limits (or thresholds) so that corrective action can begin
Although these activities use the same metrics and sometimes the same tools, their implementation differs, as shown in Table - Implementation Differences between Performance Testing and Performance Monitoring.
Leveraging Performance and Stress Testing
By providing common guidelines for Performance Monitoring across all applications and leveraging performance and stress testing in the Implementation environment, CMS seeks to benefit from:
-
Reuse of test scripts
-
Cross-training on common performance tools
-
Use of a single, end-to-end process with smoother handoffs
-
Sharing of best practices
-
Accurate baseline testing scripts for more accurate results
-
Greater efficiency / reduced cost
Additional Testing vs. Monitoring Considerations
It is to CMS’s advantage to leverage Performance & Stress Testing within its approach to Performance Monitoring. To accomplish this, there must be consistency in performance metrics, performance requirements, and naming conventions in test scripts and transaction properties.
Tests developed for Performance Testing may not be directly reusable in production performance monitoring because of additional constraints such as these:
-
Capability differences may exist between testing tools in the Implementation environment and monitoring tools in the CMS Production Environments at different CMS data centers.
-
The presence of production data for Performance Monitoring may constrain the types of transactions that may be used or the metric data that may be collected.
-
Logging too much data or too often in the Production Environment could present unintended performance consequences and storage issues.
-
Access to performance logs in the Production Environment is controlled.