Concepts and Terminology

Layered Architecture

Two layers of monitoring exist: infrastructure / systems and service / application monitoring. These layers represent the dichotomy between technology and business.

The lowest level is infrastructure and systems monitoring, which seeks to quantify the performance of network elements such as load balancers, switches, routers, firewalls, and other such devices. Metrics tend to be very technical and translate to the technical capabilities of the networking infrastructure. In clouds, this kind of data may be difficult to obtain.

The next highest level is resource monitoring, where computing and storage elements are quantified for their respective performance.

The highest level is service or application-level monitoring, where the performance of the application or service is quantified. At this level, metrics are business focused and reflect the link between technology and business function. The challenge is in defining the services and metrics because these will not be expressed in universal terms as is true with technical metrics. The benefit is that the business can reason about the performance in business terms, which is how customers think of services.

Supporting Tools

IT PM can be defined as the process of monitoring and reporting on the availability and performance of a service, including the detection and diagnosis of performance issues. IT PM tools include:

Agents and other mechanisms, which collect or generate metric data relating to application performance and status
IT PM servers, which store, compile, and communicate the performance metric data
Analysis tools, which analyze and report on the metric data
A dashboard portal, which provides data center operators with status information

IT PM analysis tools track and monitor response time or other performance metrics. The analysis tools can be configured by IT PM administrators using Event Situation Rules to detect and respond automatically to Performance Events as they occur based on metric data.

Configuration

Configuring the IT PM analysis tools involves settings that define metrics, condition thresholds, events, rules, and actions. Configuration relies on:

Metrics refer to data reported to the IT PM analysis tool concerning specific measures of an application’s performance, such as response time.

Thresholds are defined for each metric. A threshold may be a stated value or range of values for a given metric that, if exceeded, indicate a certain condition is true.

A Condition is true when a monitored metric data value exceeds a given threshold.

A Performance Event occurs when some predefined combination of conditions is true. Each type of Performance Event has its own criteria.

A set of Event Situation Rules consist of criteria for a given type of Performance Event and the actions to take when the Performance Event occurs.

An Alert is triggered when the Event Situation rules are met.

When the predefined conditions of a rule are met, a Performance Event is logged and the IT PM analysis tools automatically perform the rule’s predefined set of actions such as displaying an alert, sending a notification, or generating a ticket. The metrics, thresholds, and rules used to monitor a CMS application are defined through collaboration between the business application owner, Office of Information Technology (OIT), and the IT PM team.

Other Common Terms

Business Service Metrics

Measures of the availability or performance of a business service as provided by an application.

Event Correlation

Process of correlating multiple events using decision logic to determine a measurement of a multiple-event process or to detect a pattern indicating a performance issue.

Event Management

Activity of monitoring and taking action on events.

Event Enrichment

Information and analysis added to the description of an event defined by a set of Event Situation Rules. An example of Event Enrichment might be a description of the business impact of that event, or a note that the Event may be affected by cache size.

Performance Testing vs. Performance Monitoring

Performance and Stress Testing occurs in an Implementation environment and seeks to:

Verify that an application system’s performance meets system baseline performance requirements
Baseline the application’s performance and resource use under different workloads to establish initial norms
Stress an application system to identify failure points and inform design decisions before service implementation

In the CMS Processing Environment, Performance Monitoring seeks to:

Verify that a service is performing within prescribed control limits
Detect when a service is not performing within pre-defined standards so that corrective steps can be taken
Forecast when a service may exceed prescribed control limits (or thresholds) so that corrective action can begin

Although these activities use the same metrics and sometimes the same tools, their implementation differs, as shown in Table - Implementation Differences between Performance Testing and Performance Monitoring.

Table - Implementation Differences between Performance Testing and Performance Monitoring
Performance and Stress Testing	Performance Monitoring
Takes place in the lower environments (development, testing, integration), not production	Takes place in any of the CMS Processing Environments
Tests business and system functions that may affect performance	Monitors only critical functions
Uses simulation	Uses sampling
May introduce a heavy load to stress the system	Applies a minimal load to avoid impact on system
May simulate many users	Monitors from multiple locations, each simulating only one user
Test data, often designed to cause problems	Live data, reflecting actual use in that environment

Leveraging Performance and Stress Testing

By providing common guidelines for Performance Monitoring across all applications and leveraging performance and stress testing in the Implementation environment, CMS seeks to benefit from:

Reuse of test scripts
Cross-training on common performance tools
Use of a single, end-to-end process with smoother handoffs
Sharing of best practices
Accurate baseline testing scripts for more accurate results
Greater efficiency / reduced cost

Additional Testing vs. Monitoring Considerations

It is to CMS’s advantage to leverage Performance & Stress Testing within its approach to Performance Monitoring. To accomplish this, there must be consistency in performance metrics, performance requirements, and naming conventions in test scripts and transaction properties.

Tests developed for Performance Testing may not be directly reusable in production performance monitoring because of additional constraints such as these:

Capability differences may exist between testing tools in the Implementation environment and monitoring tools in the CMS Production Environments at different CMS data centers.
The presence of production data for Performance Monitoring may constrain the types of transactions that may be used or the metric data that may be collected.
Logging too much data or too often in the Production Environment could present unintended performance consequences and storage issues.
Access to performance logs in the Production Environment is controlled.

TRA 2024 Revision 3.0 • General Distribution / Unclassified Information