Guidelines for Performance Monitoring Planning

This topic provides CMS guidance on developing a Performance Monitoring Plan for an application. The suggested methodology begins with considering business service performance, formulating a Performance Monitoring Strategy, developing a Logical Application Performance Monitoring Plan, and finally, producing the Detailed Application Performance Monitoring Plan, which becomes part of the application’s Operations and Maintenance Manual (OM&M). Of these, only the OM&M and the Detailed Application Performance Monitoring Plan are required of all applications. It is strongly suggested that a Performance Monitoring Strategy be presented at the Preliminary Design Review (PDR).

Throughout the development of an Applications Performance Management Plan, the business owners and application maintainers should work closely with the CMS Enterprise Monitoring and Management Team.

The Enterprise Monitoring & Management Matrix below also discusses the Enterprise Monitoring & Management Matrix Form that application owners should use to request the setting of thresholds and actions.

Performance Monitoring Strategy

The Performance Monitoring Plan should implement a performance monitoring strategy based on the following factors:

  • Business needs, which drive business service performance requirements.

  • Technical needs, which drive technical performance requirements.

  • Impact of monitoring, which drives cost and performance tradeoffs.

The next subtopics address each of these factors in steps representing the development of a Performance Monitoring Strategy.

Business Service Performance Considerations

An application’s business services exist to support business operations. These operations have performance and capacity requirements based on the business mission, such as the need to process and store a specific number of claims per week. A poorly performing application can bog down business operations, reduce customer satisfaction, and even cause additional work or penalties for business operations. Therefore, some of the performance standards (and metrics) for application performance must be derived from the business operations’ performance and capacity requirements.

For each business application service, a business owner or operations manager must consider:

  • What is the slowest permissible service response time before users or operations are significantly impacted?

  • How often can the service be unavailable before users or operations are significantly impacted?

  • For how long can a service be unavailable before users or operations are significantly impacted?

A “significant impact” to users or operations is when one of the following occurs:

  • Operations are slowed or delayed to the point where costs increase, backlogs occur, or time-critical deadlines for business processes are missed.

  • A user’s productivity is reduced. If the user cannot work as quickly due to waiting for the system to respond, that represents a significant impact. In time-sensitive business operations such as a call center, slow system response can directly translate into higher operations costs, negatively impact staff personnel performance ratings, and (in a call center), directly increase call costs.

  • A user becomes frustrated with the service performance. For interactive services accessed directly by the public, higher costs or productivity impacts from poor service performance may not be obvious. For example, a frustrated user may make mistakes, give up on his or her online request, or file complaints—all of which have negative consequences for CMS.

The service fails to meet user performance expectations. Today’s users expect computer applications to perform rapidly, at continuous, maximum “internet speeds” and with very high availability. When CMS does not meet these expectations, this reflects poorly on the Agency and may cause users to seek older, less efficient, more costly methods of interacting with the Agency, such as paper and postal mail.

Step 1: Define Service Performance Objectives

For IT PM to be an effective tool that avoids the above significant impacts, application performance must be monitored for compliance with business-driven performance standards, and data center IT must understand the impact of sub-standard performance on business operations and mission. After evaluating the foregoing Business Service Performance Considerations, the application owner or operations manager should produce a list of Service Performance Objectives. For each objective, the following information should be provided:

  • A clear, measurable definition

  • An Acceptable Performance Standard as a minimum, maximum, or range of performance

  • A delineation of where the standard applies (everywhere, regional offices, public internet, etc.)

  • A monitoring threshold as a minimum, maximum, or range of performance at which the business owner must be notified whenever the application’s performance is degraded beyond the threshold.

  • A brief description of the business service and its role in business operations

  • A brief explanation of the potential impact to business operations if the business service is degraded or unavailable for a few seconds, minutes, or hours.

Armed with the above information, data center IT staff can plan for and monitor application performance, proactively anticipate performance problems, and prioritize restoration of services when problems occur. The Service Performance Objectives should be reviewed with the data center operators and may be used to define Business Service Performance Standards in application hosting service agreements.

Step 2: Map Service Performance Objectives to Components and Transactions

Application developers and maintainers can identify additional critical performance metrics. They should begin by reviewing the Service Performance Objectives identified by the business owner and the operations manager and mapping the objectives to specific child transactions and application components. External services such as database operations, remote procedure calls, and remote queues should be included as components addressed by the Service Performance Objectives. Additional Service Performance Objectives may be defined as needed, based on non-functional requirements, component dependencies, or other technical requirements.

Step 3: Develop a Performance Monitoring Strategy

The Performance Monitoring Strategy identifies which transactions, resources, and application performance data will be routinely monitored in production.

Prioritize Service Performance Objectives

The application developer and maintainer should identify which Service Performance Objectives represent the highest risk in probability of performance degradation and impact of performance degradation. Higher-risk objectives require more frequent monitoring, while lower-risk objectives may be unmonitored or monitored only when required for testing changes or diagnosing problems.

Identify Transactions and Resources for Monitoring

Once it has been decided which Service Performance Objectives will be monitored, the application developer and maintainer should identify the simplest, most minimal, lowest-impact set of transactions and resources for routine monitoring to determine whether those objectives are being met. The set can comprise any combination of application resources, specific application transactions, or application-provided data.

Cost and Performance Tradeoffs

Over-monitoring an application can have negative impacts. Over-monitoring can degrade performance, increase resource utilization, produce volumes of performance data, and increase costs and complexity. By carefully selecting which transactions will be monitored and when they will be monitored, the negative impacts of monitoring can be controlled and minimized. The list of proposed transactions for monitoring should be refined accordingly.

Logical IT PM Plan

The Performance Monitoring Strategy is the basis for the initial Logical IT PM Plan. This plan should contain:

  • A list of application resources to be monitored.

  • A list of specific application transactions to be monitored.

  • Definitions of application-provided performance data.

  • Recommendations about where monitored transactions should originate (data center, call center, regional office, etc.) and when or how often they should be measured.

These details should be based on the standards defined for the Service Performance Objectives, the risks associated with the Service Performance Objectives, and the possible impact of monitoring the transaction. To minimize the number of items monitored, the plan should avoid any unnecessary or duplicate application resources, specific application transactions, or application-provided data.

When complete, the Logical IT PM Plan should include tables or matrices providing traceability between Service Performance Objectives and the resources, transactions, or data for routine monitoring. It is unnecessary to include any items that will not be monitored routinely.

For each Service Performance Objective, the Logical IT PM Plan should identify some combination of application resources, specific application transactions, or application-provided data that may be employed to ensure that Service Performance Objective is being met.

It is important for the application developer to consult with the data center operators because many elements may already be monitored with existing tools, without requiring modifications to the application.

Detailed IT PM Plan

The application developer or maintainer works closely with data center operators and the CMS Enterprise Monitoring and Management Team to create the Detailed Application Performance Monitoring Plan by defining the properties for each resource, transaction, or data element in the Logical Application Performance Monitoring Plan.

The resources, transactions, or data may each have different types of properties, including:

  • Identity properties – that define the type or class of an application or transaction.

  • Context properties – that may be unique for each instance of a given application or transaction. Examples include userid, processid, or business information, such as a claim number.

  • Relationship properties – that show how one transaction relates to another. Typically, these are parent / child relationships.

  • Metric properties – that describe measurements of the resource, transaction, or data elements. Typical examples include status, response time, time of day, blocked time, number of bytes, number of records, number of threads, and queue size.

Once approved, the Detailed Application Performance Monitoring Plan becomes part of the application’s OM&M.

Enterprise Monitoring & Management Matrix

An Enterprise Monitoring & Management Matrix Form is used to request the setting of thresholds and actions. The matrix form must be submitted to the CMS Enterprise Monitoring and Management Team that will review and approve the request and implement the settings.

Table - Contents of the Enterprise Monitoring & Matrix Form summarizes the major columns in an Enterprise Monitoring & Management Matrix Form.

Table - Contents of the Enterprise Monitoring & Matrix Form

Form Column

Description

Identity Properties

Resource Name

Names the resource to be monitored. A resource may include a physical or virtual resource (memory, bandwidth, etc.), a process starting or dying, a queue, or a transaction.

EXAMPLE: Disk space

Monitored Resource Description

Contains a short statement that explains the reason for monitoring the resource.

EXAMPLE: Warning threshold exceeded for disk space usage 85%

Context Properties

Host Name

Identifies the host name or device name to be monitored.

Relationship Properties

Metric or Event Dependency

Identifies any dependencies on other events/thresholds. Usually there are NONE.

Metric Properties

Condition/Threshold Reached

Lists the specific metric value (in bold) or the specific indication the event has occurred.

EXAMPLE: Maximum Disk Space Percent >=85

Sampling Interval

Identifies the interval between samples for the metric or event.

EXAMPLE:15-minute intervals

Occurrences within Sampling Interval

Identifies situations in which alerts are needed when a specified number of metric / event occurrences takes place within the Sampling Interval.

EXAMPLE: Alert if more than 3 occurrences during a Sampling Interval

Criticality

Lists the criticality of the detected situation, such as Critical, Warning, or Informational.

EXAMPLE: Warning (Yellow)

Action to be Taken

Identifies the action to be taken when metric or event is detected.

EXAMPLE: Send email notification to support team at <email address>

Situation/Monitor Name

Names the monitor performing the monitoring task.

EXAMPLE: UNIX_DiskSpace_Monitor