Best Practices and Recommendations
RP-PMM-1: Coordinate Application Changes with Monitoring Operations
Significant application changes must be coordinated with the existing IT PM infrastructure to avoid negative impact to IT PM services or to the monitored application.
Rationale:
Changes to an application may invalidate some of the rules, transactions, or resources used for IT PM. The result may degrade the ability to monitor application performance or possibly the performance of the application itself or a related application. Coordinate changes, typically via the project or data center change control board (CCB), can help mitigate the risk of negatively impacting IT PM.
RP-PMM-2: Consider Monitoring Lower Environments
Monitoring applications in the Development, Test, and Implementation environments is a business decision. Applications in Production must be monitored for application performance.
Rationale:
The additional cost of IT PM may be considered unjustified by business owners who may chose not to monitor lower environments. Instrumenting and monitoring applications in lower environments may can help identify problem code prior to promoting code to production. In addition, it makes lower environments execute similarly to production, which typically helps smooth migration.
RP-PMM-3: Provide Performance Data to CMS NOC
If requested, CMS data centers must provide performance event data to the CMS Network Operations Center where it can be monitored.
Rationale:
By having an integrated, enterprise-wide view of performance, CMS can better control and manage resource usage. This can be accomplished using either open standards-based products or the CMS-sanctioned IT PM gateways.
RP-PMM-4: Conduct Performance Management Planning
All CMS application maintainers must provide and maintain a Performance Management Plan approved by the data center operations contractor(s).
Rationale:
A Performance Management Plan identifies specific application monitoring requirements, business service metrics, key performance indicators (KPI), and thresholds for alerts and SLA violations. This plan also identifies contacts to be notified of performance degradation issues.
RP-PMM-5: Use a Trouble Ticketing System to Track Performance Problems
Use the CMS enterprise trouble ticket system to track resolution of incidence of performance issues in production systems.
Related CMS ARS Security Controls include: IR-4 - Incident Handling, IR-5 - Incident Monitoring, and IR-6 - Incident Reporting.
Rationale:
An enterprise trouble ticketing system allows for trouble tickets to be tracked, prioritized, and dispositioned according to business and technical priorities.
RP-PMM-6: At Least One End-to-End Test
Before going into production, applications hosted in CMS data centers must have at least one test to verify end-to-end operational status of the performance monitoring system as part of Production Readiness testing from within the Production environment.
Rationale:
This rule ensures that applications processing CMS data have implemented and validated the performance monitoring system.
RP-PMM-7: Identify Un-Monitorable Components as a Risk
Application owners and maintainers should identify any component that cannot be monitored with existing CMS IT PM tools as part of the application project risk register.
Rationale:
Application components that cannot be monitored represent an application performance risk and need to be tracked in the risk register. Rather than being wholly un-monitorable, such components may require different monitoring capabilities than the infrastructure possesses. Risk mitigations must be provided.
RP-PMM-8: Support Standards-Based Tools
Performance-critical applications hosted in CMS data centers should support standards-based centralized monitoring, alerting, and event management. It is more important, however, to integrate with the existing CMS monitoring infrastructure than to deviate to meet a standard.
Rationale:
This applies to all performance-critical production applications providing business or infrastructure services. A performance-critical application is one for which significant degradation of the application’s performance or availability may have an immediate impact—either directly or by impacting another application—on end users or business operations.
Exceptions may be required for some existing or COTS applications; however, data centers contain tools for monitoring most of these applications and their underlying platforms without requiring modification of the application code.
RP-PMM-9: Define Services in a Service Catalog
Services must be defined and collected into a Service Catalog before use. The Service Catalog should also contain service level agreements and definitions of metrics to ensure that all service consumers may understand service constraints.
Rationale:
Without service definitions, service level agreements cannot be defined. There are usually additional constraints (such as capacity) and dependencies (such as other subordinate services or resources) that should be described in the service definition as well. The Service Catalog is the collection of active service definitions.