Cloud Environment Governance and Management

As noted in NIST SP 800-146, Cloud Computing Synopsis and Recommendations:

Attempts to describe cloud computing in general terms have been problematic because cloud computing is not a single kind of system, but instead spans a spectrum of underlying technologies, configuration possibilities, service models, and deployment models.

Many factors influence the definition, benefits, risks, and implementation of a cloud infrastructure. From a pragmatic standpoint, deploying into a cloud infrastructure relies on the same underlying technologies, principles, and processes of a managed services virtualized data center. A cloud infrastructure differs from a virtualized data center by providing the following additional capabilities:

Self-service, allowing authorized users to provision servers and applications on demand, on behalf of business owners, using streamlined processes
Automation, beyond that of virtualized and physical data centers
Rapid elasticity, allowing applications to meet short- and long-term growth goals in a regulated but automated fashion
Multi-tenancy (in Community Clouds), allowing multiple cloud customers to share resources while securely separating resources for one customer from those of another
Metered billing, which allows CMS to lease rather than own the cloud resources and pay only for what CMS uses from the CSP

Utilizing both the proper controls, as defined by applicable CMS and federal security mandates, and the requirements for the handling and processing of CMS information and information systems, will help assure a trusted, shared environment that provides the benefits and efficiencies of the cloud environment and cloud solution providers.

PREFERRED

The CMS strategic implementation that supports these requirements is CMS Cloud. Governance and management facilities include:

Cloud Environment Management

Controlling and managing the cloud environment requires deployment of cloud management tools.

In aggregate, these tools must provide the following capabilities:

Cloud virtual resource management
Cost management
Capacity management
License compliance management
Security management
Automation of virtual resource configuration management automation

The following subtopics address each of these cloud management tool capabilities.

Cloud Virtual Resource Management

Cloud resources are virtual resources temporarily allocated to the business owners on demand and by request. At some point when a business owner determines that it no longer requires these resources, the business owner may release the resources back to the CSP’s available resource pool. If the business owner fails to do this, the organization will continue to pay for resources that are no longer utilized. Even though the cloud resource capacity may seem endless, the applications need to be designed and configured to operate with a specific minimum and maximum capacity in mind. The capacity consumed should be monitored carefully to ensure that costs remain bounded within planned parameters. The CSP’s cloud resource management tools and reports are designed for these purposes. At a minimum, these tools must provide:

Resource provisioning
Resource de-provisioning
Resource accounting and inventory
Capacity management

Cloud Resource Life Cycle

There are six states in the cloud resource life cycle:

Provisioned virtual resources – have been created using resources either previously cleansed and made available for use or are archived resources that must be restored to operational status. Before making any resources operational, preparatory steps (such as enabling encryption) may be necessary.
Operational resources – have been provisioned and are running normally.
Elastic resources – have already been provisioned but may be bursted or made elastic (as needed) by adding or removing capacity to the resources. The last atomic unit of a resource may not be removed except during resource de-provisioning.
Archived resources – no longer occupy execution resources but do occupy storage. These resources may be provisioned again in the future.
Cleansed resources – were previously used by CMS and have been wiped based on CMS ARS requirements. Where wiping is impractical, such as with Solid-State Drive (SSD) technology, resources must have been previously encrypted and data handled in accordance with CMS ARS requirements.
De-provisioned resources – have been fully reclaimed by the CSP’s hypervisor systems.

Resource Provisioning

Once CMS has contracted for cloud services, actual provisioning of IaaS and PaaS services within the cloud must adhere to the CMS TRA, the CMS ARS, the CMS ISPG Cloud Computing Standard (RMH Volume III, Version 1.0), and any other federally mandated policies.

Table - Resource Provisioning Methods presents the three common methods for provisioning resources in existing clouds. Not all vendors accommodate every method. Some of the methods may be accommodated through different means and capability levels.

Table - Resource Provisioning Methods
Method	Benefits	Risks
Manually through the CSP’s Web interface	Improved controls and awareness of services Better control over resource costs	Cannot provision instantaneously; short-term performance degradation may occur Inability to respond effectively to bursts in activity due to manual monitoring
Programmatically through cloud API calls embedded in scripts	Can quickly meet demand and scale accordingly (up or down) Cost efficient	Must be synchronized with the CSP’s capabilities Increases vendor lock-in Must be performed securely to prevent external exploitation of CSP resources and CMS data at CMS expense
Automatically by a cloud management system based on rules	Can quickly meet demand and scale accordingly (up or down) Cost efficient No potential conflict due to configuration management	The CSP is responsible for managing resources and SLAs Automatic provisioning can lead to spiraling costs

Whether a CSP or a CMS application owner manages the resource provisioning, the capability to monitor and manage the resources’ performance characteristics must be made available to both parties. This helps ensure the expectations of and consistency in quality and levels of service.

It is essential to carefully control the use of APIs for resource provisioning. Some form of logging, auditing, and notification should support the automated provisioning of resources. Uncontrolled use of automated provisioning routines could have devastating consequences in service performance and cost escalation. Thus, rigorous control is mandatory for the capability to perform automated provisioning and de-provisioning.

CSPs offer tools to gauge performance of their service offerings. Additional tools may be considered to help CMS measure performance characteristics of a CSP’s full services. Aggregation and analysis of CSP data along with CMS data is a likely use case.

System owners must clearly understand the CSP provisioning policies and the physical location of the resources. CSPs have different allocation policies and practices for normal provisioning and elasticity capabilities. Some CSPs may support elasticity within the confines of a specific data center, and only migrate to alternative sites based on complete outages; others may spread provisioned resources over several data centers as a matter of course.

Resource De-Provisioning

Disposing of technical resources after retiring a business capability is not always a clear and simple process. It is even less so in a cloud, where previously allocated resources return to a pool to be reused minutes or even seconds later. Identifying all dependent resources can be particularly difficult in a cloud environment. In private data centers, these types of remnants do not typically have the significant impact on resources or expenses that they do in a private or community cloud environment, where services are billed based on metered consumption and resources are shared among the cloud’s tenants.

CSPs supply resource management tools to manage a business owner’s resource consumption. Depending on the CSP’s infrastructure capabilities, pay-as-you-go services can be very dynamic in nature and difficult to track. In cloud environments, where resources may be manually provisioned for temporary use through bursting or elasticity, the business owners must remember to de-provision the resources when they are no longer required. A follow-up audit may be necessary because CMS may not be aware that copies of data, via mirroring and similar redundancy services, may reside on different logical units of the CSP’s infrastructure.

Although de-provisioning can be performed in either manual or automated solutions, automated de-provisioning is generally recommended. Manual de-provisioning is not properly aligned with the security architecture principle of decreased system footprint. Manual de-provisioning also increases risk of unnecessary costs and is likely to be more error prone than automated. To operate successfully, both automated and manual de-provisioning require reliable system inventory and architectural information.

License Management

Provisioning and de-provisioning virtual machines in an IaaS cloud may generate software license issues. The business application must have sufficient licenses to cover its fully elastic configuration (unless previously negotiated). If bursting was enabled, there may be implications to licensed software that is licensed per CPU or core.

Although open-source software at CMS may be freely employed, there can be associated licensing fees for support, which may limit the number of supported instances of that software.

Automated Virtualized Environments and APIs

CSPs operate an advanced, virtualized environment, using tools to manage multi-tenancy, self-service, and resource consumption tracking that provide expected cloud benefits. Due to the economics of operating a cloud data center, CSPs rely heavily on automation to reduce their labor costs and permit rapid bursting and elasticity.

The need for highly automated environments has driven CSPs to provide APIs to their environments. Essentially, a cloud API allows a cloud consumer to control cloud resources via a programmatic interface instead of using a manual, cloud consumer portal. For example, APIs typically allow for such activities as starting VMs and expanding capacity.

The cloud industry has not yet standardized on APIs for managing cloud environments. Instead, other providers are supporting de facto standards such as the Amazon Elastic Compute (EC2) APIs in both commercial and open-source products (such as Eucalyptus). Currently, it is too early to standardize on a vendor API because of the immature state of the cloud industry. As a result, programming to a cloud vendor’s API involves vendor lock-in. Architectural design techniques may mitigate this risk to some degree, and business owners should consider such mitigations when deploying business applications to the cloud. For example, Apache libcloud provides API wrappers for proprietary cloud APIs, allowing consumers to program to a compatibility layer instead of vendor-specific APIs. Although this approach may be preferable, using API wrappers requires thorough technical analysis of the business application’s requirements.

Cost Management Implications

Given that dynamic resource management is a key element of the cloud value proposition, cost management and its architectural impact is a critical element of cloud resource management.

The cloud’s cost management component must be able to provide detailed resource utilization billing, allowing CMS data center operations to identify the costs associated with CMS business applications. In an IaaS cloud, this means accounting for CPU, memory, network bandwidth, and other agreed-upon billable resources. In a PaaS cloud, the billable resources may be number of user accounts, number of customer accounts, or other metered resources.

Pay-as-you-go billing may have an architectural impact to custom business applications. For example, it is possible that certain resource tradeoffs (such as memory vs. disk I/O) will lower the overall costs. CMS application developers and contractors should provide alternative architectures, when appropriate, to reduce overall costs.

Elasticity, while an attractive concept, has financial impacts that have not been fully explored at CMS. CMS business owners will need to include the cost of resource elasticity in their estimated operational cost for a business application. Cost models should be capable of translating pay-as-you-go billing to existing billing models until pay-as-you-go becomes acceptable in the CMS financial environment. For example, it may be feasible to have a model that uses prior-year average (or even moving average) billing to determine future costs once applications achieve a resource utilization steady state.

Resource elasticity does not normally include the licensing cost associated with COTS software. Provisioning a VM does not mean that the OS and application software are properly licensed. It is CMS’s responsibility to ensure compliance with licenses. Although some CSPs offer license management, it is typically not the CSP’s responsibility to do so.

Capacity Management

The management tools must provide authorized users a snapshot or “dashboard” view, at any given time, of the capacity management statistics, such as percentage of resources utilized, available capacity, growth trends, etc., for a range of duration (day, week, month, year). These tools should have threshold settings that take some automated action when limits are exceeded. These actions could range from simple notifications (alarms) to automatically provisioning more VM resources or VMs (at predetermined allocation increments). Many of these features depend on the CSP and its offered tool set. Different CSPs offer a variety of tool sets based on the cloud framework and vendor-specific technology in use. In addition to thresholds, “caps” should be considered. Caps are soft thresholds put in place on various cloud resources (such as CPUs, CPU cycles, Terabytes of storage, memory, and bandwidth) that may not be exceeded without a certain level of CMS approval. This approach would prevent a spiraling of costs during an unusual surge situation, much like a circuit breaker on an electrical circuit.

Current CMS virtualization business rules prohibit over-subscription of virtual machines and resources in CMS Production Environments. This means that CSPs may not overburden physical machines under the assumption that workload overlap is unlikely. Each production VM must have enough capacity to run at 100 percent load (peak capacity) without impacting another production VM.

Cloud Service Management Support

Cloud operations and support should be the same as the ongoing operational support of any other system / technology in the CMS enterprise. In fact, cloud support should leverage existing support infrastructure and managed services, such as operations staff, help desks, and service management processes. The key distinction between data center and cloud service management support is the varying degrees of managed services and possibly different service providers / contractors for the IaaS and PaaS. This also means there is potentially a large set of stakeholders involved in communications. The larger communications component necessitates more coordination across processes and reporting to help CMS make informed decisions and provide necessary oversight to the CSP and associated contractor(s) regarding their contracted performance.

Service Level Management

All ongoing operations and support should follow a formal process framework, be adequately documented, and be part of a continuous process improvement regimen. SLAs with the CSP should be no different from SLAs with any other CMS data center hosting vendor that provides integrated support and services to CMS. With some cloud providers, IaaS SLAs may be the CSP’s responsibility while PaaS SLAs are the responsibility of an associated integration contractor or broker.

This separation of concerns requires better CMS oversight to assure appropriate management of service levels. The details of service level management are outside of the scope of this chapter, but are addressed by CMS Cloud: Defining Service Level Objectives (SLOs).

IT Processes and Best Practices

Using a CSP makes the integration of standard IT processes and procedures more complex. Problem management, incident management, change management, and configuration management remain critical processes, but often their responsibility is shared between the CSP and associated integration and application contractors. IaaS changes, patches, and configuration adjustments are the CSP’s responsibility. PaaS changes and configuration settings are typically the responsibility of the integration or application contractor. CMS is responsible for ensuring smooth identification, reporting, and resolution of issues.

TRA 2024 Revision 3.0 • General Distribution / Unclassified Information