Data Management
|
Active Data Warehouse |
Warehouse featuring high-performance transaction processing which supplies data for online processing and updating. |
Data Lakehouse |
A modern data architecture that creates a single platform by combining the key benefits of data lakes (large repositories of raw data in its original form) and data warehouses (organized sets of structured data). Specifically, data lakehouses enable organizations to use low-cost storage to store large amounts of raw data while providing structure and data management functions. |
Enterprise Data Mesh |
A CMS-wide connectivity facility that enables CMS enterprise data to be “shared in place” for consumption within and external to CMS. |
Enterprise Data Warehouse |
Warehouse comprised of any CMS data warehouse or data mart built for decision support. |
Data Use Agreement(DUA) |
A legally binding agreement between CMS and an external entity (e.g., contractor, private industry, academic institution, federal government agency, or state agency), when an external entity requests use of CMS personally identifiable data covered by the Privacy Act of 1974. The agreement delineates the confidentiality requirements of the Privacy Act, security safeguards, and CMS’s data use policies and procedures. The DUA serves as both a means of informing data users of these requirements and a means of obtaining their agreement to abide by these requirements. The DUA also serves as a control mechanism by which CMS can track the location of data and the reason for the data’s release. A DUA requires that a System of Records (SOR) be in effect, which allows for the data to be disclosed. |
Information Exchange Agreement (IEA) |
A legally binding agreement between CMS and an external entity for Business Owners and Privacy Advisors working together to determine the terms of sharing PII with other federal or state agencies. |
Interconnection Security Agreement (ISA) |
A legally binding agreement between CMS and an external entity that defines the relationship between CMS information systems and external systems. |
Computer Matching Agreement (CMA) |
A CMA is created when CMS records are matched with records from another Federal or State agency and the results of such match may have an adverse impact on an individual in relation to a Federal benefit program. |
Data Storage Services
|
Online Storage |
Online storage is in constant use in the data center and performs real-time data transactions for applications. Online storage consists of disk drive-based storage that resides in or is attached (direct or fabric) to a server. Direct-attached storage allows only that server attached to the storage to access the storage. Fabric-attached storage enables all servers attached to the fabric to share the available storage resources, such as in a Network Attached Storage (NAS) or Storage Area Network (SAN) configuration. (The “fabric” is the hardware that connects workstations and servers to storage devices in a SAN.)
Online storage devices should allow high-speed access to the storage while at the same time providing data protection and security. High-speed access to the storage is achieved with the high-speed Input/Output (I/O) for the network, system bus, and disk drive interfaces.
|
Offline Storage |
Commonly referred to as “archive” or “backup” storage. Offline storage typically consists of a tape drive or low-end disk drive (virtual tape). Offline storage backs up the data stored on both the online and online archive storage devices. Offline storage is designed store data for long periods.
Because data is archived, offline storage appliances should focus on data accuracy, protection, and security.
|
Data Archiving |
Data Archiving (or data migration) is driven by proactive and efficient Information Lifecycle Management. This process involves relocating static, inactive and rarely accessed data from the production environment to a secure, reliable and more cost-effective storage/archival location. Data will be migrated or archived to different storage classes based on such factors as age of data, frequency of use, probability of access, size, usage profile, etc. Data can also be restored from the archival storage location to the production environment as needed. |
Direct Access Storage Disk (DASD) |
Storage that is attached directly to a computer (typically a mainframe computer). This differentiates DASD from storage that is attached to a network, such as Network Attached Storage and Storage Area Networks. |
Network Attached Storage (NAS) |
Describes a complete storage system that is attached to a traditional data network.
NAS clusters or grids enable scaling of capacity and/or performance into the multi-petabyte (PB) range with bandwidth in the 10’s of GB per second at up to 1,000,000 iops.
NAS consists of one or more file servers. These file servers serve either Windows clients in the form of Common Internet File System (CIFS) or UNIX/Linux clients in the form of Network File System (NFS). Storage on these file servers is provided to the remote system over Ethernet.
In most cases, NAS is less expensive to purchase and less complex to operate than a SAN; however, a SAN can provide better performance and a larger range of configuration options.
|
Storage Area Network (SAN) |
A Storage Area Network is a network specifically dedicated to the task of transporting data for storage and retrieval. SAN architectures are alternatives to storing data on disks directly attached to servers or storing data on Network Attached Storage devices that are connected through general purpose networks.
A SAN is used to access Block Level Data. The remote system accesses a Logical Unit (LUN) as though it were a local disk. The remote system can put a file system on the storage or, in the case of data bases, use it as a raw device.
Storage Area Networks are traditionally connected over Fibre Channel (FC) networks. SANs have also been built using SCSI (Small Computer System Interface) technology. An Ethernet network that is dedicated solely to storage purposes would also qualify as a SAN. Internet Small Computer Systems Interface (iSCSI) is a SCSI variant that encapsulates SCSI data in Transport Control Protocol (TCP) packets and transits them over Internet Protocol (IP) networks.
SAN Connectivity Option and Protocols include Fibre Channel (FC), Fibre Channel over IP (FCIP), Fibre Channel over Ethernet (FCoE), and SCSI over Ethernet. Fibre Channel is the preferred connectivity protocol. Each SAN should consist of two (2) independent fabrics (connections) for high availability.
|
Disk Storage Virtualization |
Disk Storage Virtualization carves up physical disks into smaller chunks that are then used to build traditional RAID (Redundant Array of Independent Disks) constructs. Storage Virtualization abstracts the concept of “Disk” to “Logical Units Numbers” or LUNS. LUNS may be treated exactly the same as Physical Disks. Each LUN is assembled from parts of one or more physical disk and may be arranged as RAID stripes or other logical organization. LUNs may be “Thin Provisioned”.
The key advantage of this technology is that the storage administrator does not have to think about what business applications might be sharing a given set of disks in a RAID set because virtually every RAID set uses some portion of every disk in the array. Disk virtualization also makes it possible to move individual chunks between different tiers of storage within a single array. If the data are then referenced often, the storage administrator can move it dynamically back to the higher performance storage.
The main business driver for virtualization is consolidating multiple storage and services on a single environment. Virtualization reduces costs by sharing hardware, infrastructure, and administration. The benefits of virtualization include:
- Increased hardware utilization
- Greater flexibility in resource allocation
- Reduced power requirements
- Fewer management costs
- Lower cost of ownership
- Administrative and resource boundaries between applications on a system.
There are three (3) Virtualization levels:
-
Fabric Level Virtualization: A SAN fabric enables any-server-to-any-storage device connectivity through the use of protocols such as Fibre Channel switching technology. A “fabric” is the hardware that connects workstations and servers to storage devices in a SAN.
-
SAN fabric is zoned to allow the virtualization appliances to see the storage subsystems and for the servers to see the virtualization appliances. Servers would not be able to directly see or operate on the storage subsystems.
-
Storage Subsystem Level Virtualization: RAID subsystems are an example of virtualization at the storage level.
-
Server Level Virtualization: Abstraction at the server level is by means of the logical volume management of the operating systems on the servers.
|
Redundant Array of Inexpensive Disks (RAID) |
Used to create LUNs that span multiple physical disks. RAID can enhance the reliability of storage compared to single disks, and also decouples the storage from the details of the physical disks.
There are several forms of RAID.
-
RAID 0 = Stripe across several disks. This offers the advantage of creating LUNS larger than the physical disk. It also has the advantage of greater performance than a single disk. The disadvantage of RAID 0 is that a single disk failure corrupts the LUN and it must be rebuilt from backup.
-
RAID 1 = Mirror between pairs of disks. In RAID 1, the same data is stored on two disks. If one disk fails, the other disk still has all of the data. RAID 1 has the disadvantage of requiring twice the amount of physical storage.
-
RAID 5 = Stripe with parity. In RAID 5, the data is striped as in RAID 0, but another disk for parity data is added. In this way, if a single disk fails, the data is not corrupted because it can be recreated from the remaining data and the parity drive. RAID 5 has the disadvantage of slower write speeds due to the parity calculation.
-
RAID 6 = Striped set with dual distributed parity. RAID 6 provides fault tolerance from two drive failures; the array continues to operate with up to two failed drives. This makes larger RAID groups more practical, especially for high-availability systems. This becomes increasingly important because large-capacity drives lengthen the time needed to recover from the failure of a single drive. Single parity RAID levels are vulnerable to data loss until the failed drive is rebuilt: the larger the drive, the longer the rebuild will take. Dual parity gives time to rebuild the array without putting the data at risk if a (single) additional drive were to fail before completing the rebuild.
-
A hybrid RAID called RAID10 combines the performance of RAID0 with the reliability of RAID5.
|
Thin Provisioning |
Thin provisioning allows the storage administrator to over-commit storage on a per-volume basis as long as the amount of data that is actually written does not exhaust the free space in the array or in a particular pool of storage from which the thin-provisioned volumes are backed.
With Thin Provisioning, one can create LUNS that exceed the total amount of physical storage present. The storage appliance only allocates as much physical storage as is used. For instance, the creation of a 500GB LUN only uses 400GB. With thin provisioning, only 400GB of physical storage would be allocated to the LUN. Additional storage would only be added to the LUN if the amount of storage used grew beyond 400GB.
Thin Provisioning should be approached carefully. Heavily over-committing of volumes on a relatively small number of disks can lead to poor performance. The combination of disk spindle virtualization and thin provisioning can mitigate performance concerns.
|
De-Duplication |
De-duplication reduces the physical storage that is required by identifying duplicate files or blocks. These duplicates are then referenced by a pointer to an original and the space associated with the duplicate is available for other writes. This technology is most common in backup or archiving applications. It may fit into other areas such as server virtualization, where there are many identical blocks or files across multiple operating system instances. |
File Transfer
|
Connect:Direct |
An IBM Sterling multi-platform file transfer application capable of transferring files as well as executing job scripts. Connect:Direct also describes the proprietary protocol used to transfer files securely from system to system. Also known as Network Data Mover (NDM) |
File Transfer Protocol (FTP) |
A TCP/IP-based protocol, originally developed for Unix. FTP is a popular but un-secure protocol for transferring files. Alternatives at CMS include S/FTP or FTP/S. |
IBM Sterling Business Integration Suite (ISIS) |
An IBM file transfer management suite used at CMS, capable of communicating with S/FTP, FTP/S, and Connect:Direct protocols. |
TIBCO Platform Server |
A managed file transfer application used at CMS, able to communicate with S/FTP, FTP/S, and Connect:Direct protocols. |
Trigger Scripts |
A script executed once the Sweeps system has determined that it needs to be further processed. Scripts can be applications, Job Control Language (JCL), Restructured Extended Executor (REXX), etc. |
Analytics and Business Intelligence
|
Ad Hoc Queries |
Queries formulated by users on the spur of the moment and therefore unpredictable in nature. |
Dashboards and Scorecards |
Dashboards are a subset of reporting and include the ability to publish formal, web-based reports with intuitive displays of information, including dials, gauges, sliders, check boxes, and traffic lights. They are designed to deliver historical, current, and predictive information typically represented by key performance indicators (KPI), and they use visual cues to focus user attention on important conditions, trends, and exceptions. Scorecards take the dashboard metrics a step further by applying them to a strategy map that aligns KPIs with a strategic objective. A scorecard implies the use of a performance management methodology such as Balanced Scorecard, Six Sigma, or Capacity Maturity Models. |
Data Mart |
A specific subset of a data warehouse for specific analyses needed by a specific group of users. |
Data Mining |
Enables users to conduct exploratory and predictive analytics to extract meaning from seemingly unrelated data by using parameters to search for relationships and patterns. |
Data Quality Assurance |
Process of testing data for consistency, completeness, and fitness for publishing to the user community. |
Data Visualization |
Provides the ability to display numerous aspects of data more efficiently by using interactive pictures and charts instead of rows and columns. The main goal of data visualization is to communicate information clearly and efficiently through graphical means. |
Data Warehouse (DW) |
A collection of data serving as the basis for informational processing and created for decision support. A DW is subject-oriented, integrated, non-volatile, and time-variant. |
Enterprise User Administration (EUA) |
A CMS system used to manage enterprise userIDs and passwords.
Administrators enter access requests using an EUA workflow system and these requests are forwarded to approvers. Upon approval, the system automatically grants the accesses.
|
Geographic Information System (GIS) |
A type of location intelligence revealing trends and patterns using spatial and geographic relationships in the data. For instance, Medicare data at the national, state, zip code, or congressional district level is displayed in a graphical map format with data roll-up and drill-down capabilities integrated into the map itself. |
Hybrid Online Analytical Processing (HOLAP) |
Analytical processing that provides the best of both worlds, using ROLAP and MOLAP techniques. Some HOLAP tools provide capability to drill through from aggregated data stored in multi-dimensional cubes to detailed data stored in relational tables. |
Lightweight Directory Access Protocol (LDAP) |
An application protocol for querying and modifying the data of directory services implemented in Internet Protocol (IP) networks. |
Multidimensional Online Analytical Processing (MOLAP) |
A form of OLAP that works with multidimensional array storage rather than a relational database. MOLAP requires the pre-computation and storage of information in multi-dimensional cubes for slicing and dicing. The data in the cubes is usually aggregated. |
Online Analytical Processing (OLAP) |
While standard and ad hoc query tools are typically used to answer questions like “What happened?” and “When and where did it happen?” OLAP tools are used to answer questions like “Why did it happen?” and to perform “What if?” analysis. Also known as “slicing and dicing” analysis, OLAP allows power users to see facts (numerical, typically additive numbers like claims, payment amounts, and account balances) almost instantaneously regrouped, re-aggregated, and resorted according to any specified dimension (descriptive elements like time, region, claim type, or diagnosis). OLAP methodology can be implemented in ROLAP, MOLAP, or HOLAP depending upon data storage requirements. |
Predictive Modeling |
Answers questions about what is likely to happen next. Using various statistical models, these tools attempt to predict the likelihood of attaining certain metrics in the future, given various existing and future conditions. |
Relational Online Analytical Processing (ROLAP) |
Works directly with relational databases and stores database and dimension tables as relational tables. This methodology relies on manipulating the data stored in the relational database to produce the appearance of traditional OLAP’s slicing and dicing functionality. |
Role-Based Access Control (RBAC) |
A security approach that restricts system access to authorized users. RBAC is used in BI server administration. |
Role-Based Authorization |
Process in which users are granted access rights based on their primary role (group to which a BI user belongs). In the CMS BI Environment, roles are divided into the following user classifications, depending upon the BI application: standard users, power users, BI developers, and BI analysts. |
Standard/Ad Hoc Reporting |
Provides the ability to create formatted and interactive reports with highly refined scheduling capabilities for offline batch processing. The ad hoc query capability enables users to ask their own questions of a set of data, without relying on IT staff to create a report. Most BI tools have a robust semantic layer enabling users to navigate available data sources. In addition, these tools offer query governance, security, and auditing capabilities to ensure that queries perform well, making only the appropriate data available to users based on user role and access rights. |