Databases and Knowledge Bases
Potential knowledge bases were evaluated against the Good Practice Guidelines found in the Clinical Practice Guidelines: Directions for a New Program written by the Institute of Medicine in 1990. Per the publication, the Institute recommended eight Attributes of Good Practice Guidelines: Validity, Reliability/Reproducibility, Clinical Applicability, Clinical Flexibility, Clarity, Multidisciplinary Process, Scheduled Review, and Documentation. While the definitions and descriptions for each attribute will be only briefly detailed in this LCD, the rationale supporting how a potential knowledge base was considered in the context of these attributes will be detailed below.27
Per the Good Practice Guidelines, the Validity attribute addresses both the guidelines as a whole in addition to the content within the guidelines. For the former, validity is demonstrated when the guidelines lead to improvement in health and cost outcomes, a metric that would best be assessed through an external study of the guidelines in clinical practice. For the latter, validity is demonstrated when the guideline content assesses how a clinical action/recommendation affects health and cost outcomes. Validity of the guideline content is ideally evaluated using 11 elements: Projected health outcomes, Projected costs, Relationship between the evidence and the guidelines, Preference for empirical evidence over expert judgment, Thorough literature review, Methods used to evaluate the scientific literature, Strength of the evidence, Use of expert judgment, Strength of expert consensus, Independent review, and Pretesting.
The NCCN generally meets the definition of validity. In 2022, the NCCN was the subject of a retrospective study performed by CVS Health that was presented at that year’s American Society of Clinical Oncology conference. The study found that adherence to NCCN guidelines lowered the total cost of care in the treatment of colorectal and breast cancers in Medicare patients.172,173 Similarly, adherence to NCCN guidelines has been shown to improve health outcomes in cancer patients.174-177 As for the guideline content, the NCCN thoroughly details its recommendations using language that acknowledges the potential complexities and variations found in patient cases (terminology such as “consider” or “preferred” or “should be guided by”) and supports these recommendations with ample peer reviewed literature. Often the discussion of the evidence highlights strengths and weaknesses in studies and tailors the strength of a recommendation based on such evaluation. Additionally, the guidelines are created, reviewed, and updated by subject matter experts through consensus panels.
Unlike NCCN, OncoKB acts as more of a catalogue of genetic variants with proven clinical validity and utility. OncoKB does not make formal recommendations regarding specific clinical actions, but rather provides expert and evidentiary analysis of known variants and explains how confident one can be in the actionability of a variant. For instance, a genetic variant with a low score on OncoKB may not have enough research and clinical evidence to prove a certain drug is more or less effective when that variant is present. This does not mean that testing for the variant will not (in OncoKB’s view) become useful in the future, but OncoKB suggests, based on expert consensus and evidence, that the variant should not yet be used in clinical decision making. Because of these characteristics, OncoKB does not meet the above definition of the attribute of validity, although it does provide valuable information from expert and evidentiary analysis.178
ClinGen demonstrates both features of NCCN and OncoKB as a knowledge base in the context of the attribute of validity. In terms of similarities to OncoKB, ClinGen assesses the relationship between a gene and disease (including hereditary cancer) and makes determinations whether or not there is sufficient evidence supporting this causative relationship. In this instance, again, the knowledge base acts as guidance for whether or not a gene should be evaluated as part of a particular cancer workup. However, more like NCCN, ClinGen also provides very specific recommendations on “Clinical Actionability.” The recommendations provided by this section of ClinGen, however, are relatively simplistic, including clinical actions such as “Surveillance” and preventative surgery. Given that ClinGen is designed to evaluate hereditary diseases, it is unsurprising that most clinical actions evaluated by ClinGen revolve around screening and preventative interventions as opposed to therapeutic actions. Moreover, the prospective nature of clinical actions evaluated on ClinGen are difficult to assess health and cost outcomes as an external study.
The Reliability/Reproducibility attribute in the Good Practice Guidelines refers to consistency in guideline development and procedures. In practice, by following a knowledge base’s standard operating procedures and criteria in evaluating a topic, different groups of reviewers should come up with the same guideline for that topic. Reliability/reproducibility also refers to consistency in how a guideline is interpreted and utilized by its audience. This means that patients under different providers who use the same guideline should see equivalent care and management of their needs. Providers who use the same guideline should also come to the same conclusions and management decisions when evaluating similar patient complaints.
Evaluating knowledge bases for their reliability and reproducibility is difficult, as Good Practice Guidelines has noted. Guidelines, by definition, are meant to guide (not necessarily mandate) clinical decisions. As described in the Validity attribute, the practice of medicine requires flexibility in clinical management to address different patient presentations. This means that guidelines should not only be clear in their recommendations but also provide alternatives, when available.
As stated earlier, the NCCN guidelines thoroughly details its recommendations using language that acknowledges the potential complexities and variations found in patient cases (terminology such as “consider” or “preferred” or “should be guided by”) and supports these recommendations with ample peer reviewed literature. Given the complexity and variation inherent to managing even similar appearing clinical presentations, external comparison of the reliability/reproducibility of NCCN guidelines between these presentations would not be productive. However, one could interpret improvements in health outcomes for those using NCCN guidelines (as opposed to not using them) as an indication that the guidelines reliably improve patient outcomes.174-177
OncoKB provides a very detailed standard operating procedure detailing how it reviews, scores, and updates the knowledge base. The standard operating procedure, while providing step by step instructions in granular detail, still requires OncoKB’s software and personnel infrastructure to review evidence and score genetic content. In this sense, while consistency within the organization is likely robust, it cannot be translated to external reviewers outside the OncoKB team, preventing evaluation of reproducibility as described in the Good Practice Guidelines. Additionally, several steps rely on the judgement of experienced individuals as can be seen in the procedure. The thought processes of these individuals are not provided publicly. As for how OncoKB is used by clinicians, currently no studies were identified addressing this topic. This lack of studies, however, is to be expected given that OncoKB provides scored data without direct clinical management recommendations. Overall, OncoKB meets the Reliability/Reproducibility attribute, albeit, in the context of the OncoKB organization without ability to translate to external reviewers.
ClinGen, like OncoKB, supplies clinicians with scored data but additionally includes minimal clinical management recommendations (e.g., when the clinical action of cancer surveillance is given a high score and is thus implicitly recommended for patients with relevant genetic findings). ClinGen, however, is less centralized than OncoKB in standard operating procedures. Generalizable standard operating procedures applying to all workgroups are provided, however, workgroups sometimes have panel-specific operating procedures as well. Given the broad scope of ClinGen as a whole (pediatric and adult hereditary disorders, both non-cancer and cancer related), having different protocols per workgroup with an overarching more general policy makes sense, even if it leads to variability in how genetic material is evaluated. The disadvantage to this decentralization of standardized operating procedures is consistency can be lost from workgroup to workgroup, making reproducibility more difficult and less consistent. Given the simplicity of clinical recommendations, with more focus on providing a user with information to help make their decisions, external studies comparing the reproducibility of ClinGen recommendations in clinical practice would be difficult if not impractical. ClinGen, while having some internal variation in standard operating procedures, still captures the general features of the Reliability/Reproducibility attribute.
To fulfill the attribute Clinical Applicability, a knowledge base should clearly define in what circumstances and/or patient populations a guideline applies. For all three knowledge bases in this LCD, Clinical Applicability was demonstrated de facto by gene-disease correlations; namely, every guideline relevant to this LCD identified genes in context to a cancer(s) or risk of cancer. More specifically, NCCN further contextualizes its recommendations based on several factors including strength of evidence, cost of intervention, and special patient population considerations. ClinGen similarly expounds upon its scored recommendations in attached Summary Reports that address details such as relevant ages, gender, and clinical presentations. OncoKB categorizes its scoring based on the clinical indication (e.g., prognostic scores Px1, Px2, Px3 and therapeutic scores 1, 2, 3A, 3B, 4, R1, R2) with ties to the FDA scoring where indicated.
Clinical Flexibility, per the Good Practice Guidelines, describes the practice of providing alternatives and exceptions to primary recommendations in a guideline. At best, a guideline would be systematic, considering not only all major alternatives and exceptions, but also describing the rationales behind them and describing which circumstances support not utilizing a primary recommendation, including a consideration of patient preferences, clinical judgement, and logistics behind an alternative recommendation.
Again, the NCCN’s guidelines demonstrate thoroughness backed by evidence, as is seen in the details surrounding each recommendation statement. Often, NCCN guidelines will provide parameters to their recommendations, including what elements should be considered before performing a clinical action and what exceptions and alternatives should be considered, when appropriate. The NCCN guidelines also tend to defer clinical actions to clinical judgement, including recommending team-based decision making in collaboration with the patient.
ClinGen acts less as guidelines and more as a knowledge repository in respect to providing recommendations. As stated earlier, the clinical actionability statements in ClinGen generally refer to hereditary cancer syndromes, which focus primarily on surveillance and prophylactic interventions. However, ClinGen is designed to provide scoring and evidence for clinical actions, allowing the clinicians to decide if the scores are high enough and the evidence strong enough to move forward with a clinical action. In this way, ClinGen does not directly provide clinical instructions, but rather indirectly guides clinical behavior through supplying evidence without frank conclusions. Thus, in this sense, the Clinical Flexibility attribute does not readily apply to ClinGen.
Likewise, OncoKB acts as a knowledge base, however, with even less description of clinical actions. OncoKB also supplies clinicians with information and evidence, but without recommending a certain clinical action and providing alternatives (such as selecting one chemotherapy over another). As a result, the Clinical Flexibility attribute is not applicable to OncoKB.
According to the Good Practice Guidelines, a knowledge base should also demonstrate Clarity, namely be straightforward, logical, organized, and well defined. Terminology and clinical descriptions (e.g., anemia versus hemoglobin level) should be as precise as possible.
The attribute of Clarity is exemplified by NCCN guidelines in many respects. The guidelines use several modalities to present their recommendations, including a searchable database and frequent use of flow chart, diagrammatic algorithms. Additionally, NCCN provides numerous resources describing its process of developing recommendations and what the scoring means. It should be noted that clarity varies per NCCN guideline, with some guidelines less clear in terminology and defining recommendations. In fact, embedding recommendations within long paragraphs of discussion can make a recommendation harder to find. Another weakness of the NCCN guidelines is the practice of saying any unscored recommendation within a guideline should be treated as Level 2A, namely: “Based upon lower-level evidence, there is uniform NCCN consensus that the intervention is appropriate.”179 The great majority of NCCN recommendations are Level 2A but given the fact that the recommendation received “uniform NCCN consensus,” the Level 2A recommendations were considered strong enough to warrant coverage as clinically reasonable and necessary in Medicare. Overall, the NCCN guidelines aim to be transparent and clear with their recommendations and use a variety of modalities to ensure the recommendations are clearly communicated.
ClinGen addresses the Clarity attribute through thorough explanation of its scoring systems, separation of different topics such as gene-disease relationships versus clinical actionability of genetic testing, use of searchable databases, and presentation of scores in consistently structured reports. ClinGen allows its users to determine how to use the presented information, only providing recommendations indirectly through listing potential clinical actions, ranking the strength of evidence, and consensus scoring of the clinical actions. ClinGen, however, provides information with the apparent audience being people with more experience in genetics. Due to the complexity of the ClinGen website, the technical nature of evidence review, and the various topics assessed per gene (gene-disease relationship, dosage sensitivity, clinical actionability, variant pathogenicity, and pharmacogenetics), the ClinGen knowledge base can be challenging to use depending on the user’s background and foundation of knowledge. ClinGen requires its users to learn how to work with the data before they can properly use the website. Moreover, the absence of stated recommendations requires users to understand the details provided before arriving at a conclusion. In fact, per the website: “The information on this website is not intended for direct diagnostic use or medical decision-making without review by a genetics professional.”24 Overall, these weaknesses reduce the clarity of the knowledge base; however, the website does provide a lot of definitional and procedural details that add to its transparency.
OncoKB, much like ClinGen, addresses the Clarity attribute through its clearly described scoring system. Additionally, OncoKB does not provide direct recommendations, but rather provides the strength of evidence and expert assessment of genetic content in the context of cancer(s). Also, like ClinGen, the absence of expert recommendations requires users to come to their own decisions on the provided scoring and data. However, OncoKB’s website is comparatively straightforward/intuitive and has a searchable database.
Noted as “one of the committee’s [Institute of Medicine] strongest recommendations,” good practice guidelines should focus on the attribute of Multidisciplinary Process.27 The committee for Good Practice Guidelines felt that guideline development should include all stakeholders potentially affected by the guideline, suggesting even patients and payors be considered. Part of the process of organizing a guideline committee would be determining the participants’ conflicts of interest, not necessarily to exclude individuals but rather recognize and account for potential biases. It is noteworthy that even among the committee writing the Good Practice Guidelines there was debate on who should be included in guideline development and who should lead the guideline development.
Per the NCCN’s Disclosure Policies: “The NCCN Guidelines are updated at least annually in an evidence-based process integrated with the expert judgment of multidisciplinary panels of experts from NCCN Member Institutions. NCCN depends on the NCCN Guidelines Panel Members to reach decisions objectively, without being influenced or appearing to be influenced by conflicting interests.”179 Panel members are overseen by various levels of staff and leadership with conflicts of interest updated at least semiannually. It is recognized that panel members will have outside activities with other entities such as industry and patient advocacy groups, and thresholds for untenable conflicts of interest have been set, as described on the NCCN website. It should also be noted that while panels are comprised of subject matter experts, the guidelines also receive input from patient advocates. Overall, the process of selecting and maintaining panel members as well as other NCCN leadership, participants, and staff are well-detailed on the NCCN website. As a whole, the NCCN system robustly meets the Multidisciplinary Process attribute.
As described on their website and documented in their standard operating procedures, OncoKB maintains its knowledge base through use of curators, a CGAC, and External Advisory Board. The CGAC consists of “Core” members with broad skillsets that include clinical management, research, and translational cancer biology expertise. The CGAC also includes “Extended” members that include service chiefs, physicians, and scientists that represent multidisciplinary clinical leadership within the Memorial Sloan Kettering Cancer Center (MSKCC). Each member must submit their conflicts of interest and the system of developing and approving genetic assertions including scoring takes conflicts of interest into consideration before a member is allowed to approve or deny these assertions. The CGAC is additionally overseen by an External Advisory Board consisting of leaders from the oncology and genomics community that are not employed by MSKCC. One of the inherent weaknesses of the OncoKB knowledge base is the limited representation from experts outside of the MSKCC. Moreover, involvement from other stakeholders such as patient advocacy groups do not appear to be included in OncoKB development and maintenance. The current CGAC committee appears to consist of doctoral level members (MDs and PhDs) only. While these weaknesses are noted, the overall system utilized by OncoKB meets the core elements of the Multidisciplinary Process attribute.
ClinGen utilizes Expert Panels to score, review, and report recommendations. Expert Panels are open to public volunteers who first receive training, are required to disclose conflicts of interest, and finally are assigned to a topic specific panel(s) by ClinGen leadership. Members of Expert Panels are listed on the ClinGen website, and their conflicts of interest disclosures usually can be found within their workgroup’s webpage. These members represent a variety of professions in research, medicine, academia, and/or industry and diverse educational backgrounds including both doctoral and non-doctoral degrees. Additionally, membership is not limited to the United States. While ClinGen provides one of the most inclusive memberships of the three knowledge bases discussed, the transparency around these members (including leadership) is lacking. Moreover, as would be expected given the scope of ClinGen’s content, the predominance of membership leans towards genetics research as opposed to clinical management. ClinGen does demonstrate weaknesses that would benefit from improvement, but the essence of the knowledge base, inclusivity guided by centralized review and reporting standardized operating procedures and multidisciplinary input, meets the core elements of the Multidisciplinary Process attribute.
The Institute of Medicine committee also endorsed Scheduled Reviews for knowledge bases. While the committee does not specify exact timelines for scheduled reviews, the Good Practice Guidelines does state that reviews should be scheduled based on the known or expected frequency of updates in the topic of interest. Additionally, scheduled reviews do not preclude ad hoc reviews as new evidence is identified.
NCCN guidelines are reviewed and updated as necessary on a yearly basis. However, NCCN guidelines are often updated on a more frequent basis as stakeholder inquiries are submitted or new information is learned. A good example of this process can be seen in NCCN’s transparency documents which record guideline specific meetings and the topics discussed at those meetings. NCCN clearly meets the Scheduled Reviews attribute.
OncoKB, of the three discussed knowledge bases, appears to have the most rigorous and frequent scheduled reviews of its content. OncoKB’s team reviews various data sources at frequencies from weekly for databases like cBioPortal and COSMIC to monthly for peer-reviewed literature from a variety of well-established journals. Additionally, OncoKB keeps abreast of new information from major annual conferences (e.g., ASH Annual Meeting) and evaluates other sources ad hoc, such as user feedback or data from clinical trials. All of these procedures can be found in the Standard Operating Procedure for the website.178 This procedure strongly meets the Scheduled Review attribute described in the Good Practice Guidelines.
Based on the structure of ClinGen, the Scheduled Review attribute is not well met. In 2018, McGlaughon and colleagues published an analysis asking how frequently a gene curation should be re-assessed and updated.180 The retrospective study recommended different timelines for re-evaluation based on the initial strength of a gene-disease association (limited, moderate, strong), ranging from greater than five years to three years. In 2019, ClinGen did develop a policy along these lines, with the shortest time to recuration being two years for gene-disease associations with a Moderate classification, mandating the policy for “all current and future GCEPs” (Gene Curation Expert Panels).181 However, a similar policy for recuration of Clinical Actionability scores and recommendations was not found. Based on what is available, it appears that ClinGen may re-evaluate Clinical Actionability on an ad hoc basis, such as the recent re-evaluation of hereditary cardiac disease scores in 2021.182 It should also be noted that for all released Clinical Actionability reports, the oldest update was only in April 2020, meaning that all released reports appear to be current up to at least 2020 (with a vast majority of reports showing updates in 2022). While the absence of a clear re-curation policy for Clinical Actionability is problematic, ClinGen’s reports are based on current evaluations and recent updates.
The eighth and final attribute described in Good Practice Guidelines is Documentation. This attribute refers to the written procedures, policies, scoring metrics, etc. describing how a knowledge base makes its determinations and guidelines. Good documentation should also record activities of the knowledge base, including which individuals developed a guideline, the evidence they utilized, their rationales and assumptions, and any analytical methodology they used.
NCCN demonstrates robust standard operating procedures and vetting of its guideline developers. Additionally, NCCN both records submissions from external sources and documents the subsequent panel meetings where submissions are considered. Much of the text in NCCN guidelines outlines both the evidence and rationale for NCCN scoring and recommendations. However, NCCN is less transparent on the dialogue and content of its guideline panel meetings, for instance not providing comprehensive meeting minutes or transcription of the meeting’s discussions. The votes of the panel on an issue are likewise anonymous even if the members attending a discussion are disclosed. These practices can obscure full understanding of the rationales and assumptions behind a panel’s final decisions and recommendations. As a whole, the NCCN is fairly transparent in both the procedures used to develop guidelines and the people creating and updating these guidelines. There are areas in which more transparency would be needed to fully meet the Documentation attribute, but as a whole, NCCN demonstrates good documentation.
OncoKB provides a very detailed standard operating procedure on its website and lists all team members with their respective Conflicts of Interest disclosures. However, a major weakness of OncoKB is the lack of transparency as to what data and/or citations support their score for genetic content. For instance, although alterations are listed with their score, disease, and drug associations, only the number (but not the actual references) of citations are listed. Moreover, meeting minutes do not appear to be publicly available, limiting insight into the rationales and assumptions behind scores. OncoKB, because of these weaknesses, does not completely meet the Documentation attribute as described above, although the knowledge base does do a good job explaining its general procedures and documenting who is curating the knowledge base.
ClinGen provides overall scoring criteria and parameters for its knowledge base, but some expert panels create their own working group protocols. Given the broad scope of ClinGen as a whole (pediatric and adult hereditary disorders, both non-cancer and cancer related), having different protocols per workgroup with an overarching more general policy makes sense, even if it leads to variability in how genetic material is evaluated. Members of Expert Panels are listed on the ClinGen website, and their conflicts of interest disclosures usually can be found within their workgroup’s webpage. ClinGen, because of these weaknesses, does not completely meet the Documentation attribute as described above, although the knowledge base does do a good job explaining its general procedures and documenting who is curating the knowledge base.
Specific Lab Tests
Cxbladder (Detect, Triage, Monitor, Resolve)
The fundamental methodology of Cxbladder tests is founded on a 2008 paper from Holyoake and colleagues describing the creation of an RNA expression assay that could predict the likelihood of urothelial carcinoma from urine.39 Several gene expression profiles were examined between urothelial carcinoma tissue and normal urothelial tissue, the latter of which was collected as non-malignant tissue from patients with renal cell carcinoma who had undergone a radical nephrectomy. The gene expression profiles that demonstrated the most promise in differentiating between cancer and non-cancer were gathered into a four gene expression panel and then optimized to discriminate between urine from patients with any grade/stage of urothelial cancer and patients without urothelial cancer. Gene expression tests that are used to predict the presence or absence of cancer, however, must take into consideration many potential complicating and confounding factors. The absence of a rigorous approach to addressing complicating/confounding factors undermined the clinical validity of Cxbladder tests, as will be detailed below.
As evidenced in the publications reviewed in the Summary of Evidence, the key weakness of the Cxbladder tests is found within their test design. Cxbladder tests are founded on the concept that differences in gene expression between urothelial cancer and non-urothelial cancer (including non-neoplastic tissue) can be measured in urine to determine if urothelial cancer is present or not present. This means that a well-designed test will be able to not only discriminate between cancer and normal tissue, but also between different types of malignancy.
For the precursor test uRNA, Holyoake and colleagues started with a custom-printed array from MWG Biotech that allowed gene expression profiling of 26,600 genes.39 This array was used to analyze normal tissue (18 specimens) and urothelial carcinoma tissue (28 specimens from Ta tumors and 30 specimens from T1-T4 tumors). The preliminary data was then analyzed to select the most promising genes for creation of a GEP test. This subset of promising genes was further pruned by testing urine from patients with transitional cell carcinoma (TCC) (urothelial carcinoma) (n=75), patients with other “urological cancer” (n=33), and patients without cancer, including patients with infection (n=20) and “other benign urinary tract disease” (n=24). Additionally, the paper mentions testing blood to get gene expression levels for blood and inflammatory cells, but the results of this subset of tests were not provided in this paper. It must be noted here that the characteristics of the non-TCC cancers were also not disclosed in the paper.
After this additional testing, Holyoake and colleagues settled on four gene expression test (uRNA-D) that utilized the genes MDK, CDC2 (now officially known as CDK1), IGFBP5, and HOXA13.39 Unfortunately, the false positives and false negatives received little attention from the paper, including false positives for patients with other non-TCC cancers (n=3). The authors concluded that their results “will need to be further validated in a prospective setting to more accurately determine test characteristics, particularly in patients presenting with hematuria and other urological conditions.”
Standing alone, the 2008 paper from Holyoake and colleagues lacks the scientific rigor to establish the uRNA-D test as able to accurately distinguish between urothelial carcinoma and other cancers or other non-cancer urological conditions.39 One very notable gap included a lack of details or definition for non-urothelial cancers, of which many would feed into the urinary system, including prostate cancers, renal cancers, and metastatic or locally invasive cancers from other organs. It would be expected that a well-designed test would assess not only the full spectrum of potential cancers, but that the test design would include a much higher count of specimens (beyond the 33 undefined cancers found in this study). In the same sense, the absence of details regarding non-malignant specimens, which included only 20 “urinary tract infections,” was a major and notable gap in this test’s development. There were many other issues identified with this paper including a strong population bias towards male patients, but altogether, this validation of uRNA-D was insufficient to support that the test performed accurately as a tool for distinguishing between urine from patients with and without urothelial carcinoma.
It is critical to understand the limitations of the 2008 publication from Holyoake and colleagues because the test uRNA-D was used to create the Cxbladder line of tests, with the main difference between uRNA-D and Cxbladder being the addition of a single gene, CXCR2, to the gene expression profile of the Cxbladder assay.39,40 It is noted that other versions of Cxbladder use non-genetic data in an overarching algorithm to produce results, but the focus of this discussion will be upon the gene expression profile technology of Cxbladder tests.
In 2012, the first paper describing a Cxbladder test was published by O’Sullivan and colleagues.40 This paper acted as both a test validation and a comparison of the new Cxbladder test with other urine tests on the market. While the statistical results of Cxbladder seem promising, we must return to the foundation of the test, namely its ability to distinguish between urothelial carcinoma and other cancerous or non-cancerous conditions (or patients without disease). In this paper, other malignancies (n=7) were assessed only when they were found in patients with urothelial carcinoma. Moreover, the types of other malignancy were not disclosed in this paper. There are also 255 “nonmalignant disease” specimens, which included representations of “benign prostatic hyperplasia/prostatitis”, “cystitis/infection or inflammation of urinary tract”, calculi, and “hematuria secondary to warfarin,” and 164 specimens from patients with “no specific diagnosis.” This first paper from 2012 also does not sufficiently address Cxbladder’s ability to distinguish between urothelial carcinoma and other malignancies, which is of particular relevance when a majority of the patient population were male (78%) with a median patient age of 64 years and thus, with higher risk of prostate carcinoma. The paucity of clinical data also created gaps in the data integrity, failing to answer questions such as how many of the urothelial carcinoma specimens had coincident inflammation and what other medical conditions (and medications) were present in this patient population? Additionally, the paper does not spend significant time discussing the potential reasons for false positives and false negatives. These issues are compounded by a short follow-up period (only 12-months) with participating patients.
In the most recent paper published for Pacific Edge Diagnostics by Lotan and colleagues in 2022, Cxbladder Triage and Detect were “enhanced” by addition of a different test methodology, digital droplet PCR, adding a different approach to detecting urothelial carcinoma: identification genetic variants associated with urothelial carcinoma.44 This approach was based on the premise that the six variants (called single nucleotide polymorphisms or SNPs in the paper) are either acquired as mutations in the carcinogenesis of urothelial carcinoma or already present as an inherited genetic variant in the patient’s germline DNA, representing a higher risk of urothelial carcinoma. However, it is known that these SNPs can also show up in the context of other malignancies (such as papillary renal cell carcinoma), which is not addressed by Lotan and colleagues.183,184 Moreover, as mentioned in the paper’s discussion, the presence of these SNPs in urine may not coincide with clinically detectable (e.g., cystoscopically visible) carcinoma. This could lead to further confusion with false positives, especially when the PPV of Cxbladder tests tends to be very low. If numerous false positive results in Cxbladder are accepted as an inherent trait of the test, providers may not be as vigilant in closely following patients with a positive Cxbladder result after a negative cystoscopy. In addition, providers may not search for other malignancies (e.g., papillary renal cell carcinoma) as a potential cause for the “false positive” Cxbladder result. Another weakness of the 2022 study was seen in the differences between cohorts. Notably, the six SNPs alone were less sensitive for urothelial carcinoma in the Singapore cohort (66%) than in the United States cohort (83%). This could indicate differences in the genetic etiology of urothelial carcinoma in different populations, meaning that the six SNPs may not be as representative in populations not evaluated in this study. Furthermore, while this study claimed to evaluate multiple ethnicities, the paper does not disclose which ethnicities were evaluated and the numbers of patients from each ethnicity.
Each new Cxbladder test builds on their predecessors, often utilizing the same specimens from prior studies in their test validations and performance characterizations. Moreover, the insufficient assessment of potential confounding factors is perpetuated through these studies. For instance, if we look at assessment of non-urothelial cancer through all major published uDNA and Cxbladder studies, we see the following:
- Holyoake 2008: 33 undefined cancers were noted39
- O’Sullivan 2012: Seven other malignancies (undefined) were noted, all in patients with urothelial carcinoma40
- Kavalieris 2015: Non-urothelial neoplasms were not discussed (study population included 517 patients from the O’Sullivan 2012 study)40,41
- Breen 2015: Non-urothelial neoplasms were not discussed (study population included patients from the O’Sullivan 2012 study)40,45
- Kavalieris 2017: Non-urothelial neoplasms were not discussed (same patient population as Lotan 2017)42,46
- Lotan 2017: Non-urothelial neoplasms were not discussed (same patient population as Kavalieris 2017) 42,46
- Konety 2019: In some subpopulations, patients with history of prostate or renal cell carcinoma were excluded from the study; otherwise, non-urothelial neoplasms were not discussed (study population included patients from O’Sullivan 2012 and Kavalieris/Lotan 2017) 40,42,46,49
- Davidson 2019: Other non-bladder malignancies and neoplasms were identified (but not subclassified) in a study evaluating hematuria; notably, Cxbladder-Triage was positive in most of these other malignancies (seven of nine total) and neoplasms (two of three total)52
- Koya 2020: Non-urothelial neoplasms were not discussed50
- Davidson 2020: Other non-bladder malignancies and neoplasms were identified in the study but data was not presented to allow association of these other malignancies and neoplasms with positive or negative results from Cxbladder.53
- Raman 2021: In some subpopulations, patients with history of prostate or renal cell carcinoma were excluded from the study; otherwise, non-urothelial neoplasms were not discussed (study population included patients from O’Sullivan 2012 and Konety 2021)40,43,49
- Lotan 2022: In some subpopulations, patients with history of prostate or renal cell carcinoma were excluded from the study; otherwise, non-urothelial neoplasms were not discussed44
- Li 2023: Some patients (24 of 92 patients) noted to have “other cancers;” except for one mention (a patient with breast cancer who missed their nine month follow-up due to conflict with breast cancer treatment), other types of cancers are not described or significantly discussed51
There are numerous potential malignancies that can contribute to urine genetic composition (e.g., renal cell cancer, bladder cancer, prostate cancer). However, by using only 40 unspecified neoplasm specimens, 33 of which were tested only for uDNA (not Cxbladder), the validation from Pacific Edge Diagnostics underrepresents potentially confounding variables. This underrepresentation is further substantiated through data found in studies not performed or funded by Pacific Edge Diagnostics. In an independent study from Davidson and colleagues in 2019 performed on patients with hematuria, seven of nine malignant prostate or kidney lesions were discovered in patients with a positive Cxbladder-Triage result.52 False positive Cxbladder-Triage results were also seen in a majority of patients without bladder cancer but instead diagnosed with radiation cystitis, vascular prostate, bladder stones, anticoagulation related bleeding, post-TURP bleeding, and urethral stricture. No cause for hematuria was found in 225 patients, with 137 of them having a positive Cxbladder-Triage result. This 2019 study as well as others indicate that Cxbladder is less sensitive for detecting smaller, low grade malignancies making it unlikely the false positives could represent urothelial malignancies below the limit of detection by cystoscopy and other conventional evaluations.52
The exclusion criteria further weakened the development and validation of Cxbladder tests. Consistent with the aforementioned confounding variables of other urinary tract cancers and metastases, the Cxbladder studies generally excluded patients with a history of prostate or renal cancer; however, these exclusions were not always seen in studies funded and/or published by Pacific Edge Diagnostics. The validation studies for Cxbladder also typically excluded inflammatory disorders such as pyelonephritis and active urinary tract infections and also excluded known causes for hematuria like bladder or renal calculi or recent manipulation of the genitourinary tract (e.g., cystoscopy).40-43, 46 In the first published validation study for Cxbladder, the authors stated that the additional 5th RNA marker, CXCR2, “was predicted to reduce the risk of false-positive results in patients with acutely or chronically inflamed urothelium.”40 However, in this same study, the authors went on to exclude patients with “documented urinary tract infection.” Fortunately, publications from other sources such as Davidson and colleagues in 2019 provide insight into how Cxbladder tests (namely Cxbladder Triage) perform under these benign conditions.52 Davidson and colleagues found that false positives were seen in a majority of patients where the underlying etiology of hematuria was radiation cystitis, vascular prostate, bladder stones, anticoagulation related bleeding, post-TURP bleeding, or urethral stricture. False positives were also seen in 10 of 23 (43%) patients with urinary tract infection and 8 of 10 (80%) patients with “other” inflammatory etiologies. Altogether, in the inflammatory category of the study, over half (59%) of patients with an inflammatory etiology of their hematuria received a false positive result from the Cxbladder Triage (CxbT) test. In a subsequent study by Davidson and colleagues in 2020, “approximately 10% of patients (85 of 884) required a repeat CxbT assay because quality control failures, mainly caused by interference of inflammatory products or a large number of white blood cells.”53 The couple of systemic reviews and meta-analyses that included Cxbladder tests were mixed in their assessment of this line of tests. Chou and colleagues in 2015 only reviewed one of the Cxbladder papers (O’Sullivan 2012) and came to the conclusion that the strength of evidence for the study was graded low.40,54,55 In 2022, Laukhtina and colleagues performed a more involved assessment of Cxbladder tests, particularly Cxbladder Monitor, coming to the conclusion that it had “potential value in preventing unnecessary cystoscopies.”56 The authors also determined that there was “not enough data to support” using Cxbladder Triage and Detect in the “initial diagnosis setting.” Laukhtina and colleagues did acknowledge that their study had several potential limitations which included the “absence of data on blinding to pathologist and urologists” and the inability to “perform subgroup analyses for HG [High Grade urothelial carcinoma] recurrence detection only.” However, it should be noted that for both the 2015 and 2022 systemic reviews and meta-analyses, the evaluation of the actual clinical features of each Cxbladder study was relatively superficial, focusing more on the statistical values and less on the quality of the studies and the designs underlying those values.54-56
In conclusion, the Cxbladder line of tests all suffer from the foundational problem of insufficient validation of their test in potentially confounding clinical circumstances including non-urothelial carcinoma malignancies and inflammatory conditions of the urinary tract. Cxbladder also demonstrates several population biases, including early papers with a strong bias towards male patients of European ancestry. The majority of Cxbladder papers avoid disclosing the PPV and number of false positives of their tests. Cxbladder tests generally have low PPVs (down to 15-16% as seen in Konety, et al 2019) and high numbers of false positives (also in Konety, et al 2019, there were 464 false positive results as compared to 86 true positive results).49 These values are significant in that false test results, particularly false positives, can lead to patient anxiety and distress among other procedural issues related to follow up for an inaccurate result. Most of the primary literature regarding Cxbladder test development and performance is funded, if not directly written by, the test’s parent company, Pacific Edge Diagnostics. This conflict of interest must be taken into account when reviewing these papers. Finally, and most importantly, due to the insufficient representation of confounding factors in the validation populations, the Cxbladder tests have not been adequately vetted in the context of the Medicare population. Given all of these findings, the Cxbladder line of tests are considered not medically reasonable and necessary for Medicare patients.
ThyroSeq CRC, CBLPath, Inc, University of Pittsburgh Medical Center
ThryoSeq CRC is a prognostic test for malignant cytology that predicts the five year likelihood of cancer recurrence (low, intermediate, or high risk) based on algorithmic synthesis of raw data from the next generation sequencing (NGS) of DNA and RNA from 112 genes. Given that fine needle aspirated (FNA) nodules proven to be malignant on cytology are typically surgically resected, sometimes with coincident lymph node dissections as warranted, and the total features of the cancers are then assessed on permanent pathology of the resection specimen, having a prognostic test predicting risk of cancer recurrence on cytology before assessment of the entire resection seems preemptive. However, ThyroSeq CRC is proposed to direct extent of surgery for Bethesda VI nodules, increasing aggressiveness of surgery for more aggressive cancers. Therefore, ThyroSeq CRC must not only supply information that is not obtained through standard clinical and pathologic procedures prior to a resection, but also provide results that are subsequently confirmed on patient follow-up after the resection. Ultimately, a prognostic test should provide information that predicts the course of a patient’s disease before therapy is implemented and thus informs future clinical management to preemptively reduce adverse outcomes. For a prognostic test to be clinically useful, it must ultimately improve patient outcome.
In the first publication describing the evaluation of the ThyroSeq CRC test, a small population of patients (n=287) with differentiated thyroid cancer (DTC) were evaluated with the CRC prognostic algorithm and their molecular risk group (Low, Intermediate, High) was compared to their outcome in terms of distant metastases (DM) as identified through pathology or whole body scans with iodine-131.57 Patients were divided into two groups: control (n=225) and DM within five years (n=62). In the control group, precise numbers of how many patients fell into each CRC risk category were not supplied by the paper. Instead the control group was further segregated into a subcategory of propensity matched patients where each DM positive patient was compared with a control patient with similar demographic and pathologic characteristics, although the authors clearly state histologic subtype was not used to perform this propensity match. Using this propensity matching technique, comparisons were provided between the 53 DM positive patients and 55 control patients. In this subgroup comparison, the DM positive patients demonstrated more high risk scores (Low=1 patient; Intermediate=17 patients; High=35 patients) than the control patients (Low=28 patients; Intermediate=19 patients; High=8 patients). This comparison was felt to be adequate by the authors to conclude that their “molecular profile can robustly and quite accurately stratify the risk of aggressive DTC defined as DM.”
This study had numerous limitations and drew dramatic conclusions from a very small sample size that was poorly presented by the paper.57 The immediate issue with this study was the lack of transparency. Thyroid cancer is a complex category of malignancy that includes many different subtypes of cancer, each with a variety of behaviors depending on numerous demographic, clinical, and pathologic factors. Considerations for management of cancer patients is thus a multifactorial and interdisciplinary process that requires careful evaluation. The study from Yip and colleagues not only oversimplifies the descriptions of patient populations, but the background data for each patient is not provided to allow for objective review by their readers. We are not given crucial details such as key findings in pathology reports (mitoses, lymphovascular invasion, capsular invasion, histologic subtype of the cancer) nor the number of patients with positive lymph nodes found during resection of the cancer. Instead, the patient demographics and molecular characteristics provided (Table 1) include simplifications such as generalized cancer types without subclassification (Papillary, Follicular, or Oncocytic) and non-specific metastatic locations (Bone, Lung, “>1” and Other). Additionally, the propensity matched description table (Table 2) only lists mean age at diagnosis, mean tumor size, and gender ratio.
Yip and colleagues also did not provide significant insight into why some controls (n=8) were ranked as high risk while one patient with DM was categorized as low risk.57 The purpose of the intermediate risk category is unclear and concerningly unhelpful when the number of patients with this risk category were basically the same between propensity matched DM and control patients (n=17 versus n=19 respectively).
Ultimately, it was unclear how this test would be used in patient care.57 Given that the test is performed on cytology before resection, the authors conjectured their test could be used to guide extent of surgery (lobectomy versus total thyroidectomy) or help direct patients to therapeutic trials. However, these potential clinical utilities were not assessed in this paper.
In the second publication evaluating ThyroSeq CRC, Skaugen and colleagues performed a single-institution retrospective cohort study assessing 128 Bethesda V (suspicious for malignancy) cytology specimens.58 The study assessed both the ThyroSeq v3 diagnostic test as well as the ThyroSeq CRC test. For the CRC portion of the study, 100 specimens were assessed, with exclusion of five due to a benign diagnosis upon resection and three excluded due to concurrent metastatic disease discovered at resection. For the remaining 92 specimens, there was a mean follow-up of 51.2 months (about four years). The shortest follow-up time was less than one month, and the longest follow-up time was of 470 months (nearly 40 years). It must immediately be noted here that the ThyroSeq CRC test claims to predict a five year risk of DM, which means over half of the CRC tested specimens (more than 46 specimens) demonstrated potentially inadequate follow-up to assess the core five year prognostic claim. The importance of these follow-ups become even more evident when the authors drew conclusions about the prognostic power of the CRC’s three risk categories: High, Intermediate, and Low. Distant metastases was identified in 12 of the 92 specimens: 6 of 11 specimens with the high risk result and 6 of 63 specimens with the intermediate risk result. The authors did not provide a deeper analysis of the five high risk specimens without DM, including no speculation as to why the test potentially misclassified these specimens. Additionally, the authors did not provide significant discussion into the meaning of the intermediate-risk result that was given to 66 of the 100 specimens tested. In the paper’s conclusions, the ThyroSeq CRC was again proposed as potentially helpful in deciding the extent of surgery required.
Much like the first paper, the second paper (Skaugen and colleagues) lacked data transparency, making further assessments by readers difficult.58 While Table 3 provided patient characteristics, surgical findings, and pathologic findings, all to a much greater extent than the first paper, readers were still unable to synthesize how data categories corresponded to each other (e.g., of the patients who received lymph node dissection, what subtypes of thyroid cancer were represented).
Ultimately, Skaugen and colleagues lacked sufficient follow-up to draw significant conclusions about the accuracy of the ThyroSeq CRC results.58 The paper, while data rich, was not transparent nor thorough enough for readers to draw their own conclusions about the validity of the test. Moreover, the conclusions given by the authors regarding the prognostic test were overly simplified, such as highlighting the presence of DMs in some patients with intermediate and high risk results, and considering this correlation to be significant when they also noted absence of DMs in patients with low risk results. Finally, the actual use of ThyroSeq CRC in the clinical setting is still unclear based on the discussion of the paper.
In the third publication, Liu and colleagues assessed their three tier classification system (low-risk, intermediate-risk, and high-risk for recurrence) in the context of primary thyroid cancer recurrence after a primary thyroidectomy and subsequent initial oncologic therapy.56 Notably, the test name ThyroSeq CRC was never used in this paper, even though the three tier system of risk stratification appeared to be the same. This raises a concern that the classification system used in this paper may not be the same methodology as used for the marketed ThyroSeq CRC. With that caveat and for the purposes of this Analysis of Evidence, this third paper will be considered contribuatory to the body of literature evaluating ThyroSeq CRC.
Just from the methodology section of the publication alone, we can see immediate differences between this paper and the previous two papers described.57-59 Firstly, surgical specimens were permitted in the study, not just cytology specimens. This allowance of a non-cytology specimen type (“final surgical specimens,” without specification of post resection handling of tissue, formalin fixation versus fresh-frozen preservation) in a test that was presumably designed for cytologic specimen would require a separate validation of the test for a new type of specimen. Validation for this change in pre-analytic procedure was not evidenced in this paper nor either of the two prior publications. Secondly, the study was not blinded due to its retrospective nature. Thirdly, in cases where multifocal cancer was identified, only samples from the “most aggressive biology” were selected for molecular testing; however, the paper does not define what constitutes “most aggressive biology.” Fourthly, the specimens included in the study included patients with preoperative Bethesda I, II, III, and IV cytology as well as Bethesda V and VI cytology. This starkly contrasts the inclusion criteria seen in the prior two studies. Overall, these methodologic differences between papers reduces the comparability of results between the three studies.
Data collection in this study from Liu and colleagues was also different from the previous two papers.57-59 For instance, Liu and colleagues recorded several details on the surgical and post-surgical treatments of the patients. This data included types of lymph node dissection (central versus lateral and prophylactic versus therapeutic), postoperative complications (e.g., hematoma, hypercalcemia, surgical site infection), and long-term complications (such as hypocalcemia and recurrent laryngeal nerve paresis). Several of these data categories were similar to those seen in Skaugen and colleagues’ study, but differences found in Liu’s publication included post-operative details, grouping of several types of papillary thyroid cancer (such as tall cell variant) into a more general category (“Papillary, high risk"), evaluating only all-cause mortality (not substratifying into disease specific mortality), and detailing AJCC prognostic stages. Note that as mentioned above, there was a paucity of clinical and pathologic data provided for samples in the study from Yip and colleagues, and the amount of data from Liu and colleagues was far more diverse than that prior study.
The study followed up patients for a median of 19 months ([Interquartile range] IQR of 10-31 months).59 None of the patients were followed for a total of five years, which means the data in this study is insufficient to substantiate the five year prognostication claims of the ThyroSeq CRC test.
The above analysis captures only some of the issues identified with the study from Liu and colleagues.59 In fact, careful reading of the paper’s discussion brings up numerous other “limitations” to the study, not already described above, as identified by the authors. While the authors’ discussion remains upbeat, statements such as “how to manage the intermediate group?” draw attention to the novelty of this classification schemea and the uncertainty of what and how the results can impact patient care and outcome. While the authors suggest numerous ways their classifications can affect patient management, and even suggest that they use this test within their institution to guide their decision-making, the lack of evidence demonstrating this prognostic test’s clinical utility through carefully designed studies suggests that currently the test may not be adequately studied for use in patient care. Ultimately, in spite of the numerous data supplied, this paper still failed to adequately evaluate the clinical validity and utility of the prognostic three tier system.
In summary, the validity of the ThyroSeq CRC test is not sufficiently supported by the three peer-reviewed papers identified. The three papers were exceptionally difficult to compare to each other due to differences in information provided, types of samples tested, and methodologies described. The clinical utility of the test is not significantly evaluated by any of the papers. Due to the inadequate quality of the papers and the insufficiency of data, this test does not have sufficient evidence to prove clinical reasonableness and necessity and will be considered non-covered in Medicare patients.
PancraGEN- Interpace Diagnostics
A 2006 patent described a topographic genotyping molecular analysis test (which would later become PathfinderTG and then renamed PancraGEN) for risk classification of pancreatic cysts and solid pancreaticobiliary lesions when first line results were inconclusive.185 PancraGEN integrates the molecular results (loss of heterozygosity markers and oncogene variants) with a pathologist interpretation to provide four categories of risk (benign, statistically indolent, statistically higher-risk, or aggressive).
Topographic genotyping (also called integrated molecular pathology [IMP]) was created to integrate molecular and microscopic analyses when a definitive pathologic diagnosis or prognosis was inconclusive. Typically, investigation of a pancreatic cyst or solid pancreaticobiliary lesion is an interdisciplinary process that involves a battery of clinical evaluations including imaging, cytology, and, when applicable, cyst fluid analysis. Given the complexity of this work-up process, it is surprising that only a small number of PancraGEN studies compared its test results with the histology, cytology, and/or pathology of surgical biopsy specimens. For the PancraGEN studies addressing pancreatic cysts, all three studies were retrospective in nature and contained significant limitations. The largest study by Al-Haddad in 2015, analyzed 492 patients registered with the National Pancreatic Cyst Registry (NPCR).85 The majority of the patients reviewed for inclusion (n=1,732) did not meet the study inclusion criteria, due to insufficient or inaccessible documentation, which resulted in many cases not meeting the follow-up threshold of ≥23 months. Researchers evaluated how well PancraGEN (PathfinderTG), and the 2012 Sendai International Consensus Guideline classification categorized patients with pancreatic cysts in terms of their chance of developing cancer. However, the publication from Al-Haddad and colleagues did not adequately address the validity of the PancraGEN due to several shortcomings and limitations, including, but not limited to:
- Data that would be used for the categorization of patients according to Sendai 2012 criteria were also not specified for most patients, as the collection of information was started prior to publication of the 2012 guidelines.
- Only a small fraction of all enrolled patients (26%) met inclusion guidelines.
- The study only used a retrospective design without randomization or prognostic data.
- All patients in the study had been scheduled for surgery, while typically not all patients with pancreatic cysts get surgery referrals.
- The mean follow-up period for benign disease in this study was too short for firm conclusions to be made beyond three years (insufficient follow-up times).
Of note, during the study, the criteria for the test evolved and older cases on the registry had to be recategorized based on new criteria.
Two other retrospective studies by Malhotra et al (2014) and Winner et al (2015) analyzed data from patients with pancreatic cysts in 2006 and 2012 who had surgical resection and analysis with PancraGEN (Pathfinder TG).86,87 The study by Winner and colleagues had an extremely small cohort of 36 patients, 85% of whom were “white” with no other race/ethnicity reported. All patients in the study were scheduled for surgery even though not all patients with pancreatic cysts undergo surgery. Moreover, the authors were unable to include the majority of patients recruited because of the lack of final pathology results. The study by Malhotra and colleagues utilized 26 patients with no demographic characteristics reported and only three months of follow-up. In Malhotra (2014), no clinical validity outcomes such as sensitivity, specificity, and predictive values were calculated or reported. Both Malhotra et al (2014) and Winner et al (2015) were performed at single institutions with no blinding.
Al-Haddad and colleagues assessed clinical utility by describing how PancraGEN might provide incremental improvement to international consensus guidelines (Sendai [2006] and Fukuoka [2012]).85 Of the 289 patients who met the consensus criteria for surgery, 229 had a benign outcome. The PancraGEN test correctly classified 84% as benign and correctly categorized four out of six as high risk. A 2016 study by Loren and colleagues evaluated clinical utility by comparing the association between PancraGEN diagnoses and international consensus guidelines for the classification of intraductal papillary mucinous neoplams and mucinous cystic pancreatic neoplasms.88 In the study, 491 patients were categorized as (1) "low-risk" or "high-risk" using the PancraGEN diagnostic algorithm; (2) meeting "surveillance" criteria or "surgery" criteria using consensus guidelines; and (3) having "benign" or "malignant" outcomes during clinical follow-up. Additionally, the real-world management decision was categorized as "intervention" if there was a surgical report, surgical pathology, chemotherapy, or positive cytology within 12 months of the index EUS-FNA, and otherwise categorized as "surveillance". A 2016 study by Kowalski and colleagues analyzed false negatives from the NPCR to examine clinical utility.89 The study hypothesized that PancraGEN might appropriately classify some pancreatic cysts that had been misclassified by consensus guidelines, but the numbers where the PancraGEN and consensus guidelines disagreed were small, limiting the value of these results.
The clinical validity of PancraGEN has been addressed in several retrospective studies. Most evaluated performance characteristics of PancraGEN for classifying pancreatic cysts according to the risk of malignancy without comparison to current diagnostic algorithms. The best evidence regarding incremental clinical validity comes from the report from the NPCR, which found that PancraGEN has slightly lower sensitivity (83% vs. 91%), similar NPV (97% vs. 97%), but better specificity (91% vs. 46%) and PPV (58% vs. 21%) than consensus guidelines.85
Throughout their publications, Interpace Diagnostics has indicated that the PancraGEN test is meant to support first-line testing, but no process for combining PancraGEN with consensus guidelines for decision making has been proposed, and the data reporting outcomes in patients where the PancraGEN and consensus guideline diagnoses disagreed was limited. There are no prospective studies with a simultaneous control population that proves PancraGEN can affect patient-relevant outcomes (e.g., survival, reduction in unnecessary surgeries). The evidence reviewed moreover does not demonstrate that PancraGEN has incremental clinical value in the prognosis of pancreatic cysts and associated cancer.
The evidence for the clinical validity of using PancraGEN to evaluate solid pancreaticobiliary lesions consists of three retrospective studies by Khosravi and colleagues (2018), Kushnir and colleagues (2018), and Gonda and colleagues (2017).90-92 One study assessed the ability of PancraGEN for classifying solid pancreatic lesions while the other two evaluated the classification of biliary strictures. Biliary strictures can be caused by solid pancreaticobiliary lesions but also other causes such as pancreatitis or trauma. Additionally, the studies did not specify what percentage of patients with biliary stricture had solid pancreaticobiliary lesions. While the three retrospective studies noted that the use of cytology plus FISH plus PancraGEN increased sensitivity significantly, the incremental value of using cytology plus FISH plus PancraGEN over cytology plus FISH is unclear. Interpace Diagnostics has indicated that PancraGEN is meant as an adjunct to first-line testing for pancreatic cysts but has not effectively tested or assessed how PancraGEN performs for solid pancreaticobiliary lesions. Therefore, the evidence reviewed does not demonstrate that PancraGEN has incremental clinical value for the diagnosis of solid pancreaticobiliary lesions.
Notably, there are no studies assessing the analytical validity of the PancraGEN test. Without such data, the technical performance of the test cannot be truly determined. A Technology Assessment and systematic review of Pathfinder TG prepared by CMS found no studies that evaluated the analytic validity of LOH analyses in the Pathfinder framework as compared to a reference standard such as pathology reports or radiologic findings.91 The systematic review addressed questions about analytical validity, clinical validity, and clinical utility, but found no studies which “directly measured whether using LOH-based topographic genotyping with PancraGEN/[PathfinderTG] improved patient-relevant clinical outcomes.” The study also found that sample sizes were small, had methodological limitations, and were all retrospective without any prospective studies. To date, no study has been performed to identify how PancraGEN impacts patient outcomes such as reducing patient mortality from pancreatic cancer or leading to improved survival.
In summary, the body of peer-reviewed literature concerning PancraGEN is insufficient to establish the analytic validity, clinical validity, and clinical utility of this test in the Medicare populace. There is insufficient literature evidence to demonstrate that the topographic genotyping used in PancraGEN is an effective method to aid in the management of individuals with pancreatic cysts or solid pancreaticobiliary lesions when other testing methods are inconclusive or unsuccessful. There is also a lack of peer-reviewed evidence demonstrating that the use of topographic genotyping in the management of individuals with pancreatic cysts results in improved clinical outcomes. As such, this test will not be covered currently as medically reasonable and necessary in Medicare patients.
DecisionDx – Castle Biosciences
DecisionDx-Melanoma
In order to systematically evaluate such a large body of publications, we will discuss both test design and study design of DecisionDx-Melanoma in the context of what information is offered to providers and patients. Looking at an example report from the Castle Biosciences website, we can see that their test provides prognostic information based on GEP data alone (class assignment) and GEP data in combination with clinicopathologic data (i31-ROR and i31-SLNB).186 We will spend most of the discussion below focused on the class assignment portion of the results since most of the peer-reviewed literature focuses on this result alone and the i31-ROR and i31-SLNB were developed much more recently.
Fundamentally, DecisionDx-Melanoma is a GEP that analyzes 28 genes of interest (and considered by the company to be significantly informative of the prognosis of melanoma) and anchors these 28 genes to three reference genes. As controls, the three reference genes should provide a consistent baseline across all types of melanoma and non-melanoma tissue. Unfortunately, the literature described in the summary of evidence did not provide insight into the consistency of expression of the three reference genes across tissue types and other pre-analytic variables (for instance fixation time and age of formalin-fixed paraffin-embedded [FFPE] tissue). Of note, one of the reference genes (FXR1) serves as a gene of interest (not a control gene) in Castle Bioscience’s DecisionDx-UM.187
Gene expression profiles are founded on the principle of differential gene expression in a cell of interest, like a cancer cell, when compared to background cells, namely all other cells in the surrounding tissue. Tissues are comprised of many different types of cells. Skin, for instance, is comprised of a variety of cell types including melanocytes, keratinocytes, immune cells such as lymphocytes, structural cells such as fibroblasts and fibrocytes, hair generating cells, and specialized glandular cells such as apocrine cells. Many different factors influence gene expression in a cell including changes in the cell’s surroundings (as is seen in a response to sun damage) and the cell’s stage of development, such as whether a cell is part of a germinal layer (and mitotically active) or terminally differentiated. This means that even among cells of the same type (same lineage) the GEP can be different. Considering all of these complexities, a GEP from cells of interest can be difficult to untangle from the GEP of background cells.
In terms of DecisionDx-Melanoma, the test development publication from Gerami and colleagues did not adequately address these complexities inherent to GEPs.109 Many key questions were left unaddressed, such as: how did the relative quantity of tumor versus non-tumor cells affect test sensitivity, did lighter or darker skin tones affect the test outcome, did the test perform differently between histologically distinct melanomas such as acral lentiginous melanoma versus desmoplastic melanoma, and did the presence of sun damage affect test results? As a more specific example, we would expect the presence of tumor-infiltrating lymphocytes (TILs) to affect the outcome of a GEP test. Tumor-infiltrating lymphocytes change the composition of the background tissue, increasing the density of background cells. Additionally, TILs are thought to interact with tumor cells, which would suggest tumor cells would respond with differential GEP expression. These factors alone could influence the expression profile of the 28 genes in DecisionDx-Melanoma test, but until this scenario is tested, its impact is unknown. In summary, understanding the potential pitfalls of GEP testing is critical for understanding the reliability, performance, and accuracy of a GEP test.
During the development of any test, consideration and documentation of pre-analytic variables is critical for establishing test accuracy and precision. For instance, RNA extracted from formalin-fixed, paraffin-embedded tissue is a challenging analyte. Of all the macromolecules (like DNA, carbohydrates, lipids, and proteins), RNA is one of the most fragile and unstable macromolecules, meaning that adverse conditions such as delayed tissue fixation can result in the rapid degradation of RNA.188 When creating a valid clinical test (such as a GEP that uses RNA extracted from FFPE specimens), the material used to develop a test must be comparable to the material tested in the clinics. One cannot legitimately develop a test for blood using only urine, a test for females using only male specimens, or a test for breast cancer using only prostate cancer specimens. Similarly, a test using archived, older specimens cannot represent newer, <1 year-old specimens without, at minimum, comparisons demonstrating that the specimens behave similarly in the test. In the case of DecisionDx-Melanoma, test development and validation utilized archival specimens with ages up to 14 years.109 The effect of aged material on RNA integrity was not thoroughly addressed (only a brief statement about quantity and quality assessed using NanoDrop 1000 and Agilent Bioanalyzer 2100 was provided), and older material was not differentiated from newer material.
In a later study assessing the analytic validity of DecisionDx-Melanoma, Cook and colleagues presented data supporting the performance and reproducibility of their test.113 Per the paper introduction, Cook and colleagues stated that they performed their study in accordance with published guidelines, specifically the 2010 publication “Quality, Regulation and Clinical Utility of Laboratory-developed Molecular Tests” from the AHRQ and the 2011 publication “NCCN molecular testing white paper: effectiveness, efficiency, and reimbursement.”189,190 In the study, several points of concern were identified, a few of which are described below.
First, RNA stability studies only extracted RNA once from each specimen, relying on downstream analysis of the same pool of RNA (kept at -80°F) to confirm analytic validity.113 This means that the study did not assess the processes of macrodissection and RNA extraction for reproducibility and reliability. Note that per the AHRQ, “if the assay incorporates an extraction step, reproducibility of the extraction step should be incorporated into the validation studies, and likewise for any other steps of the procedure.”189
Second, while Cook and colleagues did try to evaluate the affect of FFPE block age on GEP testing, the study did not compare GEP results from the same tumor specimen at different time points.113 Instead, the study evaluated whether or not a GEP result could be obtained from an older FFPE block. Interestingly, Figure 3 in the paper diagrams test failure rates in yearly increments (for specimens aged up to four years) and then lumps together all data from specimens older than four years. Although 6,772 FFPE specimens were represented by Figure 3, the break down of how many specimens were found in each age category was not given. Also, it is not clear as to the origin or handling of these 6,772 specimens. Nowhere else in the paper are such large numbers (thousands versus hundreds) of specimens evaluated despite the apparent availability of 6,772 FFPE specimens. Not only does Figure 3 demonstrate an expected decline in the testability of older specimens, but it also highlights the quandary of using older, less reliable specimens to develop a test intended for clinical specimens that will invariably be under one year of age. Moreover, data regarding the measurement of RNA integrity (as was done in the Gerami publication from 2015) was not provided, even though this would be valuable for comparing specimens of different ages.109 Altogether, the evaluation represented by Figure 3 does not answer the question of whether an older specimen would have the same test result as a younger version of itself.
Cook and colleagues (along with other studies from Castle Biosciences) also failed to sufficiently address many other pre-analytic variables including, but not limited to:113
- Protocols for central pathologic review of cohort specimens:
- How many pathologists participated?
- What specific features were evaluated in each slide?
- How were discrepancies between outside report and internal review handled?
- Protocols for diagnosis of sentinel lymph nodes?
- How many sections/levels were evaluated per lymph node?
- Was immunohistochemistry used for every specimen to identify occult or subtle tumor deposits?
- How much time passed between biopsy or wide excision and placing the specimen in fixative?
- Was the fixation time (time in fixation before processing to FFPE) consistent for each specimen?
- Was the same fixative (e.g., formalin) used for each specimen?
- What was the time between tissue sectioning for slide creation and RNA extraction? Was this time consistent between different specimens?
- When cDNA was “preamplified” prior to testing, was the process confirmed to consistently amplify all relevant genes to the same, consistent degree or were some amplifications more efficient than others?
These questions address known pitfalls in both the comparability of specimens and the integrity of extracted RNA. Moreover, even if some of the above questions were addressed during test development, the lack of transparency in the published literature prevents clear assessment of the integrity of the test development.
In terms of study design, a prognostic test should ideally evaluate itself in the context of the current standard of care. We would anticipate that a prognostic test for malignancy would both compare its accuracy with the best prognostic standards available and would also compare itself against real world outcomes. Once accuracy is sufficiently established, proving clinical utility becomes crucial. One of the key factors in determining clinical utility is a test’s impact on patient outcome. A test without an improvement in patient outcome is not clinically utile for the purposes of Medicare coverage.
The initial assessment of newly diagnosed melanoma is complicated. For the primary melanoma alone, clinical and pathologic evaluations are critical for developing a proper plan of management for the patient. This plan must consider many factors both in the primary melanoma and the surrounding clinical context, including exposure and family histories. The American Joint Committee on Cancer (AJCC) acts as an authority on the grading and staging of primary melanomas based on many clinical, radiologic, and pathologic factors.191,192 Additionally, the AJCC provides extensive prognostic data tied to the factors used in the grading and staging of melanomas. Many of these factors are assessed during pathologic evaluation and include histologic features such as tumor mitotic rate, surface ulceration, and Breslow depth. At the same time, it must be recognized that AJCC staging is only one consideration in a multitude of data points that are considered by the clinical team when developing plans for patient management. For instance, a melanoma subtype, which is not explicitly factored into the AJCC scoring, can play a significant role in determining patient management. In general, the term “melanoma” represents a category of malignancies that are actually comprised of a spectrum of subtypes, each with their own etiologies, behaviors, and properties. For instance, acral lentiginous melanoma is known to be more aggressive than other subtypes of melanoma and have a poorer prognosis.193 Without consideration of this subtype, a patient could be misclassified as having a less dangerous form of melanoma based on AJCC clinical and pathologic staging alone. For this reason, while AJCC staging is invaluable to patient management in melanoma, it does not represent the only clinical consideration in patient care.
The development and assessment of DecisionDx-Melanoma relied heavily on comparisons to AJCC clinical and pathologic staging and the factors used in these AJCC scores. Often, the authors of DecisionDx-Melanoma studies would focus primarily on a single factor, such as sentinel lymph node positivity, and compare the prognostic value of that factor to the prognostic value of DecisionDx-Melanoma. This strategy often set up false dichotomies since in clinical practice a single prognostic factor such as sentinel lymph node biopsy is not considered in isolation without considering other clinical data. Even in more complicated, multifactorial comparisons, studies involving DecisionDx-Melanoma failed to account for the whole clinical and pathologic picture, sometimes only evaluating a limited number of factors used in AJCC scoring when attempting to establish the prognostic validity of the test. This can be seen in the variability of demographics and clinical information provided from study to study. In general, most studies at least provided information on patient age, Breslow thickness, presence/absence of tumor ulceration, and AJCC clinical and pathologic stages. Conversely, most studies did not provide information regarding the subtype of melanoma, location of primary tumor, presence/absence of transected tumor base, and presence/absence of lymphovascular invasion. Moreover, none of the studies identified in the Summary of Evidence provided sufficient information to determine the interrelationships between demographic and clinicopathologic data points. For instance, despite knowing the count of patients with a specific subtype of melanoma, one could not further explore other characteristics within a melanoma subtype group such as the average Breslow thickness per subtype group or the AJCC clinical stages represented in a subtype group.116
Of all the clinicopathologic factors used in describing melanoma, the Breslow thickness is a central factor, critical in both AJCC clinical and pathologic staging. Measuring Breslow thickness requires histologic identification of both the surface of the melanoma and the deepest point of tumor growth. Obviously, transection of the tumor base during biopsy or wide excision would compromise the accurate measurement of Breslow thickness. Moreover, since AJCC pathologic staging is primarily based on Breslow thickness (with subgrouping currently based on presence or absence of ulceration), undermeasurement of Breslow thickness can dramatically affect both clinical and pathologic stage assignment. For instance, according to AJCC’s 8th edition, the cutoff between pathologic stage T2 and T3 tumors is a Breslow thickness of 2 mm. Just looking at the pathologic stage without consideration of nodal or metastasis status, a T2 could be AJCC clinical stage I or II depending on the presence or absence of ulceration.192 However, a T3 melanoma will always be at least a clinical stage II tumor. Undermeasurement of a T3 melanoma without ulceration would drop the melanoma at least one clinical stage, from II to I. While this seems to be a minor technicality, several DecisionDx-Melanoma studies draw conclusions through comparison of clinical stage I and stage II melanomas (such as Zager, 2018).117 Interestingly, most of the DecisionDx-Melanoma studies do not present data on how many specimens were transected at the base of tumor, although this metric does appear in three more recent studies.110,126,194 In fact, the rate of transection in the more recent studies is striking, seen in 39.5%, 34.9%, and 53.29% of all specimens respectively.110,126,194 It is further notable that even with the presence of transection, the specimens were still used in the papers’ analyses and conclusions.
Limited patient follow-up proved to be another critical weakness in many of the DecisonDx-Melanoma studies. DecisionDx-Melanoma advertises its results as five year prognosticators for risk of recurrence, metastases, and/or death.186 At a baseline, data supporting this assertion must account for a minimum of five years of patient follow-up, even if the patient experiences a recurrence event. If the patient experiences a local recurrence, they may still develop distant metastases and/or pass away from the melanoma within the five year time frame, both events of which would be relevant to the DecisionDx-Melanoma prognostics. Of all the studies reviewed in the summary of evidence, only one study monitored all of its patients for a minimum of five years. 115 Even the publication from Gerami and colleagues in 2015 that described the development and validation of DecisionDx-Melanoma reported use of specimens with well under five years of follow-up.109 Their training cohort included patients with 0.06 years of follow up (claiming a median of 6.8 years for all training specimens), and their validation cohort included patients with 0.5 years of follow-up (claiming a median of 7.3 years all validation specimens).109 Overall, studies demonstrated median follow-ups of patients without disease recurrence that ranged from 1.5 to 7.5 years.115,117
Publications involving DecisionDx-Melanoma also lacked consistent definitions from study to study. Definitional inconsistency was well captured by Marchetti and colleagues in the Melanoma Prevention Working Group, which convened in 2020 to discuss prognostic GEP tests for melanoma.108 For instance, the definition of “melanoma recurrence,” as used in the outcome metrics of Disease-Free Recurrence (DFS) or Recurrence-Free Survival (RFS), differed from study to study. In Hsueh (2017), RFS was defined by regional and distant metastases while in Zager (2018) RFS included local metastases in addition to regional and distant metastases and excluded sentinel lymph node positivity.115,117 Podlipnik (2019) used the term DFS, defining it by “relapse” without further detail, and Keller (2019) used the term RFS without providing a clear definition altogether.121,122 A majority of studies indicated that the outcome risk estimates represented the first five years following a primary diagnosis of melanoma with only a few studies reducing the risk estimate to cover only the first three years. Note again that only one study from Zager and colleagues in 2018 included patients all followed-up for a minimum of five years.117
As evidenced in the previous paragraphs, there are several weaknesses in both the quality and thoroughness of data collection in DecisionDx-Melanoma studies as well as methodologic and definitional inconsistencies. In terms of conclusions and results, we see the potential corollaries of these weaknesses. For instance, the PPV and NPV of these studies are particularly striking. Not all papers used these metrics when evaluating their results, but when PPV and NPV were provided, their values changed dramatically from study to study. This finding is particularly relevant when examining the latest version of the DecisionDx-Melanoma report.186 The DecisionDx-Melanoma report supplies a three to four tier prognostic classification of melanomas (Classes 1A, 1B/2A, 2B). In one of the interpretation tables in the DecisionDx-Melanoma report, the classes (1A, 1B/2A, and 2B) are paired with the AJCC clinical stages (I, II, or III) to provide five year risk estimates for three potential outcomes: Melanoma-Specific Survival(MSS), Distant Metastasis-Free Survival (DMFS), and RFS. According to the report, this interpretation table may be referenced to Greenhaw and colleague’s publication in 2020.107 If we look at Greenhaw’s meta-analysis study, we find the PPV and NPV is only provided for RFS (PPV 46%; NPV 92%) and DMFS (PPV 35%; NPV 93%). Remember that PPV and NPV scores represent the number of true results divided by the number of all positive or negative results respectively, both true and false results. This means that for patients with a positive result, 35 of 100 patients will experience a distant metastasis (positive for this event) within five years of their original melanoma diagnosis and 65 will not experience a distance metastasis within five years of their original diagnosis. Negative predictive value provides the opposite reassurance, namely that a negative result means 93 of 100 patients will NOT experience a distant metastasis within five years while 7 of 100 patients will still experience a distant metastasis within five years. The reason the concept of PPV and NPV is described here in such basic detail is to highlight the risks of relying on the Class designation provided by DecisionDx-Melanoma to prognosticate patient outcome. The concern for test accuracy is further compounded when one considers that the PPV and NPV are different from study to study. The PPVs for DMFS for studies as described in the Summary of Evidence ranged from 14.6% to 62%.117,124 For reference, the only study with a minimum of five years of follow-up for all patients recorded a PPV of 40% for DMFS.117
Several studies were published addressing the clinical utility of DecisionDx-Melanoma. All of these studies focused on how DecisionDx-Melanoma would impact patient management, typically by measuring to what degree and how the test result changed patient management. Several of the studies utilized hypothetical scenarios and polled providers (ranging from trainee residents to practicing attendings) on how they would respond to these scenarios with and without DecisionDx-Melanoma results.195-198 These studies did not assess real world interactions of the test with patient management. A couple of studies prospectively measured changes in physician behavior and patient management when provided with DecisionDx-Melanoma results for their patients.184,199 However, as defined for the purposes of this LCD, a clinically utile test must positively affect patient outcome. While these six studies altogether demonstrated changes in physician behavior and/or patient management when DecisionDx-Melanoma was used, none of the studies demonstrated how this positively impacted patient outcome, ie, increasing patient survival. A demonstration of clinical utility could be accomplished in a clinical trial where patients’ overall survival is compared between patients where the test is used or patients managed without test results. To date, such a trial has not been performed for DecisionDx-Melanoma.
Finally, as discussed in an editorial review by WH Chan, MS and H Tsao MD, PhD in 2020, management of cutaneous melanoma has dramatically changed within the past few years.102 Prognosis determination plays less of a role in determining patient management when other factors are used to determine predictive (therapy-related) outcomes. For instance, sentinel lymph node biopsy status is used to determine if a patient should receive adjuvant chemotherapy. Targeted sequence analyses for specific gene mutations (such as BRAF V600E) now can inform clinicians on which targeted therapy would most benefit their patients. This changing landscape appears to be recognized by Castle Biosciences, who most recently added clinicopathologic algorithmic prognostic results to their test.186 Unfortunately, there is currently insufficient peer-reviewed literature to establish the clinical validity and utility of these new features: two papers as of the writing of this LCD, both of which are validation papers, one for i31-SLNB and the other for i31-ROR.110,111 Without more published literature, including clinical trials, the i31-SLNB and i31-ROR cannot be considered clinically valid or utile for Medicare patients.
It is beyond the scope of this LCD to provide comprehensive analysis of all individual papers reviewed. While the major concerns regarding peer-reviewed literature for DecisionDx-Melanoma are well characterized above, many other concerns were not detailed and still should be addressed, even if not detailed in this Analysis of Evidence. Examples of concerns not expounded in the analysis of evidence include:
- Inadequate information regarding patients with hereditary melanoma disorders
- Inadequate study of the effects of therapies on measured outcomes
- Inadequate information comparing melanomas with different mutational profiles (e.g., tumors with BRAF V600E)
In summary, the body of peer-reviewed literature concerning DecisionDx-Melanoma is insufficient to establish the analytic validity, clinical validity, and clinical utility of this test in the Medicare population. As such, this test does not currently meet medically reasonable and necessary criteria for Medicare patients and will not be currently covered.
DecisionDx-SCC
In 2020, Wysong and colleagues described a 40 GEP test (which would later become DecisionDx-SCC) for risk classification in cases of cutaneous squamous cell carcinoma (cSCC).134 Their biomarker study aimed to validate a GEP test that could assess the risk of metastasis in cSCC. Fundamentally, DecisionDx-SCC is a GEP that analyzes 34 genes of interest (considered by Castle Biosciences to be significantly informative of the prognosis of cSCC) and six control genes.
Gene expression profiles are founded on the principle of differential gene expression in a cell of interest, like a cancer cell, when compared to background cells, namely all other cells in the surrounding tissue. Tissues are comprised of many different types of cells. Skin, for instance, is comprised of a variety of cell types including melanocytes, keratinocytes, immune cells such as lymphocytes, structural cells such as fibroblasts and fibrocytes, hair generating cells, and specialized glandular cells such as apocrine cells. Many different factors influence gene expression in a cell including changes in the cell’s surroundings (as is seen in a response to sun damage) and the cell’s stage of development, such as whether a cell is part of a germinal layer (and mitotically active) or terminally differentiated. This means that even among cells of the same type (same lineage) the gene expression profile can be different. Considering all of these complexities, a GEP from cells of interest can be difficult to untangle from the GEP of background cells.
In terms of DecisionDx-SCC, the test development publication from Wysong and colleagues did not adequately address these complexities inherent to GEPs.134 Many key questions were left unaddressed, such as: how did the relative quantity of tumor versus non-tumor cells affect test sensitivity, did lighter or darker skin tones affect the test outcome, did the test perform differently between histologically distinct cSCCs, and did the presence of sun damage affect test results? As a more specific example, we would expect the presence of TILs to affect the outcome of a GEP test. Tumor infiltrating lymphocytes change the composition of the background tissue, increasing the density of background cells. Additionally, TILs are thought to interact with tumor cells, which would suggest tumor cells would respond with differential GEP expression. These factors alone could influence the expression profile of the 40 genes in the DecisionDx-SCC test, but until this scenario is tested, its impact is unknown. In summary, understanding the potential pitfalls of GEP testing is critical for understanding the reliability, performance, and accuracy of a GEP test.
An additional validation study from Borman and colleagues was published in 2022.135 This paper primarily focused on whether or not the DecisionDx-SCC test would provide “actionable class call outcomes.” They did not provide any information regarding patient follow-up or accuracy of the class call outcomes. While they did test for replication and precision, the sample sizes for these assessments were considerably smaller than the overall cohort used in the study. Additionally, they (as seen in other studies from Castle Biosciences) failed to sufficiently address many other pre-analytic variables including, but not limited to:
- Protocols for central pathologic review of cohort specimens:
- How many pathologists participated?
- What specific features were evaluated in each slide?
- How were discrepancies between outside report and internal review handled?
- How much time passed between biopsy or wide excision and placing the specimen in fixative?
- Was the fixation time (time in fixation before processing to FFPE) consistent for each specimen?
- Was the same fixative (e.g., formalin) used for each specimen?
- What was the time between tissue sectioning for slide creation and RNA extraction? Was this time consistent between different specimens?
- When cDNA was “preamplified” prior to testing, was the process confirmed to consistently amplify all relevant genes to the same, consistent degree or were some amplifications more efficient than others?
These questions address known pitfalls in both the comparability of specimens and the integrity of extracted RNA. Moreover, even if some of the above questions were addressed during test development, the lack of transparency in the published literature prevents clear assessment of the integrity of the test development.
The additional observational studies produced by Castle Biosciences included three cohort studies and a case series, all published between 2020 and 2022.136-138
Farberg and colleagues published a paper, using the same dataset as the validation study, aiming to assess whether or not the DecisionDx-SCC test could be integrated into the existing NCCN guidelines for the management of patients with cSCC.136 Another paper, from Aaron and colleagues, also used samples from the same dataset as the original validation study.137 This paper assessed whether DecisionDx-SCC could predict recurrence and “provide independent prognostic value to complement current risk assessment methods.” The third cohort study was from Ibrahim and colleagues whose paper attempted to clinically validate the DecisionDx-SCC test.138 In general, these studies had the same issues as those outlined above, and in the two studies that assessed rates of recurrence, patient follow-up data was not reported. The studies stated that “cases had a documented regional or distant metastasis, or documented follow-up of at least three years post-diagnosis of the primary tumor without a metastatic event” but did not give any further information.136-138 DecisionDx-SCC advertises its results as three year prognosticators for risk of recurrence, metastases, and/or death. At a baseline, data supporting this assertion must account for a minimum of three years of patient follow-up, even if the patient experiences a recurrence event. If the patient experiences a local recurrence, they may still develop distant metastases and/or death may result from the cSCC within the three year time frame, both events of which would be relevant to the DecisionDx-SCC prognostics.
Au and colleagues described two cases of cSCC, one with fatal recurrence and one without recurrence, and the retrospective results of DecisionDx-SCC testing on tissue samples from each case.139 While the results did show that the recurrent case was classified as high risk of recurrence and the non-recurrent case was classified as low risk, two cases are insufficient to provide meaningful insight into generalizability of the test in the normal population. Additionally, there is no evidence that the test results would have resulted in a change of management decisions for the cases or in eventual patient outcomes.
Aside from the paper from Au and colleagues, papers addressing clinical utility included surveys, a panel review, and literature reviews.128-133 These papers had a number of shortcomings and limitations, including, but not limited to:
- A high likelihood of selection and response bias in the surveys
- No description of survey participant recruitment methods
- An expert panel composed of Castle Bioscience employees, consultants, and researchers
- Reviews that cited the authors’, or their colleagues’, previous work without acknowledgement
- Lack of methods descriptions or, in review papers, inclusion criteria
Notably, there are no significant studies assessing patient outcomes or clinician treatment decisions in a real-world setting following a DecisionDx-SCC test. Without such data, clinical utility cannot be determined. For example, a demonstration of clinical utility could be accomplished in a clinical trial where patients’ overall survival is compared between patients tested with DecisionDx-SCC and patients managed without this test. To date, such a trial has not been performed for DecisionDx-SCC.
In summary, the body of peer-reviewed literature concerning DecisionDx-SCC is insufficient to establish the analytic validity, clinical validity, and clinical utility of this test in the Medicare population. As such, this test does not currently meet medically reasonable and necessary criteria for Medicare patients and will not be currently covered.
UroVysion fluorescence in situ hybridization (FISH) – Abbott
In order to systematically evaluate the available research surrounding the uFISH test for bladder cancer it is important to first discuss the test within the context of the test information provided for clinicians and patients. The product page on Abbott Molecular’s website states that the uFISH test “is designed to detect aneuploidy for chromosomes 3, 7, 17, and loss of the 9p21 locus via fluorescence in situ hybridization (FISH) in urine specimens from persons with hematuria suspected of having bladder cancer.”200 UroVysion fluorescence in situ hybridization (uFISH) is often compared to urine cytology, and the manufacturer specifically states that the uFISH test has “greater sensitivity in detecting bladder cancer than cytology across all stages and grades.” A positive result of the test is defined by the manufacturer as four or more cells out of 25 showing gains for two or more chromosomes (3, 7, or 17) in the same cell, or 12 or more out of 25 cells having no 9p21 signals detected. However, not all bladder cancers have these alterations, and these chromosomal changes can also be seen occasionally in healthy tissues and other types of cancer, as noted by Bonberg and colleagues (2014) and Ke and colleagues (2022).140,161 Additionally, genomic profiles and chromosomal abnormalities can vary between low grade and high grade bladder cancer which can make the detection of low grade cancer less likely.
As noted by Lavery and colleagues in 2017 and Mettman and colleagues in 2021, much of the literature assessing uFISH uses a variety of definitions for positivity.150,157 Lavery aimed to overcome these shortcomings by using a strict definition for a positive uFISH test – which used the manufacturer’s definition along with the addition of “tetraploidy in at least 10 morphologically abnormal cells.”157 Tetraploidy can be seen in normal cell division and in other non-cancerous processes, so this addition was made to account for false-positive results from the uFISH test. The blinded study described in the paper found no significant difference between uFISH and urine cytology, with sensitivities of 67% and 69% and specificities of 72% and 76%, respectively. Additionally, the authors found that inclusion of the tetraploidy requirement in their definition effectively reduced false-positive rates, but also determined that some bladder cancer tumors do not have the chromosomal alterations for which uFISH assesses (30% of the tumors tested by the authors). Mettman similarly attempted to increase the accuracy of the uFISH test by including tetraploidy in their positivity definition.150 The authors reported considerably different results than the paper from Lavery, with sensitivity of uFISH ranging from 58-95% depending on the definition used, and a specificity using each definition of 99%. The study was specifically evaluating the test in patients suspected of having pancreatobiliary stricture malignancies, however, which could account for the differences seen between the two papers.
Sassa and colleagues (2019) compared the uFISH test to urine cytology in 113 patients prior to nephrouterectomy and 23 volunteers with no history of urothelial carcinoma.153 In cases of high-grade urothelial carcinoma (HGUC), the sensitivity, specificity, positive PPV, and NPV for detection by urinary cytology were 28.0%, 100.0%, 100.0%, and 31.6%, respectively. For uFISH, these values were 60.0%, 84.0%, 93.8%, and 41.2%, respectively. In cases of low-grade urothelial carcinoma (LGUC), however, the results were significantly worse, with sensitivities for both UroVysion and urine cytology of only 30%.
UroVysion fluorescence in situ hybridization (uFISH) has also been assessed as a prognostic test for the recurrence of bladder cancer in patients and as a means of identifying recurrence in patients sooner. A paper from Guan and colleagues in 2018 evaluated the value of uFISH as a prognostic risk factor of bladder cancer recurrence and survival in patients with upper tract urothelial cancer (UTUC).155 One hundred and fifty-nine patients in China received a uFISH test prior to surgery and were then monitored for recurrence. While the authors did indicate that there was a relationship between uFISH results and recurrence, the results were non-significant (p=.07). Liem and colleagues (2017) conducted a prospective cohort study to evaluate whether uFISH can be used to early identify recurrence during treatment with Bacillus Calmette–Guerin (BCG).156 During the study, three bladder washouts at different time points during treatment (t0 = week 0, pre-BCG, t1 = 6 weeks following transurethral resection of bladder tumor [TURBT], t2 = 3 months following TURBT) were collected for uFISH from patients with bladder cancer that were treated with BCG. The authors found no significant association between a positive uFISH result at t0 or t1 but found that a positive uFISH result at t2 was associated with a higher risk of recurrence. Additionally, in 2020, Ikeda and colleagues published a paper that aimed to evaluate the relationship between uFISH test results following TURBT and subsequent intravesical recurrence.158 They indicated that uFISH test positivity was a prognostic indicator for recurrence following TURBT. However, recurrence in patients with two positive uFISH tests was only 33.3%, and in patients with one positive uFISH test (out of two tests total) the recurrence rate was only 16.5%.
Limited patient follow-up was a repeated weakness in papers evaluating uFISH to detect or predict recurrence. For example, the paper from Guan had a median follow-up of 27 months (range: 3-55 months), the paper from Liem had a median of 23 months of follow-up (range: 2-32 months), and the paper from Ikeda had a median follow-up of 27 months (range: 1-36.4).155,156,160 The ranges of follow-up indicate that at least one patient was only followed for one month, and at least half of all patients had less than the median follow-up time. This limited follow-up means that cases of recurrence were likely overlooked in the studies. Even in cases where shorter follow-up may have been due to the early detection of recurrence, lack of continued follow-up could result in overlooking a patient with reduced survival following a recurrence; this additional information would be relevant to the uFISH prognostics.
Other observational studies identified included two cohort studies from Nagai and colleagues (2019) and Gomella and colleagues (2017), a case-control study from Freund and colleagues (2018), and a cross-sectional study from Todenhöfer and colleagues (2014).152,154,158,159 Each of these studies reported similar results and limitations to the papers described above. Additionally, Breen and colleagues (2015)45 evaluated uFISH in a comparative study with other tests used to detect urothelial carcinoma in urine. The other tests included Cxbladder Detect, cytology, and NMP22. The study utilized five cohorts of patients, only one of which evaluated all four tests for the entire cohort. Data from the five cohorts were evaluated and integrated, with several different imputation analyses utilized to fill in for missing test values and create a “new, imputed, comprehensive dataset.” The authors report that before imputation uFISH had a sensitivity of 40% (the lowest of the four tests) and a specificity of 87.3% (the second lowest of the four tests). Utilizing several different imputation methodologies, similar findings for comparative sensitivities and specificities were seen, leading to the conclusion that the imputed data sets were valid, with the best imputation methodology being the 3NN model. In this 3NN model, uFISH had considerably lower sensitivity than the other three tests and lower specificity than two of the three tests.
In recent years, other authors have conducted reviews and meta-analyses in order to better address the validity and utility of uFISH, and other urinary biomarkers in general. In 2022, Zheng and colleagues published a meta-analysis and review that assessed the prognostic value of uFISH to detect recurrence in the surveillance of non-muscle invasive bladder cancer (NMIBC).141 They identified 15 studies from 2005-2019 that met their inclusion criteria and in their meta-analysis determined that the pooled sensitivity of uFISH in detecting recurrence was 68% (95% CI:58-76%) and the pooled specificity was 64% (95% CI: 53-74%).
Sciarra and colleagues (2019) conducted a systematic review to evaluate the diagnostic performance of urinary biomarkers for the initial diagnosis of bladder cancer.146 The review identified 12 studies addressing uFISH, with a combined sample size of 5,033 uFISH test results. The mean sensitivity was 64.3% and the median was 64.4%, with a range of 37-100%. Additionally, the mean specificity was 88.4% and the median was 91.3%, with a range of 48-100%.
Another recent paper identified was from Soputro and colleagues (2022), who conducted a literature review and meta-analysis to evaluate the diagnostic performance of urinary biomarkers to detect bladder cancer in primary hematuria.145 The authors identified only two studies assessing uFISH that met their inclusion criteria. The pooled sensitivity and specificity of uFISH in the identified studies was 0.712 and 0.818, respectively. The authors noted that the “current diagnostic abilities of the FDA-approved biomarkers remain insufficient for their general application as a rule out test for bladder cancer diagnosis and as a triage test for cystoscopy in patients with primary hematuria.”145
Sathianathen and colleagues also conducted a literature review and meta-analysis to evaluate the performance of urinary biomarkers in the evaluation of primary hematuria.148 The authors were only able to identify one paper addressing uFISH which met their inclusion criteria, which determined that uFISH was comparable to the other biomarker tests being evaluated. However, given the fact that only one paper was identified which met the authors’ criteria for inclusion, the findings regarding uFISH could not be properly assessed.
The most recent meta-analysis identified was written by Papavasiliou and colleagues (2023) who assessed the diagnostic performance of urinary biomarkers potentially suitable for use in primary and community care settings.143 The authors identified 10 studies addressing the diagnostic performance of uFISH between 2000 and 2022. These studies had a wide range of sensitivities (0.38-0.96) but a narrower range of specificities (0.76-0.99).
Three additional literature reviews were identified from Bulai and colleagues (2022), Miyake and colleagues (2018), and Nagai and colleagues (2021). Each of these papers noted significant issues with the literature support for these biomarkers in general, and uFISH in particular, but also lacked non-ambiguous inclusion criteria, search methods, and other necessary information to validate their assessments.142,144,147
Only two identified papers significantly addressed the clinical utility of the uFISH test – Guan (2018) and Meleth (2014).149,155 Guan noted that they did not find any association between a positive uFISH test and survival in patients; however, as noted above, limited follow-up was a significant shortcoming of their study.155 Meleth and colleagues conducted a review of the available literature and were unable to find any papers that met their inclusion criteria which directly assessed patient survival, physician decision-making, or downstream health outcomes in relation to uFISH test results.149 This lack of information regarding clinical utility is notable and without studies assessing for improvement in patient outcomes in a real-world setting, the evidence supporting the uFISH test for use in the Medicare population is severely lacking.
It is also important to note that no studies were identified that established that uFISH was able to accurately distinguish between urothelial carcinoma and other cancers or other non-cancer urological conditions. As noted above, the specific chromosomal changes that uFISH uses to identify urothelial carcinoma have been identified in non-cancerous tissues and other types of carcinomas. This very notable gap in the identified research included a lack of details or definitions for non-urothelial cancers, of which many would feed into the urinary system, including prostate cancers, renal cancers, and metastatic or locally invasive cancers from other organs. With the knowledge that the chromosomal changes that uFISH uses to identify urothelial carcinoma can also be found in the context of other malignancies and non-malignancies, and that their identification in urine may not coincide with clinically detectable (e.g., cystoscopically visible) carcinoma, confusion could arise with false positives, especially when the PPV of uFISH tests tends to be very low. If numerous false positive results in uFISH are accepted as an inherent trait of the test, providers may not be as vigilant in closely following patients with a positive uFISH result after a negative cystoscopy. In addition, providers may not search for other malignancies as a potential cause for the “false positive” uFISH result.
It is beyond the scope of this LCD to provide comprehensive analysis of all individual papers reviewed. However, the major concerns regarding peer-reviewed literature for uFISH are well represented above.
In summary, the body of peer-reviewed literature concerning UroVysion FISH is insufficient to establish the analytic validity, clinical validity, and clinical utility of this test in the Medicare population. As such, this test does not currently meet medically reasonable and necessary criteria for Medicare patients and will not currently be covered.
Colvera – Clinical Genomics
In April 2015, Pedersen and colleagues published a validation paper in which they described a blood test which would later come to be named Colvera.162 The test was designed to identify two methylated genes, namely branched-chain amino acid transaminase 1 (BCAT1) and ikaros family zinc finger protein 1 (IKZF1). Clinical Genomics had previously identified both genes as being important in the screening of colorectal cancer (CRC). Their study used methylation-specific PCR assays to measure the level of methylated BCAT1 and IKZF1 in DNA extracted from plasma obtained from colonoscopy-confirmed 144 healthy controls and 74 CRC cases. The authors found that their test was positive in 77% of cancer cases and 7.6% of controls. This study, however, failed to sufficiently address many pre-analytic variables, such as the protocols for pathologic review and diagnosis.
Later that same year, another validation paper (also led by Pedersen) was published.161 This cohort study included both prospective and retrospective methods to collect plasma samples from 2,105 volunteers and reported a test sensitivity of 66%, (95% CI: 57–74). For CRC stages I-IV, respective positivity rates were 38% (95% CI: 21%–58%), 69% (95% CI: 53%–82%), 73% (95% CI: 56%–85%) and 94 % (95 % CI: 70%–100%). Specificity was 94% (95% CI: 92%–95%) in all 838 non-neoplastic pathology cases and 95% (95% CI: 92%–97%) in those with no colonic pathology detected (n = 450). It is important to note that case diagnosis was performed by one independent physician and that there was no control of colonoscopy or pathology procedures. The authors stated that this was due to their aim “to assess marker performance relative to outcomes determined in usual clinical practice.”
An additional validation paper was published by Murray and colleagues in 2017 which assessed both the analytic and clinical validity of the Colvera blood test.164 The authors reported using archived samples from the previous study from Pedersen and colleagues (n=2,105 samples), but only used a subset of these archived samples (n=222 specimens, 26 with cancer).162,163 The authors did not describe selection criteria for these samples specifically, namely, whether or not sample selection was a randomized process or why a majority of the archived specimens were not selected. Murray and colleagues found that the Colvera test had good reproducibility and repeatability with a reported sensitivity of 73.1% and specificity of 89.3%. In addition to questions regarding sample selection, other questions were left unanswered in the paper including, but not limited to:
- Does the precision of the test vary in different stages of cancer?
- Does treatment (such as chemotherapy/radiation) impact the precision of the test?
- For apparent false positives, would a longer follow-up reveal them to be true positives?
- In general, would serial sampling or longitudinal data impact the precision estimates of the test?
An additional paper published in 2018 from Murray and colleagues, sought to establish the clinical validity of the Colvera test.170 In the paper, the authors tested patients post-surgery (median of 2.3 months after surgery) and followed them to establish whether or not recurrence was detected. Median follow-up for recurrence was 23.3 months, with an IQR 14.3-29.5 months. Twenty-three participants were diagnosed with recurrence, but the Colvera test was positive in 28 participants. It should be noted that the cancer treatment varied considerably between cases, even between patients with a positive Colvera test result and those with negative result. Only 61% of patients with a positive Colvera result completed their initial course of treatment, while 87% of patients with a negative result completed the initial course of treatment. The authors state that this was due “to either patients declining ongoing therapy, or due to comorbidities or complications precluding a full course of treatment.”170 This could have significantly confounded the results given the higher likelihood of recurrence in a patient who did not receive a full course of treatment as opposed to patients who did receive a full course. Additionally, while the median follow-up was 23.3 months, half of the patients had a shorter follow-up than the median, and without long-term follow-up, additional cases of recurrence were likely missed.
Five other papers, all cohort studies, were identified that assessed the clinical validity of the Colvera test, in particular, Colvera’s performance compared to carcinoembryonic antigen (CEA) and/or fecal immunochemical tests (FIT).165-167,169,171 These papers from Clinical Genomics found the sensitivity of Colvera to be 62-68% and the specificity to be 87-97.9%, better than the results for both CEA and FIT in the same studies.
Young and colleagues (2016) assessed 122 patients that were being monitored for recurrent CRC (28 of whom had confirmed recurrence) to determine if Colvera or CEA was more accurate.165 The study only obtained a blood sample 12 months prior to or three months following verification of a patient’s recurrence status. This method of determining test accuracy was problematic, in particular because the follow-up lengths varied considerably between patients. In patients with confirmed recurrence, the median follow-up was 28.3 months, with an IQR of 21.9-41.0. In patients without confirmed recurrence, the median follow-up was only 17.3 months, with an IQR of 12.0-29.2. This indicates that the majority of “confirmed” cases of no recurrence were followed for less time than the median follow-up in recurrent cases. Without an adequate length of follow-up, it is certainly possible that cases of recurrence would be missed. Additionally, while the authors did report on some longitudinal data (the concordance of test results in the same patient taken at different times), that data was limited to only 30 cases out of the total 122. Of the cases that did have longitudinal data, multiple cases were reported to have false-positive test results. When combined with the insufficient follow-up already discussed, the likelihood of incorrectly identified false-positive tests increases considerably.
Musher and colleagues (2020) and Symonds and colleagues (2020) also compared Colvera to CEA for detecting recurrent CRC.166,167 Musher, similar to the paper from Young (2016), also had short follow-up periods and insufficient longitudinal data (median follow-up was 15 months, range: 1-60 months).166,167 Symonds (2020), however, did obtain relatively more longitudinal data and longer follow-up periods, and showed that months prior to imaging confirmation, Colvera could show a positive result.167 However, without any assessment of the impact of test results on clinical outcomes, the utility of the test cannot be ascertained. Also, in the papers from Young, Musher, and Symonds, CEA sensitivity was considerably lower than normally reported in other literature (32%, 48%, and 32%, respectively).165-167 While not a direct reflection on the validity or utility of Colvera, it is important to note this discrepancy since the authors were comparing Colvera to the CEA test.
The two additional cohort studies evaluating Colvera, Symonds and colleagues (2016) and Symonds and colleagues (2018), had similar findings and shortcomings as the three studies described above, with test sensitivities of 62% in both papers and similar study designs.169,171
Finally, the paper from Cock and colleagues (2019) assessed the precision of both Colvera and FIT testing in the detection of sessile serrated adenomas/polyps (SSPs).168 For this study, the authors used the same samples that were used in Symonds and colleagues (2016).169 While the paper did address pre-analytic variables and other shortcomings more sufficiently than the previous studies discussed, the results do not support the use of Colvera for the detection of SSPs. Forty-nine SSPs were identified during the colonoscopies of 1,403 participants who were also tested with the Colvera test. In those patients with SSPs, the Colvera test only had a sensitivity of 8.8%, and when combined with FIT, the sensitivity only increased to 26.5%.
Notably, there are no studies assessing patient outcomes or clinician treatment decisions in a real-world setting following a Colvera test. Without such data, clinical utility cannot be determined. One of the key factors in determining clinical utility is a test’s impact on patient outcomes. For example, a demonstration of clinical utility could be accomplished in a clinical trial where patients’ overall survival is compared between patients tested with Colvera and patients managed without this test. To date, such a trial has not been performed for Colvera.
In general, papers assessing the validity of the Colvera test for CRC have a number of shortcomings, including short follow-up time, insufficient longitudinal data, insufficient description of study methodology, and a failure to sufficiently address important pre-analytic variables. Additionally, no paper has been published addressing the clinical utility of Colvera; a test without an improvement in patient outcomes is not clinically useful for the purposes of Medicare coverage.
In summary, the body of peer-reviewed literature concerning Colvera is insufficient to establish the analytic validity, clinical validity, and clinical utility of this test in the Medicare population. As such, this test does not currently meet medically reasonable and necessary criteria for Medicare patients and will not be currently covered.
PancreaSeq® Genomic Classifier, Molecular and Genomic Pathology Laboratory, University of Pittsburgh Medical Center
Due to the lack of peer-reviewed literature identified, and thus a lack of peer-reviewed evidence detailing analytic and clinical validity and clinical utility, this test is currently non-covered for the Medicare population.