Genomics: Insight
Integrating Multi-Omics and Deep Learning for Enhanced Breast Cancer Subtype and Stage Recognition.
An advanced diagnosis approach through biomarker analysis with the use of machine learning
Research Question: How can multi-omics data integration with Deep Learning (DL) improve the prediction of breast cancer subtypes and stages of progression?
Background
Accounting for an estimated 16% of new cancer cases and 7% of cancer related deaths recorded in 2024, breast cancer is the most prevalent cancer in the world and can be lethal if detected too late (National Cancer Institute, 2018). There is a roughly 13% chance (1 in 8) that a woman in the United States will develop breast cancer and the cancer is most often found in middle-aged and older women, with a median age of diagnosis at 62 years old (American Cancer Society, 2023). It is also found that cancer specific survival (CSS) rates were significantly higher when caught early. CSS is a metric that only focuses on deaths caused by breast cancer and excludes deaths from any other causes, which is particularly useful to evaluate the impacts of breast cancer. A study from Zuo et al. (2017) found that five-year CSS rates for stages I, II, III, and IV, breast cancer were 97.1%, 92.6%, 75.6%, and 42.7%, respectively. These statistics show that it is essential to detect breast cancer early on to improve the survival outcomes of patients.
Current diagnostic techniques for breast cancer primarily rely on imaging. These include mammography, ultrasonography, magnetic resonance imaging (MRI), and positron emission tomography (PET) scans (Karellas & Vedantham, 2008). However, while effective, these techniques may not always detect early-stage cancer or accurately distinguish between subtypes (Cho, 2016). Pathological analysis such as HER2, estrogen, and progesterone receptor testing may give some insights but often lack the precision needed for more personalized care. To address these limitations, researchers are starting to integrate multi-omics data into machine learning models (Yang et al., 2024). Multi-omics-based approaches combine data from genomics, transcriptomics, proteomics, and epigenomics. This can give a comprehensive view of molecular changes in cancer cells. Machine learning algorithms can be used to analyze these large datasets to identify patterns and predict cancer subtypes with higher accuracy than traditional methods. Using multi-omics data enables increases precision in classification of breast cancer, improving patient outcomes and adding specificity to patient treatments.
Breast Cancer Subtypes and Complexity
Breast cancer is a heterogeneous disease, with multiple subtypes including hormone receptor-positive (HR-positive), human epidermal growth factor receptor 2-positive (HER2-positive), and triple-negative breast cancer (TNBC) (Ensenyat-Mendez et al., 2021). Each has certain molecular differences that influence treatment and following outcomes. In HR-positive cancers, the receptors either for estrogen or progesterone are present, allowing hormone-blocking treatments that inhibit cancer-driving hormones to slow down or stop tumor progression. HER2-positive cancers overexpress the HER2 protein; therefore, they would be sensitive to targeted therapies that specifically block HER2 receptors, disrupting tumor growth (Froelich, 2024). TNBC becomes aggressive and hard to treat without the presence of hormone receptors or the expression of HER2; therefore, chemotherapy remains the primary of treatment (Gupta et al., 2020). While current diagnostics are effective, imaging can fail to detect early-stage cancers, especially in dense breast tissue. Additionally, traditional biomarker tests may miss molecular differences, limiting their accuracy (Esserman and Yau, 2015). These challenges show the need for precision and personalized approaches. Personalized medicine uses multi-omics and machine learning to examine the individual genetic, protein, and metabolic profile of each tumor to come up with personalized treatment strategies (Hasin et al., 2017). Therefore, personalized medicine can optimize treatment efficacy and reduce unnecessary side effects by targeting specific molecular features, thus ensuring a better quality of life and improved survival.
Machine Learning for Multi-Omics Data Integration
Alkhateeb et al. (2022) introduced a novel model to predict breast cancer stages by integrating multi-omics data–gene expression, copy number alteration (CNA), and DNA methylation. The integration of these diverse data types leads to a “large p, small n” problem, where the number of features is much larger than the number of samples. This may lead to overfitting, where the model performs well on training data but never generalizes well to new data. To combat this challenge of a large feature space, the model uses dimensionality reduction to find a simpler form for the complex data. More precisely, it is based on Isomap, a technique that projects high-dimensional data into a simplified 2D map of the data, known as a gene similarity network (GSN), which represents the relationships among the genes. The map is then color-coded based on the values from each data type: red for gene expression, green for DNA alterations, and blue for DNA methylation, allowing the CNN to treat the combined data as a single, easy-to-read image.
The CNN model has multiple processing layers to detect important patterns in the GSN images, some of which adjust the images, sharpen details, and prevent overfitting. This adjustment refers to filters highlighting biologically significant patterns, such as gene expression or regions with high methylation levels. Sharpening details is achieved through layers that maintain important features while cutting out irrelevant information. After passing these layers, the data moves through fully connected layers that take over the final task of predicting breast cancer stages. When tested on a large breast cancer dataset it can be seen in Figure 1 that the model had an accuracy of 99.85%.
Multi-omics and ML can increase precision in diagnosing and personalizing breast cancer treatments.
The success this model displays implies it could be influential in predicting breast cancer stages. However, it should be noted that there are limitations with this proposed model due to its RGB color coding which can only integrate three types of data. To improve this, future work could be done to expand this model’s capabilities by developing a framework that can support and combine more data types, enabling further insights.
Model | Author | Performance Metrics | ||
ACC | AUC | F1 | ||
MVGNN | Ren et al. | 91.80% | 95.30% | 71.55% |
Proposed CNN | Alkhateeb et al. | 99.85% | 99.97% | 99.4% |
Table 1: A comparison of the two discussed DL models based on accuracy, created by the authors
The multi-view graph neural network (MVGNN) is a machine learning model designed to analyze complex data from different perspectives. It uses Graph Convolutional Networks (GCN) to capture features from various kinds of data-for instance, genomics and proteomics-which are represented as nodes within a graph. The MVGNN merges these features to help the model recognize patterns across various sources of data. This is particularly useful for tasks like predicting disease subtypes in multi-omics studies.
A new study, Ren et al. (2024) proposes an MVGNN for improving predictions of subtypes of breast cancer by integrating multi-omics, including genomics, transcriptomics, and proteomics to represent the complexity of breast cancer. Valous et al. (2024) shows that, in MVGNN, GCNs are employed for encoding omics-specific features, while the attention mechanism integrates these features to achieve better accuracy in the subtype classification. The attention mechanism is a process where the model assigns different levels of importance to features based on their relevance to the task. In this case, it helps the MVGNN focus on the most critical patterns in the multi-omics data. This approach shows the potential of multi-omics data fusion in developing precise and effective predictive models for cancer research.
Table 1 compares the results of different models for the task of classifying Breast Cancer subtypes based on the metrics ACC, AUC, and F1 score. The Multi-View Graph Neural Network (MVGNN) outperforms other models, achieving the highest values across all metrics: 0.9180 for ACC, 0.9530 for AUC, and 0.7155 for F1. This demonstrates the superior integration of multi-omics data by MVGNN for more accurate classification of subtypes within breast cancer, thus signifying its potential to drive precision oncology.
One major limitation is the "large p small n" problem-in other words, datasets with many features but relatively few samples. This imbalance between features and samples in omics data may result in overfitting, in which models capture noise rather than meaningful patterns, limiting predictive accuracy. Some feature selection techniques can be applied to overcome this challenge, but even those methods may omit relevant biological information.
MVGNN effectively integrates multi-omics data, enhancing accuracy in breast cancer subtype classification by 22%
Conclusion
Integrating multi-omics data with advanced machine learning techniques can lead to significant advancement in recognizing breast cancer stages and subtypes. This is imperative to improving survival rates in a disease that accounts for the highest number of cancer cases worldwide. Traditional imaging methods have shortcomings when identifying early-stage cancers or correctly differentiating between the diverse subtypes, which can change the outcomes of treatments. Leveraging multi-omics data–genomics, transcriptomics, proteomics, and epigenomics–allows researchers to discover vital patterns that lead to more precise classifications and personalized treatment options. Overall, the proposed CNN model and MVGNN show strong potential in assisting physicians in categorizing breast cancer subtypes and identifying its stages.
Reference
- Alkhateeb, A., Bashier ElKarami, H., Qattous, H., Al-Refai, A., AlAfeshat, N., Shahrrava, B., & Azzeh, M. (2022). Multi-omics data integration model based on Isomap and convolutional neural network. IEEE. https://doi.org/10.1109/icmla55696.2022.00218
- American Cancer Society. (2023, January 12). Key statistics for breast cancer. American Cancer Society. https://www.cancer.org/cancer/types/breast-cancer/about/how-common-is-breast-cancer.html
- Cho, N. (2016). Molecular subtypes and imaging phenotypes of breast cancer. Ultrasonography, 35(4), 281–288. https://doi.org/10.14366/usg.16030
- Ensenyat-Mendez, M., Llinàs-Arias, P., Orozco, J. I. J., Íñiguez-Muñoz, S., Salomon, M. P., Sesé, B., DiNome, M. L., & Marzese, D. M. (2021). Current triple-negative breast cancer subtypes: Dissecting the most aggressive form of breast cancer. Frontiers in Oncology, 11(15). https://doi.org/10.3389/fonc.2021.681476
- Esserman, L., & Yau, C. (2015). Rethinking the standard for ductal carcinoma in situ treatment. JAMA Oncology, 1(7), 881. https://doi.org/10.1001/jamaoncol.2015.2607
- Froelich, W. (2024). Chemotherapy-free therapy for HR+/HER2+ breast cancer patients. Oncology Times, 46(2), 32–32. https://doi.org/10.1097/01.cot.0001007220.34406.8e
- Gupta, G. K., Collier, A. L., Lee, D., Hoefer, R. A., Zheleva, V., Siewertsz van Reesema, L. L., Tang-Tan, A. M., Guye, M. L., Chang, D. Z., Winston, J. S., Samli, B., Jansen, R. J., Petricoin, E. F., Goetz, M. P., Bear, H. D., & Tang, A. H. (2020). Perspectives on triple-negative breast cancer: Current treatment strategies, unmet needs, and potential targets for future therapies. Cancers, 12(9), 2392. https://doi.org/10.3390/cancers12092392
- Hasin, Y., Seldin, M., & Lusis, A. (2017). Multi-omics approaches to disease. Genome Biology, 18(1). https://doi.org/10.1186/s13059-017-1215-1
- Karellas, A., & Vedantham, S. (2008). Breast cancer imaging: A perspective for the next decade. Medical Physics, 35(11), 4878–4897. https://doi.org/10.1118/1.2986144
- National Cancer Institute. (2018). Common cancer sites - Cancer stat facts. SEER. https://seer.cancer.gov/statfacts/html/common.html
- Ren, Y., Gao, Y., Du, W., Qiao, W., Li, W., Yang, Q., Liang, Y., & Li, G. (2024). Classifying breast cancer using multi-view graph neural network based on multi-omics data. Frontiers in Genetics, 15. https://doi.org/10.3389/fgene.2024.1363896
- Valous, N. A., Popp, F., Inka Zörnig, Jäger, D., & Pornpimol Charoentong. (2024). Graph machine learning for integrated multi-omics analysis. British Journal of Cancer, 131(2), 205–211. https://doi.org/10.1038/s41416-024-02706-7
- Yang, S., Wang, Z., Wang, C., Li, C., & Wang, B. (2024). Comparative evaluation of machine learning models for subtyping triple-negative breast cancer: A deep learning-based multi-omics data integration approach. Journal of Cancer, 15(12), 3943–3957. https://doi.org/10.7150/jca.93215
- Zuo, T., Zeng, H., Li, H., Liu, S., Yang, L., Xia, C., Zheng, R., Ma, F., Liu, L., Wang, N., Xuan, L., & Chen, W. (2017). The influence of stage at diagnosis and molecular subtype on breast cancer patient survival: A hospital-based multi-center study. Chinese Journal of Cancer, 36(1). https://doi.org/10.1186/s40880-017-0250-3
About the Author
Shakeel Abdulkareem, Aiden Wang, and Aneesh Gudipati are high school juniors at John Champe High School in Aldie, VA. They are interested in computational biology and the applications of deep learning in analyzing the human genome, and are planning to continue researching disease diagnosis through machine learning.
Mentor: Andrew Riggleman. Affiliation: John Champe High School