## Abstract

Heterogeneity in terms of tumor characteristics, prognosis, and survival among cancer patients has been a persistent problem for many decades. Currently, prognosis and outcome predictions are made based on clinical factors and/or by incorporating molecular profiling data. However, inaccurate prognosis and prediction may result by using only clinical or molecular information directly. One of the main shortcomings of past studies is the failure to incorporate prior biological information into the predictive model, given strong evidence of the pathway-based genetic nature of cancer, *i.e.*, the potential for oncogenes to be grouped into pathways based on biological functions such as cell survival, proliferation, and metastatic dissemination. To address this problem, we propose a two-stage approach to incorporate pathway information into the prognostic modeling using large-scale gene expression data. In the first stage, we fit all predictors within each pathway using the penalized Cox model and Bayesian hierarchical Cox model. In the second stage, we combine the cross-validated prognostic scores of all pathways obtained in the first stage as new predictors to build an integrated prognostic model for prediction. We apply the proposed method to analyze two independent breast and ovarian cancer datasets from The Cancer Genome Atlas (TCGA), predicting overall survival using large-scale gene expression profiling data. The results from both datasets show that the proposed approach not only improves survival prediction compared with the alternative analyses that ignore the pathway information, but also identifies significant biological pathways.

OVER the past three decades, remarkable improvement has been achieved in cancer treatment in the United States, with the annual death rate from cancer between 2002 and 2011 declining by 1.4% for women, 1.8% for men, and 2.3% for children aged 0–10 years (Edwards *et al.* 2014). However, the problem that has persisted in cancer treatment is the heterogeneity of prognostic prediction across patients (Barillot 2013). This heterogeneity is, for the most part, genetically determined and rooted in the molecular profile of patients. The Precision Medicine Initiative has been introduced by the White House to expand cancer genomics research as the goal to develop better prevention and treatment methods for cancers (Collins and Varmus 2015). Recent high-throughput technologies, which can easily and robustly generate large-scale molecular profiling data, including genomic, epi-genomic, transcriptomic, and proteomic markers, offer extraordinary opportunities to integrate clinical and genomic data into prediction models, improving our understanding of interindividual differences that may be critical for the application of precision medicine strategies (Barillot 2013; Collins and Varmus 2015).

The recent development of molecular signatures to predict recurrence of breast, colon, and prostate cancers is notable and clinically useful but may not be sufficient to achieve the goals of precision medicine (Mook *et al.* 2007; Pohl and Lenz 2008; Eng *et al.* 2013). Gene signatures across different studies with very few overlapped genes can have similar prediction results, implying that some underlying mechanisms may exist (van de Vijver *et al.* 2002; Wang *et al.* 2005; Sotiriou and Piccart 2007). According to the analysis of 24 pancreatic tumors by Jones *et al.* (2008), altered genes varied greatly across tumors but the pathways with the altered genes remain largely the same. It implied that one potential reason for the discrepancies may be that many genes are associated with outcomes in complex diseases, yet with a small individual contribution to marginal effect. Thus, for small sample sizes, many substantial but weak signals may be missed (Mootha *et al.* 2003). It has also been revealed that the genetic nature of cancer is pathway-based, that is, oncogenes can be grouped into pathways based on biological functions such as cell survival, proliferation, and metastatic dissemination (Barillot 2013; Huang *et al.* 2014). Therefore, by combining weak signals from a number of genes within each pathway could increase power in prediction and prognosis methods.

Various studies have thus focused on developing pathway-based methods for cancer prognosis prediction (Jones 2008; Jones *et al.* 2008; Lee *et al.* 2008; Reyal *et al.* 2008; Abraham *et al.* 2010; Teschendorff *et al.* 2010; Eng *et al.* 2013; Huang *et al.* 2014). These methods can be divided into two major categories. The first category focused on employing sophisticated statistical methods for variable selection with grouped predictors or pathways such as group lasso with an “all-in-all-out” idea, meaning that when one predictor in a group is chosen, then all variables in that group are chosen (Park *et al.* 2007; Wei and Li 2007; Jones 2008). The second category, on the other hand, reduces the data dimension by generating pathway risk scores to be used in downstream data analysis. Abraham *et al.* (2010) adopted a gene set statistic to provide stability of prognostic signatures instead of individual genes. Huang *et al.* (2014) converted the gene matrix to a pathway matrix through “principal curve,” similar to principal components analysis. Both of these methods did not incorporate outcome when generating the pathways scores from the individual genes. Eng *et al.* (2013) proposed a method to reduce the computational complexity by incorporating a binary outcome to stand for decreased or increased risk score in each pathway which inferred potentially loss of information.

In this article, we propose a two-stage procedure to incorporate pathway information into the prognostic models using large-scale gene expression data. In the first stage, we calculate the leave-one-out cross-validated (LOOCV) prognostic score as risk score for each pathway by fitting all predictors within each pathway using the penalized Cox model and Bayesian hierarchical Cox model. In the second stage, we combine the risk scores of all pathways obtained in the first stage as new predictors to build a super prediction model. We used the proposed method in both breast cancer and ovarian cancer projects from The Cancer Genome Atlas (TCGA) to predict overall survival using gene expression profiling.

Breast cancer is the second most commonly diagnosed malignancy after skin cancer in women (Huang *et al.* 2014). It is widely understood that breast cancer can be categorized into four clinical subtypes, Luminal A, Luminal B, Triple Negative/Basal-like, and Her2, and the survival/metastasis outcomes differ significantly among these four subtypes (Carey *et al.* 2006; O’Brien *et al.* 2010; Haque *et al.* 2012). However, it is increasingly being realized that using only the clinical subtypes cannot fully discriminate breast cancer patients, and that better prediction of prognosis is needed. Ovarian cancer is the leading cause of death from gynecologic malignancies in the western world. According to statistics from the American Cancer Society, the 5-year survival rate is 46.2% for ovarian cancer. However, the survival rates are discrepant among different cancer stages. The majority of patients are diagnosed with advanced stages, requiring a highly toxic 6–8 cycles of platinum-based chemotherapy (Pignata *et al.* 2011). To this end, because of inaccurate prediction of response with clinical measures, ∼30% of patients will be identified as chemo-resistant after undergoing multiple cycles of therapy with little or no benefit. Thus the need for better prediction of response and prognosis is under great pressure (Barakat *et al.* 2009). By applying our method to these two gene expression datasets from TCGA, we not only improve survival prediction compared with the alternative gene-based model that ignores the pathway information, but also identify significant biologically relevant pathways.

## Materials and Methods

### Data collection

We downloaded two datasets with clinical information, microarray mRNA gene expression for breast cancer and ovarian cancer projects from TCGA using *R* package *TCGA-Assembler* (Zhu *et al.* 2014). Overall survival is the outcome of interest for both datasets. We downloaded and analyzed the processed level 3 [log2 lowess normalized (cy5/cy3) collapsed by gene symbol] gene expression data for both datasets. It represents up-regulation or down-regulation of a gene compared to the reference population (Zhao *et al.* 2014).

### Data preprocessing and pathway analysis

For both datasets, samples with missing or zero overall survival time were removed. As the ratios of missing gene expression data were low, simple imputation with mean values across samples were adopted. To construct the pathways, we used genome annotation tools, KEGG (Kanehisa and Goto 2000), to map genes to pathways. We first mapped gene symbols to Entrez ids with *R/bioconductor* package *AnnotationDbi*, and then mapped all the probes to KEGG pathways using the Bioinformatics tool *DAVID* (Huang *et al.* 2009a,b).

### Cox proportional hazards models

Cox proportional hazards model is the commonly used method for analyzing censored survival data (van Houwelingen and Putter 2012), for which the hazard function of survival time takes the form:(1)where is the baseline hazard function, *X* and are the vectors of predictors and coefficients, respectively, and is the linear predictor or called the prognostic index. The coefficients are estimated by maximizing the partial log-likelihood:(2)where the censoring indicator *d _{i}* takes 1 if the observed survival time

*t*for individual

_{i}*i*is uncensored and 0 if it is censored, and is the risk set at time For molecular data, the number of coefficients is much larger than the number of individuals, and/or covariates are usually highly correlated, for which Cox regression is not directly applicable.

#### Penalized Cox Model (ridge, lasso, and elastic-net Cox models):

The elastic net is a widely used penalization approach to deal with high-dimensional models, which adds the elastic-net penalty to the log-likelihood function and estimates the parameters by maximizing the penalized log-likelihood (Zou and Hastie 2005; Hastie *et al.* 2009, 2015; Friedman *et al.* 2010; Simon *et al.* 2011). For the Cox model described above, we estimate the parameters by maximizing the penalized partial log-likelihood:(3)where () is a predetermined elastic-net parameter, () is a penalty parameter, and is the partial log-likelihood of the Cox model. The penalty parameter controls the overall strength of penalty and the size of the coefficients; for a small many coefficients can be large, and for a large many coefficients will be shrunk toward zero. The elastic net includes the lasso () and ridge Cox regression () as special cases (Tibshirani 1997; Gui and Li 2005; van Houwelingen *et al.* 2006; Simon *et al.* 2011; van Houwelingen and Putter 2012).

The ridge, lasso, and elastic-net Cox models can be fitted by the cyclic coordinate descent algorithm, which successively optimizes the penalized log-likelihood over each parameter with others fixed and cycles repeatedly until convergence. The cyclic coordinate descent algorithm has been implemented in the *R* package *glmnet*. Cross-validation is the most widely used method to select an optimal value (*e.g.*, an optimal Cox model) that gives minimum cross-validated error.

#### Bayesian hierarchical Cox model:

Hierarchical modeling is another efficient approach to handling high-dimensional data, where the regression coefficients are themselves modeled (Gelman and Hill 2007; Gelman *et al.* 2014). Hierarchical models are more easily interpreted and handled in the Bayesian framework, where the distribution of the coefficient is the prior distribution and statistical inference is based on the posterior estimation. The commonly used prior is the double-exponential (or Laplace) prior distribution (Park and Casella 2008; Yi and Xu 2008; Yi and Ma 2012):(4)where the inverse scale *s* is shrinkage parameter and controls the amount of shrinkage; a smaller scale *s* induces stronger shrinkage and thus forces the estimates of toward the prior mean zero. For the hierarchical Cox model with the double-exponential prior, the log posterior distribution of the parameters can be expressed as(5)By maximizing the log posterior distribution, we estimate the parameters by finding the posterior modes of the parameters. We have developed an algorithm for fitting the hierarchical Cox model by incorporating an expectation-maximization (EM) procedure into the usual Newton–Raphson algorithm for fitting classical Cox models. Our algorithm has been implemented in our *R* package *BhGLM* (http://www.ssg.uab.edu/bhglm/). The above hierarchical Cox model with is equivalent to the lasso, if one estimates the mode of the posterior distribution.

#### Optimizing the penalty parameter λ and the inverse scale s:

The penalized Cox models heavily depend on the penalty tuning parameter *λ*. The tuning parameter is usually chosen using the *K*-fold cross-validation procedure and estimated by maximizing the cross-validated partial likelihood (CVPL) (van Houwelingen *et al.* 2006; Simon *et al.* 2011):(6)where is the estimate of *β* from all the data except the *k*-th part of the data in *K*-fold cross-validation, is the partial likelihood of all the data points, and is the partial likelihood excluding the *k*-th part of the data. By subtracting the log-partial likelihood evaluated on the non-left-out data from that evaluated on the full data, we can make efficient use of the death times of the left-out data in relation to the death times of all the data. We choose the *λ* value which maximizes Based on the relationship between the lasso and the hierarchical Cox model, the inverse scale in the hierarchical Cox model is chosen to be

### Two-stage approach for pathway integration

Intuitively we can simultaneously analyze all the genes in a single model. In some previous studies, penalized Cox models have been used to analyze molecular data with this gene-based single model approach (Rappaport 2007; Bovelstad *et al.* 2009; Jacob 2009; Zhang *et al.* 2013; Yuan *et al.* 2014; Zhao *et al.* 2014). However, due to the high dimension of molecular data, fitting a model including all genes can lead to instability of the predictive model and may result in decreased prediction performance with the increased model complexity.

To overcome these problems, we propose a two-stage procedure to build a prediction model, inspired by the super learner of van der Laan *et al.* (2007) and van Houwelingen and Putter (2011). Suppose that genes are assigned into *G* pathways, with the *g-th* pathway *G _{g}* containing

*J*genes, and denote the vector of predictors in the

_{g}*g-th*pathway by

*X*. Overlapping is common in pathways analysis, that is, a gene could belong to multiple functional pathways. The two-stage procedure can easily deal with overlapping.

^{g}In the first stage, we separately analyze each pathway by fitting all the predictors within each pathway using the hierarchical Cox model or penalized Cox regression. For the *g-th* pathway, we fit the model: and can obtain the estimate of prognostic score, for each individual. However, directly using the prognostic score in the second stage can result in overfitting. To prevent overfitting, we estimate cross-validated prognostic scores using LOOCV. The LOOCV prognostic score for the *i-th* individual and the *g*-*th* pathway is calculated as follows; we first estimate the coefficients *β ^{g}* of the predictors

*X*using all other (

^{g}*n*– 1) individuals (

*i.e.*, excluding the

*i-th*individual). Denote the estimate by then the LOOCV prognostic score is(7)The purpose of using LOOCV prognostic scores instead of prognostic scores directly is to prevent overfitting in the super prediction model in the second stage. The LOOCV prognostic score for each patient is derived independently of the observed data of the patient, and hence the scores along with the observed data of the patients can essentially be treated as a “new dataset.” Therefore, this procedure can overcome overfitting (Tibshirani and Efron 2002; Hastie

*et al.*2015).

In the second stage, we combine the LOOCV prognostic scores of pathways with cross-validated C-index >0.5 obtained in the first stage as new predictors to build a super prediction model:(8)To prevent the super prediction model from being nonidentifiable and overfitting, we utilize the hierarchical Cox model approach, *i.e.*, assuming that the pathway effects, *α ^{g}*, follow the double-exponential prior as described in (4). The hierarchical Cox model with the EM algorithm can produce interval estimates and

*P*-values for the pathway effects,

*α*, and thus allow us to detect significant pathways. To evaluate the prognostic performance of the super prediction model, we carry out 10-fold cross-validation over 10 repeats. The two-stage procedure to build the pathway-structured predictive model is presented in Figure 1. The R source code for implementing the two-stage approach is provided in Supplemental Material, File S1.

^{g}### Evaluating the predictive performance

To assess the prognostic utility of the fitted model, we need to evaluate the quality of the fitted model and its predictive value. There are several ways to measure the performance of a Cox model (Steyerberg 2009; van Houwelingen and Putter 2012):

Concordance Index (C-index) (Harrell

*et al.*1996): the standard way to measure the concordance between the observed survival times and predicted survival times. The performance is better when the C-index is greater.CVPL: a general measurement of model quality and prediction as mentioned above as

Prediction Error: the most popular measure of prediction error is the Brier score, which is defined as (van Houwelingen and Putter 2012): where is the estimated survival probability of an individual beyond t

_{0}given the predictor*x*.Prevalidated Kaplan–Meier Analysis: We compute the cross-validated prognostic score for each patient for the super prediction model in the second stage. We then divide the individuals into two groups of high and low risks based on the median of the cross-validated prognostic scores. The samples with higher scores are in the high-risk group, while the samples with lower scores are in the low-risk group. We then get the Kaplan–Meier plot and log-rank test by comparing the two groups (Tibshirani and Efron 2002; Hastie

*et al.*2015). This provides a valid assessment of the predictive performance of the model (Tibshirani and Efron 2002; Hastie*et al.*2015).

### Comparison to gene-based model

To demonstrate the advantages of our approach, we compared our two-stage pathway-structured predictive model to the previously used gene-based model. For the gene-based model, we used the penalized Cox models and the hierarchical Cox model to simultaneously fit all the genes that are mapped into pathways as one single model. Tenfold cross-validation over 10 repeats was used to evaluate the predictive performance of the gene-based model. Then we compared the prediction performances of the pathway-structured predictive model and the gene-based model.

### Simulation study

We carried out a simulation study to evaluate the false-positive rate of the proposed two-stage approach. We simulated 10 groups (or called pathways) of 500 genes with 50 genes in each group. The sample size was set to be 500, and the correlation between genes within each group was 0.6. To assess the false-positive rate under the null, the coefficient *β _{j}* for each gene was set to be 0. Following Simon

*et al.*(2011), we generated “true” survival time

*T*for each individual from the exponential distribution: and then generated censoring time

_{i}*C*for each individual from the exponential distribution: where

_{i}*r*were randomly sampled from a standard normal distribution. The observed censored survival time

_{i}*t*was set to be the minimum of the “true” survival and censoring times, and the censoring indicator

_{i}*d*was set to be 1 if

_{i}*C*>

_{i}*T*and 0 otherwise.

_{i}The simulation was repeated 1000 times to evaluate the performance of the two-stage approach in controlling the false-positive rate. Our simulation results showed that the false-positive rate of our two-stage approach was ∼0.01 at the significance level of 0.05. This indicates that our two-stage approach well controls the false-positive rate.

### Data availability

The authors state that all data necessary for confirming the conclusions presented in the article are represented fully within the article.

## Results

### Data summary

For breast cancer data, 54 samples were removed for missing or zero overall survival time, and 17,815 features across 533 samples were profiled for gene expression, which included a total of 1571 missing observations. For our analysis, only 505 samples were kept for whom survival time and gene expression were both available. Among these 505 patients, only 65 were dead and thus the event rate was 12.9%. For ovarian cancer data, the final sample size was 538 with 17,814 features of gene expression profiled, which included a total of 746 missing observations. Among these 538 patients, 278 were dead and thus the event rate was 51.7%.

Simple imputation with mean values across samples was adopted to fill the missing gene expression values for both cancer datasets. In pathway analysis, we mapped gene symbols to Entrez ids with R/Bioconductor package AnnotationDbi.

For breast cancer, 3181 probes were mapped to 109 pathways. For ovarian cancer, 4887 probes were mapped to 193 pathways with at least five genes in each pathway. Therefore, we used the expression data of 3181 and 4887 genes to predict overall survival for breast and ovarian cancers, respectively.

### Building pathway-structured predictive model with two-stage approach

In the first stage, we calculated the LOOCV prognostic score for each pathway and each patient by fitting all the genes in that pathway. The procedure was then repeated for all the pathways. For breast cancer data, the predictive model was built with ridge, lasso, and elastic-net Cox models and hierarchical Cox model. For ovarian cancer data, the predictive model was built with the lasso Cox model and hierarchical Cox model. For each pathway, the tuning parameters in the penalized Cox models were estimated by 10-fold cross-validation over 10 repeats. For the inverse scale *s* in the hierarchical Cox model, we set *s* = 1/(*nλ*) based on the relationship between the lasso and hierarchical Cox model. For breast cancer data, we tested three inverse scales: *s* = 1/(*nλ*), 1/(*nλ*) + 0.03, 0.08. For ovarian cancer data, we only used *s* = 1/(*nλ*) for the hierarchical Cox model.

In the second stage, we used pathways with cross-validated C-index >0.5 obtained in the first stage to build a super prognostic model. A super prognostic model was built with the LOOCV prognostic scores obtained in the first stage as new predictors with the hierarchical Cox model approach. Tenfold cross-validation over 10 repeats was carried out to validate the super prognostic models. To select the prior scale for the hierarchical Cox model, we calculated the CVPL from 10-fold cross-validation for different prior scales, *s* = 0.08, 0.10, 0.12, 0.14, 0.16, 0.18, 0.20, and 0.22, and then the scale with the highest CVPL was chosen.

For breast cancer, Table 1 shows the number of pathways with C-index >0.5, CVPL, and C-index for all the two-stage models. It can be seen that the hierarchical Cox models with *s* = 1/(*nλ*) or 1/(*nλ*) + 0.03 generated a larger CVPL and C-index, and thus had better prediction performance than the other approaches. For ovarian cancer data, CVPL were −1652.928 (3.753) and −1650.489 (2.599), and C-index were 0.718 (0.003) and 0.722 (0.004), for the two-stage hierarchical–hierarchical Cox model with *s* = 1/(*nλ*) and two-stage lasso–hierarchical Cox model, respectively.

### Pathway-structured predictive model superior to gene-based model in prediction performance

To compare the two-stage approach with the single model analysis, we simultaneously fit all the genes that were mapped into pathways using the lasso and the hierarchical Cox model. The CVPL and C-index for the joint analyses are presented in Table 2. For breast cancer data, CVPL were −364.845 (0.949) and −363.554 (0.626), and C-index were 0.507 (0.023) and 0.572 (0.023), for the joint lasso and joint hierarchical Cox models, respectively. For ovarian cancer data, CVPL were −1716.385 (1.565) and −1714.236 (0.631), and C-index were 0.541 (0.011) and 0.573 (0.006), for the joint lasso and joint hierarchical Cox models, respectively. Our results of the joint analyses were similar to those of Zhao *et al.* (2014) and Yuan *et al.* (2014). Therefore, the two-stage approach provided much larger CVPL and lower C-index than the gene-based models, and thus significantly outperformed the gene-based models for both datasets.

The predictive performances of all the models were also assessed by the prediction error. Figure 2 and Figure 3 present the Brier prediction error curves for breast and ovarian cancer data, respectively. For breast cancer data, the pathway-structured predictive models consistently performed better than the joint lasso. Among all the pathway-structured predictive models, the two-stage ridge–hierarchical Cox model had the highest prediction error. The two-stage hierarchical–hierarchical Cox models with *s* = 1/(*nλ*) or 1/(*nλ*) + 0.03 and the two-stage lasso–hierarchical Cox model had lower prediction error than the best fitted pathway. For ovarian cancer data, all the pathway-structured predictive models consistently performed better than the joint lasso and the best-fitted pathway.

### Risk group stratification

We also demonstrated that our pathway-structured predictive models were superior to the gene-based models by evaluating their ability in stratifying samples. For both breast cancer and ovarian cancer datasets, the Kaplan–Meier curves were drawn for the low-risk and high-risk groups to compare among the joint lasso, joint hierarchical Cox model, two-stage lasso–hierarchical (L-H) model, and two-stage hierarchical–hierarchical (H-H) model. Log-rank tests were carried out for each model. The Kaplan–Meier curves and *P*-values of log-rank tests are shown in Figure 4 and Figure 5 for breast and ovarian cancer datasets, respectively. For both breast and ovarian cancer datasets, our pathway-structured predictive models generated significantly larger differences in Kaplan–Meier curves between the low-risk group and high-risk group (*P*-value = 4.69e−11, *P*-value = 5.81e−11, *P*-value < 0.001, *P*-value < 0.001, respectively). Both gene-based models did not show significant differences of Kaplan–Meier curves between two groups (*P*-value = 0.242, *P*-value = 0.231, *P*-value = 0.0504, *P*-value = 0.0539, respectively).

### Biological relevance of associated pathways

We evaluated the ability of the pathway-structured predictive models to identify prognostic cancer relevant pathways by comparing them with previously published results. Figure 6 and Figure 7 show the estimated effects and *P*-values of the identified significant pathways for the two-stage lasso–hierarchical Cox model and the two-stage hierarchical–hierarchical Cox model with *s* = 1/(*nλ*) for breast and ovarian cancer datasets, respectively. For breast cancer, we compared all the detected significant pathways with 15 core cancer pathways discussed in Eng *et al.* (2013) and found that 5 out of 15 were consistent for the two-stage hierarchical–hierarchical Cox model. Besides that, metabolic pathways such as *Pyrimidine metabolism*, *Glycerolipid metabolism*, and *Metabolism of xenobiotics by cytochrome P450* have been identified in both the two-stage lasso–hierarchical Cox model and two-stage hierarchical–hierarchical Cox model. They have been reported as important pathways in regulating breast cancer in previous literature (Murray *et al.* 1993; Schramm *et al.* 2010; Merdad *et al.* 2015). Moreover, *Ribosome* and *Proteasome* are functional as protein builders and degraders, respectively, which also play an essential role in breast cancer. For ovarian cancer, several significant pathways are also consistent with those 15 core cancer pathways discussed in Eng *et al.* (2013). Besides that, metabolic pathways such as *Nicotinate and nicotinamide metabolism* and *fatty acid metabolism* have been identified in the two-stage hierarchical–hierarchical Cox model. They have been discussed as important pathways in regulating ovarian cancer in previous literature (Tania *et al.* 2010; Vermeersch *et al.* 2014).

### Incorporating clinical factors with pathway information to predict cancer survival

The most widely used method for integrating clinical factors and gene expression data is to include all the predictors in a single model. However, previous studies rarely found meaningful improvement in terms of prediction performance compared to models with only traditional clinical factors or a single type of molecular data. There are many drawbacks with these previous methods: (1) they create a huge model with too many predictors which cannot be easily handled and thus can reduce the prediction accuracy; (2) models including different types of predictors can result in different coefficient estimates compared with the models with only clinical or molecular predictors; and (3) the predictive values of clinical and molecular factors obtained from such models cannot be easily interpreted, and cannot easily be applied to other datasets.

To overcome the limitations stated above, we incorporated clinical factors with pathway predictors in the second stage of the two-stage approach. Compared to the methods incorporating clinical factors and molecular data directly, this approach reduces the instability of the model drastically. Besides, as we could apply different prior scales for hierarchical Cox regression to clinical factors and pathways, the contribution of clinical factors in prediction will be preserved in our predictive model. We applied this approach in both the TCGA breast and ovarian cancer datasets to incorporate age of patients at diagnosis and tumor stages with pathway matrix information to predict cancer survival. Our results showed that the prediction performance can be improved after combining clinical factors with pathway information. The results are presented in detail in File S2.

## Discussion

### Pathway-structured two-stage predictive modeling

The heterogeneity of prognostic prediction in cancers has been a persistent problem for decades (Barillot 2013). It is now realized that cancer is a disease of the genome and can be understood by identifying the abnormal genes and proteins that are associated with the risk of developing cancer. Some statistical methods have been used to analyze genomic data with a gene-based approach to search for gene signature and to predict prognosis (Rappaport 2007; Bovelstad *et al.* 2009; Jacob 2009; Zhang *et al.* 2013; Yuan *et al.* 2014; Zhao *et al.* 2014). Due to the complicated genetic nature of cancer and potentially underpowered statistical analysis of a gene-based approach, it has been suggested that the complexity of cancer should be handled based on pathway-centric instead of gene-centric perspectives (Jones 2008). Oncogenes and tumor suppressor genes have been well studied and can be arranged into signaling pathways according to their biological functions such as cell survival, proliferation, and metastatic dissemination. Other studies have investigated methods to analyze high-throughput cancer genomics data based on functional units, *i.e.*, pathways (Goeman and Buhlmann 2007; Lee *et al.* 2008; Reyal *et al.* 2008; Abraham *et al.* 2010; Teschendorff *et al.* 2010).

Our two-stage approach was developed to incorporate the functional structure of pathways to predict survival for cancer patients. The proposed method has two outstanding features in reducing the large-scale molecular matrix to a predictable pathway-based matrix: (i) it incorporates the correlation with survival information in calculating risk scores for each pathway; and (ii) we use LOOCV prognostic score as risk score, which not only prevents overfitting and can be easily carried out, but also gives an unbiased view on the contribution of the different information from pathways to the super prediction model. Meanwhile, pathway-structured predictive models perform consistently better than the gene-based models using lasso or hierarchical Cox models in terms of C-index, CVPL, and reduced prediction error in two cancer datasets. Our pathway-structured predictive models also show remarkable performance in discriminating the prognostic effects between different patients compared with gene-based methods based on the Kaplan–Meier analysis results. It is noteworthy that different penalized Cox models and hierarchical Cox models with double-exponential prior have been compared in generating pathway-based matrix, which suggests that lasso and hierarchical Cox models are more stable in preserving the pathway information for super prediction models.

Our approach is capable of identifying important pathways. Among those pathways identified for both breast and ovarian cancers, we have identified in total 7 out of 15 core cancer pathways. *Mitogen-activated protein kinase* (*MAPK*) pathways are important in controlling fundamental cellular processes, *i.e.*, growth, proliferation, differentiation, migration, and apoptosis (Dhillon *et al.* 2007). When abnormally activated, *MAPK* pathways can lead to the progression of cancer (Ussar and Voss 2004; McCubrey *et al.* 2007). Another pathway, *the mammalian target of rapamycin* (*mTOR*), also plays an essential role in the regulation of cell proliferation, growth, differentiation, migration and survival. Similar to *MAPK* pathways, the dysregulation of *mTOR signaling* happens in various human tumors, resulting in higher susceptibility to inhibitors of *mTOR* (Huang and Houghton 2003). The *Hedgehog* pathway regulates many fundamental processes including stem cell maintenance, cell differentiation, tissue polarity, and cell proliferation. It has been demonstrated that inappropriate activation of the *Hedgehog* pathway occurs in various cancers such as brain, gastrointestinal, lung, breast, and prostate cancer (Gupta *et al.* 2010). The *JAK-STAT* pathway is also identified by our approach. This pathway regulates various cellular processes such as stem cell maintenance, apoptosis, and the inflammatory response and was found to be frequently dysregulated in diverse types of cancer (Thomas *et al.* 2015). Furthermore, the *cell cycle* pathway and *cell adhesion molecules* pathway have also been reported to play an essential role in cancer progression and the potential to find cancer therapy (Okegawa *et al.* 2004).

However, there are some issues that need to be addressed in the future based on our results. We implemented our approach in microarray gene expression data from two separate TCGA breast and ovarian cancer projects. There are data from other platforms, such as RNA-Seq data. Our model can be directly adopted in RNA-Seq data by applying additional constraints in normalization of the RNA-Seq data. The second issue is the potential loss of gene information, as only 18–30% of genes are mapped into pathways in both datasets. One plausible solution is to calculate a risk score for the unmapped genes as one additional group, yet requiring prefiltering with univariate analysis approach or variance-filter to reduce the number of unmapped genes to a certain feasible level. Meanwhile, our method can easily incorporate clinical factors with the pathway information, yielding great potential for application in clinical research. In the future, we hope to apply our approach to other levels of genomic data, *e.g.*, DNA methylation, miRNA, and copy number alterations, for more than 30 types of cancer, by combining clinical biomarkers into our pathway-structured predictive model to better predict cancer survival in clinical research.

### Key points

Current development of molecular signatures to predict for breast, colon, and prostate cancers is notable but may not be sufficient to achieve the goal of precision medicine. It has been revealed that the genetic nature of cancer is pathway based, that is, oncogenes can be grouped into pathways based on biological functions.

Current methods have shortcomings in incorporating pathway information into predictive modeling. We proposed a two-stage procedure to incorporate pathway information into the predictive modeling using large-scale gene expression data and applied the proposed method to analyze two independent breast and ovarian cancer datasets from TCGA project for predicting overall survival.

The results show that the proposed approach not only improves survival prediction compared with the alternative gene-based methods that ignore the pathway information, but also identifies significant biological pathways.

The approach can be extended to data from other platforms, such as RNA-Seq data or other molecular levels of data including DNA methylation, miRNA, and copy number alterations, for various types of cancer.

## Acknowledgments

We thank two reviewers and the associate editor for their constructive suggestions and comments that have improved the manuscript. This work was supported in part by the following research grants: National Institutes of Health (NIH) R01-GM069430, NIH U01-CA158428, NIH R03-DE024198, NIH R03-DE025646, R01-CA133093, R01-ES016354, the Alabama Innovation Fund, Avon Foundation grant 02-2014-030, and V Foundation Scholar Award V2015-009.

## Footnotes

Supplemental material is available online at www.genetics.org/lookup/suppl/doi:10.1534/genetics.116.189191/-/DC1.

*Communicating editor: W. Valdar*

- Received March 14, 2016.
- Accepted October 31, 2016.

- Copyright © 2017 by the Genetics Society of America