PINCAGE - Probabilistic INTegration of CAncer GEnomics data

Overview

PINCAGE is a method that uses probabilistic integration of cancer genomics data for combined evaluation of RNA-seq gene expression and 450K array DNA methylation measurements of promoters as well as gene bodies.The method learns the specific relationships between the data types and exploits these for biomarker discovery and classification of new samples. It also explicitly models the uncertainty of both the count-based NGS data and the continuous array data measurements. This approach is specifically tailored for cancer studies where much heterogeneity is observed among tumours. The method combines graphical model formalism with non-parametric specification of probability distributions to capture the highly context-specific relationships between methylation patterns and gene expression.

Instructions

PINCAGE was implemented using R (3.0.1+) and is currently available as scripts governing the data preparation, model training and evaluation. It relies on external binary available from “phy” C++ library. We provide a statically compiled binary dfgEval_static, compiled for Ubuntu 12.04.5 LTS (GNU/Linux 3.2.0-75-generic x86_64). If your system does not permit you to use that static binary, you may compile the phy library as described here: http://github.com/jakob-skou-pedersen/phy/, and use the compiled ./phy/bin/dfgEval instead. At this point I ask you to rename the binary to dfgEval_static and place in the directory with remaining R scripts and data. Our scripts also require the following libraries to be installed in your R system: aws and smoothie. If you plan to re-run the simulated data analysis, you will additionally need MASS library.

Data download

We make example of PINCAGE with Breast Invasive Carcinoma dataset from The Cancer Genome Atlas. The processed data files are split by the analyses: normal vs tumour and progressing vs non-progressing. Download these .RData files for running the model. They contain processed data in formats expected by the script files.

To run the scripts below, enter in your prompt:
$ Rscript scriptName ID_start ID_end

The scripts are set up for running on HPC systems and hence rely on numerical IDs for genes. Each script can process IDs in given succession from start_ID to end_ID. To run the command for a single ID, set start and end to be the same numbers. If you want to find the list of 17728 genes that were suitable for PINCAGE modelling, load any of the provided .RData files and see the names of genes in the workingList_BRCA vector.

Scripts for BRCA normal vs tumour comparison

The BRCA tumour (G2) vs normal (G1) analysis was performed with 2/3 of the data set using either single data type sub-models or the integrative model. Additionally, predictions were made using the integrative PINCAGE on the remaining BRCA tumours and normal. The following scripts are available:

PINCAGE-expression_eval.R: evaluation using gene expression PINCAGE sub-model

PINCAGE-geneBody_eval.R: evaluation using gene body methylation PINCAGE sub-model

PINCAGE-promoter_eval.R: evaluation using promoter methylation PINCAGE sub-model

PINCAGE-integrative_eval.R: evaluation using integrative PINCAGE model. This script also produces model files for use with the prediction script:

PINCAGE-integrative_predict.R: produces an output with posterior probabilities for the 1/3 of left out BRCA dataset.

 

The following is an example output which contains 4 lines:

$ head 1.result

2.18940699622378e-307 60.8660000000091 1.85376000000571 1.5754059866035 37.458433255818

0 1.26170400928105 1.76036812802714 2.25370085162364 2.73682905236275 3.16286316599919 3.64057425338269 4.04940757249639 4.4294496481436 4.86153688417511 5.32437204069118 5.82180088987547 6.31358459692797 6.80504171869964 7.36508705450302 8.06933729509091 8.89610027687383 9.63792988608681 10.3480857295364 11.0540271381453 12.2252022238576 14.3937423746706 16.476045600892 19.4155215191607 28.1912320879913 4980.8466724282

-7.01 -6.02 -5.52 -5.175 -4.9 -4.655 -4.43 -4.2 -3.915 -3.315 -0.32 0.77 1.29 1.655 1.93 2.15 2.345 2.515 2.665 2.805 2.935 3.07 3.215 3.385 3.61 7.01

-7.01 0.13 0.61 0.95 1.23 1.48 1.715 1.935 2.15 2.35 2.545 2.75 2.965 3.165 3.36 3.535 3.68 3.815 3.945 4.08 4.22 4.375 4.56 4.795 5.13 7.01

 

The first line is the most important for you. The first number of that line is the Z-test p-value. The second number is the value of the D statistic (from Likelihood Ratio Test). The third and fourth numbers are the mean and standard deviation of the random expectation of the D statistic. The fifth number is the Z-score for given gene. In the lines below, the discretization boundaries are output for normalized gene expression (2nd line), gene body methylation (3rd line) and promoter methylation (4th line). Additionally, the script produces a .tar file that contains trained model specification that can be used with the PINCAGE-integrative_predict.R script. The prediction script will take the tar file, unpack it and predict for the remaining 1/3 of BRCA dataset to produce a table with sample IDs, tumour posterior probabilities (posterior_G2) and –log-likelihoods (under G1 and G2 models):

$ head 1.predicted

sample_ID posterior_G2 G2_mloglik G1_mloglik

TCGA-A7-A0D9-11A 0.310025518872388 108.23 107.43

TCGA-A7-A0DB-11A 0.294422667408569 100.18 99.306

TCGA-AC-A2FB-11A 0.394126331568238 128.65 128.22

TCGA-AC-A2FM-11B 0.725119497789823 131.11 132.08

TCGA-BH-A0AU-11A 0.455121107626422 111.05 110.87

TCGA-BH-A0AY-11A 0.413382421082671 120.78 120.43

TCGA-BH-A0B3-11B 0.495000166660001 122.21 122.19

TCGA-BH-A0B8-11A 0.293177778906433 109.88 109

TCGA-BH-A0BA-11A 0.447692090425673 133.68 133.47

Scripts for BRCA progressing vs non-progressing comparison

The BRCA 14 progressing (G1) vs 57 non-progressing (G2) tumours analysis was performed using the integrative model in the cross-validation fashion. Therefore, the provided script performs 14-fold cross-evaluation and validation at once. The following script is available:

PINCAGE-integrative_evalANDpredict_LOU.R

 

The example output contains gene Z-scores in each fold of the validation procedure in the first line, and a table with –log-likelihoods of each sample according to the cross-validation G1 and G2 models:

$ head 1.result

1.59 1.44 1.58 2.60 2.20 0.98 1.30 1.87 1.02 2.12 2.56 1.43 1.62 4.22

sample_ID G1_mloglik G2_mloglik

TCGA-A7-A3RF 108.63 108.73

TCGA-A7-A425 94.815 94.879

TCGA-LL-A5YM 106.13 106.18

TCGA-E9-A243 106.07 105.98

TCGA-A7-A13G 92.953 92.995

TCGA-A7-A26H 98.799 98.934

TCGA-LQ-A4E4 105.64 105.69

TCGA-A7-A13H 88.738 88.796

 

If you wish to repeat the full progression set analysis, you must complete analysis of all 17728 genes using the above script and join the respective sample likelihoods according to the ranking using Z-score.

Scripts for data simulation and analysis

We also evaluated PINCAGE and its sub-models on simulated datasets. We provide the following scripts for simulating the analysed data sets:

sim_dilution_true.R script for simulating true positive genes of the tumour heterogeneity set

sim_dilution_false.R script for simulating true negative genes of the tumour heterogeneity set

sim_deltaCorelation_true.R script for simulating true positive genes of the delta correlation set

sim_deltaCorelation_false.R script for simulating true positive genes of the delta correlation set

 

And the corresponding sub-model and integrative PINCAGE scripts for analysis of such produced data sets:

PINCAGE-expression_eval_simData.R evaluation using gene expression simulated data only with the respective PINCAGE sub-model

PINCAGE-geneBody_eval_simData.R evaluation using gene body methylation simulated data only with the respective PINCAGE sub-model

PINCAGE-promoter_eval_simData.R evaluation using promoter methylation simulated data only with the respective PINCAGE sub-model

PINCAGE-integrative_eval_simData.R evaluation using all data types with the integrative PINCAGE model

Contact

Should you have questions, please contact: michal.switnicki [at] gmail.com

Reference

PINCAGE has been described in the following manuscript:

Michał P. Świtnicki, Malene Juul, Tobias Madsen, Karina D. Sørensen, Jakob S. Pedersen PINCAGE: Probabilistic integration of cancer genomics data for perturbed gene identification and sample classification (Submitted.)