Essentiality with continuous expression: Breast CCND1 expression example

DOWNLOAD R CODE FILE FOR THIS TUTORIAL

We previously worked through a basic 2 class differential essentiality analysis using cell line HER2+ status as an example. Here we see how to predict genes whose essentiality is associated (positively or negatively) with a continuous variable, such as RNA-Seq expression of a gene. We’ll use the expression of well-known Breast oncogene CCND1 as a case study.

First DOWNLOAD SIMEM R CODE zip bundle containing the R code files loaded below.

Also note that, while the code below uses relative paths, you should edit these to point to where you’ve stored the code, annotation or data files.

### To install required packages, uncomment and run this
# source("http://www.bioconductor.org/biocLite.R")
# biocLite(c("Biobase", "preprocessCore", "genefilter"))
# install.packages(c("blme", "doParallel", "ggplot2", "locfit", "MASS", "plyr", "reshape"))

suppressPackageStartupMessages(library("Biobase"))
suppressPackageStartupMessages(library("blme"))
suppressPackageStartupMessages(library("doParallel"))
suppressPackageStartupMessages(library("genefilter"))
suppressPackageStartupMessages(library("ggplot2"))
suppressPackageStartupMessages(library("locfit"))
suppressPackageStartupMessages(library("MASS"))
suppressPackageStartupMessages(library("plyr"))
suppressPackageStartupMessages(library("preprocessCore"))
suppressPackageStartupMessages(library("reshape"))


source("../../R/data_format_lib.R")
source("../../R/model_lib.R")
source("../../R/simem_lib.R")

The screen data formatted as an ExpressionSet, including measurement weights, can be downloaded HERE.

load("../../data/shrna/breast_screens_with_weights.eset")
#breast_screens

The reagent annotation table we’re about to load, including the weight column which will be used to exclude some reagents, can be downloaded HERE.

hp = read.delim("../../data/annotations/hairpin_annotations.txt", header=T, as.is=T, check.names=F)
hpWeights = hp[,c("trcn_id", "gene_id", "weight")]

The breast RNA-Seq expression file, from which we’ll extract CCND1 log2-FPKM expression values, can be downloaded HERE.

options(width=100)

rnaseq = read.delim("../../data/rnaseq/breast_rnaseq_qn.txt", header=T, as.is=T, check.names=F)
# Get the row index for CCND1 expression
ccnd1Index = which(rnaseq$symbol == "CCND1")
# Remove all identifier columns, and drop the columns dimension, resulting in a cell-line-named vector of log2-FPKM values
ccnd1Expr = unlist(rnaseq[ccnd1Index,-grep("gene_id|symbol|ensembl_id", colnames(rnaseq))])
ccnd1Expr

##       184a1       184b5      600mpe       au565        bt20       bt474       bt483       bt549 
##       4.845       5.481       7.756       6.255       4.346       5.855       6.853       2.300 
##      cal120      cal148       cal51      cal851       cama1      du4475       efm19     efm192a 
##       6.601      -1.627       6.540       5.338       7.981       2.870       7.333       7.039 
##       evsat      hbl100     hcc1008     hcc1143     hcc1187     hcc1395     hcc1419     hcc1428 
##       3.654       1.670       6.476       9.054       5.051       5.409       7.099       7.352 
##     hcc1500     hcc1569     hcc1599     hcc1806     hcc1937     hcc1954      hcc202     hcc2185 
##       9.495      -1.764       7.520       6.337       6.712       8.416       4.122       6.823 
##     hcc2218     hcc2688     hcc3153       hcc38       hcc70      hcc712       hdqp1      hs578t 
##       6.965       4.411       3.626       5.846       5.333       8.130       7.402       6.433 
##       jimt1        kpl1         ly2      macls2       mb157      mcf10a      mcf12a        mcf7 
##       6.435       7.021       7.290       6.288       7.873       4.964       4.317       8.044 
##  mdamb134vi    mdamb157 mdamb175vii    mdamb231    mdamb330    mdamb361    mdamb415    mdamb436 
##      10.783       4.955       8.378       6.927      10.540       6.868       8.606       5.258 
##    mdamb453    mdamb468      mfm223         mx1       ocubm       skbr3       skbr5       skbr7 
##       9.097       5.966       7.336       4.108       2.271       5.702       6.838       4.382 
##      sum102     sum1315      sum149      sum159      sum185      sum190      sum225      sum229 
##       5.812       4.665       4.266       6.623       5.341       7.542       6.844       3.509 
##       sum44       sum52       sw527        t47d    uacc3199     uacc812     uacc893       zr751 
##       6.577       5.705       7.586       6.412       5.646       5.615       7.171       7.696 
##      zr7530       zr75b 
##       7.897       7.703

Once we have the CCND1 log2-FPKM expression values, we want to construct a data frame with 2 columns: cell_line as the 1st column, and a second column to contain the expression values, which we’ll creatively name ccnd1. We will then specify the covariate = "ccnd1" and annotationsPerCellLine = ccnd1Values parameters in the simem() function.

ccnd1Values = cbind.data.frame(cell_line=names(ccnd1Expr),
                               ccnd1=ccnd1Expr,
                               stringsAsFactors=FALSE)
ccnd1Values[1:5,]

##        cell_line ccnd1
## 184a1      184a1 4.845
## 184b5      184b5 5.481
## 600mpe    600mpe 7.756
## au565      au565 6.255
## bt20        bt20 4.346

For the purposes of this example, to reduce computation time, we’ll perform the analysis for CCND1, which is expected, based on the breast cancer literature, to be more essential in cell lines with higher CCND1 expression. We’ll also examine CDK4 and CDK6 essentiality, since these are known to directly form protein-protein complexes with CCND1 and are known to be required for CCND1’s cell cycle functions in different contexts.

genesOfInterest = c(595,   #CCND1
                    1019,  #CDK4
                    1021)  #CDK6

If you want to perform a genome-wide analysis, simply omit the geneIds = genesOfInterest parameter. As noted, this analysis typically takes ~18 hours but can be greatly reduced using the parallelNodes parameter on systems with multiple processor cores.

If we specify an analysis using both precision and signal-noise measurement weights (inverseVarianceWeights = TRUE and signalProbWeights = TRUE parameters, respectively), we must ensure that we’ve added these weights to the ExpressionSet beforehand (DETAILED HERE).

We’re now ready to predict genes whose essentiality is significantly associated with CCND1 log2-FPKM values.

results = simem(screens = breast_screens,
                geneIds = genesOfInterest,
                covariate = "ccnd1",
                reagentWeights = hpWeights,
                annotationsPerCellLine = ccnd1Values,
                inverseVarianceWeights = TRUE,
                signalProbWeights = TRUE,
                analyzeReagents = TRUE,
                parallelNodes = 1
                )

Here are the gene-level summaries (more detailed gene-level results can be obtained using results$gene_detailed). As expected, CCND1 and CDK4 show strongly significant increases in essentiality in cell lines with higher CCND1 expression. This is consistent with the knockdown of either member of the CCND1-CDK4 complex disrupting the function of the complex. CDK6 is not found to be significantly associated with CCND1 expression, suggesting that in Breast Cancer contexts, CDK4 rather than CDK6 is the primary complex partner for CCND1.

The difference parameter can be interpreted as the magnitude of the average change in dropout slope associated with each unit increase in log2-FPKM CCND1 expression. In other words, on average the dropout is faster (more negative), indicating increased essentiality, in cell lines with higher CCND1 expression.

options(width=100)

results$gene

##   gene_id symbol   difference       pvalue          fdr warnings errors
## 1     595  CCND1 -0.069142031 2.108795e-13 3.163193e-13                
## 2    1019   CDK4 -0.098144306 9.259274e-24 2.777782e-23                
## 3    1021   CDK6 -0.006342742 1.509899e-01 1.509899e-01

Since analyzeReagents = TRUE parameter was specified, per-reagent context-specific essentiality predictions are also available (more detailed gene-level results can be obtained using results$reagent_detailed)

options(width=100)
reagent = results$reagent
reagent[order(reagent$symbol, reagent$trcn_id), ]

##           trcn_id gene_id symbol    difference       pvalue          fdr warnings errors
## 13 TRCN0000010315     595  CCND1  0.0043898353 7.554301e-01 8.183826e-01                
## 14 TRCN0000010317     595  CCND1 -0.1245646522 5.426349e-06 2.351418e-05                
## 21 TRCN0000040038     595  CCND1 -0.1298592714 8.398346e-07 1.068391e-05                
## 22 TRCN0000040039     595  CCND1 -0.1264545296 1.232759e-06 1.068391e-05                
## 23 TRCN0000040040     595  CCND1  0.0006743605 9.617795e-01 9.617795e-01                
## 24 TRCN0000040041     595  CCND1  0.0372417472 4.209963e-01 6.755768e-01                
## 25 TRCN0000040042     595  CCND1 -0.0803004670 2.374300e-05 8.818829e-05                
## 1  TRCN0000000362    1019   CDK4 -0.1486592957 1.036791e-03 2.695657e-03                
## 2  TRCN0000000363    1019   CDK4 -0.1151053469 4.804589e-06 2.351418e-05                
## 3  TRCN0000000364    1019   CDK4 -0.0517594494 1.520418e-04 4.392318e-04                
## 4  TRCN0000000365    1019   CDK4 -0.0509033521 1.002917e-02 2.370530e-02                
## 9  TRCN0000009876    1019   CDK4  0.0090458961 4.825386e-01 6.970002e-01                
## 15 TRCN0000010472    1019   CDK4 -0.1686939532 3.232102e-06 2.100866e-05                
## 17 TRCN0000010520    1019   CDK4 -0.1148474472 7.583172e-05 2.464531e-04                
## 18 TRCN0000018364    1019   CDK4 -0.2472135162 3.969374e-09 1.032037e-07                
## 5  TRCN0000000485    1021   CDK6  0.0040365171 4.384718e-01 6.755768e-01                
## 6  TRCN0000000486    1021   CDK6 -0.0071061656 4.417233e-01 6.755768e-01                
## 7  TRCN0000000487    1021   CDK6  0.0033456048 5.704377e-01 7.105954e-01                
## 8  TRCN0000000488    1021   CDK6 -0.0059697800 6.012731e-01 7.105954e-01                
## 10 TRCN0000009877    1021   CDK6 -0.0302379973 4.811596e-02 8.935821e-02                
## 11 TRCN0000010074    1021   CDK6  0.0059725757 5.945774e-01 7.105954e-01                
## 12 TRCN0000010082    1021   CDK6 -0.0243369962 3.146235e-02 6.816842e-02                
## 16 TRCN0000010473    1021   CDK6  0.0145013231 5.720802e-01 7.105954e-01                
## 19 TRCN0000039745    1021   CDK6 -0.0181547237 4.497263e-02 8.935821e-02                
## 20 TRCN0000039747    1021   CDK6  0.0153839690 6.909228e-01 7.810431e-01                
## 26 TRCN0000055435    1021   CDK6 -0.0037305824 8.303432e-01 8.635569e-01

The above analysis tests whether the essentiality of each specified gene is associated with CCND1 expression. In the next section we will see how to test whether the essentiality of a gene is significantly associated with that gene’s own mRNA expression level.