Collection of RNA-seq data from immune cell types and tumor cell lines
To build the signature matrix, we collected 51 data sets generated from paired-end Illumina RNA-seq of blood-derived immune cells (Supplementary Table 1). In addition, we downloaded from the Cancer Genomics Hub (CGHub, https://cghub.ucsc.edu, accessed on February 2016) RNA-seq data from a breast (G41726.MCF7.5) and a colorectal (G27202.SW480.1) cancer cell line.
BAM files of mapped reads gathered from to the CGHub were converted to FASTQ with samtools1, whereas SRA files downloaded from the Sequence Read Archive (SRA, https://www.ncbi.nlm.nih.gov/sra/) were converted to FASTQ with the “fastq-dump” function of the SRA Toolkit.
RNA-seq data pre-processing
FASTQ files of RNA-seq reads were pre-processed with Trimmomatic2 to remove adapter sequences and read ends with Phred quality scores lower than 20, to discard reads shorter than 36 bp, and to trim long reads to a maximum length of 50 bp.
This analysis is implemented in the “Preprocessing” module of quanTIseq (step 1 in Fig. 1c), which also allows selecting different parameters for data preprocessing.
Quantification of gene expression and normalization
The pre-processed RNA-seq reads were analyzed with Kallisto3 to generate gene counts and transcripts per millions (TPM) using the “hg19_M_rCRS” human reference. For single-end data, the following Kallisto options were used: “--single -l 50 -s 20”.
After gene expression quantification, gene names were re-annotated to updated gene symbols defined by the HUGO Gene Nomenclature Committee (http://www.genenames.org, annotations downloaded on April, 2017). In case of duplicates, the median expression per gene symbol was considered.
The final expression value for each gene in library was computed from TPM with the following formula:
For microarrays data, before the normalization of Equation 1, expression data were transformed from log scale to natural scale (when needed) and quantile-normalized.
TPM can be computed from RNA-seq reads with the “Gene Expression Quantification” module of quanTIseq (step 2 in Fig. 1c). Gene re-annotation and expression normalization are performed by the quanTIseq “Deconvolution” module before deconvolution (step 3 in Fig. 1c) and quantile normalization is performed if the “--arrays” option is set to “TRUE”.
We simulated RNA-seq data from breast tumors with different purity values and immune infiltrates by mixing pre-processed reads from immune cell types and from a tumor cell line (G41726.MCF7.5) of the RNA-seq compendium. We simulated 100 different immune-cell mixtures by sampling the cell fractions from a uniform distribution in the [0-1] interval. The cell fractions were combined with 11 different tumor purity scenarios: 0:10:100% tumor purity, defined as the fraction of read pairs from the tumor cell line over total read pairs. Each simulated data set consisted of 1 million paired-end reads. In addition, for the data set with 60% purity (which is the minimum value considered by the TCGA consortium for tumor specimen inclusion4) we simulated different sequencing depths, namely: 1, 2, 5, 10, 20, 50, and 100 million read pairs. In total, we generated 1,700 simulated RNA-seq data sets. The simulated data are available at the following link: http://icbi.i-med.ac.at/software/quantiseq/doc/index.html.
Generation of the TIL10 signature matrix
An expression matrix was generated from the compendium of RNA-seq data as described in “RNA-seq data pre-processing” and “Quantification of gene expression and normalization”, andconsisted in 19,423 genes and 53 immune- and tumor-cell libraries. From this matrix, we filtered out the genes that were not detected in at least two immune libraries and selected the genes specific for each cell type considering the criteria described in the following. Gene expression is here considered in terms of normalized values (Equation 1) on a natural scale, if not differently stated.
Cell-specific expression. We quantized the expression of each gene into three bins representing low, medium, and high expression, computed as in 5. For each immune cell type, we selected the genes having: (ii) high quantized expression in all libraries belonging to the considered immune-cell type; and (iii) low or medium quantized expression in all other libraries.
Expression in tumors. We filtered the signature genes that were highly expressed also in tumor cells by discarding the genes having a median log2 expression larger than 7 in all non-hematopoietic cancer cell lines assayed in the Cancer Cell Line Encyclopedia (CCLE) 6, as done in 7. Moreover, RNA-seq data from 8,243 TCGA solid tumors were used to remove genes that provide little support for bulk-tissue deconvolution because their expression in tumor samples is generally low or null. More precisely, we discarded the genes having an average expression across all TCGA samples lower than 1 TPM.
Specificity of marker genes. Signature genes specific for a certain cell type should not be associated to another cell type. We considered a compendium of 489 gene sets specific for 64 cell types recently proposed in 8 and removed the signature genes that were listed in a gene set specific for another cell type. CD4+ T cell gene sets were not used to filter Treg-cell signature genes, as the CD4+ T cell population may contain bona fide Treg-cell expression markers such like the forkhead box P3 (FOXP3).
Range of expression. As genes with high expression can bias deconvolution results, we excluded the genes whose expression exceeded the 700 TPM.
Correlation with true cell fractions. The 1,700 simulated RNA-seq data sets (see “Generation of the simulated data sets”) were then used to identify the signature genes that provide valuable information over cell fractions and are more robust to the sequencing depth and unknown tumor content. For each cell type, we selected the genes whose expression levels had a correlation with the true cell fractions equal or greater than 0.6.
Restricted expression. We considered four external expression data sets from enriched/purified immune cells: two microarray data sets (GEO accession: GSE28490 and GSE2849)9, an RNA-seq data set10, and a microarray compendium that was used to build the CIBERSORT LM22 signature matrix7. All data sets were preprocessed and normalized as explained in the previous paragraphs. For each gene specific for a cell type in the signature matrix, we computed the ratio between the median expression across all libraries in data set belonging to the cell type and the median expression across all libraries in data set not belonging to the cell type . For each cell type, the top 30 ranked signature genes (or less, when not available) with were selected for the final signature matrix. When processing the Treg signature genes, the data sets belonging to CD4+ T cells were not considered. Treg signature genes were further filtered with a similar approach, but considering the RNA-seq data of circulating CD4+ T and Treg cells from 11 and selecting only the genes with .
The final signature matrix TIL10 (Supplementary Table 2) was built considering the 170 genes satisfying all the criteria reported above. The expression profile of each cell type was computed as the median of the expression values over all libraries belonging to that cell type:
For the analysis of RNA-seq data quanTIseq further reduces this signature matrix by removing a manually-curated list of genes that showed a variable expression in the considered data sets: CD36, CSTA, NRGN, C5AR2, CEP19, CYP4F3, DOCK5, HAL, LRRK2, LY96, NINJ2, PPP1R3B, TECPR2, TLR1, TLR4, TMEM154, CD248. This default signature considered by quanTIseq for the analysis of RNA-seq data consist of 153 genes and has a lower condition number then the full TIL10 signature (6.73 compared to 7.45), confirming its higher cell-specificity. We advise using the full TIL10 matrix (--rmgenes=“none”) for the analysis of microarray data, as they often lack some signature genes, and the reduced matrix (--rmgenes= “default”) for RNA-seq data. Alternatively, the “rmgenes” option allows specifying a custom list of signature genes to be disregarded (see quanTIseq manual).
The quanTIseq deconvolution module takes as input:
A signature matrix of expression values over signature genes and cell types.
After re-annotation of gene symbols and normalization of the mixture matrix (see “Quantification of gene expression and normalization”), quanTIseq performs deconvolution of the unknown cells fractions over immune cell types and samples. For each sample, the following system of equations is solved to estimate the cell fractions (the subscript is omitted):
where is the set of signature genes that are present in the mixture matrix. quanTIseq solves this inverse problem using constrained least squares regression, i.e. by minimizing the formula: , imposing the constraints:
To account for differences in the average mRNA content per cell type, which might otherwise bias deconvolution results12–14, the estimated cell fractions are normalized by a cell-type-specific scaling factor :
Then, the cell fractions are scaled so to sum up to the original percentage of total cells, as:
Finally, the proportion of “other” (uncharacterized) cells is estimated as:
As the population of “other” cells might include different types of malignant and normal cells with various mRNA contents15 depending on the sample under investigation, quanTIseq does not scale these estimates.
The scaling factors were computed as the median expression of the Proteasome Subunit Beta 2 (PSMB2) housekeeping gene16 in the immune cell types of the RNA-seq compendium. In the analysis of the simulated RNA-seq data, where the true fractions represented mRNA fractions and not cell fractions, deconvolution was performed without mRNA-content normalization (