|Supplementary Table 4).
The deconvolution of Treg cells and CD4+ T cells is inherently hampered by the high correlation of their expression signatures (namely, multi-collinearity7) and can result in the underestimation of Treg cells present in low fractions. As there is currently no consensus on whether regularization methods can overcome multi-collinearity in regression-based deconvolution17,18, we adopted an heuristic strategy to specifically address the issue of Treg-cell underestimation. First, quanTIseq estimates the Treg cell fractions considering all cell types together. Then, for the samples with , quanTIseq re-estimates the Treg cell fractions removing from the signature matrix the expression profiles of the CD4+ T cells. The final Treg cell fractions are then estimated by averaging the results:
whereas CD4+ T cell fractions are scaled to:
Finally, all cell fractions are normalized to sum up to 1.
The analysis described in this section is implemented in the “Deconvolution” module of quanTIseq (step 3 in Fig. 1c).
Deconvolution of bulk tumors
Aberrant de-methylation and sequence duplication can lead to over-expression of immune signature genes in tumor cells. Tumor RNA-seq data can be analyzed with quanTIseq setting the “--tumor” option to “TRUE”. This setting discards the signature genes whose expression in the TCGA RNA-seq data exceed 11 TPM, which are: NUPR1, CD36, CSTA, HPGD, CFB, ECM1, FCGBP, PLTP, FXYD6, HOPX, SERPING1, ENPP2, GATM, PDPN, ADAM6, FCRLA, SLC1A3. All tumor data sets presented in this work have been analyzed with this parameter setting (Supplementary Table 4).
Validation data sets
To benchmark quanTIseq, we considered the expression data sets listed in Supplementary Table 3, using the options reported in Supplementary Table 4.
Normalized microarray data were downloaded from the Gene Expression Omnibus (GEO) (https://www.ncbi.nlm.nih.gov/geo) with the GEOquery R package19. Probes were mapped to gene symbols with the biomaRt R package20. In case of multiple probes mapping to the same gene symbol, the probe with the highest average expression across all samples was selected.
Immune cell fractions estimated with flow cytometry, Coulter Counter, or from images of stained tissue slides were used as ground truth to validate quanTIseq. Where necessary, different functional states of an immune cell type were aggregated by summing up the corresponding cell fractions (e.g., for the Newman’s data set 7, B cells were quantified summing up the fractions of naïve and memory B cells).
Generation of flow cytometry and RNA-seq data from blood-derived immune-cell mixtures
Blood samples from healthy human donors were obtained from the Blood Bank Innsbruck under approval of the local ethics committee. Peripheral blood mononuclear cells (PBMC) were isolated from human whole blood by density centrifugation using Lymphocyte Separation Medium (Capricorn, Ebsdorfergrund, Germany). The PBMC fraction was collected and washed three times with Dulbecco’s phosphate buffered saline. To isolate polymorphonuclear (PMN) cells, the cells on top of the erythrocytes were collected and contaminating red blood cells were removed by two rounds of lysis with 0.2% NaCl solution at 4°C. PMN were added to the PBMC fractions in low abundance (3-6% of total cells) and aliquots were taken for RNA extraction and flow cytometry analysis.
Total RNA was extracted with the Qiagen RNeasy mini kit (Qiagen GmbH, Hilden, Austria), including on-column DNAse I treatment. INVIEW polyA RNA library preparation, and Illumina 50bp SR sequencing at >60 Million reads per library was obtained from an external provider (GATC biotech, Konstanz, Germany).
The fractions of the following cell types in the immune-cell mixtures were determined by flow cytometry using specific marker combinations: CD4+ T cells (CD3+CD4+), CD8+ T cells (CD3+CD8+), Treg cells (CD3+CD4+CD25+CD127-), B cells (CD19+), NK cells (CD3-CD16+CD56+), myeloid dendritic cells (Lin-HLA-DR+CD11c+), monocytes (CD14+) and neutrophils (CD15+CD16+). Labeled antibodies specific for the following antigens were purchased from BD Biosciences (San Jose, CA, USA) and Biolegend (San Diego, CA, USA): CD3 (UCHT1), CD4 (RPA-T4), CD8 (HIT8a), CD11c (3.9), CD14 (M5E2), CD15 (W6D3), CD16 (3G8), CD19 (HIB19), CD20 (2H7), CD25 (BC96), CD56 (B159), CD127 (A019D5), HLA-DR (L243), Lin: CD3, CD14, CD19, CD20, CD56. The measurements were performed on a BD LSRFortessa flow cytometer and the data were evaluated with FlowLogic 7.1 software (Inivai Technologies, Melbourne, Australia).
Leiden validation data set
Fresh frozen and formalin-fixed material was available from 9 colorectal cancer patients. Their usage was approved by the local ethics committee. All the specimens were anonymized and handled according to the ethical guidelines described in the Code for Proper Secondary Use of Human Tissue in the Netherlands of the Dutch Federation of Medical Scientific Societies.
RNA was isolated with the NucleoSpin RNA kit (Macherey-Nagel, Düren, Germany) including on-column DNAse I treatment. Library preparation was preceeded by rRNA depletion with the NEBNext rRNA depletion kit (New England Biolabs, MA, USA). PE 150bp sequencing was performed at GenomeScan (Leiden, The Netherlands) on a HiSeq 4000 (Illumina, San Diego CA, USA).
4µm sections of formalin-fixed paraffin-embedded tissues were deparaffinized and underwent heat-mediated antigen retrieval in 10 mmol/L citrate buffer solution (pH 6). Unspecific antibody binding was prevented with the SuperBlock PBS buffer (Thermo Fisher Scientific, Waltham, MA, USA) according to the manufacturer’s instructions. Immunofluorescence detection was performed with the following antibodies: pan-keratin (AE1/AE3, Biolegend and C11, Cell Signaling Technology), anti-CD3 (D7AE), anti-CD45RO (UCHL1), and anti-PD1 (D4W2J) from Cell Signaling Technology (Danvers, MA, USA), anti-FOXP3 (236A/E7, Thermo Fisher Scientific, Waltham, MA, USA), and anti-CD8 (4B11, Leica, Wetzlar, Germany). Immunofluorescence detection was performed directly and indirectly with Alexa 488, 546, 594, 633, 647, and 680 with an in-house methodology (manuscript in preparation).
Image analysis was performed with the Vectra 3.0 Automated Quantitative Pathology Imaging System and the inFORM Cell Analysis software (Perkin Elmer, Waltham, MA, USA) including spectral separation of dyes, tissue and cell segmentation, and automated cell counting of immune phenotypes.
Vanderbilt validation data set
Patient samples. 70 melanoma patient samples and data were procured based on availability of tissue and were not collected according to a pre-specified power analysis. Included in these, 42 samples were baseline pre-anti-PD1 therapy. Remaining patients were treated with either anti-CTLA-4 alone (n=14), or combinations of anti-PD-1 and anti-CTLA-4 (n=5). Finally, 9 samples were obtained from progressing tumors in patients experiencing an initial response (n=7, 1, and 1 for anti-PD-1, anti-CTLA-4 and anti-PD-+anti-CTLA-4, respectively). Clinical characteristics and objective response data were obtained by retrospective review of the electronic medical record. Patients were classified in responders (Complete Response and Partial Response) and non-responders (Progressive Disease, Mixed Response) according to investigator assessed, RECIST defined responses. All patients provided informed written consent on IRB approved protocols (Vanderbilt IRB # 030220 and 100178).
RNA sequencing. Total RNA quality was assessed using the 2200 Tapestation (Agilent). At least 20ng of DNase-treated total RNA having at least 30% of the RNA fragments with a size >200 nt (DV200) was used to generate RNA Access libraries (Illumina) following manufacturer’s recommendations. Library quality was assessed using the 2100 Bioanalyzer (Agilent) and libraries were quantitated using KAPA Library Quantification Kits (KAPA Biosystems). Pooled libraries were subjected to 75 bp paired-end sequencing according to the manufacturer’s protocol (Illumina HiSeq3000). Bcl2fastq2 Conversion Software (Illumina) was used to generate de-multiplexed Fastq files.
IHC analysis. For Foxp3, CD4 and CD8 staining, slides were placed on a Leica Bond Max IHC stainer. All steps besides dehydration, clearing and coverslipping were performed on the Bond Max. Heat induced antigen retrieval was performed on the Bond Max using their Epitope Retrieval 2 solution for 20 minutes. Slides were incubated with anti-CD4 (PA0427, Leica, Buffalo Grove, IL), Foxp3 (14-4777-82, eBiosciences), or anti-CD8 (MS-457-R7, ThermoScientific, Kalamazoo, MI) for one hour.
Analysis of Vanderbilt images with IHCount
After quality control and filtering of low-quality images, 68 IHC images were available for 31 melanoma patients of the Vanderbilt cohort. To quantify total cells and tumor-infiltrating immune cells from IHC images, we implemented a semi-automated computational pipeline called IHCount, which performs different analytical tasks, including: image pre-processing, annotation and algorithm training, image segmentation and classification, and cell counting.
For the pre-processing of the IHC images we used the script collection bftools from the Open Microscopy Environment (OME)21. As a first step, high-resolution bright-field images were extracted from the image containers, available in Leica (.scn) format. Then, each of these high-resolution images was tiled into smaller images to be used as training data. The tiling was performed to limit the computational cost of the subsequent analytical steps.
The Pixel Classificator module of Ilastik22 was used to build classifiers from a subset of the previously generated image tiles. Leveraging on manual annotation of the segmented images, the classifiers were trained to identify positively-stained cells and all nuclei on the selected IHC images, as well as tissue and background. As a result of running the classifier as a batch process on all tiles, we obtained two sets of probability maps, reporting either the probabilities for positively-stained cells or that of all nuclei on the slide.
Image segmentation and cell counting was performed with CellProfiler23 starting from the probability maps built at the previous step. We created a CellProfiler pipeline that uses several internal modules to identify and count positive stained cells, nuclei and the area of the tissue. First, the probability maps were split by the red, blue and green color channels and converted into gray scale. Second, multiple intensity-based operations were run to reduce noise and to identify positively-stained cells, nuclei and tissue. Finally, the aggregated results for each image were obtained by summing up the data obtained for the single tiles.
We implemented a python script (runCP.py) that allows the user to run all the computational tasks from image pre-processing until the generation of the probability maps and a CellProfiler pipeline (IHCount.cppipe4) for image segmentation and cell counting. IHCount code is available at the following link: https://github.com/mui-icbi/IHCount.
IHCount results for the Leiden cohort are reported in Supplementary Table 5. Total cell densities per tumor sample to be used to scale quanTIseq cell fractions were estimated as the median number of all cell nuclei per mm2 across all images generated from that tumor.
Correlation between numeric variables was assessed with Pearson’s correlation. The area under the receiver operating characteristic curve (AUROC) for multi-class classification was computed with the “multiclass.roc” function of the pROC R package. Constrained least squares regression was performed with the “lsei” function from the “limSolve” R package. The root-mean-squared error was computed as . Statistically significant differences between two groups were tested with two-sided Wilcoxon’s test. For comparisons across multiple groups, Kruskal-Wallis test followed by two-sided Dunn’s pairwise post hoc were used. Normality of the data distribution was tested with Shapiro-Wilk test. Overall survival analyses were performed using the R package survival on TCGA survival data ('vital_status','days_to_death', and 'days_to_last_followup'). For each cancer type, patients were dichotomized in two groups according to maximal logrank statistics within a selection interval from the 10% to 90% percentile of the respective immune cell fractions. The Kaplan-Meier estimator was used to generate survival curves and logrank tests (corresponding to two sided z-test) were applied. Resulting p-values were corrected according to the minimal p-value approach24.
1. Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinforma. Oxf. Engl. 25, 2078–2079 (2009).
2. Bolger, A. M., Lohse, M. & Usadel, B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinforma. Oxf. Engl. 30, 2114–2120 (2014).
3. Bray, N. L., Pimentel, H., Melsted, P. & Pachter, L. Near-optimal probabilistic RNA-seq quantification. Nat. Biotechnol. 34, 525–527 (2016).
4. Aran, D., Sirota, M. & Butte, A. J. Systematic pan-cancer analysis of tumour purity. Nat. Commun. 6, 8971 (2015).
5. Birnbaum, K. D. & Kussell, E. Measuring cell identity in noisy biological systems. Nucleic Acids Res. 39, 9093–9107 (2011).
6. Barretina, J. et al. The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature 483, 603–607 (2012).
7. Newman, A. M. et al. Robust enumeration of cell subsets from tissue expression profiles. Nat. Methods 12, 453–457 (2015).
8. Aran, D. & Butte, A. J. Digitally deconvolving the tumor microenvironment. Genome Biol. 17, 175 (2016).
9. Allantaz, F. et al. Expression profiling of human immune cell subsets identifies miRNA-mRNA regulatory relationships correlated with cell type specific expression. PloS One 7, e29979 (2012).
10. De Simone, M. et al. Transcriptional Landscape of Human Tissue Lymphocytes Unveils Uniqueness of Tumor-Infiltrating T Regulatory Cells. Immunity 45, 1135–1147 (2016).
11. Plitas, G. et al. Regulatory T Cells Exhibit Distinct Features in Human Breast Cancer. Immunity 45, 1122–1134 (2016).
12. Abbas, A. R., Wolslegel, K., Seshasayee, D., Modrusan, Z. & Clark, H. F. Deconvolution of blood microarray data identifies cellular activation patterns in systemic lupus erythematosus. PloS One 4, e6098 (2009).
13. Gong, T. et al. Optimal deconvolution of transcriptional profiling data using quadratic programming with application to complex clinical blood samples. PloS One 6, e27156 (2011).
14. Racle, J., de Jonge, K., Baumgaertner, P., Speiser, D. E. & Gfeller, D. Simultaneous Enumeration Of Cancer And Immune Cell Types From Bulk Tumor Gene Expression Data. eLIFE 6, e26476 (2017).
15. Marinov, G. K. et al. From single-cell to cell-pool transcriptomes: stochasticity in gene expression and RNA splicing. Genome Res. 24, 496–510 (2014).
16. Eisenberg, E. & Levanon, E. Y. Human housekeeping genes, revisited. Trends Genet. TIG 29, 569–574 (2013).
17. Newman, A. M., Gentles, A. J., Liu, C. L., Diehn, M. & Alizadeh, A. A. Data normalization considerations for digital tumor dissection. Genome Biol. 18, 128 (2017).
18. Li, B., Liu, J. S. & Liu, X. S. Revisit linear regression-based deconvolution methods for tumor gene expression data. Genome Biol. 18, 127 (2017).
19. Davis, S. & Meltzer, P. S. GEOquery: a bridge between the Gene Expression Omnibus (GEO) and BioConductor. Bioinforma. Oxf. Engl. 23, 1846–1847 (2007).
20. Durinck, S. et al. BioMart and Bioconductor: a powerful link between biological databases and microarray data analysis. Bioinforma. Oxf. Engl. 21, 3439–3440 (2005).
21. Linkert, M. et al. Metadata matters: access to image data in the real world. J. Cell Biol. 189, 777–782 (2010).
22. Sommer, C., Straehle, C., Koethe, U. & Hamprecht, F. A. Ilastik: Interactive learning and segmentation toolkit. in Biomedical Imaging: From Nano to Macro, 2011 IEEE International Symposium on 230–233 (IEEE, 2011).
23. Kamentsky, L. et al. Improved structure, function and compatibility for CellProfiler: modular high-throughput image analysis software. Bioinforma. Oxf. Engl. 27, 1179–1180 (2011).
24. Altman, D. G., Lausen, B., Sauerbrei, W. & Schumacher, M. Dangers of using ‘optimal’ cutpoints in the evaluation of prognostic factors. J. Natl. Cancer Inst. 86, 829–835 (1994).