The accurate detection of mutations within single cells from their transcriptomic sequences enables the inference of their clonal relationships. In cancerous tissues, reconstructing the clonal evolution of tumour cells is well established, principally through the genomic sequencing of relatively large numbers of cells from different tumour regions. Lineage tracing of normal cells is more challenging, however lymphocyte lineage tracing has more recently become routine through single cell 5’ RNA sequencing and enrichment of the T and B cell receptors. Determining lymphocyte clonality has enabled researchers to infer the degree of clonal expansion (thought principally via neo-epitope antigenic stimulation), and to study how cells mature from naivety through to exhaustion.
Methods such as Souporcell have been developed to genotype single cells from their transcriptomic readouts. Other methods, such as mgatk have demonstrated the ability to clonally trace single cells through using scATAC-seq and mtDNA mutations.
This methodology, which we have termed deSCeRNAMut (de novo Single Cell RNA Mutation), aims to accurately detect mutations present in different cell populations within a mixed cellular environment from droplet based single cell RNA sequencing data. The principal challenge is the lack of consistent coverage, and low depth of reads, alongside error-prone sequencing reads. To abrogate these challenges, this method relies upon the vast numbers of cells that can be sequenced through droplet based technologies, and the implausibility of shared post-embryonic mutations between different cell-type lineages.
To demonstrate this method, we use transcriptomes generated using droplet based sequencing from Stromal and tumour cell evolution in clear cell renal cell carcinoma. The dataset provides a detailed insight into cellular heterogneity in 12 kidney tumours through multi-regional sequencing, with a large number of cells sequenced across different primary cell lineages.
Bcell | Endo | Epi | Fibro | Myeloid | NK | RCC | TCD4 | TCD8 | |
---|---|---|---|---|---|---|---|---|---|
PD43824 | 1426 | 260 | 493 | 90 | 7040 | 9298 | 3070 | 8631 | 13042 |
PD43948 | 544 | 103 | 139 | 111 | 1839 | 2231 | 1873 | 2508 | 7185 |
PD44714 | 173 | 1 | 15 | 1 | 1132 | 271 | 0 | 1099 | 353 |
PD44966 | 976 | 48 | 4 | 6 | 3789 | 3214 | 453 | 4011 | 10129 |
PD44967 | 762 | 4 | 4 | 32 | 2875 | 1643 | 149 | 1884 | 1152 |
PD45814 | 916 | 239 | 102 | 343 | 2931 | 3428 | 289 | 5117 | 9589 |
PD45815 | 1459 | 929 | 673 | 168 | 5420 | 6546 | 3012 | 5060 | 7507 |
PD45816 | 373 | 2753 | 2191 | 870 | 5062 | 3026 | 2359 | 2408 | 20222 |
PD47171 | 661 | 46 | 68 | 32 | 6070 | 1553 | 2351 | 2414 | 23282 |
PD47172 | 1041 | 1651 | 566 | 1908 | 2817 | 2294 | 0 | 2201 | 5033 |
PD47465 | 207 | 8 | 9 | 6 | 3695 | 1747 | 267 | 2838 | 2826 |
PD47512 | 567 | 20 | 168 | 15 | 7933 | 1069 | 612 | 3739 | 6151 |
This page describes an overview for detecting and filtering putative somatic mutations from the above cells. We then benchmark the tumour cell calls by comparison to whole exome sequencing, and T cell calls by reference to whether they are restricted to cells with specific TCR clonotypes (i.e. acquired post thymic selection).
In order to call cell specific mutations, indexed BAM files from the cellranger pipeline were fist split into cell specific BAM files. We borrowed this code for the task. Cell specific BAM files were indexed using samtools.
Mutations were initially called using bcftools mpileup. The choice of mutation caller was primarily influenced by the need for high sensitivity calls of variants with few supporting reads, as demonstrated by this useful systematic comparison of single-nucleotide variant callers (Liu et al 2019).
Unsurprisingly, a huge number of mutations were called - with between 800,000 and 4,000,000 mutations called per patient. To facilitate more efficient downstream filtering of putative mutations, we perform the first filter step at this point:
After these steps, we are left with between 40,000 and 300,000 mutations per patient.
So far, we have generated a list of putative variant sites, but we are unaware how many variants may have been missed at each loci, and we have no information regarding reference calls at those loci. We therefore run alleleCount to generate count tables of each base for all cells at every putative patient-specific loci.
The code used to run the above steps is archived here.
Reference and variant counts were collated for all of the loci called above to create a sparse matrix of counts for cell derived from every patient. In the absence of copy number variants, if an autosomal chromosome posses a true mutation, one expects an approximately equal number of reference and variant calls. The exception is for genes that exhibit a high degree of allelic specific expression, or that typically transcribe a particular allele in concentrated bursts. Alternatively, a high ratio of reference to variant counts in a cell base may imply artifact associated with high depth sequencing/ poorly mapped regions. A binomial filter (p<0.05) is therefore applied in each cell, with calls ignored in future analyses if there are significantly higher reference than variant counts.
Each genomic loci is annotated using ANNOVAR, the trinucleotide context of the variant, and any loci associated with previously documented RNA editing sites using DARNED and RADAR.
The number of cells containing either the reference and variant base are collated for:
The Fisher’s exact test is used to compute whether there are proportionally greater numbers of mutations in the cell lineage/ clonotype with the greatest number of mutations:
Number of CD8 T cells | Number of other cell types | |
---|---|---|
Alternative base | 21 | 0 |
Reference base | 1256 | 6089 |
Number of cells with TRA:CAVTGGGADGLTF;TRB:CASSQGGGGADTQYF CDR3 | Number of cells with other CDR3 | |
---|---|---|
Alternative base | 9 | 11 |
Reference base | 21 | 2683 |
An enrichment factor is also calculated for each mutation that represents the multiple of the increased prevalence in the predominant cell type compared to all others. The top few lines of the output table appears as such:
patient | chr | pos | ref | mut | tissue | mutTP | mutTO | refTP | refTO | pValT | clono | mutCP | mutCO | refCP | refCO | pValC | trinuc_sub | channel | AtoI | Func.refGene | Gene.refGene | ExonicFunc.refGene | cytoBand | ExAC_ALL | avsnp147 | strand | enrich | Func.refGene.1 | Gene.refGene.1 | ExonicFunc.refGene.1 | cytoBand.1 | ExAC_ALL.1 | avsnp147.1 | strand.1 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
PD43824 | 1 | 137825 | G | A | Myeloid | 7 | 0 | 0 | 0 | 1.0000000 | 0 | 0 | 0 | 0 | 1.0000000 | CGC>CAC | 43 | FALSE | ncRNA_exonic | LOC729737 | . | 1p36.33 | . | rs147252685 | - | NA | ncRNA_exonic | LOC729737 | . | 1p36.33 | . | rs147252685 | - | |
PD43824 | 1 | 182640 | C | A | TCD4 | 2 | 0 | 18 | 39 | 0.1110462 | TRA:CAMSALLGQNFVF;TRB:CSARVNNGEAFF | 1 | 1 | 0 | 30 | 0.0625000 | TCT>TAT | 16 | FALSE | ncRNA_exonic | LOC102725121 | . | 1p36.33 | . | . | + | Inf | ncRNA_exonic | LOC102725121 | . | 1p36.33 | . | . | + |
PD43824 | 1 | 187153 | T | C | Myeloid | 25 | 36 | 55 | 114 | 0.2729017 | TRA:CAASPANTGGFKTIF;TRB:CASSATVTGDTEAFF | 1 | 13 | 0 | 61 | 0.1866667 | CTG>CCG | 71 | FALSE | downstream | MIR6859-1;MIR6859-2;MIR6859-3;MIR6859-4 | . | 1p36.33 | . | . | - | 1.302083 | downstream | MIR6859-1;MIR6859-2;MIR6859-3;MIR6859-4 | . | 1p36.33 | . | . | - |
PD43824 | 1 | 187497 | G | A | Myeloid | 15 | 28 | 28 | 60 | 0.8431712 | TRA:CASQGGGGNKLTF;TRB:CASSGVRAEAFF | 2 | 13 | 0 | 31 | 0.1014493 | CGG>CAG | 39 | FALSE | downstream | MIR6859-1;MIR6859-2;MIR6859-3;MIR6859-4 | . | 1p36.33 | . | . | - | 1.096345 | downstream | MIR6859-1;MIR6859-2;MIR6859-3;MIR6859-4 | . | 1p36.33 | . | . | - |
PD43824 | 1 | 200104 | T | C | Epi | 17 | 13 | 1 | 8 | 0.0232371 | TRA:CAAQNNNAGNMLTF;TRA:CAVRGDMRF;TRB:CASNLWGGDTEAFF | 1 | 4 | 0 | 4 | 1.0000000 | GTC>GCC | 74 | FALSE | intergenic | MIR6859-1;FAM138D | . | 1p36.33 | . | . | - | 1.525641 | intergenic | MIR6859-1;FAM138D | . | 1p36.33 | . | . | - |
PD43824 | 1 | 629937 | A | G | TCD8 | 21 | 0 | 1256 | 6089 | 0.0000000 | TRA:CAVTGGGADGLTF;TRB:CASSQGGGGADTQYF | 9 | 11 | 21 | 2683 | 0.0000000 | AAT>AGT | 68 | FALSE | upstream | LOC101928626 | . | 1p36.33 | . | . | - | Inf | upstream | LOC101928626 | . | 1p36.33 | . | . | - |
The code to run the above steps has been archived here.
We apply the following thresholds to filter all possible mutations
The discriminatory power of the above metrics in reducing mutational calls is examined through a Venn diagram. Unsurprisingly, the cell lineage enrichment metrics provide the strongest power in curating the dataset.
Venn diagram demonstrating the discriminatory power of the above filtering metrics
We can also examine the trinucleotide context of the mutations pre- and post-filtering.
Trinucleotide context of called mutations pre- and post- filtering
Note is made of high levels of mutations that are otherwise unexplained from published catalogues of mutational signatures (particularly in a GCN>GGN and GTN>GGN context). By separating the trinucleotide context into the positive versus negative transcribed strand, we see differences that are otherwise unexplained by DNA derived mutational signatures, implying artifact either through library prep, sequencing, or RNA editing.
Trinucleotide context of filtered mutations by transcribed strand
A striking strand bias is demonstrated that cannot be accounted for by known mutational processes. Given the disparity between transcribed strands, mutations that have arisen with a highly biased context are removed (binomial filter, p < 0.005). The difference in the relative height of the removed mutations indicates that the majority of those removed are artefacts.
We finally remove all mutations that are clustered within 4 bases in a given patient, to yield the final mutation calls. The code to run the above steps is archived here
Multi-regional whole exome sequencing data has been processed for tumour tissue adjacent to the regions that have undergone single cell RNA sequencing. The exonic mutations may therefore be used as a benchmark to determine the precision and sensitivity of the single cell mutation calling method above. To provide a fair comparison between single cell RNA and bulk exonic DNA mutation calls, and to account for differences in coverage between the methods, we also examine whether there is evidence of a given mutation using the reciprocal technology by performing a pileup at that mutation locus.
We can therefore classify mutations called using the above pipeline as:
Note that it is possible that some of the false positive results may be real mutations that simply have not been captured spacially as adjacent tissue was sequenced. Overall, this scenario is unlikely as the majority of mutations are clonal and present throughout the tumour. The code used to run the above steps, including the pileup of scRNAseq and whole exome sequencing BAMS is provided here.
We show the classification of the mutations in the barplots below. The top plot shows all mutations, whereas the lower plot is restricted to exonic only mutations.
It is therefore trivial to compute an average precision and sensitivity of mutation calling, using whole exome sequencing data as the comparator:
All mutations | Exonic mutations only | |
---|---|---|
Precision | 0.6444444 | 0.7045455 |
Sensitivity | 0.5272727 | 0.4920635 |
In adult tumours, one expects a high proportion of somatic mutations in expanded CD8 positive T cells to have been acquired post thymic selection. Most called mutations should therefore be restricted to a single T cell receptor clonotype. By using identical metrics to those used to select mutations across all cell types, we can examine the proportion of CD8 positive T cell mutations that are restricted to a single clonotype. Again, in order to call a mutation, we use thresholds requiring at least 5 cells with the variant in the most prevalent clonotype, with a least 20 cells covering the reference allele in the other clonotypes.
## [1] "overall restricted fraction = 0.84"
We sought to investigate whether differences in the numbers of mutations expressed between the cell types present in the tumour micro-environment can be used to infer the degree of clonal expansion. The proportion of cells with one, two, three, or greater than three mutations from the final filtered set above was therefore calculated. To account for the lack of discriminatory power in rarer cell populations, we required at least 100 cells from each cell lineage and patient. As expected, the lineage with the highest number of cells expressing called mutations were the cancerous cells themselves, mainly explained by the known clonal structure of the lineage, but also due to the likelihood of increased tumour burden when compared to the normal cell types. For similar reasons, all of the normal cell types did not tend to have discernible numbers of cells with more than one called mutation. A surprisingly large number of Myeloid cells express mutations, indicating that a sizable proportion of these cells are clonally related. These are followed by fibroblasts, and CD8 positive T cells (which we know are clonally expanded based on the TCR sequencing results). A very small proportion of CD4 positive T cells expressed mutations, consistent with the low degree of clonality based on TCR analysis. In summary, it seems possible to infer clonality of different cell types from expressed scRNA seq data, and the results are consistent with other sequencing modalities.
We explored whether it might be possible to determine the broad phylogenetic structure of the different cell types. Unfortunately, the small number of called mutations, in combination with the low expression at these loci mean that formal methods to build phylogenetic structure cannot be used. One can look instead at the proportion of cells that express any mutations, and the number of cells that express increasing numbers of mutations to give a very rough account of the structure. These might range from predominantly clonal (RCC cells), through a continuum of clonality, to the polyclonal CD4 positive T cells subsets as depicted in the schematic below. Although it is impossible to estimate the relative length of the branches, or sizes of the expansion, we envisage that may appear similar to that below.Â
Schematic of the phylogenies of trees for the main cell types
TCR data from our CD8 positive T cells may be used to determine the relationship between the size of a clonal population and the likelihood of a cell expressing any mutations in that cell population. If such a relationship can be established, it may be possible therefore to infer the degree of clonality of other cell populations, where we do not have any prior information to the clonality. Called mutations in CD8 positive T cells are annotated with their TCR clonotype in order to calculate the proportion of cells that express a mutations. We plot the proportion of cells expressing a mutation with clonotype size to determine whether this proportion appears to reach an symptotic limit. The proportion of cells with an expressed mutation appears to reach a limit, at a population size of 200 cells reaching approximately 11% of cells that express a somatic mutation. We can sanity check this proportion by comparing to our known clonal renal cancer cells, where a mean proportion of 19% of cells express a mutation. Â
Mutations expressed by cells across different cell states infer a common progenitor, and allow lineage tracing of cells as they mature from one cell state to another. Myeloid cells (in particular monocytes and macrophages), lend themselves to this approach as they exhibit significant phenotypic heterogeneity yet also appear to be the most clonal of all of the normal cell types. To assess the relationships between cell states (such as classical/ non-classical monocytes, tissue resident/ tumour-associated macrophages), a distance matrix was calculated, which represents the number of mutations that are expressed in any cell derived from exactly one of the two comparitor states as a fraction of the number of mutations that are present in any cell from either of the two comparitor states:
\[d = (Ac ⋂ B + A ⋂ Bc) / A ⋃ B \]
The distance matrix can be then used to directly show the developmental trajectory of myeloid cells as described in our paper.Â