Background

The accurate detection of mutations within single cells from their transcriptomic sequences enables the inference of their clonal relationships. In cancerous tissues, reconstructing the clonal evolution of tumour cells is well established, principally through the genomic sequencing of relatively large numbers of cells from different tumour regions. Lineage tracing of normal cells is more challenging, however lymphocyte lineage tracing has more recently become routine through single cell 5’ RNA sequencing and enrichment of the T and B cell receptors. Determining lymphocyte clonality has enabled researchers to infer the degree of clonal expansion (thought principally via neo-epitope antigenic stimulation), and to study how cells mature from naivety through to exhaustion.

Methods such as Souporcell have been developed to genotype single cells from their transcriptomic readouts. Other methods, such as mgatk have demonstrated the ability to clonally trace single cells through using scATAC-seq and mtDNA mutations.

This methodology, which we have termed deSCeRNAMut (de novo Single Cell RNA Mutation), aims to accurately detect mutations present in different cell populations within a mixed cellular environment from droplet based single cell RNA sequencing data. The principal challenge is the lack of consistent coverage, and low depth of reads, alongside error-prone sequencing reads. To abrogate these challenges, this method relies upon the vast numbers of cells that can be sequenced through droplet based technologies, and the implausibility of shared post-embryonic mutations between different cell-type lineages.

Method

To demonstrate this method, we use transcriptomes generated using droplet based sequencing from Stromal and tumour cell evolution in clear cell renal cell carcinoma. The dataset provides a detailed insight into cellular heterogneity in 12 kidney tumours through multi-regional sequencing, with a large number of cells sequenced across different primary cell lineages.

Numbers of cells per lineage and patient
Bcell Endo Epi Fibro Myeloid NK RCC TCD4 TCD8
PD43824 1426 260 493 90 7040 9298 3070 8631 13042
PD43948 544 103 139 111 1839 2231 1873 2508 7185
PD44714 173 1 15 1 1132 271 0 1099 353
PD44966 976 48 4 6 3789 3214 453 4011 10129
PD44967 762 4 4 32 2875 1643 149 1884 1152
PD45814 916 239 102 343 2931 3428 289 5117 9589
PD45815 1459 929 673 168 5420 6546 3012 5060 7507
PD45816 373 2753 2191 870 5062 3026 2359 2408 20222
PD47171 661 46 68 32 6070 1553 2351 2414 23282
PD47172 1041 1651 566 1908 2817 2294 0 2201 5033
PD47465 207 8 9 6 3695 1747 267 2838 2826
PD47512 567 20 168 15 7933 1069 612 3739 6151

This page describes an overview for detecting and filtering putative somatic mutations from the above cells. We then benchmark the tumour cell calls by comparison to whole exome sequencing, and T cell calls by reference to whether they are restricted to cells with specific TCR clonotypes (i.e. acquired post thymic selection).

Initial variant calling

In order to call cell specific mutations, indexed BAM files from the cellranger pipeline were fist split into cell specific BAM files. We borrowed this code for the task. Cell specific BAM files were indexed using samtools.

Mutations were initially called using bcftools mpileup. The choice of mutation caller was primarily influenced by the need for high sensitivity calls of variants with few supporting reads, as demonstrated by this useful systematic comparison of single-nucleotide variant callers (Liu et al 2019).

Unsurprisingly, a huge number of mutations were called - with between 800,000 and 4,000,000 mutations called per patient. To facilitate more efficient downstream filtering of putative mutations, we perform the first filter step at this point:

  • Remove singlet variants only called in a single cell. The rationale is that is will be exceptionally difficult to accurately determine whether these mutations are real or artifact.
  • Removal of variants that are shared between the different cell lineages shown in the table above. The vast majority of somatic mutations are acquired post embryonic differentiation, and therefore any true degree of sharing is implausible.

After these steps, we are left with between 40,000 and 300,000 mutations per patient.

So far, we have generated a list of putative variant sites, but we are unaware how many variants may have been missed at each loci, and we have no information regarding reference calls at those loci. We therefore run alleleCount to generate count tables of each base for all cells at every putative patient-specific loci.

The code used to run the above steps is archived here.

Collation and annotation of counts

Reference and variant counts were collated for all of the loci called above to create a sparse matrix of counts for cell derived from every patient. In the absence of copy number variants, if an autosomal chromosome posses a true mutation, one expects an approximately equal number of reference and variant calls. The exception is for genes that exhibit a high degree of allelic specific expression, or that typically transcribe a particular allele in concentrated bursts. Alternatively, a high ratio of reference to variant counts in a cell base may imply artifact associated with high depth sequencing/ poorly mapped regions. A binomial filter (p<0.05) is therefore applied in each cell, with calls ignored in future analyses if there are significantly higher reference than variant counts.

Each genomic loci is annotated using ANNOVAR, the trinucleotide context of the variant, and any loci associated with previously documented RNA editing sites using DARNED and RADAR.

The number of cells containing either the reference and variant base are collated for:

  • The cell lineage with the greatest number of mutations.
  • All of the other cell lineages.
  • The TCR clonotype with the greatest numb.er of mutations.
  • All other TCR clonotypes

The Fisher’s exact test is used to compute whether there are proportionally greater numbers of mutations in the cell lineage/ clonotype with the greatest number of mutations:

Contingency table of counts for single mutation by broad cell lineage
Number of CD8 T cells Number of other cell types
Alternative base 21 0
Reference base 1256 6089
Contingency table of counts for single mutation by clonotype
Number of cells with TRA:CAVTGGGADGLTF;TRB:CASSQGGGGADTQYF CDR3 Number of cells with other CDR3
Alternative base 9 11
Reference base 21 2683

An enrichment factor is also calculated for each mutation that represents the multiple of the increased prevalence in the predominant cell type compared to all others. The top few lines of the output table appears as such:

Top of mutation calls (examples above are the bottom line)
patient chr pos ref mut tissue mutTP mutTO refTP refTO pValT clono mutCP mutCO refCP refCO pValC trinuc_sub channel AtoI Func.refGene Gene.refGene ExonicFunc.refGene cytoBand ExAC_ALL avsnp147 strand enrich Func.refGene.1 Gene.refGene.1 ExonicFunc.refGene.1 cytoBand.1 ExAC_ALL.1 avsnp147.1 strand.1
PD43824 1 137825 G A Myeloid 7 0 0 0 1.0000000 0 0 0 0 1.0000000 CGC>CAC 43 FALSE ncRNA_exonic LOC729737 . 1p36.33 . rs147252685 - NA ncRNA_exonic LOC729737 . 1p36.33 . rs147252685 -
PD43824 1 182640 C A TCD4 2 0 18 39 0.1110462 TRA:CAMSALLGQNFVF;TRB:CSARVNNGEAFF 1 1 0 30 0.0625000 TCT>TAT 16 FALSE ncRNA_exonic LOC102725121 . 1p36.33 . . + Inf ncRNA_exonic LOC102725121 . 1p36.33 . . +
PD43824 1 187153 T C Myeloid 25 36 55 114 0.2729017 TRA:CAASPANTGGFKTIF;TRB:CASSATVTGDTEAFF 1 13 0 61 0.1866667 CTG>CCG 71 FALSE downstream MIR6859-1;MIR6859-2;MIR6859-3;MIR6859-4 . 1p36.33 . . - 1.302083 downstream MIR6859-1;MIR6859-2;MIR6859-3;MIR6859-4 . 1p36.33 . . -
PD43824 1 187497 G A Myeloid 15 28 28 60 0.8431712 TRA:CASQGGGGNKLTF;TRB:CASSGVRAEAFF 2 13 0 31 0.1014493 CGG>CAG 39 FALSE downstream MIR6859-1;MIR6859-2;MIR6859-3;MIR6859-4 . 1p36.33 . . - 1.096345 downstream MIR6859-1;MIR6859-2;MIR6859-3;MIR6859-4 . 1p36.33 . . -
PD43824 1 200104 T C Epi 17 13 1 8 0.0232371 TRA:CAAQNNNAGNMLTF;TRA:CAVRGDMRF;TRB:CASNLWGGDTEAFF 1 4 0 4 1.0000000 GTC>GCC 74 FALSE intergenic MIR6859-1;FAM138D . 1p36.33 . . - 1.525641 intergenic MIR6859-1;FAM138D . 1p36.33 . . -
PD43824 1 629937 A G TCD8 21 0 1256 6089 0.0000000 TRA:CAVTGGGADGLTF;TRB:CASSQGGGGADTQYF 9 11 21 2683 0.0000000 AAT>AGT 68 FALSE upstream LOC101928626 . 1p36.33 . . - Inf upstream LOC101928626 . 1p36.33 . . -

The code to run the above steps has been archived here.

Final filter

We apply the following thresholds to filter all possible mutations

  • Fisher’s exact significance of enrichment by cell lineage, p < 0.0001 with proportionally at least 5 times greater mutations in the most enriched lineage.
  • Absence of any known single nucleotide polymorphisms from either ExAC or dbSNP.
  • No shared mutations between patients
  • Adequate coverage with at least 5 cells with variant base from the mutated cell lineage and at least 20 cells with reference base from the reference population

The discriminatory power of the above metrics in reducing mutational calls is examined through a Venn diagram. Unsurprisingly, the cell lineage enrichment metrics provide the strongest power in curating the dataset.

\label{fig:figs} Venn diagram demonstrating the discriminatory power of the above filtering metrics

Venn diagram demonstrating the discriminatory power of the above filtering metrics

We can also examine the trinucleotide context of the mutations pre- and post-filtering.

\label{fig:figs} Trinucleotide context of called mutations pre- and post- filtering

Trinucleotide context of called mutations pre- and post- filtering

Note is made of high levels of mutations that are otherwise unexplained from published catalogues of mutational signatures (particularly in a GCN>GGN and GTN>GGN context). By separating the trinucleotide context into the positive versus negative transcribed strand, we see differences that are otherwise unexplained by DNA derived mutational signatures, implying artifact either through library prep, sequencing, or RNA editing.

\label{fig:figs} Trinucleotide context of filtered mutations by transcribed strand

Trinucleotide context of filtered mutations by transcribed strand

A striking strand bias is demonstrated that cannot be accounted for by known mutational processes. Given the disparity between transcribed strands, mutations that have arisen with a highly biased context are removed (binomial filter, p < 0.005). The difference in the relative height of the removed mutations indicates that the majority of those removed are artefacts.

We finally remove all mutations that are clustered within 4 bases in a given patient, to yield the final mutation calls. The code to run the above steps is archived here

Benchmarking Data

Whole exome sequencing

Multi-regional whole exome sequencing data has been processed for tumour tissue adjacent to the regions that have undergone single cell RNA sequencing. The exonic mutations may therefore be used as a benchmark to determine the precision and sensitivity of the single cell mutation calling method above. To provide a fair comparison between single cell RNA and bulk exonic DNA mutation calls, and to account for differences in coverage between the methods, we also examine whether there is evidence of a given mutation using the reciprocal technology by performing a pileup at that mutation locus.

We can therefore classify mutations called using the above pipeline as:

  • True positive - The mutation has been called in both in the scRNAseq pipeline and CaVEMan.
  • True positive, pileup only - The mutation has been called in the scRNAseq pipeline, and there is evidence of the mutation in exome sequencing from tumour regions, with no mutations in the normal sample BAM files. The most common reasons for these mutations not being called by CaVEMan is low coverage or the mutation being called in mitochondrial DNA.
  • False positive - The mutation has been called in the scRNAseq pipeline, but there are fewer than 5 supporting reads for the variant base, and more than 20 reads for the reference base in the exome data.
  • False negative - The mutation has been called by CaVEMan from the exome data, and has not been called from the scRNAseq data, despite there being adequate coverage of at least 5 cells with the variant and at least 20 cells with the reference base.
  • Indeterminate - The mutation has been called by the scRNAseq pipeline, but there is not sufficient depth in the exome data to corroborate the call.

Note that it is possible that some of the false positive results may be real mutations that simply have not been captured spacially as adjacent tissue was sequenced. Overall, this scenario is unlikely as the majority of mutations are clonal and present throughout the tumour. The code used to run the above steps, including the pileup of scRNAseq and whole exome sequencing BAMS is provided here.

We show the classification of the mutations in the barplots below. The top plot shows all mutations, whereas the lower plot is restricted to exonic only mutations.

It is therefore trivial to compute an average precision and sensitivity of mutation calling, using whole exome sequencing data as the comparator:

Precision and sensitivity of scRNAseq pipeline as compared to whole exome sequencing
All mutations Exonic mutations only
Precision 0.6444444 0.7045455
Sensitivity 0.5272727 0.4920635

T cell clonotype restriction

In adult tumours, one expects a high proportion of somatic mutations in expanded CD8 positive T cells to have been acquired post thymic selection. Most called mutations should therefore be restricted to a single T cell receptor clonotype. By using identical metrics to those used to select mutations across all cell types, we can examine the proportion of CD8 positive T cell mutations that are restricted to a single clonotype. Again, in order to call a mutation, we use thresholds requiring at least 5 cells with the variant in the most prevalent clonotype, with a least 20 cells covering the reference allele in the other clonotypes.

## [1] "overall restricted fraction = 0.84"

Applications

Inferring the degree of clonal expansion in different cell types

We sought to investigate whether differences in the numbers of mutations expressed between the cell types present in the tumour micro-environment can be used to infer the degree of clonal expansion. The proportion of cells with one, two, three, or greater than three mutations from the final filtered set above was therefore calculated. To account for the lack of discriminatory power in rarer cell populations, we required at least 100 cells from each cell lineage and patient. As expected, the lineage with the highest number of cells expressing called mutations were the cancerous cells themselves, mainly explained by the known clonal structure of the lineage, but also due to the likelihood of increased tumour burden when compared to the normal cell types. For similar reasons, all of the normal cell types did not tend to have discernible numbers of cells with more than one called mutation. A surprisingly large number of Myeloid cells express mutations, indicating that a sizable proportion of these cells are clonally related. These are followed by fibroblasts, and CD8 positive T cells (which we know are clonally expanded based on the TCR sequencing results). A very small proportion of CD4 positive T cells expressed mutations, consistent with the low degree of clonality based on TCR analysis. In summary, it seems possible to infer clonality of different cell types from expressed scRNA seq data, and the results are consistent with other sequencing modalities.

Estimating the phylogenetic structure of cell types

We explored whether it might be possible to determine the broad phylogenetic structure of the different cell types. Unfortunately, the small number of called mutations, in combination with the low expression at these loci mean that formal methods to build phylogenetic structure cannot be used. One can look instead at the proportion of cells that express any mutations, and the number of cells that express increasing numbers of mutations to give a very rough account of the structure. These might range from predominantly clonal (RCC cells), through a continuum of clonality, to the polyclonal CD4 positive T cells subsets as depicted in the schematic below. Although it is impossible to estimate the relative length of the branches, or sizes of the expansion, we envisage that may appear similar to that below. 

Schematic of the phylogenies of trees for the main cell types

Schematic of the phylogenies of trees for the main cell types

Using known T cell clonality and population size to determine detection thresholds

TCR data from our CD8 positive T cells may be used to determine the relationship between the size of a clonal population and the likelihood of a cell expressing any mutations in that cell population. If such a relationship can be established, it may be possible therefore to infer the degree of clonality of other cell populations, where we do not have any prior information to the clonality. Called mutations in CD8 positive T cells are annotated with their TCR clonotype in order to calculate the proportion of cells that express a mutations. We plot the proportion of cells expressing a mutation with clonotype size to determine whether this proportion appears to reach an symptotic limit. The proportion of cells with an expressed mutation appears to reach a limit, at a population size of 200 cells reaching approximately 11% of cells that express a somatic mutation. We can sanity check this proportion by comparing to our known clonal renal cancer cells, where a mean proportion of 19% of cells express a mutation.  

Lineage tracing across cell states

Mutations expressed by cells across different cell states infer a common progenitor, and allow lineage tracing of cells as they mature from one cell state to another. Myeloid cells (in particular monocytes and macrophages), lend themselves to this approach as they exhibit significant phenotypic heterogeneity yet also appear to be the most clonal of all of the normal cell types. To assess the relationships between cell states (such as classical/ non-classical monocytes, tissue resident/ tumour-associated macrophages), a distance matrix was calculated, which represents the number of mutations that are expressed in any cell derived from exactly one of the two comparitor states as a fraction of the number of mutations that are present in any cell from either of the two comparitor states:

\[d = (Ac ⋂ B + A ⋂ Bc) / A ⋃ B \]

The distance matrix can be then used to directly show the developmental trajectory of myeloid cells as described in our paper.Â