de Novo Mutation Calling from scRNA seq Data

Background

The accurate detection of mutations within single cells from their transcriptomic sequences enables the inference of their clonal relationships. In cancerous tissues, reconstructing the clonal evolution of tumour cells is well established, principally through the genomic sequencing of relatively large numbers of cells from different tumour regions. Lineage tracing of normal cells is more challenging, however lymphocyte lineage tracing has more recently become routine through single cell 5’ RNA sequencing and enrichment of the T and B cell receptors. Determining lymphocyte clonality has enabled researchers to infer the degree of clonal expansion (thought principally via neo-epitope antigenic stimulation), and to study how cells mature from naivety through to exhaustion.

Methods such as Souporcell have been developed to genotype single cells from their transcriptomic readouts. Other methods, such as mgatk have demonstrated the ability to clonally trace single cells through using scATAC-seq and mtDNA mutations.

This methodology, which we have termed deSCeRNAMut (de novo Single Cell RNA Mutation), aims to accurately detect mutations present in different cell populations within a mixed cellular environment from droplet based single cell RNA sequencing data. The principal challenge is the lack of consistent coverage, and low depth of reads, alongside error-prone sequencing reads. To abrogate these challenges, this method relies upon the vast numbers of cells that can be sequenced through droplet based technologies, and the implausibility of shared post-embryonic mutations between different cell-type lineages.

Method

To demonstrate this method, we use transcriptomes generated using droplet based sequencing from Stromal and tumour cell evolution in clear cell renal cell carcinoma. The dataset provides a detailed insight into cellular heterogneity in 12 kidney tumours through multi-regional sequencing, with a large number of cells sequenced across different primary cell lineages.

Numbers of cells per lineage and patient
	Bcell	Endo	Epi	Fibro	Myeloid	NK	RCC	TCD4	TCD8
PD43824	1426	260	493	90	7040	9298	3070	8631	13042
PD43948	544	103	139	111	1839	2231	1873	2508	7185
PD44714	173	1	15	1	1132	271	0	1099	353
PD44966	976	48	4	6	3789	3214	453	4011	10129
PD44967	762	4	4	32	2875	1643	149	1884	1152
PD45814	916	239	102	343	2931	3428	289	5117	9589
PD45815	1459	929	673	168	5420	6546	3012	5060	7507
PD45816	373	2753	2191	870	5062	3026	2359	2408	20222
PD47171	661	46	68	32	6070	1553	2351	2414	23282
PD47172	1041	1651	566	1908	2817	2294	0	2201	5033
PD47465	207	8	9	6	3695	1747	267	2838	2826
PD47512	567	20	168	15	7933	1069	612	3739	6151

This page describes an overview for detecting and filtering putative somatic mutations from the above cells. We then benchmark the tumour cell calls by comparison to whole exome sequencing, and T cell calls by reference to whether they are restricted to cells with specific TCR clonotypes (i.e. acquired post thymic selection).

Initial variant calling

In order to call cell specific mutations, indexed BAM files from the cellranger pipeline were fist split into cell specific BAM files. We borrowed this code for the task. Cell specific BAM files were indexed using samtools.

Mutations were initially called using bcftools mpileup. The choice of mutation caller was primarily influenced by the need for high sensitivity calls of variants with few supporting reads, as demonstrated by this useful systematic comparison of single-nucleotide variant callers (Liu et al 2019).

Unsurprisingly, a huge number of mutations were called - with between 800,000 and 4,000,000 mutations called per patient. To facilitate more efficient downstream filtering of putative mutations, we perform the first filter step at this point:

Remove singlet variants only called in a single cell. The rationale is that is will be exceptionally difficult to accurately determine whether these mutations are real or artifact.
Removal of variants that are shared between the different cell lineages shown in the table above. The vast majority of somatic mutations are acquired post embryonic differentiation, and therefore any true degree of sharing is implausible.

After these steps, we are left with between 40,000 and 300,000 mutations per patient.

So far, we have generated a list of putative variant sites, but we are unaware how many variants may have been missed at each loci, and we have no information regarding reference calls at those loci. We therefore run alleleCount to generate count tables of each base for all cells at every putative patient-specific loci.

The code used to run the above steps is archived here.

Collation and annotation of counts

Reference and variant counts were collated for all of the loci called above to create a sparse matrix of counts for cell derived from every patient. In the absence of copy number variants, if an autosomal chromosome posses a true mutation, one expects an approximately equal number of reference and variant calls. The exception is for genes that exhibit a high degree of allelic specific expression, or that typically transcribe a particular allele in concentrated bursts. Alternatively, a high ratio of reference to variant counts in a cell base may imply artifact associated with high depth sequencing/ poorly mapped regions. A binomial filter (p<0.05) is therefore applied in each cell, with calls ignored in future analyses if there are significantly higher reference than variant counts.

Each genomic loci is annotated using ANNOVAR, the trinucleotide context of the variant, and any loci associated with previously documented RNA editing sites using DARNED and RADAR.

The number of cells containing either the reference and variant base are collated for:

The cell lineage with the greatest number of mutations.
All of the other cell lineages.
The TCR clonotype with the greatest numb.er of mutations.
All other TCR clonotypes

The Fisher’s exact test is used to compute whether there are proportionally greater numbers of mutations in the cell lineage/ clonotype with the greatest number of mutations:

Contingency table of counts for single mutation by broad cell lineage
	Number of CD8 T cells	Number of other cell types
Alternative base	21	0
Reference base	1256	6089

Contingency table of counts for single mutation by clonotype
	Number of cells with TRA:CAVTGGGADGLTF;TRB:CASSQGGGGADTQYF CDR3	Number of cells with other CDR3
Alternative base	9	11
Reference base	21	2683

An enrichment factor is also calculated for each mutation that represents the multiple of the increased prevalence in the predominant cell type compared to all others. The top few lines of the output table appears as such:

Top of mutation calls (examples above are the bottom line)
patient	chr	pos	ref	mut	tissue	mutTP	mutTO	refTP	refTO	pValT	clono	mutCP	mutCO	refCP	refCO	pValC	trinuc_sub	channel	AtoI	Func.refGene	Gene.refGene	ExonicFunc.refGene	cytoBand	ExAC_ALL	avsnp147	strand	enrich	Func.refGene.1	Gene.refGene.1	ExonicFunc.refGene.1	cytoBand.1	ExAC_ALL.1	avsnp147.1	strand.1
PD43824	1	137825	G	A	Myeloid	7	0	0	0	1.0000000		0	0	0	0	1.0000000	CGC>CAC	43	FALSE	ncRNA_exonic	LOC729737	.	1p36.33	.	rs147252685	-	NA	ncRNA_exonic	LOC729737	.	1p36.33	.	rs147252685	-
PD43824	1	182640	C	A	TCD4	2	0	18	39	0.1110462	TRA:CAMSALLGQNFVF;TRB:CSARVNNGEAFF	1	1	0	30	0.0625000	TCT>TAT	16	FALSE	ncRNA_exonic	LOC102725121	.	1p36.33	.	.	+	Inf	ncRNA_exonic	LOC102725121	.	1p36.33	.	.	+
PD43824	1	187153	T	C	Myeloid	25	36	55	114	0.2729017	TRA:CAASPANTGGFKTIF;TRB:CASSATVTGDTEAFF	1	13	0	61	0.1866667	CTG>CCG	71	FALSE	downstream	MIR6859-1;MIR6859-2;MIR6859-3;MIR6859-4	.	1p36.33	.	.	-	1.302083	downstream	MIR6859-1;MIR6859-2;MIR6859-3;MIR6859-4	.	1p36.33	.	.	-
PD43824	1	187497	G	A	Myeloid	15	28	28	60	0.8431712	TRA:CASQGGGGNKLTF;TRB:CASSGVRAEAFF	2	13	0	31	0.1014493	CGG>CAG	39	FALSE	downstream	MIR6859-1;MIR6859-2;MIR6859-3;MIR6859-4	.	1p36.33	.	.	-	1.096345	downstream	MIR6859-1;MIR6859-2;MIR6859-3;MIR6859-4	.	1p36.33	.	.	-
PD43824	1	200104	T	C	Epi	17	13	1	8	0.0232371	TRA:CAAQNNNAGNMLTF;TRA:CAVRGDMRF;TRB:CASNLWGGDTEAFF	1	4	0	4	1.0000000	GTC>GCC	74	FALSE	intergenic	MIR6859-1;FAM138D	.	1p36.33	.	.	-	1.525641	intergenic	MIR6859-1;FAM138D	.	1p36.33	.	.	-
PD43824	1	629937	A	G	TCD8	21	0	1256	6089	0.0000000	TRA:CAVTGGGADGLTF;TRB:CASSQGGGGADTQYF	9	11	21	2683	0.0000000	AAT>AGT	68	FALSE	upstream	LOC101928626	.	1p36.33	.	.	-	Inf	upstream	LOC101928626	.	1p36.33	.	.	-

The code to run the above steps has been archived here.

Final filter

We apply the following thresholds to filter all possible mutations

Fisher’s exact significance of enrichment by cell lineage, p < 0.0001 with proportionally at least 5 times greater mutations in the most enriched lineage.
Absence of any known single nucleotide polymorphisms from either ExAC or dbSNP.
No shared mutations between patients
Adequate coverage with at least 5 cells with variant base from the mutated cell lineage and at least 20 cells with reference base from the reference population

The discriminatory power of the above metrics in reducing mutational calls is examined through a Venn diagram. Unsurprisingly, the cell lineage enrichment metrics provide the strongest power in curating the dataset.

$\label{fig:figs} Venn diagram demonstrating the discriminatory power of the above filtering metrics$

Venn diagram demonstrating the discriminatory power of the above filtering metrics

We can also examine the trinucleotide context of the mutations pre- and post-filtering.

$\label{fig:figs} Trinucleotide context of called mutations pre- and post- filtering$

Trinucleotide context of called mutations pre- and post- filtering

Note is made of high levels of mutations that are otherwise unexplained from published catalogues of mutational signatures (particularly in a GCN>GGN and GTN>GGN context). By separating the trinucleotide context into the positive versus negative transcribed strand, we see differences that are otherwise unexplained by DNA derived mutational signatures, implying artifact either through library prep, sequencing, or RNA editing.

$\label{fig:figs} Trinucleotide context of filtered mutations by transcribed strand$

Trinucleotide context of filtered mutations by transcribed strand

A striking strand bias is demonstrated that cannot be accounted for by known mutational processes. Given the disparity between transcribed strands, mutations that have arisen with a highly biased context are removed (binomial filter, p < 0.005). The difference in the relative height of the removed mutations indicates that the majority of those removed are artefacts.

We finally remove all mutations that are clustered within 4 bases in a given patient, to yield the final mutation calls. The code to run the above steps is archived here

Benchmarking Data

Whole exome sequencing

Multi-regional whole exome sequencing data has been processed for tumour tissue adjacent to the regions that have undergone single cell RNA sequencing. The exonic mutations may therefore be used as a benchmark to determine the precision and sensitivity of the single cell mutation calling method above. To provide a fair comparison between single cell RNA and bulk exonic DNA mutation calls, and to account for differences in coverage between the methods, we also examine whether there is evidence of a given mutation using the reciprocal technology by performing a pileup at that mutation locus.

We can therefore classify mutations called using the above pipeline as:

True positive - The mutation has been called in both in the scRNAseq pipeline and CaVEMan.
True positive, pileup only - The mutation has been called in the scRNAseq pipeline, and there is evidence of the mutation in exome sequencing from tumour regions, with no mutations in the normal sample BAM files. The most common reasons for these mutations not being called by CaVEMan is low coverage or the mutation being called in mitochondrial DNA.
False positive - The mutation has been called in the scRNAseq pipeline, but there are fewer than 5 supporting reads for the variant base, and more than 20 reads for the reference base in the exome data.
False negative - The mutation has been called by CaVEMan from the exome data, and has not been called from the scRNAseq data, despite there being adequate coverage of at least 5 cells with the variant and at least 20 cells with the reference base.
Indeterminate - The mutation has been called by the scRNAseq pipeline, but there is not sufficient depth in the exome data to corroborate the call.

Note that it is possible that some of the false positive results may be real mutations that simply have not been captured spacially as adjacent tissue was sequenced. Overall, this scenario is unlikely as the majority of mutations are clonal and present throughout the tumour. The code used to run the above steps, including the pileup of scRNAseq and whole exome sequencing BAMS is provided here.

We show the classification of the mutations in the barplots below. The top plot shows all mutations, whereas the lower plot is restricted to exonic only mutations.

It is therefore trivial to compute an average precision and sensitivity of mutation calling, using whole exome sequencing data as the comparator:

Precision and sensitivity of scRNAseq pipeline as compared to whole exome sequencing
	All mutations	Exonic mutations only
Precision	0.6444444	0.7045455
Sensitivity	0.5272727	0.4920635

T cell clonotype restriction

In adult tumours, one expects a high proportion of somatic mutations in expanded CD8 positive T cells to have been acquired post thymic selection. Most called mutations should therefore be restricted to a single T cell receptor clonotype. By using identical metrics to those used to select mutations across all cell types, we can examine the proportion of CD8 positive T cell mutations that are restricted to a single clonotype. Again, in order to call a mutation, we use thresholds requiring at least 5 cells with the variant in the most prevalent clonotype, with a least 20 cells covering the reference allele in the other clonotypes.

## [1] "overall restricted fraction = 0.84"

Applications

Inferring the degree of clonal expansion in different cell types

We sought to investigate whether differences in the numbers of mutations expressed between the cell types present in the tumour micro-environment can be used to infer the degree of clonal expansion. The proportion of cells with one, two, three, or greater than three mutations from the final filtered set above was therefore calculated. To account for the lack of discriminatory power in rarer cell populations, we required at least 100 cells from each cell lineage and patient. As expected, the lineage with the highest number of cells expressing called mutations were the cancerous cells themselves, mainly explained by the known clonal structure of the lineage, but also due to the likelihood of increased tumour burden when compared to the normal cell types. For similar reasons, all of the normal cell types did not tend to have discernible numbers of cells with more than one called mutation. A surprisingly large number of Myeloid cells express mutations, indicating that a sizable proportion of these cells are clonally related. These are followed by fibroblasts, and CD8 positive T cells (which we know are clonally expanded based on the TCR sequencing results). A very small proportion of CD4 positive T cells expressed mutations, consistent with the low degree of clonality based on TCR analysis. In summary, it seems possible to infer clonality of different cell types from expressed scRNA seq data, and the results are consistent with other sequencing modalities.

Estimating the phylogenetic structure of cell types

We explored whether it might be possible to determine the broad phylogenetic structure of the different cell types. Unfortunately, the small number of called mutations, in combination with the low expression at these loci mean that formal methods to build phylogenetic structure cannot be used. One can look instead at the proportion of cells that express any mutations, and the number of cells that express increasing numbers of mutations to give a very rough account of the structure. These might range from predominantly clonal (RCC cells), through a continuum of clonality, to the polyclonal CD4 positive T cells subsets as depicted in the schematic below. Although it is impossible to estimate the relative length of the branches, or sizes of the expansion, we envisage that may appear similar to that below.

Schematic of the phylogenies of trees for the main cell types

Using known T cell clonality and population size to determine detection thresholds

TCR data from our CD8 positive T cells may be used to determine the relationship between the size of a clonal population and the likelihood of a cell expressing any mutations in that cell population. If such a relationship can be established, it may be possible therefore to infer the degree of clonality of other cell populations, where we do not have any prior information to the clonality. Called mutations in CD8 positive T cells are annotated with their TCR clonotype in order to calculate the proportion of cells that express a mutations. We plot the proportion of cells expressing a mutation with clonotype size to determine whether this proportion appears to reach an symptotic limit. The proportion of cells with an expressed mutation appears to reach a limit, at a population size of 200 cells reaching approximately 11% of cells that express a somatic mutation. We can sanity check this proportion by comparing to our known clonal renal cancer cells, where a mean proportion of 19% of cells express a mutation.

Lineage tracing across cell states

Mutations expressed by cells across different cell states infer a common progenitor, and allow lineage tracing of cells as they mature from one cell state to another. Myeloid cells (in particular monocytes and macrophages), lend themselves to this approach as they exhibit significant phenotypic heterogeneity yet also appear to be the most clonal of all of the normal cell types. To assess the relationships between cell states (such as classical/ non-classical monocytes, tissue resident/ tumour-associated macrophages), a distance matrix was calculated, which represents the number of mutations that are expressed in any cell derived from exactly one of the two comparitor states as a fraction of the number of mutations that are present in any cell from either of the two comparitor states:

\[d = (Ac ⋂ B + A ⋂ Bc) / A ⋃ B \]

The distance matrix can be then used to directly show the developmental trajectory of myeloid cells as described in our paper.