Parents of identical twins could not emphasize how different their children are despite their striking resemblance to each other. Now extrapolate these visual similarities right down to the level of the single cell. What if, despite their uncanny similarities, each cell is vastly different from its neighbor?
Thanks to incredible advances in the field of genomics, it is now possible for scientists to collect gene-expression data of individual cells. In single-cell RNA sequencing, however, data is often gathered from multiple experiments conducted by different personnel, and using different methods, reagents, equipment and platforms.
All of these minute experimental differences add up and can lead to large variations—or batch effects—in the data. As such, correcting for batch effects helps align different datasets and preserve key biological variations.
“If not corrected, batch effects can introduce false signals while masking the underlying biological differences that we are interested in,” explained Jinmiao Chen, a Principal Investigator at A*STAR’s Singapore Immunology Network (SIgN). Chen was the corresponding author on a study that compared 14 state-of-the-art algorithms to determine the most suitable method for correcting batch-specific variations.
The algorithms were tested on ten biological datasets, covering diverse cell types such as dendritic cells, pancreatic cells, retinal cells and peripheral blood mononuclear cells, with datasets from both human and mouse samples. The datasets were collected using a range of RNA-sequencing technologies, namely 10x, SMART-seq, Drop-seq and SMARTer.
Based on five evaluation scenarios—ranging from identical cell types with different technologies, to non-identical cell types, multiple batches, big data and simulated data—the researchers found no superior algorithm among the 14 tested, as each had its strengths and weaknesses.
That being said, Harmony, LIGER, and Seurat 3 were the top three recommendations for batch integration based on rank-sum scores of performance across ten datasets. All three methods were able to complete runs on the large datasets, making them valuable as datasets grow in size and complexity.
Due to its significantly shorter runtime, Harmony was recommended as the first method to try when dealing with large datasets. Conversely, ComBat, MMD-ResNet and limma were ranked the worst-performing methods overall.
“With the continued advancements in single-cell technologies, it will be necessary to identify more efficient and effective methods capable of scaling up in terms of the number of cells and batches,” Chen said.
The hallmarks of an excellent algorithm, Chen noted, is one that achieves a fine balance between superior batch integration and being able to operate within the constraints of computational resources available.
The A*STAR-affiliated researchers contributing to this research are from the Singapore Immunology Network (SIgN),