Few microbes are as well-known to science as Escherichia coli. Once considered a harmless bacterial resident of animal and human guts, we now know E. coli as a vast collection of over 700 strains, some of which pose threats to human health. If you’ve ever had a severe bout of food poisoning, chances are you’ve met a pathogenic member of the species.
According to Frank Eisenhaber, a Senior Fellow at A*STAR’s Genome Institute of Singapore (GIS), the secrets behind an E. coli strain’s disease-causing potential lie hidden in its genome, which holds 4,000 to 5,500 genes. While researchers have historically relied on a ‘typical’ E. coli's genome sequence to represent the species, this can mask subtle differences between benign and disease-causing strains.
"E. coli strains have enormous genomic and mutational diversity; only a few hundred gene families are shared among all of them,” said Eisenhaber. “A single reference genome can’t completely represent that diversity.”
To create a more comprehensive reference, Eisenhaber and colleagues from GIS and A*STAR’s Bioinformatics Institute (BII) built an E. coli pangenome using computational tools and the publicly available sequences of 1,324 complete strain genomes. The team created a systematic map of over 25,000 E. coli gene families, unlocking new insights on the evolutionary history, adaptability and functional diversity of the species.
“To date, our E. coli pangenome study is by far the largest in terms of the number of complete E. coli genomes included,” said Eisenhaber.

The distribution of 1,324 sequenced E. coli genomes across eight E. coli phylogroups, showing the proportion of genomes in each phylogroup that fell into one of four virulence categories. Based on the total number of virulence factors (VF) identified in each genome, the team classified them as non-pathogenic (<6 VFs); likely virulent (6 to 14 VFs); highly virulent (14–22 VFs) or very highly virulent (22< VFs).
©️ A*STAR Research
The team found that a set of around 3,000 gene families made up a stable ‘softcore’ genome: one shared by at least 95 percent of E. coli strains. There were also three divergent groups of strains (phylogroups) with distinct genetic profiles—B1, B2 and E—which had acquired specialised functions. For example, phylogroup B2 had multiple genes to efficiently acquire iron, an important nutrient for survival.
Curiously, the team also noticed that the ST131 strain from phylogroup B2 had viral DNA integrated into its genome, allowing it to produce tailocin: a distinctive protein structure used by bacteriophages to ‘pop’ bacterial membranes. This suggests that ST131's rising dominance in global disease outbreaks may be partly due to this uniquely lethal weapon, which can kill other closely-related bacterial neighbours in a host.
These results present exciting new angles that challenge long-held beliefs about bacterial virulence. “So far, virulence factors were almost exclusively seen as tools for undermining host defences. The interbacterial competition for access to the host was never in the spotlight,” said Eisenhaber. He added that these findings can open up a new possibility: the engineering of ‘good’ bacteria to safely destroy ‘bad’ bacteria as an alternative to antibiotics.
With help from Lars Jensen of the University of Copenhagen, Denmark, Eisenhaber’s team published a follow-up paper that mapped the existing literature on E. coli gene families and biomolecular functions to their pangenome, revealing that many of its genetic secrets remain unexplored. The team noted it may take up to 30 years for the scientific community to fully characterise the E. coli softcore genome’s gene functions; a painstaking but necessary effort to shed light on an iconic species.
The A*STAR-affiliated researchers contributing to this research are from the Genome Institute of Singapore (GIS) and the Bioinformatics Institute (BII).