Field of Science

#Microtwjc A Salmonella of Doubt ?

The microbiology twitter journal club beckons, and it is now time to wade through the latest paper, on Salmonella, and how to spot different subtypes of this bacteria. This allows us to get a look at the recent evolutionary history of the bacteria, and to track new subtypes as they arise. In the past, the ways these new subtypes were tracked and analysed through serotyping. 

So how does serotyping work? 
It essentially relies on antisera detecting proteins/sugars (antigens) on the surface of the bacteria. These antibodies are specific for specific antigens, like the H and O antigens. You can use different antisera to detect each of these antigens,
 The expression of these antigens differs between different subtypes of salmonella, so you can identify different strains based on whether or not they have these antigens.
When the bacteria are mixed in with antisera, the antibodies within bind to their targets and then bind to eachother. So a positive reaction will cause the bacteria to stick to eachother, or agglutinate.
an example of the reaction is below.

This reaction is a fairly quick way of identifying these different types of salmonella.
So in the past different subspecies of salmonella have been classed into groups based on they way they are serotyped. These groups are known as serotypes or serovars.

Figure 1

The first figure essentially paints a picture of how salmonella types are classified from the species level down to the serovar level, and an example of how species are divided at different levels.
To define a species, DNA hybridisation and MLST are used, although biochemistry and serology are more commonly used.
They can be then differentiated by the different diseases they cause. Once the disease is identified, antisera for different O and H antigens can also be used to identify different species.
One can look at this figure, and perhaps draw the assumption that certain serovars are related to specific diseases. Indeed, this assumption is widespread. If you track one serovar as it spreads through a community, you will consequently track the outbreak of a specific type of disease such as gastroenteritis.

However, if we take a closer at this assumption, cracks appear. Serovars are classified by their surface antigens- the O and H antigens primarily. What are these antigens? O is lipopolysaccharide, a component of the bacterial cell wall, and H is for flagella, which act as propellers for bacterial movement. These components differ enough between serotypes to allow them to be discerned from eachother, but are the accurate markers of disease ? Is it possible for two salmonella subtypes with different evolutionary ancestry to appear as one serotype through the sheer luck of having the correct combination of antigens. Do each of these serotypes represent just one genetic line ?

Track virulence genes perhaps ? Whilst this is tempting , as virulence genes often have direct effects on disease pathogenesis, there is a problem with that- Many virulence genes are carried by phages- viruses that can infect bacteria and bring foreign DNA sequences. These phages can transfer DNA, and genes between different strains of bacteria through a process known as horizontal gene transfer. In fact, its possible that the H &O antigens themselves can be transferred between different bacteria.

The truth is that you would need to get Sequences at Multiple locations on the bacterial genome. This better allows scientists to more fully characterize the evolutionary history of the salmonella, rather than one specific gene within Salmonella. This was done by  Multi-Locus-Sequence-Typing (MLST). 
How does MLST work for Salmonella ?
Seven different genes, aroCdnaNhemDhisD,purEsucAand thrA, had been previously used to analyse the heritage of salmonella typhi. Only small fragments of these genes were needed to differentiate different subtypes of this bacteria.
As a subtype of salmonella evolves, the sequences of these seven genes change. Through these gene alterations , it's possible to look at multiple subtypes and deduce how related they are to eachother.
Subtypes with the exact same sequences for all of these genes are designated Sequence Types (ST's).
The eBURST algorithm was then used to find the most closely related Sequence types and to assign them into a group. This algorithm will group Sequence Types together if they have 6/7 identical genes in common.
So these groups were defined as eBURST groups (eBG's), based on the algorithm used to define them.
(in the same way that serotypes are named after the process of serotyping)
When ten or more different salmonella strains were found to be part of the same Sequence type, they were upgraded to eBURST group. This added a layer of complication, as this means that an eBURST group, which itself contains a collection of related Sequence types could itself contain another eBURST group if one of those sequence types numbered more than ten.
If eBGs were found to share a common serovar with sequence types that differed only in 5 genes, they allowed those sequence types to be grouped in with the eBG.
Using these methods, they managed to distill  3,550 separate salmonella strains, isolated from around the world, into 138 eBURST groups.
It should be noted that whilst the eBURST algorithm is great at spotting close relationships between strains, it  is somewhat less trustworthy at looking at less close relationships.

Figure 2

This diagram shows a Minimal Spanning tree, which broadly shows each sequence type (which are shown as dots) and the distance between each sequence type speaks to the distance of relatedness. The thick lines linking the dots indicate that the dots share 6/7 genes, and thin lines 5/7 genes.
The colour codes indicate which serovar to which each of these sequence types belong. (with white being unidentified)
 But as I said before, eBG is not the be all and end all here, and it is always preferable to use multiple lines of evidence to double check the validity of the results presented here. So the authors used three different methods that have been used in previous work to re-check the results shown above. I should point out here that all of these techniques are completely new to me, and I have tried to simplify the descriptions as best as possible and emphasise the differences between these models as best I can.

  1.  Clonal Frame was used. This works slightly differently from eBURST, by taking into account the genes which do not undergo recombination, and treating those as a way of looking at how the bacteria are related.
  2.  gene by gene bootstrap analysis. How does gene by gene bootstrap analysis work? This is based on the molecular clock, that each set of genes are subject to mutation over a period of time, and accumulated mutations can be used to trace the genealogy of each set of genes. Moreover that the changes within each of these genes can make some homologous recombination events more likely than others, and these in themselves can be used to determine how each sequence type evolves. This can be used to generate a UPGMA tree.
  3. Bayesian analysis of Population structure (BAPS) was the final check they used against their data. This model uses Bayesian inference to examine the question of whether a population of bacteria consists of subgroups which have genetically drifted apart.
So each of these methods were used to generate clusters of related subtypes, and these were mapped against the eBG's defined above to check whether they were in agreement.
This produced the next figure:

Figure 2

This Venn diagram shows the overlap and the differences between the the three models. The models are in agreement for 108 clusters, and BAPS and Cloneframe have a fairly large overlap of 21 clusters. Unfortunately, I am not yet familiar enough with any of the three other models and their construction to understand why there is more overlap between the BAPS and cloneframe than with Gene by Gene bootstrapping.
 Nor can I say whether testing each of these three models will naturally reflect the state of nature, or that these models have the same base assumptions but approach them from different directions. Considering that they are all based off of the same data, that conclusion can easily be drawn.
Nevertheless, even with those thoughts taken into account,  the agreement with these different groups is very convincing, and ensures that any weakness in one model is supplanted by the strength in another.

So with this done, we can look back on the Figure 2 and look at the relationships shown in it with more confidence. And what becomes more clear is that the relationship between serotype and eBG is not as clear cut as one would first assume. It looks like serovars like Newport, Paratyphi and Oranienberg encompass at least three eBG clusters each.

Figure 4

This figure focuses on salmonella typhimurium, which correlates the eBG1 in the Figure 2. Typhimurium as serotype variants, which don't express some of the antigens seen with other strains. These are the Mononphasic strain (in red) non motile/rough (in brown)
This is where things get interesting.
Each circle represents a Sequence Type, in which all of the 7 genes tested were identical. And we can see that with the biggest one includes a large chunk of monophasic variants. In fact monophasic variants never seem to represent their own cluster. Previous work suggests that the "monophasic" phenotype in these cases must occur through multiple unrealted genetic events, and so doesn't really form it's own independant subtype.
Also, this data suggests that typhimurium itself comprises more than just one eBURSTgroup, as the smaller eBG138 shares only have three similar genes with the larger eBG1 cluster.
The serotypes Hato, Kunduchi and Farsta are not generally associated with typhimurium, so seeing some of them group in the same cluster is unexpected. This suggests that serotyping is unreliable, although that blade can cut both ways.

Figure 5

So now we must look at eBG4, which tends to correlate serologically with enteridis mostly, although there is some relationship to gallinarum, and gallinarium pullorum.
eBG53 seems to enclose most of the Dublin.
But group eBG93 appears to serotype for dublin and enteridis, suggesting that it may be related to both. eBG32 is slightly more diverse, both of the previous ones plus paratyphi B var Java (monophasic).
This was relatively interesting, so the authors delved a little bit deeper, sequencing the genes for the antigens of ST74 (which is in eBG32). The gene which encodes this difference is FliC. If this has  Ala220* and Thr315, then it will test positive as enteridis. If it has Ala318, then it tests positive for dublin. If it has all three, then it will test positive for both. Salmonella in ST74 test positive for both. The reasons that they were assigned to either dublin or enteridis is likely due to random chance depending on the lab in which they were first identified.

Also shown in this figure are various eBG's which serotype as paratyphi and if they can digest d-Tartrate, they are classified as Java. These two serotypes are believed to cause different diseases. However, these serotypes are characterised by multiple eBG's. D-tartrate digestion is controlled by a single nucleotide change, and this work suggests that the mutation can arise multiple times. Further analysis suggests that the virulence factors can vary between eBG's, as well as the presence of antibiotic resistance.
What this data shows is that even within one sequence type, there can be variables that change between different strains, and these need to be investigated with further sequence typing.

Figure 6

This figure shows the structure of 6,7:c:1,5, strains, which are not as well characterised as the ones previously described, mainly because they primarily occur in Asia, as opposed to Europe and the US. 
The serotypes are mainly classified through the way in which they digest tartrate and dulcitol, as these seem to correlate with the difference in diseases caused by the bacteria.
A lot of these diseases affect swine (suis in latin means pig, hence cholerasuis= cholera in swine, typhisuis= typhoid in swine), whereas paratyphi C is associated with disease in humans.
Again, the authors note that they had to overcome some difficulties with serotyping, but most of these sequence types tend to map to closely related eBGs 6 and 20. The main exceptions are strains that serotype as Decatur, which seem to be unrelated to eachother as well as the main complex, calling into question it's definition as a type in itself.

But this raises an interesting question- how did these unrelated strains come to have the same serotype, despite being unrelated ?
The answer lies in the flagella, which are the targets of serotyping in this instance. They examined the full flaggellar genes for each of these strains and subjected them to analysis.
In addition, they searched the BLAST database of genomes to include genes that were similar, in order to work out what it was about Decatur types that separate them from all of the other ones.
What they found suggested that actually these flagella are related to eachother, despite the 7 genes for multilocus serotyping being different, suggesting that a horizontal gene transfer event has occurred, at least for  the FliC flagella protein. The FljB protein however varies a lot, and could lead to serotyping confusion.
An interesting wrinkle in this data is the nature of the mutations shown in the decatur group.
with a lot of mutations, the ones that are favoured are generally the ones which don't affect the structure of the protein, i.e . that don't change the amino acids. But with Decatur, changes which seemed to favour different amino acids (relative to the standard sequence) seem to have occurred. 
Wthese changes due to some form of evolutionary adaptation?
To test this they looked at ω. This measurement  tells us about the accumulation of mutations which change protein structure relative to mutations which do not, and we can compare this to other genes which may not be subject to the same evolutionary pressures to check whether this specific gene is undergoing selection.

Whilst the ω seemed to be different, it was within the range of variation seen in the seven genes used for MLST, thus suggesting that any selective pressures that are affecting the flagella genes are also affecting these seven housekeeping genes as well.


This paper gave me a lot to chew on whilst reading it. It touches on multiple facets of salmonella epidemiology, as well as MLST and serotyping. It's big, it's ambitious and it's thorough. Moreover, towards the end of the paper, you could feel the joy of discovery here. 
I mean, it wasn't enough for them to simply create their eBURST groups, look at the serotypes and simply say that their technique was better. They made a concerted effort with Dublin and Enteridis and Decatur to explain why these serotypes may be different, elevating this above a simple re-classification exercise and turning it into interesting science. 
I also have to give props to them for not taking their eBG at face value, and subjecting them to alternative methods of classification to demonstrate their robustness. I really got the impression (as someone who isn't very familiar with MLST) that these data were scrutinized to the limits that current techniques allow.
The major problem of trying to look at the development of Salmonella is Horizontal gene transfer. The techniques they use minimise some of the problems caused by this by looking at multiple genes simultaneously.

Now, I have a confession to make. I don't often comment on the discussion of papers. Usually because for good papers, it doesn't usually add anything that the results themselves don't implicitly suggest. For bad papers it allows for authors to run wild with evidence free speculation (See dinosaurs in space,) and I can't face cutting through all of the bullshit. 
But I'm going to comment on this discussion, because it provides something different. A warning for researchers who seek to follow in this research. They realise that the eBG map they have created is the starting point, and that there are gaps that may be filled through others following in their footsteps.
The warn against filling gaps for it's own sake, that potential errors can arise if nothing less than the best practice is performed when isolating new sequence types. 
They also warn that Lineage 3, one of the few already established trees of Salmonella, may break down under further analysis, as it undergoes recombination frequently between sequence types of the same lineage.

So all in all, what have they shown in this paper ?
To sum it up in one sentence, they have shown that serotyping and traditional methods miss out a lot of important details that eBG's do capture. And the differences in serotyping across eBG groups tells us something new and interesting about Salmonella types and how they evolve.
 Moreover, the eBG's defined here are robust and stand up to other analysis methods. Using MLST allows for a standard technique that also has the potential to track new groups as they develop.

Will this method be adopted in labs ? Well, I would have to hand that question over to microbiologists.
As seen above, the serotyping reaction is relatively quick. These sorts of classifications are often done in hospitals which may not have access to advanced molecular biology equipment, or the advanced refridgeration required to maintain the enzymes for PCR. in terms of cost, but this can be solved by having core reference labs dedicated to doing this sort of research.
But I'm wondering whether MLST is the minidisc player of the epidemiological world. At this moment it is probably the best method available. But just over the horizon is whole genome sequencing, which will blow MLST out of the water when it becomes reliable and cheap enough ?  Shall we build our infrastructure for MLST, only to have to re-do it all for whole genome sequencing ?
I don't have the answers to these questions, but it speaks to the confidence of this paper that the authors themselves draw attention to these problems themselves.

(2012). Multilocus Sequence Typing as a Replacement for Serotyping in Salmonella enterica, PLoS Pathogens, 8 (6) DOI: 10.1371/journal.ppat.1002776.t004 *these are amino acid positions within proteins. ala220= alanine is present 220 amino acids along the protein.

1 comment:

  1. This was a very interesting, useful and succinct analysis. Thank you.


Markup Key:
- <b>bold</b> = bold
- <i>italic</i> = italic
- <a href="">FoS</a> = FoS