Tuesday, August 22, 2017

Unattested character states

In an earlier post from January 2016, I argued that it is important to account for directional processes when modeling language history through character-state evolution. In previous papers (List 2016; Chacon and List 2015), I  tried to show that this can be easily done with asymmetric step matrices in a parsimony framework. Only later did I realize that this is nothing new for biologists who work on morphological characters, thus supporting David's claim that we should not compare linguistic characters with the genotype, but with the phenotype (Morrison 2014). Early this year, a colleague introduced me to Mk-models in phylogenetics, which were first introduced by Lewis (2001)) and allow analysis of multi-state characters in a likelihood framework.

What was surprising for me is that it seems that Mk-models seem to outperform parsimony frameworks, although being much simpler than elaborate step-matrices defined for morphological characters (Wright and Hillis 2014). Today, I read that a recent paper by Wright et al. (2016) even shows how asymmetric transition rates can be handled in likelihood frameworks.

Being by no means an expert in phylogenetic analyses, especially not in likelihood frameworks, I tend to have a hard time understanding what is actually being modeled. However, if I correctly understand the gist of the Wright et al. paper, it seems that we are slowly approaching a situation in which more complex scenarios of lexical character evolution in linguistics no longer need to rely on parsimony frameworks.

But, unfortunately, we are not there yet; and it is even questionable whether we will ever be. The reason is that all multi-state models that have been proposed so far only handle transitions between attested characters: unattested characters can neither be included in the analyses nor can they be inferred.

I have pointed to this problem in some previous blogposts, the last one published in June, where I mentioned Ferdinand de Saussure, (1857-1913), who postulated two unattested consonantal sounds for Indo-European (Saussure 1879), of which one was later found to have still survived in Hittite, a language that was deciphered and shown to be Indo-European only about 30 years later (Lehmann 1992: 33).

The fact that it is possible to use our traditional methods to infer unattested sounds from circumstantial evidence, but not to include our knowledge about them into phylogenetic analyses, is a huge drawback. Potentially even greater are the situations where even our traditional methods do not allow us to infer unattested data. Think, for example, of a word that was once present in some language but was later completely lost. Given the ephemeral nature of human language, we have no way to know this, but we know very well that it easily happens when just thinking of some terms used for old technology, like walkman or soon even iPod, which the younger generations have never heard about.

Colleagues with whom I have discuss my concerns in this regard are often more optimistic than I am, saying that even if the methods cannot handle unattested characters they could still find the major signal, and thus tell us at least the general tendency as to how a language family evolved. However, for classical linguists, who can infer quite a lot using the laborious methods that still need to be applied manually, it leaves a sour taste, if they are told that the analysis deliberately ignored crucial aspects of the processes and phenomena they understand very well. For example, if we detect that some intelligence test is right in about 80% of all cases, we would also abstain from using it to judge who we allow to take up their studies at university.

I also think that it is not a satisfying solution for the analysis of morphological data in biology. It is probably quite likely that some ancient species had certain traits which later evolved into the traits we observe which are simply no longer attested anywhere, either in fossils or in the genes. I also wonder how well phylogenetic frameworks generally account for the fact that what the evidence we are left with may reflect much less of what was once there.

In Chacon and List (2015), we circumvent the problem by adding ancestral but unattested sounds to the step matrices in our parsimony analysis. This is of course not entirely satisfactory, as it adds a heavy bias to the analysis of sound change, which no longer tests for all possible solutions but only for the ones we fed into the algorithm. For sound change, it may be possible to substantially expand the character space by adding sounds attested across the world's languages, and then having the algorithms select the most probable transitions. But given that we still barely know anything about general transition probabilities of sound change, and that databases like Phoible (Moran 2015)  list more than 2,000 different sounds for a bit more than 2,000 languages, it seems like a Sisyphean challenge to tackle this problem consistently.

What can we do in the meantime? Not very much, it seems. But we can still try to improve our methods in baby steps, trying to get a better understanding of the major and minor processes in linguistic and biological evolution; and not forgetting that, although I was only talking about phylogenetic tree reconstruction, in the end we also want to have all of this done in network approaches.

  • Chacon, T. and J.-M. List (2015) Improved computational models of sound change shed light on the history of the Tukanoan languages. Journal of Language Relationship 13: 177-204.
  • Lehmann, W. (1992) Historical linguistics. An Introduction. Routledge: London.
  • Lewis, P. (2001) A likelihood approach to estimating phylogeny from discrete morphological character data. Systematic Biology 50: 913-925.
  • List, J.-M. (2016) Beyond cognacy: Historical relations between words and their implication for phylogenetic reconstruction. Journal of Language Evolution 1: 119-136.
  • Moran, S., D. McCloy, and R. Wright (eds) (2014) PHOIBLE Online. Max Planck Institute for Evolutionary Anthropology: Leipzig.
  • Morrison, D.A. (2014) Are phylogenetic patterns the same in anthropology and biology? bioRxiv.
  • Saussure, F. (1879) Mémoire sur le système primitif des voyelles dans les langues indo-européennes. Teubner: Leipzig.
  • Wright, A. and D. Hillis (2014) Bayesian analysis using a simple likelihood model outperforms parsimony for estimation of phylogeny from discrete morphological data. PLoS ONE 9.10. e109210.
  • Wright, A., G. Lloyd, and D. Hillis (2016) Modeling character change heterogeneity in phylogenetic analyses of morphology through the use of priors. Systematic Biology 65: 602-611.

Tuesday, August 15, 2017

Is reticulation as important in rice as in wheat?

I have previously discussed the use of phylogenetic networks to study the Complex hybridizations in wheat, due to the very reticulate evolutionary history. It seems that the situation for the other major world food source, rice, also requires network analysis, although this time introgression is the biological source of reticulation, rather than hybridization.

Jae Young Choi, Adrian E. Platts, Dorian Q. Fuller, Yue-Ie Hsing, Rod A. Wing, and Michael D. Purugganan (2017) The rice paradox: multiple origins but single domestication in Asian rice. Molecular Biology & Evolution 34: 969-979.

The authors note:
The Asian rice Oryza sativa is the world’s most important food crop, and is a staple for more than one-third of the world’s population. Oryza sativa is genetically differentiated into several groups, the main ones being japonica and indica, which have been considered as subspecies / subpopulations with distinct morphological and physiological characteristics

The origin of domesticated Asian rice has been a contentious topic, with conflicting evidence for either single or multiple domestication of this key crop species. We examined the evolutionary history of domesticated rice by analyzing de novo assembled genomes from domesticated rice and its wild progenitors. Our results indicate multiple origins, where each domesticated rice subpopulation (japonica, indica, and aus) arose separately from progenitor O. rufipogon and / or O. nivara.

We also show that there is significant gene flow from japonica to both indica (c. 17%) and aus (c. 15%), which led to the transfer of domestication alleles from early-domesticated japonica to proto-indica and proto-aus populations. Our results provide support for a model in which different rice subspecies had separate origins, but that de novo domestication occurred only once, in O. sativa ssp. japonica, and introgressive hybridization from early japonica to proto-indica and proto-aus led to domesticated indica and aus rice.
Similar reticulation histories have, of course, been reported for most domesticated organisms (see Are phylogenetic trees useful for domesticated organisms?), including dogs, cattle, horses, sheep, grapes, etc.

Tuesday, August 8, 2017

Where to retire - a network analysis

I am an elderly man, and it is getting towards time to retire. But where?

I could retire back in Australia; but, as Thomas Wolfe said: "You can't go home again." I could retire in Sweden, but the tax authorities are likely to then take 25% of my pension, which I need to be living on, instead. So, where to go?

This is a question that has occupied the minds of many people, for themselves as well as others; and so, inevitably, you will find web sites on the matter. For example, Live and Invest Overseas has a Retire Overseas Index, recommending particular places, which it updates annually; and International Living has a similar Annual Global Retirement Index.

To help me in my decision, let's look at the International Living data, The World’s Best Places to Retire in 2017. This site provides a rating (out of 100) of ten important characteristics, for 24 countries that might be of interest to retirees:
  • Benefits & discounts
  • Buying & renting
  • Climate
  • Cost of living
  • Entertainment & amenities
  • Fitting in
  • Health care
  • Healthy lifestyle
  • Infrastructure
  • Visas & residence
For 2017, the individual scores vary from 57-100, with "Benefits & discounts" and "Cost of living" varying the most between countries, and "Fitting in" and "Health care" varying the least.

The ten scores for each country can be averaged, to provide a rank ordering of the 24 countries. These average scores vary from 73.3 to 90.9, as shown in the first graph.

There is little to choose between the first three countries in terms of their average score (Ecuador, Mexico, Panama), nor between the next three (Colombia, Costa Rica, Malaysia). But this does not make these countries intrinsically equal. After all, both Panama and Ecuador handsomely outdo Mexico on "Benefits & discounts", while Mexico does better on "Cost of living". I need an analysis that takes into account which characteristics differ between the countries.

This is where a network analysis comes in handy, as a tool for exploratory data analysis. As usual in this blog, I have calculated the Manhattan distance pairwise between the countries; and I am displaying this in the next figure using a NeighborNet network. Countries that have similar retirement characteristics are near each other in the network; and the further apart they are in the network then the more different are their characteristics.

The countries are color-coded by geography, which shows that their actual location has little effect on the Retirement Index. However, the European countries are gathered at the bottom-left, without any representative from Asia. The six top-ranked countries are all clustered in the bottom-right of the network.

Next to this top-rank cluster come Portugal and Spain on one hand, and Nicaragua on the other. These three countries have similar Retirement Scores, but they are separated in the network because Nicaragua scores poorly on "Infrastructure" and "Health care", but better than Europe on "Cost of living", "Buying & renting" and "Healthy lifestyle".

Spain does better than Portugal on "Entertainment & amenities"!

All in all, Portugal look like a good bet to me. The Live and Invest Overseas site lists individual places to retire, not just countries, and for the past three years it has recommended the Algarve region in Portugal as the top location.

Importantly, the Portugese also won't tax my pension (Pension i Portugal ger skattefria miljoner), although the Swedish government is not happy about this, of course (Skattefrihet ska stoppas: Portugal till förhandlingsbordet).

Tuesday, August 1, 2017

Stacking neighbour-nets: a real-world example

In my last post, I outlined two ideas about how stacking neighbour-nets can assist in tracing evolutionary change over time, using a theoretical example. In this post, I will show how this could work using a (tricky) real-world example: a morphological matrix including a high proportion of fossil taxa and a good deal of (strongly) homoplasious characters (Bomfleur, Grimm & McLoughlin 2017).

Stacking can be valuable when both fossil and extant taxa are included in the study. The idea of stacking is to construct networks for each time slice, rather than creating one giant network that tries to encompass everything. Adjacent time-slice networks can then be directly compared, which should reveal the evolutionary changes that occurred between those two times. The final phylogeny can then be constructed from this information, including all of the extant taxa and fossils together.

I regard our work as quite innovative for a palaeobotanical/-phylogenetic systematic study, as it generated a taxon-dense dataset down to species (sometimes individual specimens) as ‘operational taxonomic units’ (OTUs). Our goal was to provide a unifying classification for extant and fossil Osmundales (royal ferns) rhizomes. The primary purpose is hence not to infer a phylogenetic tree but to assist in describing and placing new-found rhizome fossils in the phylogeny. The placement workflow (see this tutorial) combines a polytomous key (using conserved, lineage-diagnostic traits) with neighbour-nets that use different taxon sets. We discussed odd placements in the splits graphs, and matrix signal quality (robustness) from differential branch support, as estimated by non-parametric bootstrapping (least-squares, maximum likelihood, maximum parsimony).

Sources of incompatible data patterns in real-world data

The main problem with real-world data when it comes to inferring phylogenetic relationships, i.e. estimating the true phylogeny, are incompatible data patterns. For molecular matrices, the two main sources of signals that will be incompatible with the true phylogeny are back-mutations and model-bias. For instance, there is usually a higher probability for transitions than for transversions; and for coding gene regions, the 3rd codon position can become over-saturated and thus stochastically distributed, providing little phylogenetic signal. By adapting the model in a probabilistic environment, we can (try to) counter such biases during inference

In the case of morphological (or other non-molecular) traits, incompatible signals arise from:
  1. homoplasious characters – traits that evolve convergently or in parallel, which are frequently included in such matrices;
  2. epigenetic effects – morphological traits not, or not fully, controlled by the genetic composition of the organism; and
  3. pseudo-homologies – traits that are seemingly the same but are the endpoint of different evolutionary pathways.
Inferring a tree reflecting the true phylogeny from such a matrix may be very difficult or even impossible. For a perfect probabilistic approach, we would need to establish character-wise probabilities for change, which requires that a lineage has a modern-day diversity fairly matching that in the past.

Fossils add further sources of signals incompatible with the true phylogeny, such as: preservation artefacts and misinterpretations (false homologies); uncertainty linked to heterochrony; and, last but not least, ‘temporal’ convergences, i.e. the parallel or convergent evolution of the same (or similar) trait in an ancient sister or unrelated lineage of a modern (or much younger) lineage.

For all of these aspects, the royal fern rhizomes provide a nice example (i.e. a bad-case scenario). Only a few of the 45 scored traits that can be observed in fossil material are conserved within the modern lineages and their extant representatives, and hence are of high diagnostic value for assigning fossils to one of these lineages. Many other rhizome features are variable within extant members of the now six genera (some even within a species), and increasingly so looking back into the past.

The royal ferns became arborescent several times, as reflected by convergent adaptations in rhizome anatomy — highly complex stele architectures are found from the Permian onwards in (morpho)species that differ in all relatively stable, lineage-diagnostic traits. The most complex modern-day rhizomes have anatomies that appear to be less derived than those of some of their ancient counterparts. Nonetheless, the rhizomes, scored for 129/130 OTUs (fossil species, partly referring to individual specimens) in our matrix (click here for an annotated version for use with Mesquite), reflect a substantial past diversity and cover more than 250 million years of evolution.

Basic data situation

The all-inclusive neighbour-net (Fig. 1; see here for a fully annotated version) captures aspects of similarity patterns related to phylogenetic relationships, but does not clearly resolve the known (modern) or putative (extinct) genera within the core group Osmundoideae, for example. Overall branch-support is generally low for any alternative (details can be found here), independent of the optimality criterion used. [For our systematic treatment, we used data subsets to generate a series of networks including only members of the same (putative) lineage, which were increasingly proficient to sort the OTUs.]

The main problems are: (i) the differentiation between less-derived rhizome anatomies of the Osmundoideae found in the likely paraphyletic extinct genus Millerocaulis (pink in Fig. 1) and the modern genus Claytosmunda (magenta, paraphyletic with one survivor); and (ii) the distinctness and superficial similarity of two arborescent lineages, the genus Osmundacaulis (red) and the extinct (Permian to Jurassic) family Guaireaceae (greenish). They differ in all stable, lineage-diagnostic characters but share highly dissected steles. Phylogenetic trees "resolve" this conflict by creating an artificial clade (e.g. the parsimony cladogram by Wang et al. 2014). The neighbour-net (Fig. 1) places Osmundacaulis between the Guaireaceae and the Osmundoideae, the subfamily of Osmundaceae including the surviving modern genera.

Fig. 1. Neighbour-net based on a morphological distance matrix of 122 OTUs representing Permian to extant Osmundales and their putative relatives, the Grammatopteridales (black).

Stacking procedure one: identifying closest relatives in subsequent time-slices

Signal ambiguity (from homoplastic characters and the related resolution issue) affects also the time-wise networks to some degree. Figures 2–4 show the network-per-time-slice stacks. Each neighbour-net includes only the OTUs from one stratigraphic period (Permian, Triassic, Jurassic, Cretaceous, Paleogene + Neogene) and the modern-day survivors. For simplicity, links are only established for the closest potential relative in the subsequent or preceding time-slice; and only shown when the mean morphological distance (MD) does not exceed 0.25. The colouring of the dots reflects the systematic affinity of the taxon as established by Bomfleur et al. and shown in Fig. 1.

A major taxonomic turnover characterises the transition from the (late) Permian to the Triassic (Fig. 2). The most primitive (rhizome-wise) Osmundales, the Thamnopterioideae (brown) become extinct, and are completely replaced by the Osmundoideae, their modern counterparts. The only representative of the Permian diversity remaining in the Triassic appears to be Millerocaulis (?Palaeosmunda) stipabonnetiorum, and this may provide a good taxon for rooting the Triassic phylogeny. However, it also one of the worst-preserved and most poorly described taxa — to some degree, its similarity with both lineages of Permian Osmundaceae (Thamnopterioideae and Palaeosmunda) may hint that the distances are under-estimated, since traits could not be scored that otherwise lead to increased distances.

Fig. 2. Taxon-reduced neighbour-nets, including only species from the same time-slice (as labelled). Inter-time-slice links indicate the morphologically closest match in the preceding or following time-slice for each species (in case of pairwise distances < 0.25)

The Jurassic graph (in Fig. 2) highlights a decrease in overall diversity, despite the much higher numbers of OTUs. The links can help to establish relationships between congeners of both time scales; but for Osmundastrum (today represented by a single, genetically and morphologically derived species) a more pronounced evolutionary shift is indicated: the Triassic putative member is linked to Jurassic Millerocaulis species (a paraphyletic Osmundoideae genus defined by the absence of a trait found in all extant genera), which are relatively close to the first unambiguous Osmundastrum. We also find that the three Jurassic newcomers have little relation to the Triassic basis (Fig. 2).

The linking of the Jurassic and Cretaceous time-slices highlights (Fig. 3) a general weakness of the approach using this matrix: poorer preserved, incompletely described fossils included in the matrix (Cretaceous Millerocaulis) attract most links from the Jurassic Osmundoideae — their distances are under-estimated.

Fig. 3. As above, but linking the Jurassic and subsequent Cretaceous neighbour-nets. Note the decreasing diversity but clear signals for Osmundacaulis (red) in contrast to the group of modern Osmundoideae (purplish). Plenasium (light blue) is a modern arborescent genus with complex and highly dissected steles and generally derived rhizomes.

The two Osmundastrum, which are probably part of the same evolutionary lineage, are not linked (see Bomfleur, Grimm & McLoughlin 2015 for the reasons). Two modern lineages with more or strongly derived rhizomes appear in the Cretaceous, the Todinae and Plenasium.

In the case of the Todinae the Jurassic links are partly ambiguous, with one Cretaceous OTU linked to Jurassic Claytosmunda (part of the Todinae’s sister clade according to molecular data), but the other with some relatively distinct Millerocaulis. The problem here is that the Todinae may have diverged earlier (Bomfleur, Grimm & McLoughlin 2015; Grimm et al. 2015), but their rhizome fossils have so far not been found (or lack the diagnostic characters of the lineage). Gaps in the fossil record can hinder establishing meaningful links. The links are, however, to a group of Millerocaulis that are closer to coeval Claytosmunda – which show a rhizome anatomy that may be closest to that of the common ancestor of all modern-day king ferns – than to their congeners. In the case of Plenasium, the genus with the most-derived rhizomes of all modern Osmundaceae, the closest older relative is part of the same subgroup of Millerocaulis. These potentially false links may reflect that some Millerocaulis show derived character suites, which are typically found also in one or another modern Osmundaceae genus (similarity due to convergence).

The closer we get to the modern-day situation, the more interpretable the links become (Fig. 4). Lineages with distinct and derived rhizome anatomies such as Osmundastrum and Plenasium are linked across time-slices. Cross-generic links from Cretaceous Millerocaulis to Paleogene-Neogene Osmunda to modern-day Claytosmunda relate directly to higher numbers of shared, possibly primitive characters in the connected taxa; these links can again be informative for rooting the graphs. Substantially weaker links (mean morphological distances > 0.1 between time-slices) are found for distantly related pairings (Cretaceous and extant Todinae with Paleogene-Neogene Osmundastrum and Claytosmunda).

Fig. 4. As above, but for Cretaceous to modern-day.

Stacking procedure two: graphs including taxa of two subsequent time-slices

Figures 5 and 6 show the two-adjacent-time-slices-per-graph stacks. Interpretation of these figures is more straightforward — one just compares the placement of the connecting taxa (Triassic and Jurassic in Fig. 5; Paleogene and Neogene in Fig. 6). The resolution issue regarding the relationship between Millerocaulis and genera representing the modern lineage (Claytosmunda, Osmundastrum, Plenasium, Leptopteris, Todea) is obvious — the Triassic Millerocaulis are clustered in the Permo-Triassic graph, but are placed apart within the spider-web-like portion in the Triassic-Jurassic graph (Fig. 5). This could mean that several lineages of Millerocaulis diversified in the Jurassic, all of which have their roots in the Triassic. Some of the emerging Millerocaulis groups remain coherent in the Jurassic-Cretaceous graph (and can include Cretaceous species), put their position relative to each other can change. In contrast, for Osmundacaulis the Cretaceous newcomers simply fit into the existing organisation.

Fig. 5. Stack of neighbour-nets comprising species of two subsequent time-slices, covering the time from the Permian to the Cretaceous. Connections relate to Triassic (lower half) or Jurassic (upper half) species that are included in two subsequent splits graphs.

The transition from the Cretaceous to the modern-day situation (Fig. 6) fairly reflects what could be inferred by mapping morphological characters onto the molecular tree. The placement of Osmunda species in the graphs reflect evolutionary change towards the modern-day species, whereas stasis can be assumed for Osmundastrum, and a loss of diversity for Claytosmunda. According to the structures of the graphs, the modern-day Plenasium (subgenus Plenasium) replaced the more diverse (and partly more derived) Cretaceous-Paleogene Plenasium (subgenus Aurealcaulis); but the genus is absent from the Neogene, so there are no connections between the ‘65–5 Ma’ and ‘last 25 Ma’ graphs.

Fig. 6. As above, but covering the time from the Cretaceous to now. Connections refer to Paleogene (lower half) and Neogene (upper half) species.

Now that it’s done, what can be said?

Establishing similarity links across time-slices can be tedious or even misleading, especially with increasing numbers of taxa and increasing complexity of the signals in the matrix (Figs 2–3). The process is more time-consuming and the result (Figs 2–4) is graphically more challenging than the alternative stacking procedure (Figs 5–6).

With most real-world data, it may be difficult to get a set of links between time slices that reflect the true phylogeny, like it did in my earlier theoretical example. Nonetheless, the procedure can help to identify potential relatives (ancestors, descendants, sister lineages) of groups that are restricted to a single time slice, or highlight the lack of potential or favourable candidates.

However, in general, joining the taxa from two subsequent time-slices in one graph, and connecting these graphs by the shared taxa, seems to be a more feasible and straightforward approach. Once a matrix is compiled, the distance calculation and splits-graph inference is a matter of minutes, and it takes less than half-an-hour to produce a first graphical output using the graphical functions in SplitsTree and software to graphically stack the exported SVG or EPS files (further beautification may take a day). Taxa with odd signals (with ambiguous affinity) will be placed accordingly in the nets and eventually move around in the two containing graphs (Fig. 5) and the amount of evolutionary change across time may be directly visible (Fig. 6).

Additional links for readers interested in details

Figure illustrating the history of taxonomic systems for Osmundales.
— An archive including all analysis files generated in the course of the original study is hosted at the Dryad Digital Repository.
— Further annotated versions of the figures shown in this post and the used analysis files have been published under a CC-BY licence: Grimm G. (2017) Osmundales diverstity through time: stacking networks. figshare. https://doi.org/10.6084/m9.figshare.5255014.v1.


Bomfleur B, Grimm GW, McLoughlin S (2015) Osmunda pulchella sp. nov. from the Jurassic of Sweden—reconciling molecular and fossil evidence in the phylogeny of modern royal ferns (Osmundaceae). BMC Evolutionary Biology 15: 126.

Bomfleur B, Grimm GW, McLoughlin S (2017) The fossil Osmundales (Royal Ferns)—a phylogenetic network analysis, revised taxonomy, and evolutionary classification of anatomically preserved trunks and rhizomes. PeerJ 5: e3433.

Grimm GW, Kapli P, Bomfleur B, McLoughlin S, Renner SS (2015) Using more than the oldest fossils: Dating Osmundaceae with the fossilized birth-death process. Systematic Biology 64: 396-405.

Huson DH, Bryant D (2006) Application of phylogenetic networks in evolutionary studies. Molecular Biology & Evolution 23: 254-267.

Maddison WP, Maddison DR (2001 onwards) Mesquite: a modular system for evolutionary analysis.

Wang S-J, Hilton J, He X-Y, Seyfullah LJ, Shao L (2014) The anatomically preserved Zhongmingella gen. nov. from the Upper Permian of China: evaluating the early evolution and phylogeny of the Osmundales. Journal of Systematic Palaeontology 1: 1-22.