Tuesday, August 22, 2017

Unattested character states

In an earlier post from January 2016, I argued that it is important to account for directional processes when modeling language history through character-state evolution. In previous papers (List 2016; Chacon and List 2015), I  tried to show that this can be easily done with asymmetric step matrices in a parsimony framework. Only later did I realize that this is nothing new for biologists who work on morphological characters, thus supporting David's claim that we should not compare linguistic characters with the genotype, but with the phenotype (Morrison 2014). Early this year, a colleague introduced me to Mk-models in phylogenetics, which were first introduced by Lewis (2001)) and allow analysis of multi-state characters in a likelihood framework.

What was surprising for me is that it seems that Mk-models seem to outperform parsimony frameworks, although being much simpler than elaborate step-matrices defined for morphological characters (Wright and Hillis 2014). Today, I read that a recent paper by Wright et al. (2016) even shows how asymmetric transition rates can be handled in likelihood frameworks.

Being by no means an expert in phylogenetic analyses, especially not in likelihood frameworks, I tend to have a hard time understanding what is actually being modeled. However, if I correctly understand the gist of the Wright et al. paper, it seems that we are slowly approaching a situation in which more complex scenarios of lexical character evolution in linguistics no longer need to rely on parsimony frameworks.

But, unfortunately, we are not there yet; and it is even questionable whether we will ever be. The reason is that all multi-state models that have been proposed so far only handle transitions between attested characters: unattested characters can neither be included in the analyses nor can they be inferred.

I have pointed to this problem in some previous blogposts, the last one published in June, where I mentioned Ferdinand de Saussure, (1857-1913), who postulated two unattested consonantal sounds for Indo-European (Saussure 1879), of which one was later found to have still survived in Hittite, a language that was deciphered and shown to be Indo-European only about 30 years later (Lehmann 1992: 33).

The fact that it is possible to use our traditional methods to infer unattested sounds from circumstantial evidence, but not to include our knowledge about them into phylogenetic analyses, is a huge drawback. Potentially even greater are the situations where even our traditional methods do not allow us to infer unattested data. Think, for example, of a word that was once present in some language but was later completely lost. Given the ephemeral nature of human language, we have no way to know this, but we know very well that it easily happens when just thinking of some terms used for old technology, like walkman or soon even iPod, which the younger generations have never heard about.

Colleagues with whom I have discuss my concerns in this regard are often more optimistic than I am, saying that even if the methods cannot handle unattested characters they could still find the major signal, and thus tell us at least the general tendency as to how a language family evolved. However, for classical linguists, who can infer quite a lot using the laborious methods that still need to be applied manually, it leaves a sour taste, if they are told that the analysis deliberately ignored crucial aspects of the processes and phenomena they understand very well. For example, if we detect that some intelligence test is right in about 80% of all cases, we would also abstain from using it to judge who we allow to take up their studies at university.

I also think that it is not a satisfying solution for the analysis of morphological data in biology. It is probably quite likely that some ancient species had certain traits which later evolved into the traits we observe which are simply no longer attested anywhere, either in fossils or in the genes. I also wonder how well phylogenetic frameworks generally account for the fact that what the evidence we are left with may reflect much less of what was once there.

In Chacon and List (2015), we circumvent the problem by adding ancestral but unattested sounds to the step matrices in our parsimony analysis. This is of course not entirely satisfactory, as it adds a heavy bias to the analysis of sound change, which no longer tests for all possible solutions but only for the ones we fed into the algorithm. For sound change, it may be possible to substantially expand the character space by adding sounds attested across the world's languages, and then having the algorithms select the most probable transitions. But given that we still barely know anything about general transition probabilities of sound change, and that databases like Phoible (Moran 2015)  list more than 2,000 different sounds for a bit more than 2,000 languages, it seems like a Sisyphean challenge to tackle this problem consistently.

What can we do in the meantime? Not very much, it seems. But we can still try to improve our methods in baby steps, trying to get a better understanding of the major and minor processes in linguistic and biological evolution; and not forgetting that, although I was only talking about phylogenetic tree reconstruction, in the end we also want to have all of this done in network approaches.

  • Chacon, T. and J.-M. List (2015) Improved computational models of sound change shed light on the history of the Tukanoan languages. Journal of Language Relationship 13: 177-204.
  • Lehmann, W. (1992) Historical linguistics. An Introduction. Routledge: London.
  • Lewis, P. (2001) A likelihood approach to estimating phylogeny from discrete morphological character data. Systematic Biology 50: 913-925.
  • List, J.-M. (2016) Beyond cognacy: Historical relations between words and their implication for phylogenetic reconstruction. Journal of Language Evolution 1: 119-136.
  • Moran, S., D. McCloy, and R. Wright (eds) (2014) PHOIBLE Online. Max Planck Institute for Evolutionary Anthropology: Leipzig.
  • Morrison, D.A. (2014) Are phylogenetic patterns the same in anthropology and biology? bioRxiv.
  • Saussure, F. (1879) Mémoire sur le système primitif des voyelles dans les langues indo-européennes. Teubner: Leipzig.
  • Wright, A. and D. Hillis (2014) Bayesian analysis using a simple likelihood model outperforms parsimony for estimation of phylogeny from discrete morphological data. PLoS ONE 9.10. e109210.
  • Wright, A., G. Lloyd, and D. Hillis (2016) Modeling character change heterogeneity in phylogenetic analyses of morphology through the use of priors. Systematic Biology 65: 602-611.

Tuesday, August 15, 2017

Is reticulation as important in rice as in wheat?

I have previously discussed the use of phylogenetic networks to study the Complex hybridizations in wheat, due to the very reticulate evolutionary history. It seems that the situation for the other major world food source, rice, also requires network analysis, although this time introgression is the biological source of reticulation, rather than hybridization.

Jae Young Choi, Adrian E. Platts, Dorian Q. Fuller, Yue-Ie Hsing, Rod A. Wing, and Michael D. Purugganan (2017) The rice paradox: multiple origins but single domestication in Asian rice. Molecular Biology & Evolution 34: 969-979.

The authors note:
The Asian rice Oryza sativa is the world’s most important food crop, and is a staple for more than one-third of the world’s population. Oryza sativa is genetically differentiated into several groups, the main ones being japonica and indica, which have been considered as subspecies / subpopulations with distinct morphological and physiological characteristics

The origin of domesticated Asian rice has been a contentious topic, with conflicting evidence for either single or multiple domestication of this key crop species. We examined the evolutionary history of domesticated rice by analyzing de novo assembled genomes from domesticated rice and its wild progenitors. Our results indicate multiple origins, where each domesticated rice subpopulation (japonica, indica, and aus) arose separately from progenitor O. rufipogon and / or O. nivara.

We also show that there is significant gene flow from japonica to both indica (c. 17%) and aus (c. 15%), which led to the transfer of domestication alleles from early-domesticated japonica to proto-indica and proto-aus populations. Our results provide support for a model in which different rice subspecies had separate origins, but that de novo domestication occurred only once, in O. sativa ssp. japonica, and introgressive hybridization from early japonica to proto-indica and proto-aus led to domesticated indica and aus rice.
Similar reticulation histories have, of course, been reported for most domesticated organisms (see Are phylogenetic trees useful for domesticated organisms?), including dogs, cattle, horses, sheep, grapes, etc.

Tuesday, August 8, 2017

Where to retire - a network analysis

I am an elderly man, and it is getting towards time to retire. But where?

I could retire back in Australia; but, as Thomas Wolfe said: "You can't go home again." I could retire in Sweden, but the tax authorities are likely to then take 25% of my pension, which I need to be living on, instead. So, where to go?

This is a question that has occupied the minds of many people, for themselves as well as others; and so, inevitably, you will find web sites on the matter. For example, Live and Invest Overseas has a Retire Overseas Index, recommending particular places, which it updates annually; and International Living has a similar Annual Global Retirement Index.

To help me in my decision, let's look at the International Living data, The World’s Best Places to Retire in 2017. This site provides a rating (out of 100) of ten important characteristics, for 24 countries that might be of interest to retirees:
  • Benefits & discounts
  • Buying & renting
  • Climate
  • Cost of living
  • Entertainment & amenities
  • Fitting in
  • Health care
  • Healthy lifestyle
  • Infrastructure
  • Visas & residence
For 2017, the individual scores vary from 57-100, with "Benefits & discounts" and "Cost of living" varying the most between countries, and "Fitting in" and "Health care" varying the least.

The ten scores for each country can be averaged, to provide a rank ordering of the 24 countries. These average scores vary from 73.3 to 90.9, as shown in the first graph.

There is little to choose between the first three countries in terms of their average score (Ecuador, Mexico, Panama), nor between the next three (Colombia, Costa Rica, Malaysia). But this does not make these countries intrinsically equal. After all, both Panama and Ecuador handsomely outdo Mexico on "Benefits & discounts", while Mexico does better on "Cost of living". I need an analysis that takes into account which characteristics differ between the countries.

This is where a network analysis comes in handy, as a tool for exploratory data analysis. As usual in this blog, I have calculated the Manhattan distance pairwise between the countries; and I am displaying this in the next figure using a NeighborNet network. Countries that have similar retirement characteristics are near each other in the network; and the further apart they are in the network then the more different are their characteristics.

The countries are color-coded by geography, which shows that their actual location has little effect on the Retirement Index. However, the European countries are gathered at the bottom-left, without any representative from Asia. The six top-ranked countries are all clustered in the bottom-right of the network.

Next to this top-rank cluster come Portugal and Spain on one hand, and Nicaragua on the other. These three countries have similar Retirement Scores, but they are separated in the network because Nicaragua scores poorly on "Infrastructure" and "Health care", but better than Europe on "Cost of living", "Buying & renting" and "Healthy lifestyle".

Spain does better than Portugal on "Entertainment & amenities"!

All in all, Portugal look like a good bet to me. The Live and Invest Overseas site lists individual places to retire, not just countries, and for the past three years it has recommended the Algarve region in Portugal as the top location.

Importantly, the Portugese also won't tax my pension (Pension i Portugal ger skattefria miljoner), although the Swedish government is not happy about this, of course (Skattefrihet ska stoppas: Portugal till förhandlingsbordet).

Tuesday, August 1, 2017

Stacking neighbour-nets: a real-world example

In my last post, I outlined two ideas about how stacking neighbour-nets can assist in tracing evolutionary change over time, using a theoretical example. In this post, I will show how this could work using a (tricky) real-world example: a morphological matrix including a high proportion of fossil taxa and a good deal of (strongly) homoplasious characters (Bomfleur, Grimm & McLoughlin 2017).

Stacking can be valuable when both fossil and extant taxa are included in the study. The idea of stacking is to construct networks for each time slice, rather than creating one giant network that tries to encompass everything. Adjacent time-slice networks can then be directly compared, which should reveal the evolutionary changes that occurred between those two times. The final phylogeny can then be constructed from this information, including all of the extant taxa and fossils together.

I regard our work as quite innovative for a palaeobotanical/-phylogenetic systematic study, as it generated a taxon-dense dataset down to species (sometimes individual specimens) as ‘operational taxonomic units’ (OTUs). Our goal was to provide a unifying classification for extant and fossil Osmundales (royal ferns) rhizomes. The primary purpose is hence not to infer a phylogenetic tree but to assist in describing and placing new-found rhizome fossils in the phylogeny. The placement workflow (see this tutorial) combines a polytomous key (using conserved, lineage-diagnostic traits) with neighbour-nets that use different taxon sets. We discussed odd placements in the splits graphs, and matrix signal quality (robustness) from differential branch support, as estimated by non-parametric bootstrapping (least-squares, maximum likelihood, maximum parsimony).

Sources of incompatible data patterns in real-world data

The main problem with real-world data when it comes to inferring phylogenetic relationships, i.e. estimating the true phylogeny, are incompatible data patterns. For molecular matrices, the two main sources of signals that will be incompatible with the true phylogeny are back-mutations and model-bias. For instance, there is usually a higher probability for transitions than for transversions; and for coding gene regions, the 3rd codon position can become over-saturated and thus stochastically distributed, providing little phylogenetic signal. By adapting the model in a probabilistic environment, we can (try to) counter such biases during inference

In the case of morphological (or other non-molecular) traits, incompatible signals arise from:
  1. homoplasious characters – traits that evolve convergently or in parallel, which are frequently included in such matrices;
  2. epigenetic effects – morphological traits not, or not fully, controlled by the genetic composition of the organism; and
  3. pseudo-homologies – traits that are seemingly the same but are the endpoint of different evolutionary pathways.
Inferring a tree reflecting the true phylogeny from such a matrix may be very difficult or even impossible. For a perfect probabilistic approach, we would need to establish character-wise probabilities for change, which requires that a lineage has a modern-day diversity fairly matching that in the past.

Fossils add further sources of signals incompatible with the true phylogeny, such as: preservation artefacts and misinterpretations (false homologies); uncertainty linked to heterochrony; and, last but not least, ‘temporal’ convergences, i.e. the parallel or convergent evolution of the same (or similar) trait in an ancient sister or unrelated lineage of a modern (or much younger) lineage.

For all of these aspects, the royal fern rhizomes provide a nice example (i.e. a bad-case scenario). Only a few of the 45 scored traits that can be observed in fossil material are conserved within the modern lineages and their extant representatives, and hence are of high diagnostic value for assigning fossils to one of these lineages. Many other rhizome features are variable within extant members of the now six genera (some even within a species), and increasingly so looking back into the past.

The royal ferns became arborescent several times, as reflected by convergent adaptations in rhizome anatomy — highly complex stele architectures are found from the Permian onwards in (morpho)species that differ in all relatively stable, lineage-diagnostic traits. The most complex modern-day rhizomes have anatomies that appear to be less derived than those of some of their ancient counterparts. Nonetheless, the rhizomes, scored for 129/130 OTUs (fossil species, partly referring to individual specimens) in our matrix (click here for an annotated version for use with Mesquite), reflect a substantial past diversity and cover more than 250 million years of evolution.

Basic data situation

The all-inclusive neighbour-net (Fig. 1; see here for a fully annotated version) captures aspects of similarity patterns related to phylogenetic relationships, but does not clearly resolve the known (modern) or putative (extinct) genera within the core group Osmundoideae, for example. Overall branch-support is generally low for any alternative (details can be found here), independent of the optimality criterion used. [For our systematic treatment, we used data subsets to generate a series of networks including only members of the same (putative) lineage, which were increasingly proficient to sort the OTUs.]

The main problems are: (i) the differentiation between less-derived rhizome anatomies of the Osmundoideae found in the likely paraphyletic extinct genus Millerocaulis (pink in Fig. 1) and the modern genus Claytosmunda (magenta, paraphyletic with one survivor); and (ii) the distinctness and superficial similarity of two arborescent lineages, the genus Osmundacaulis (red) and the extinct (Permian to Jurassic) family Guaireaceae (greenish). They differ in all stable, lineage-diagnostic characters but share highly dissected steles. Phylogenetic trees "resolve" this conflict by creating an artificial clade (e.g. the parsimony cladogram by Wang et al. 2014). The neighbour-net (Fig. 1) places Osmundacaulis between the Guaireaceae and the Osmundoideae, the subfamily of Osmundaceae including the surviving modern genera.

Fig. 1. Neighbour-net based on a morphological distance matrix of 122 OTUs representing Permian to extant Osmundales and their putative relatives, the Grammatopteridales (black).

Stacking procedure one: identifying closest relatives in subsequent time-slices

Signal ambiguity (from homoplastic characters and the related resolution issue) affects also the time-wise networks to some degree. Figures 2–4 show the network-per-time-slice stacks. Each neighbour-net includes only the OTUs from one stratigraphic period (Permian, Triassic, Jurassic, Cretaceous, Paleogene + Neogene) and the modern-day survivors. For simplicity, links are only established for the closest potential relative in the subsequent or preceding time-slice; and only shown when the mean morphological distance (MD) does not exceed 0.25. The colouring of the dots reflects the systematic affinity of the taxon as established by Bomfleur et al. and shown in Fig. 1.

A major taxonomic turnover characterises the transition from the (late) Permian to the Triassic (Fig. 2). The most primitive (rhizome-wise) Osmundales, the Thamnopterioideae (brown) become extinct, and are completely replaced by the Osmundoideae, their modern counterparts. The only representative of the Permian diversity remaining in the Triassic appears to be Millerocaulis (?Palaeosmunda) stipabonnetiorum, and this may provide a good taxon for rooting the Triassic phylogeny. However, it also one of the worst-preserved and most poorly described taxa — to some degree, its similarity with both lineages of Permian Osmundaceae (Thamnopterioideae and Palaeosmunda) may hint that the distances are under-estimated, since traits could not be scored that otherwise lead to increased distances.

Fig. 2. Taxon-reduced neighbour-nets, including only species from the same time-slice (as labelled). Inter-time-slice links indicate the morphologically closest match in the preceding or following time-slice for each species (in case of pairwise distances < 0.25)

The Jurassic graph (in Fig. 2) highlights a decrease in overall diversity, despite the much higher numbers of OTUs. The links can help to establish relationships between congeners of both time scales; but for Osmundastrum (today represented by a single, genetically and morphologically derived species) a more pronounced evolutionary shift is indicated: the Triassic putative member is linked to Jurassic Millerocaulis species (a paraphyletic Osmundoideae genus defined by the absence of a trait found in all extant genera), which are relatively close to the first unambiguous Osmundastrum. We also find that the three Jurassic newcomers have little relation to the Triassic basis (Fig. 2).

The linking of the Jurassic and Cretaceous time-slices highlights (Fig. 3) a general weakness of the approach using this matrix: poorer preserved, incompletely described fossils included in the matrix (Cretaceous Millerocaulis) attract most links from the Jurassic Osmundoideae — their distances are under-estimated.

Fig. 3. As above, but linking the Jurassic and subsequent Cretaceous neighbour-nets. Note the decreasing diversity but clear signals for Osmundacaulis (red) in contrast to the group of modern Osmundoideae (purplish). Plenasium (light blue) is a modern arborescent genus with complex and highly dissected steles and generally derived rhizomes.

The two Osmundastrum, which are probably part of the same evolutionary lineage, are not linked (see Bomfleur, Grimm & McLoughlin 2015 for the reasons). Two modern lineages with more or strongly derived rhizomes appear in the Cretaceous, the Todinae and Plenasium.

In the case of the Todinae the Jurassic links are partly ambiguous, with one Cretaceous OTU linked to Jurassic Claytosmunda (part of the Todinae’s sister clade according to molecular data), but the other with some relatively distinct Millerocaulis. The problem here is that the Todinae may have diverged earlier (Bomfleur, Grimm & McLoughlin 2015; Grimm et al. 2015), but their rhizome fossils have so far not been found (or lack the diagnostic characters of the lineage). Gaps in the fossil record can hinder establishing meaningful links. The links are, however, to a group of Millerocaulis that are closer to coeval Claytosmunda – which show a rhizome anatomy that may be closest to that of the common ancestor of all modern-day king ferns – than to their congeners. In the case of Plenasium, the genus with the most-derived rhizomes of all modern Osmundaceae, the closest older relative is part of the same subgroup of Millerocaulis. These potentially false links may reflect that some Millerocaulis show derived character suites, which are typically found also in one or another modern Osmundaceae genus (similarity due to convergence).

The closer we get to the modern-day situation, the more interpretable the links become (Fig. 4). Lineages with distinct and derived rhizome anatomies such as Osmundastrum and Plenasium are linked across time-slices. Cross-generic links from Cretaceous Millerocaulis to Paleogene-Neogene Osmunda to modern-day Claytosmunda relate directly to higher numbers of shared, possibly primitive characters in the connected taxa; these links can again be informative for rooting the graphs. Substantially weaker links (mean morphological distances > 0.1 between time-slices) are found for distantly related pairings (Cretaceous and extant Todinae with Paleogene-Neogene Osmundastrum and Claytosmunda).

Fig. 4. As above, but for Cretaceous to modern-day.

Stacking procedure two: graphs including taxa of two subsequent time-slices

Figures 5 and 6 show the two-adjacent-time-slices-per-graph stacks. Interpretation of these figures is more straightforward — one just compares the placement of the connecting taxa (Triassic and Jurassic in Fig. 5; Paleogene and Neogene in Fig. 6). The resolution issue regarding the relationship between Millerocaulis and genera representing the modern lineage (Claytosmunda, Osmundastrum, Plenasium, Leptopteris, Todea) is obvious — the Triassic Millerocaulis are clustered in the Permo-Triassic graph, but are placed apart within the spider-web-like portion in the Triassic-Jurassic graph (Fig. 5). This could mean that several lineages of Millerocaulis diversified in the Jurassic, all of which have their roots in the Triassic. Some of the emerging Millerocaulis groups remain coherent in the Jurassic-Cretaceous graph (and can include Cretaceous species), put their position relative to each other can change. In contrast, for Osmundacaulis the Cretaceous newcomers simply fit into the existing organisation.

Fig. 5. Stack of neighbour-nets comprising species of two subsequent time-slices, covering the time from the Permian to the Cretaceous. Connections relate to Triassic (lower half) or Jurassic (upper half) species that are included in two subsequent splits graphs.

The transition from the Cretaceous to the modern-day situation (Fig. 6) fairly reflects what could be inferred by mapping morphological characters onto the molecular tree. The placement of Osmunda species in the graphs reflect evolutionary change towards the modern-day species, whereas stasis can be assumed for Osmundastrum, and a loss of diversity for Claytosmunda. According to the structures of the graphs, the modern-day Plenasium (subgenus Plenasium) replaced the more diverse (and partly more derived) Cretaceous-Paleogene Plenasium (subgenus Aurealcaulis); but the genus is absent from the Neogene, so there are no connections between the ‘65–5 Ma’ and ‘last 25 Ma’ graphs.

Fig. 6. As above, but covering the time from the Cretaceous to now. Connections refer to Paleogene (lower half) and Neogene (upper half) species.

Now that it’s done, what can be said?

Establishing similarity links across time-slices can be tedious or even misleading, especially with increasing numbers of taxa and increasing complexity of the signals in the matrix (Figs 2–3). The process is more time-consuming and the result (Figs 2–4) is graphically more challenging than the alternative stacking procedure (Figs 5–6).

With most real-world data, it may be difficult to get a set of links between time slices that reflect the true phylogeny, like it did in my earlier theoretical example. Nonetheless, the procedure can help to identify potential relatives (ancestors, descendants, sister lineages) of groups that are restricted to a single time slice, or highlight the lack of potential or favourable candidates.

However, in general, joining the taxa from two subsequent time-slices in one graph, and connecting these graphs by the shared taxa, seems to be a more feasible and straightforward approach. Once a matrix is compiled, the distance calculation and splits-graph inference is a matter of minutes, and it takes less than half-an-hour to produce a first graphical output using the graphical functions in SplitsTree and software to graphically stack the exported SVG or EPS files (further beautification may take a day). Taxa with odd signals (with ambiguous affinity) will be placed accordingly in the nets and eventually move around in the two containing graphs (Fig. 5) and the amount of evolutionary change across time may be directly visible (Fig. 6).

Additional links for readers interested in details

Figure illustrating the history of taxonomic systems for Osmundales.
— An archive including all analysis files generated in the course of the original study is hosted at the Dryad Digital Repository.
— Further annotated versions of the figures shown in this post and the used analysis files have been published under a CC-BY licence: Grimm G. (2017) Osmundales diverstity through time: stacking networks. figshare. https://doi.org/10.6084/m9.figshare.5255014.v1.


Bomfleur B, Grimm GW, McLoughlin S (2015) Osmunda pulchella sp. nov. from the Jurassic of Sweden—reconciling molecular and fossil evidence in the phylogeny of modern royal ferns (Osmundaceae). BMC Evolutionary Biology 15: 126.

Bomfleur B, Grimm GW, McLoughlin S (2017) The fossil Osmundales (Royal Ferns)—a phylogenetic network analysis, revised taxonomy, and evolutionary classification of anatomically preserved trunks and rhizomes. PeerJ 5: e3433.

Grimm GW, Kapli P, Bomfleur B, McLoughlin S, Renner SS (2015) Using more than the oldest fossils: Dating Osmundaceae with the fossilized birth-death process. Systematic Biology 64: 396-405.

Huson DH, Bryant D (2006) Application of phylogenetic networks in evolutionary studies. Molecular Biology & Evolution 23: 254-267.

Maddison WP, Maddison DR (2001 onwards) Mesquite: a modular system for evolutionary analysis.

Wang S-J, Hilton J, He X-Y, Seyfullah LJ, Shao L (2014) The anatomically preserved Zhongmingella gen. nov. from the Upper Permian of China: evaluating the early evolution and phylogeny of the Osmundales. Journal of Systematic Palaeontology 1: 1-22.

Tuesday, July 25, 2017

More on similarities in linguistics

In an earlier blogpost I discussed various reasons for similarity of certain traits in languages. I emphasized four major reasons for similarities, for example, in the lexicon of languages: coincidence, natural reasons, inheritance, and contact (see also List 2014: 55f and Aikhenvald 2007: 5). Despite the problems of distinguishing inherited from borrowed traits, which I called historical reasons for similarity, controlling for coincidence and history can often be done in a rather straightforward way. Coincidence can be called by applying a frequency criterion: if certain similarities are extremely spurious, they are usually due to chance. Historical similarities can be detected with the help of classical methods for language comparison. If, using these methods, we know, for example, that two or more languages are genetically related or have been developing in close contact with each other, then we will usually assume that shared traits among them are due to their shared history.

The third group of similarities, on the other hand, which I called natural, is a bit more difficult to interpret, since it is not entirely clear what "natural" means in this context. My earlier example was the word for "mother", which in many languages is expressed as "mama", similar to "father", which is often expressed as "papa", even in languages where we know that they are not related. or only extremely distantly related (if we assume that language was only invented once), and will thus be acquired rather early by children.

In the case of "mama" and "papa", we can blame our articulatory apparatus, which makes sounds like [m], [p], and [a] very easy to pronounce for all humans, no matter where and in which time they are born. Calling this "nature" is probably justified, given that pronouncability is not per se characteristic for language as a general means of complex communication. In sign languages, for example, pronouncability does not play any role, as those languages are never pronounced, but expressed with the help of gestures. But even in sign languages, we also find cross-linguistic similarities, which seem to be independent of coincindence or history: body parts, for example, are often expressed iconically, e.g., by pointing to them (see Woodward 1993 for details).

However, not all of those similarities between languages that are not due to history or coincidence are necessarily due to our articulation apparatus. We can think of many different reasons for cross-linguistic similarities, such as, for example, innate settings of the human brain, or global similarities of the environment in which humans live. In the past, colleagues have occasionally pointed out to me the heterogeneity of this class of "natural" similarities. When trying to further subdivide them, the former could be called "similarities due to cognition", while the latter could be called "similarities due to environment". But neither of these two groups seems to be quite satisfying, as we do not really know the relation between environment and cognition. We may also assume that there is a certain influence between the two, and depending on where we draw the border, we would either subscribe to a predominantly Aristotelian viewpoint, where we assign the predominant role to the environment, or a Platonic viewpoint, where we assign it to the innate "ideas" which are given to us along with our brain.

As an example for the difficulty of distinguishing different sources of "natural" similarity, let us have a look at how languages of the world express a fixed set of concepts. In a very simplistic view, given only two things we want to express, for instance the concept "hand" and the concept "arm", we can ask whether a given language will use the same or different words as a rule. English, for example, uses two different words, namely hand and arm, and so does German (Hand and Arm), while Russian uses only one word, ruka, to refer to both concepts in most situations (in Russian, there is another word kist', which can be used to denote "hand", but it is rarely used). We can say that Russian ruka is polysemous, since the word form has at least two meanings. A better way of expressing this is to say that Russian colexifies "hand" and "arm" (François 2008), since the term polysemy has a specific usage in linguistics, referring to words expressing multiple meanings that should be "conceptually close" or "developed from semantic change", which is an extremely vague definition that further requires us to know the history of a given word form and the development of its meanings.

Cross-linguistically, the colexification of "arm" and "hand", i.e. that many languages tend to use a single word to denote both concepts, occurs extremely often in the languages of the world; so often that we can rule out that the use of one word for two concepts is due to coincidence (compare the colexifications of "arm" in the CLICS database by List et al. 2014 through this link). Given that the colexification recurs also in different language families spoken in different regions of the world, we can further rule out historical reasons. This leaves us with the heterogeneous class of "natural reasons for similarities". But what kind of natural similarities are we dealing with here? Are they cognitive? They surely are in some sense, as we can say that humans have good reasons to consider the hand and the arm as one continuous part of their body.

But this continuity is also given by the structure of our body, which itself is given independently of our perception. One could argue that our perception grounds in our bodily experience, but if we look further into other frequent colexifications, e.g. between "dark" and "black" (this occurs in more than 20 language families, see here), as well as "bright" and "white" (occurs in three language families, see here), our perception is less dependent on our body but more on the environment in which we experience darkness and brightness, since most humans have eyesight and do not live entirely in caves.

It is some kind of the egg-hen problem of who was there first, and the more I think about it, I prefer to avoid giving any clear-cut preference to either the egg nor the hen. We can obviously try to make a more fine-grained distinction between different kinds of non-historical and non-coincidental similarities between languages, but unless psychologists and cognitive scientists solve general problems of perception and environment, it seems that, at least for the moment, "natural similarities" is explicit enough as a term to describe universal patterns in the languages of the world.

  • François, A. (2008) Semantic maps and the typology of colexification: intertwining polysemous networks across languages. In: Vanhove, M. (ed.): From polysemy to semantic change. Benjamins: Amsterdam. 163-215.
  • List, J.-M., T. Mayer, A. Terhalle, and M. Urban (eds.) (2014) CLICS: Database of Cross-Linguistic Colexifications. Forschungszentrum Deutscher Sprachatlas: Marburg. http://www.webcitation.org/6ccEMrZYM.
  • List, J.-M., M. Cysouw, and R. Forkel (2016) Concepticon. A resource for the linking of concept lists. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation, 2393-2400.
  • Woodward, J. (1993) Lexical evidence for the existence of South Asian and East Asian sign language families. Journal of Asian Pacific Communication 4.2: 91-107.

Tuesday, July 18, 2017

Stacking neighbour-nets: ancestors and descendants

Although they are not phylogenetic networks in an evolutionary sense, neighbour-nets are amazingly efficient when it comes to depicting actual ancestor-descendant relationships. Spencer et al. (2004), for example, applied various methods of phylogenetic inference to reconstruct a known phylogeny of scripts, text copies made by scribes based on an original text, or copies of that text. They found that the neighbour-net algorithm produced a graph that best depicts the actual ancestor (original text) to descendant (text copy made by a scribe) relationships. The reason is relatively simple: neighbour-nets are well-suited to extract and reflect differentiation patterns from a distance matrix.

In this post I will explore this idea in some detail, because I think that it has important practical implications (which I will further elaborate in future posts). If data are available for time slices of the evolutionary history, then it is possible to stack a series of networks that describe each time slice, thus providing a much more comprehensively inferred genealogy.


If a distance matrix reflects exactly the phylogeny (i.e. if the signal from the matrix is trivial), then the neighbour-net looks like a tree, with the ancestors placed seemingly at the internal nodes (medians) of the subtree(s) containing their descendants — that is, the neighbour-net looks like a median network (Figure 1). In fact, this is just mimicry. The ancestors are not actually placed on the internal nodes (medians) but are connected by zero-length edge bundles to the centre of the graph (the roots of their descendants).

Figure 1. Neighbour-net inferred from a perfect distance matrix (from Denk & Grimm 2009)

In reality, distance matrices will not be exact reflections of the phylogeny (the ‘true’ tree) but distorted; and this will be reflected also in the neighbour-net (Fig. 2).

Figure 2. Neighbour-net inferred from an imperfect distance matrix (cf. Denk & Grimm 2009, fig. 1)

I superimposed a potentially inferred tree on Figure 2 to highlight some distorting effects. Note that this could be a tree optimised using a distance criterion such as minimum evolution or least-squares, or a tree optimised under maximum likelihood or maximum parsimony (one of the alternative topologies found in the sample of equally parsimonious trees).

The following distorting effects may apply during tree-inference: (i) a misplaced outgroup-inferred ingroup root (due to convergences shared by the outgroup and members of lineage B), (ii) lineage B is dissolved into a grade, although it should be a clade (convergences shared by members of lineages A and B), and (iii) the all-ingroup ancestor is resolved as the sister to lineage A, but not B. The neighbour-net illustrates the uncertainties of placing several taxa (the box-like parts of the graph), while keeping the ancestors equally close to their descendants. This provides information lost (or overlooked) when just inferring a tree (but not by exploratory data analysis of e.g. the bootstrap support patterns).

The basics — why stack networks?

Figure 3 shows a hypothetical evolution of a phylogenetic lineage in a two-dimensional morpho-space. The common ancestor (black dot) gives rise to two, morphologically somewhat distinct lineages (bluish vs. reddish coloured). The lineages evolve over time and diverge again. The overall differentiation within the group increases, but eventually the potential niche/morphospace is filled. When looking only at the final situation (i.e. the modern-day situation), we may be tempted to infer wrong relationships based on the morphological distinctness. Each one of the blue and red daughter lineages evolved into similar niches and obtained somewhat similar morphological character suites substantially distinct from the one of their respective sister taxa. Translating this situation into a distance matrix (or a character matrix) to infer a tree, will provide a wrong topology, when long-branch attraction steps in, that recognises one or two blue-red sister pair(s).

Figure 3. Evolution (vertically) and diversification of a lineage in a two-dimensional morphospace (horizontally)

Adding all of the ancestors can help to escape long-branch attraction (Figure 4; see also Wiens 2005). We don’t find red111 as sister to blu11, but still resolve the wrong sister relationship between the overall too-similar endpoints of the converging red and blue sub-lineages (red222 and blu22). The remainder of the red lineage is erroneously dissolved into (A) an ancient sister lineage (red0); (B) a real clade (red1, red11, red111), the red sub-lineage most distinct from the blu lineage; and (C) a grade (red2, red22), “basal” (wrong terminology, but often used) to the blue lineage, collecting the older members of the red sub-lineage evolving towards the morphospace of the blue lineage.

Figure 4. Neighbour-joining tree inferred on a distance matrix exactly reflecting the pairwise distances along both axes in the 2-dimensional morphospace

Only a few, (data-wise) trivial branches of the true tree (broadened green edges) are found in the inferred tree. An obvious defect of this tree is that the phylogenetic distance, the sum of branch lengths between two tips, does not reflect the pairwise distances encoded in the matrix. For instance, the all-ancestor should be equally distant from both of the ancestors of the red and blue lineages (red0, blu0).

The neighbour-net shows something rather different (Figure 5). A large box can be seen, referring to the highly incompatible signal induced by the ancestors and their direct and subsequent descendants.

Figure 5. Neighbour-net splits graph inferred on the same distance matrix

Red222 and blu22, the false sisters, are placed next to each other and share an edge bundle. They are most similar to each other and increasingly distinct from all other taxa included in the analysis. However, their nearest relatives are their actual ancestors (red22 and blu2), which are already quite distinct from each other, and show affinities to further members to their clade but not to the other clade (the topology of the tree from Figure 4 is shown in yellow in Figure 5).

The network includes the true tree in addition to edge bundles referring to wrong alternatives (induced by imperfect data; in this case, convergence due to evolution into a similar morphospace). The phylogenetic distances between two tips (via alternative pathways) reflect much better the pairwise distances (almost exactly). Thus, the neighbour-net is a much more comprehensive display of the actual signals in the imperfect matrix than a tree could ever be (independent of the optimisation criterion used).

The fossils included in our example represent a time sequence, an actual change in time. With that information as background, we can much more easily access the neighbour-net’s structure (Figure 6).

Figure 6. The same neighbour-net with time-slices

Already in the second time-slice the lineage diverged into two distinct branches represented by red0 and blu0. Both evolved (blu0→blu00) and diverged (red1, red2) in the next time slice, and so on. The fact that the neighbour-net is not a phylogenetic graph (in an evolutionary sense) becomes a strength. In any tree, we would need to deduce, or be tempted to deduce, (inclusive) common origins (Hennig’s monophyly) from the clades exhibited in the rooted version of the tree. Here, using the most natural root to root the tree, the oldest representative of the lineage (ancestor), misplaces the two representatives of the second time-slice along with those involved in subsequent radiations.

Stacking networks

There are two simple ways to trace the change in differentiation patterns through time using stacked networks: (A) Generate a series of networks per time-slice and identify the closest relatives in the next (and/or preceding) time-slice, or (B) Generate networks that combine the taxa of two subsequent time-slices.

Figure 7. A sequence of neighbour-nets, with the taxa filtered by age. (These are actual reconstructions inferred by SplitsTree based on the taxon-filtered subsets of the all-inclusive distance matrix.)

My example only includes up to four co-eval taxa, and hence the neighbour-nets are trivial graphs for each time-slice (Figure 7). The connecting lines between the time-slices indicate the closest and next-closest possible descendants (as defined by the smallest and next-smallest distance) of each earlier taxon. The thickness of the connections reflects their absolute similarity — the thickest lines indicate a morphological pairwise distance of 0.13, and the thinnest are distances of > 0.33. The colour indicates whether the connection reflects a true (green) or false (yellow) relationship.

Analysed this way, the matrix’ signal appears quite perfect in relation to the true tree. One can trace the increasing diversity within the clade (all descendants of the all-ancestor), as well as the misleading decreasing distance between the blu2/blu22 and red22/red222 lineages. In this case, the same stacking procedure would also work with trees, as the neighbour-nets are trivial and very tree-like. With real world data, the differences may be more profound.

With more complex or less complete data, we have a higher risk that ancestor-descendant relationships will not be straightforwardly identified by highest-similarity pairs. Missing data, for instance, can result in distances that mask or over-estimate the actual phylogenetic distances between an older and a younger taxon. Using the stacking procedure illustrated in Figure 7, such problems can become visible in the form of related taxa from the same time-slice that are connected to unrelated or distantly related taxa in the preceding or following time-slices.

But how to identify more likely candidates? One possibility is to assess potential phylogenetic relationship by combining taxa of two subsequent time-slices. The connectives between the reconstructions are then straightforward: each taxon is always used in two different reconstructions (Figure 8). This procedure allows us to establish the phylogenetic affinities of a taxon with respect to co-eval and older taxa (potential siblings and ancestors) or co-eval and younger taxa (potential siblings and descendants).

Figure 8. A sequence of neighbour-nets, each one including the taxa of two subsequent time-slices

Stacking networks — what’s next?

The first thing, obviously, is to test the suggested procedures for real-world data, involving groups with a dense and well-studied fossil record. I will provide a real-world example in my next post using the matrix we put up for our systematic revision of Osmundaceae (King Ferns) rhizomes (Bomfleur, Grimm & McLoughlin 2017).

Simulations may help to identify misleads caused by missing data, and the resulting distorted distance matrices, and non-comprehensively sampled (time-wise) phylogenies. They may also be informative regarding whether consensus networks reflecting competing branch support can be used for similar approaches.

Programmers are needed, too. For my graphics, I established the inter-time-slice connectives by hand; but it would be handy to have a programme environment that can do this.


Bomfleur B, Grimm GW, McLoughlin S. 2017. The fossil Osmundales (Royal Ferns)—a phylogenetic network analysis, revised taxonomy, and evolutionary classification of anatomically preserved trunks and rhizomes. PeerJ 5: e3433. Open access: https://peerj.com/articles/3433/

Denk T, Grimm GW. 2009. The biogeographic history of beech trees. Review of Palaeobotany and Palynology 158: 83-100.

Spencer M, Davidson EA, Barbrook AC, Howe CJ. 2004. Phylogenetics of artificial manuscripts. Journal of Theoretical Biology 227: 503-511.

Wiens JJ [, Soltis P ?]. 2005. Can incomplete taxa rescue phylogenetic analyses from long-branch attraction? Systematic Biology 54: 731-742. Open access: https://academic.oup.com/sysbio/article-lookup/doi/10.1080/10635150500234583

Tuesday, July 11, 2017

The curious case of the word “stemma” — from circles to trees

Each word has its own history, according to a maxim attributed to Jules Gilliéron that makes some historical linguists tremble. One with a curious history is the word stemma (plural stemmata), which we stumble upon when investigating the development of phylogenetic trees.

David has been exploring this question for some time, showing how the origin ultimately lies in the alternative to the hierarchical model of the Aristotelian "scale" offered by the practice (and the metaphors) of genealogies and pedigrees. While dealing with possible influences on 19th-century biology, I have explored a different field, stemmatics ("textual criticism"), which shares with genealogical practices both the tree model and, obviously, the word stemma. As stemmatics is one of the first scientific approaches to the idea, and considering that the now widespread tree is likely a calque of German Baum, itself a calque of stemma, it is worth writing a bit about the history of this word.

Stemma / stemmata

Dictionaries (as well as my queries in Google Books, which would probably fail to impress a reviewer) agree that not even in Romance languages does this Latin word display an uninterrupted tradition from the time of Caesar. It only entered languages such as English and Italian, with the meaning of "genealogical tree; pedigree; nobility", from the mid 17th century on. This date supports the theory that family pedigrees were not commonplace before the 17th century (when they became a true fashion, as in Strein, 1559), despite being drawn since the Middle Ages and always discussed — as in royal disputes or in the case of the genealogies of Jesus found in the Gospels (likely drawn to confirm the messianic claims with Jewish criteria, but assimilated to the European mindset). In short, modern stemmata are mostly a product of Neoclassical fashion, and their popularity was influenced by the same descriptions of Roman pedigrees where the word was learned.

Speaking of pedigrees, this Latin word is of Greek descent, a loanword of στέμμα [stémma], meaning "wreath, crown". This sense was already a development of an original "that which surrounds; circle": in Homer's Iliad, for example, we still find an occurrence in the first sense of "circle" (of warriors, cf. XIII.736); but elsewhere the word refers to a laurel-wreath wound around a staff, mostly in the plural and in relation to the laurel god Apollo (cf. Iliad, I.14, I.28, and I.373). The development is due to the costume of conceding crowning wreaths, with στέμμα deriving from the verb στέφειν [stéphein] ("to encircle, to crown, to wreathe, to tie around") by the addition of the morpheme -μα [-ma], used to form nouns denoting the result of an action, as in the analogous case of γράφω (gráphō, "write") and γράμμα (grámma, "that which is written"). Our word ultimately derives from the Proto-Indo-European root *stebh- "post, stem; to place firmly on; to fasten", related to English "(to) step" and "staff".

Theodosius offers a laurel wreath to the victor;
on the base of the obelisk in the Hippodrome (Istanbul)
[source: Wikipedia]

The "wreath, garland, chaplet" meaning is attested in Ancient Greek literature of all times and genres, such as in tragedy (cf. Euripides, Andromache, 894), comedy (cf. Aristophanes, Wealth, 39), philosophy (cf. Plato, Republic, 617c), and historiography (cf. Thucydides, Peloponnesian War, IV.133). At least one metaphoric usage is attested, in the sense of "web/tangle of life" (cf. Euripides, Orestes, 12), and various inscriptions indicate an additional meaning of "guild" (such as in one "guild of huntsmen" epigraph quoted by Liddell & Scott, 1940). The genealogical meaning is only found in later Greek authors like Plutarch (1st century CE), suggesting that it was imported from Latin.

The Roman meaning developed from the custom of decorating the portraits of one’s ancestors, sometimes in elaborate full-wall genealogies, with laurel wreaths indicating both excellence and nobility (as "noble" pretty much meant "descending from gods"). Domestic cults were central to Roman religion, and this practice seems to have become so widespread in Imperial times that it turned into a banality, with the laurel decoration being decried as a symbol of vanity by poets and philosophers alike. The custom – and the usage of stemma for "genealogical tree" – is mentioned twice by Seneca in essays of utmost importance for Roman Stoicism. In Ad Lucilium Epistulae Morales, XLIV.1, he says:
Si quid est aliud in philosophia boni, hoc est, quod stemma non inspicit. Omnes, si ad originem primam revocantur, a dis sunt. [Philosophy also has this advantage: it does not look at your genealogical tree. Everyone, if we look at their remotest origin, descends from the gods].
A similar reference, with a more detailed description of the practice, is found in his De Beneficiis, XXX.28:
We all spring from the same source, have the same origin; no man is nobler than another except in so far as the nature of one man is more upright and more capable of good actions. Those who display ancestral busts in their halls [qui imagines in atrio exponunt], and place in the entrance of their houses the names of their family, arranged in a long row and entwined in the multiple ramifications of a genealogical tree [ac multis stemmatum illigata flexuris] – are these not notable rather than noble? Heaven is the one parent of us all, whether from his earliest origin each one arrives at his present degree by an illustrious or obscure line of ancestors. You must not be duped by those who, in making a review of their ancestors, wherever they find an illustrious name lacking, foist in the name of a god. [adapted from the translation of Basore, 1935]

A golden laurel wreath, probably originating from Cyprus, 4th-3rd century BC
[source: Wikipedia]

In matters of phylogenetics, to prove that something existed is usually not enough, as we should try to demonstrate its influence and descent. Both are clear in the case of Seneca: his moral essays were read and copied without interruption in the early years of Christianity, proliferated during the Carolingian Renaissance, and were among the most published works of secular Western literature for centuries. The Wikipedia article on the second essay is well referenced on the matter:
Three translations were made into English during the sixteenth and early seventeenth century. The first translation at all into English was made in 1569 by Nicolas Haward, of books one to three, while the first full translation into English was made in 1578 by Arthur Golding, and the second in 1614 by Thomas Lodge. Roger L'Estrange made a relevant work in 1678, he had been making efforts on Seneca's works since at least 1639. A partial Latin publication of books 1 to 3, being edited by M. Charpentier & F. Lemaistre, was made circa 1860, books 1 to 3 were translated into French by de Wailly, and a translation into English was made by JW. Basore circa 1928-1935.
The new meaning od the word is confirmed by many other authors popular in Medieval times, and especially after the Renaissance, such as Suetonius (cf. Nero, 37; Galba, 2) and Statius (cf. Silvae, 3). Pliny the Elder's Naturalis Historia, an obligatory reading for all Western scholars from the Renaissance to at least the 19th century, is another important source. When exposing the history of Roman art and discussing the honor attached to portraits, Pliny mentions that "in ancient times" people had much care for faithful likeness, when "portraits modeled in wax were arranged, each in its separate niche, to be always in readiness to accompany the funeral processions of the family [... while the] the pedigree [stemmata] of the individual was traced in lines upon each of these coloured portraits" (XXXV.6, adapted from Bostock, 1855).

The last important source to note is the eight Satire of Juvenal, on the paradoxes of the Roman aristocracy, where the word stemma, as usual in the plural, is used to open the poem:
Stemmata quid faciunt? quid prodest, Pontice, longo / sanguine censeri, pictos ostendere uultus / maiorum et stantis in curribus Aemilianos / et Curios iam dimidios umeroque minorem / Coruinum et Galbam auriculis nasoque carentem [Genealogies, what are they worth? What is in for you, Ponticus, in being judged by ancient bloodline, in flaunting the portraits of your ancestors, the Aemilians standing on chariots, only half of the Curii, a Corvinus devoid of shoulders, and a Galba missing ears and nose?]
Sources suggest that the new meaning was well established by the reign of Hadrian (2nd century CE), including the derivative meanings of "high value" (cf. Martial, Satyra, VIII.6) and "antique", as in Prudentius (cf. Liber Cathemerinon, VII.81), a Christian author much read in Medieval times. As already mentioned, the word even found its way back into Greek with the new semantic shift, such as in Plutarch, one of the most popular Greek authors since the Renaissance. In his Numa, 1, we find:
ἔστι δὲ καὶ περὶ τῶν Νομᾶ τοῦ βασιλέως χρόνων, καθ᾽ οὓς γέγονε, νεανικὴ διαφορά, καίπερ ἐξ ἀρχῆς εἰς τοῦτον κατάγεσθαι τῶν στεμμάτων ἀκριβῶς δοκούντων ["There is likewise a vigorous dispute about the time at which King Numa lived, although from the beginning down to him the genealogies seem to be made out accurately"; Perrin, 1914].
It is somewhat ironic that the accusations of futility and uselessness of genealogical trees probably contributed to the Medieval and Renaissance restoration of such practices. Informed about the Roman tradition, and equipped with examples from nobility and religion, people turned genealogy and its trees into a fashion. This helped to lay the ground for the acceptance of the tree model when new scientific endeavors required a better way to describe things, like dog races and strawberry varieties, especially when non-ascending genealogies (who descends from whom, instead of who are the ancestors of whom) were already common, and when the concept of the "tree of life" gained a new popularity.

Neptune's genealogy as per Boccaccio.
Paris: Luois Hornken, 1511. [source]

  • Aristophanes (1938).Wealth. The Complete Greek Drama, vol. 2. Eugene O'Neill, Jr. New York: Random House
  • στέμμα in Autenrieth, Georg (1891) A Homeric Dictionary for Schools and Colleges. New York: Harper and Brothers.
  • στέμμα in Bailly, Anatole (1935) Le Grand Bailly: Dictionnaire grec-français. Paris: Hachette.
  • Euripides (forthcoming) Euripides, with an English translation by David Kovacs. Cambridge MA: Harvard University Press.
  • Euripides (1938) The Complete Greek Drama, edited by Whitney J. Oates and Eugene O'Neill, Jr. in two volumes. New York: Random House.
  • stemma in Lewis, Charlton T; Short, Charles Short (1879) A Latin Dictionary. Founded on Andrews' edition of Freund's Latin dictionary. revised, enlarged, and in great part rewritten by. Oxford: Clarendon Press.
  • στέμμα in Liddell & Scott (1940) A Greek–English Lexicon. Oxford: Clarendon Press.
  • στέμμα in Liddell & Scott (1889) An Intermediate Greek–English Lexicon. New York: Harper & Brothers.
  • Omero (1990) Iliade. Traduzione di Rosa Calzecchi Onesti. Torino: Giulio Einaudi editore.
  • Plato (1903) Platonis Opera, ed. John Burnet. Oxford: Oxford University Press.
  • Pliny the Elder (1855) The Natural History. John Bostock, H.T. Riley. London. Taylor and Francis.
  • Plutarch (1914).Plutarch's Lives. with an English Translation by. Bernadotte Perrin. Cambridge MA: Harvard University Press. London: William Heinemann.
  • Seneca (1917-1925) Ad Lucilium Epistulae Morales, volume 1-3. Richard M. Gummere. Cambridge MA: Harvard University Press; London: William Heinemann.
  • Seneca, Lucius Annasus (1928-1935) Moral Essays. Translated by John W. Basore. The Loeb Classical Library. London: W. Heinemann. 3 vols.: Volume III.
  • Statius, P. Papinius (1928) Statius, Vol I. John Henry Mozley. London: William Heinemann; New York: G.P. Putnam's Sons.
  • Strein, Richardus (1559) Gentium et familiarum Romanorum stemmata. Paris[?]: Henr. Stephanus.
  • Suetonius (1889).The Lives of the Twelve Caesars; An English Translation, Augmented with the Biographies of Contemporary Statesmen, Orators, Poets, and Other Associates. Suetonius. Publishing Editor. J. Eugene Reed. Alexander Thomson. Philadelphia: Gebbie & Co.
  • Thucydides (1942) Historiae in two volumes. Oxford: Oxford University Press.

Tuesday, July 4, 2017

Should we try to infer trees on tree-unlikely matrices?

Spermatophyte morphological matrices that combine extinct and extant taxa notoriously have low branch support, as traditionally established using non-parametric bootstrapping under parsimony as optimality criterion. Coiro, Chomicki & Doyle (2017) recently published a pre-print to show that this can be overcome to some degree by changing to Bayesian-inferred posterior probabilities. They also highlight the use of support consensus networks for investigating potential conflict in the data. This is a good start for a scientific community that so far has put more of their trust in either (i) direct visual comparison of fossils with extant taxa or (ii) collections of most parsimonious trees inferred based on matrices with high level of probably homoplasious characters and low compatibility. But do those matrices really require or support a tree? Here, I try to answer this question.


Coiro et al. mainly rely on a recent matrix by Rothwell & Stockey (2016), which marks the current endpoint of a long history of putting up and re-scoring morphology-based matrices (Coiro et al.’s fig. 1b). All of these matrices provide, to various degrees, ambiguous signal. This is not overly surprising, as these matrices include a relatively high number of fossil taxa with many data gaps (due to preservation and scoring problems), and combine taxa that perished a hundred or more millions years ago with highly derived, possibly distant-related modern counterparts.

Rothwell & Stockey state (p. 929) "As is characteristic for the results from the analysis of matrices with low character state/taxon ratios, results of the bootstrap analysis (1000 replicates) yielded a much less fully resolved tree (not figured)." Coiro et al.’s consensus trees and network based on 10,000 parsimony bootstrap replicates nicely depicts this issue, and may explain why Rothwell & Stockey decided against showing those results. When studying an earlier version of their matrix (Rothwell, Crepet & Stockey 2009), they did not provide any support values, citing a paper published in 2006, where the authors state (Rothwell & Nixon 2006, p. 739): “… support values, whether low or high for particular groups, would only mislead the reader into believing we are presenting a proposed phylogeny for the groups in question. Differences among most-parsimonious trees are sufficient to illuminate the points we wish to make here, and support values only provide what we consider to be a false sense of accuracy in these assessments”.

Do the data support a tree?

The problem is not just low support. In fact, the tree showed by Rothwell & Stockey with its “pectinate arrangement” conflicts in parts with the best-supported topology, a problem that also applied to its 2009 predecessor. This general “pectinate” arrangement of a large, low or unsupported grade is not uncommon for strict consensus trees based on morphological matrices that include fossils and extant taxa (see e.g. the more proximal parts of the Tree of Life, e.g. birds and their dinosaur ancestors).

The support patterns indicate that some of the characters are compatible with the tree, but many others are not. Of the 34 internodes (branches) in the shown tree (their fig. 28 shows a strict consensus tree based on a collection of equally parsimonious trees), 12 have lower bootstrap support under parsimony than their competing alternatives (Fig. 1). Support may be generally low for any alternative, but the ones in the tree can be among the worst.

The main problem is that the matrix simply does not provide enough tree-like signal to infer a tree. Delta Values (Holland et al. 2002) can be used as a quick estimate for the treelikeliness of signal in a matrix. In the case of large all-spermatophyte matrices (Hilton & Bateman 2006; Friis et al. 2007; Rothwell, Crepet & Stockey 2009; Crepet & Stevenson 2010), the matrix Delta Values (mDV) are ≥ 0.3. For comparison, molecular matrices resulting in more or less resolved trees have mDV of ≤ 0.15. The individual Delta Values (iDV), which can be an indicator of how well a taxon behaves during tree inference, go down to 0.25 for extant angiosperms – very distinct from all other taxa in the all-spermatophyte matrices with low proportions of missing data/gaps – and reach values of 0.35 for fossil taxa with long-debated affinities.

The newest 2016 matrix is no exception with a mDV of 0.322 (the highest of all mentioned matrices), and iDVs range between 0.26 (monocots and other extant angiosperms) and 0.39 for Doylea mongolica (a fossil with very few scored characters). In the original tree, Doylea (represented by two taxa) is part of the large grade and indicated as the sister to Gnetidae (or Gnetales) + angiosperms (molecular trees associate the Gnetidae with conifers and Ginkgo). According to the bootstrap analysis, Doylea is closest to the extant Pinales, the modern conifers. Coiro et al. found the same using Bayesian inference. Their posterior probability (PP) of a Doylea-Podocarpus-Pinus clade is 0.54, and Rothwell & Stockey’s Doylea-Ginkgo-angiosperm clade conflicts with a series of splits with PPs up to 0.95.

Figure 1. Parsimony bootstrap network based on 10,000 pseudoreplicate trees
inferred from the matrix of Rothwell & Stockey.
Edges not found in the authors’ tree in red, edges also found in the tree in green.
Extant taxa in blue bold font. The edge length is proportional to the frequency of the
according split (taxon bipartition, branch in a possible tree) in the pseudoreplicate
tree sample. The network includes all edges of the authors’ tree except for
Doylea + Gnetidae + Petriellales + angiosperms vs. all other gymnosperms and
extinct seed plant groups. Such a split has also no bootstrap support (BS < 10)
using least-square and maximum likelihood optimum criteria.

Do the data require a tree?

As David made a point in an earlier post, neighbour-nets are not really “phylogenetic networks” in the evolutionary sense. Being unrooted and 2-dimensional, they don’t depict a phylogeny, which has to be a sort of (rooted) tree, a one-dimensional graph with time as the only axis (this includes reticulation networks where nodes can be the crossing point of two internodes rather than their divergence point). The neighbour-net algorithm is an extension into two dimensions of the neighbour-joining algorithm, the latter infers a phylogenetic tree serving a distance criterion such as minimum evolution or least-squares (Felsenstein 2004). Essentially, the neighbour-net is a ‘meta-phylogenetic’ graph inferring and depicting the best and second-best alternative for each relationship. Thus, neighbour-nets can help to establish whether the signal from a matrix, treelike or not as it is the cases here, supports potential and phylogenetic relationships, and explore the alternatives much more comprehensively than would be possible with a strict-consensus or other tree (Fig. 2).

Figure 2. Neighbour-net based on a mean distance matrix inferred
from the matrix of Rothwell & Stockey.
The distance to the "progymnosperms", a potential ancestral group of the
seed plants, can be taken as a measurement for the derivedness of each
major group. The primitive seed ferns are placed between progymnosperms
 and the gymnosperms connected by partly compatible edge bundles; the
putatively derived "higher seed ferns" isolated between the progymnosperms
and the long-edged angiosperms. Shared edge-bundles and 'neighbourness'
reflect quite well potential phylogenetic relationships and eventual ambiguities,
as in the case of Gnetidae. Colouring as in Figure 1; some taxon names
are abbreviated.

In addition, neighbour-nets usually are better backgrounds to map patterns of conflicting or partly conflicting support seen in a bootstrap, jackknife or Bayesian-inferred tree sample. In Fig. 3, I have mapped the bootstrap support for alternative taxon bipartitions (branches in a tree) on the background of the neighbour-net in Fig. 2.

Obvious and less-obvious relationships are simultaneously revealed, and their competing support patterns depicted. Based on the graph, we can see (edge lengths of the neighbour-net) that there is a relatively weak primary but substantial bootstrap support for the Petriellales (a recently described taxon new to the matrix) as sister to the angiosperms. Several taxa, or groups of closely related taxa, are characterised by long terminal edges/edge bundles, rooting in the boxy central part of the graph. Any alternative relationship of these taxa/taxon groups receives equally low support, but there are notable differences in the actual values.

There is little signal to place most of the fossil “seed ferns” (extinct seed plants) in relation to the modern groups, and a very ambiguous signal regarding the relationship of the Gnetidae (or Gnetales) with the two main groups of extant seed plants, the conifers (Pinidae; see C. Earle’s gymnosperm database) and angiosperms (for a list and trees, see P. Stevens’ Angiosperm Phylogeny Website).

The Gnetidae is a strongly distinct (also genetically) group of three surviving genera, being a persistent source of headaches for plant phylogeneticists. Placed as sister to the Pinaceae (‘Gnepine’ hypothesis) in early molecular trees (long-branch attraction artefact), the currently favoured hypothesis (‘Gnetifer’) places the Gnetidae as sister to all conifers (Pinatidae) in an all-gymnosperm clade (including Gingko and possibly the cycads).

As favoured by the branch support analyses, and contrasting with the preferred 2016 tree, the two Doyleas are placed closest to the conifers, nested within a commonly found group including the modern and ancient conifers and their long-extinct relatives (Cordaitales), and possibly Ginkgo (Ginkgoidae). In the original parsimony strict consensus tree, they are placed in the distal part as sister to a Gnetidae and Petriellales + angiosperms (possibly long-branch attraction). The grade including the ‘primitive seed ferns’ (Elkinsia through Callistophyton), seen also in Rothwell and Stockey’s 2016 tree, may be poorly supported under maximum parsimony (the criterion used to generate the tree), but receives quite high support when using a probabilistic approach such as maximum likelihood bootstrapping or Bayesian inference to some degree (Fig. 3; Coiro, Chomicki & Doyle 2017).

Figure 3. Neighbour-net from above used to map alternative support patterns.
Numbers refer to non-parametric bootstrap (BS) support for alternative phylogenetic
splits under three optimality criteria: maximum likelihood (ML) as implemented in
RAxML (using MK+G model), maximum parsimony (MP), and least-squares
(via neighbour-joining, NJ; using PAUP*); and Bayesian posterior probabilties
(using MrBayes 3.2; see Denk & Grimm 2009, for analysis set-up). The circular
arrangement of the taxa allows tracking most edges in the authors’ tree and their,
sometimes better supported, alternatives. The edge lengths provide direct
information about the distinctness of the included taxa to each other; the structure
of the graph informs about the how tree-like the signal is regarding possible
phylogenetic relationships or their alternatives. Colouring as in Figure 1;
some taxon names are abbreviated.

Numerous morphological matrices provide non-treelike signals. A tree can be inferred, but its topology may be only one of many possible trees. In the framework of total evidence, this may be not such a big problem, because the molecular partitions will predefine a tree, and fossils will simply be placed in that tree based on their character suites. Without such data, any tree may be biased and a poor reflection of the differentiation patterns.

By not forcing the data in a series of dichotomies, neighbour-nets provide a quick, simple alternative. Unambiguous, well-supported branches in a tree will usually result in tree-like portions of the neighbour net. Boxy portions in the neighbour-net pinpoint the ambiguous or even problematic signals from the matrix. Based on the graph, one can extract the alternatives worth testing or exploring. Support for the alternatives can be established using traditional branch support measures. Since any morphological matrix will combine those characters that are in line with the phylogeny as well as those that are at odds with it (convergences, character misinterpretations), the focus cannot be to infer a tree, but to establish the alternative scenarios and the support for them in the data matrix.


Coiro M, Chomicki G, Doyle JA. 2017. Experimental signal dissection and method sensitivity analyses reaffirm the potential of fossils and morphology in the resolution of seed plant phylogeny. bioRxiv DOI:10.1101/134262

Crepet WL, Stevenson DM. 2010. The Bennettitales (Cycadeoidales): a preliminary perspective of this arguably enigmatic group. In: Gee CT, ed. Plants in Mesozoic Time: Morphological Innovations, Phylogeny, Ecosystems. Bloomington: Indiana University Press, pp. 215-244.

Denk T, Grimm GW. 2009. The biogeographic history of beech trees. Review of Palaeobotany and Palynology 158: 83-100.

Felsenstein J. 2004. Inferring Phylogenies. Sunderland, MA, U.S.A.: Sinauer Associates Inc.

Friis EM, Crane PR, Pedersen KR, Bengtson S, Donoghue PCJ, Grimm GW, Stampanoni M. 2007. Phase-contrast X-ray microtomography links Cretaceous seeds with Gnetales and Bennettitales. Nature 450: 549-552 [all important information needed for this post is in the supplement to the paper; a figure showing the actual full analysis results can be found at figshare]

Hilton J, Bateman RM. 2006. Pteridosperms are the backbone of seed-plant phylogeny. Journal of the Torrey Botanical Society 133: 119-168.

Holland BR, Huber KT, Dress A, Moulton V. 2002. Delta Plots: A tool for analyzing phylogenetic distance data. Molecular Biology and Evolution 19: 2051-2059.

Rothwell GW, Crepet WL, Stockey RA. 2009. Is the anthophyte hypothesis alive and well? New evidence from the reproductive structures of Bennettitales. American Journal of Botany 96: 296–322.

Rothwell GW, Nixon K. 2006. How does the inclusion of fossil data change our conclusions about the phylogenetic history of the euphyllophytes? International Journal of Plant Sciences 167: 737–749.

Rothwell GW, Stockey RA. 2016. Phylogenetic diversification of Early Cretaceous seed plants: The compound seed cone of Doylea tetrahedrasperma. American Journal of Botany 103: 923–937.

Schliep K, Potts AJ, Morrison DA, Grimm GW. 2017. Intertwining phylogenetic trees and networks. Methods in Ecology and Evolution DOI:10.1111/2041-210X.12760.