Monday, May 21, 2018

Misunderstandings and misrepresentations about Linné's alleged family motto


This is a joint post by Magnus Lidén and David Morrison

The Swedish biologist Carl Linnaeus (1707-1778) is well known in biology as the father of modern taxonomic nomenclature, although he is better known in his own country for writing a series of travel books that cataloged the cultures and resources of Sweden.* He was knighted in 1757, and took the noble name Carl von Linné, as well as adopting a coat of arms (shown below).

It is often claimed that at the same time he adopted a family motto:
Deus creavit, Linnaeus disposuit [Latin]
God created, Linnaeus organized [English]
Gud skapade, Linné ordnade [Swedish]
Gott erschuf, Linné ordnete [German]
This claim is repeated around the internet, almost always attributing the words directly to the man himself: Deus creavit, Linnaeus disposuit he liked to say (Smithsonian Institution); Deus creavit, Linnaeus disposuit he took as his motto (Harvard University); Deus creavit, Linnaeus disposuit was how Linnaeus himself summed up his lifetime achievements (Uppsala University; and Svenska Linnésällskapet — the Swedish Linnaean Society).

The motto has been used both to mock him for his presumptuousness and to praise him for his piety. Primary references for this alleged motto are, however, conspicuously absent from any of the web sites, and our search of the literature, as well as consultation with Linné experts, have failed to present any evidence that he ever used this motto himself.

In the standard Linné biography of Fries (1903), it is simply referred to as an "illuminating epigram which admiring contemporaries used" (see Jackson 1923), which does not explain how it came to be attributed to Linnaeus, nor where it come from. FV Hope (Anon. 1843) suspected it had originated as an act of malice. Although it has been used to that end by his adversaries, it was originally meant to express awe and admiration.

As far as we can determine, the first English-language use of the motto appears as the frontispiece of this book:
The Life of Sir Charles Linnæus, Knight of the Swedish Order of the Polar Star, &c, &c.
to which is Added a Copious List of His Works, and a Biographical Sketch of the Life of His Son

By D.H. Stoever, Ph.D.
Translated from the original German
By Joseph Trapp, A.M.
1794
B. and J. White, Fleet Street, London


As you can see, the motto is used as a banner situated directly below the coat of arms of Linné, and to all appearances is a part of it, with a portrait in profile above. This gives the impression that the words were coined by Linné himself (as was the case for the coat of arms).

However, the original German-language version of the book reveals a very different situation:
Leben des Ritters Carl von Linné
Nebst den biographischen Merkwürdigkeiten seines Sohnes, des Professors Carl von Linné
und einem vollständigen Verzeichnisse seiner Schriften, deren Ausgaben, Übersetzungen, Auszüge und Commentare

von Dietrich Heinrich Stöver, Doctor der Philosophie
1792
Benj. Gottl. Hoffmann, Hamburg


The frontispiece has the alleged motto flanking the coat of arms of Linnaeus, rather than being part of it. This makes all the difference to the interpretation. The portrait, incidentally, is a poor copper engraving, drawn from a plaster medallion by Inländer from 1773 (cf. Tullberg 1907).

Stöver reveals his source for the words in his 1792 preface:
Das Motto unter dem Bildnisse Linné's [...] wird hoffentlich mit der Religiosität keines Lesers in Collision kommen. Es rührt von einem Manne her, der ein langer Freund des Vestorbnen war.
However, in the 1794 English translation, "langer Freund" is embellished to the point of confusion:
The motto beneath the portrait of Linnaeus [...] will not, it is humbly presumed, offend the religious opinions of any reader. It originates with a man who has lived many years in the closest ties of intimacy with the deceased.**
Whoever devised it, it seems probable that this phrase is a post-Linnéan laudation communicated to Stöver orally or by letter. At any rate, it do not appear in print until 14 years after Linné's death.

This may seem like a rather harmless "factoid", but it highlights how easily erroneous beliefs can be established, even in a scientific environment.

Other myths

This brings us to a second myth, a misconstruction of the very core of Linné's views on classification, which has seriously distorted how the development of 18th century systematics is perceived. The widely held picture of Linné as an Aristotelian Essentialist, classifying nature by Medieval Scholastic Principles of Logical Division, dates from the work of Cain (1958; see Winsor 2006), and was uncritically accepted by several influential authors, such as Mayr (1982) and Futuyma (1998). But this is like stating that Darwin was a creationist!

On the contrary, the scholastic approach is strongly criticized by Linné. He was the first to clarify the conceptual difference between the top-down divisionis leges (which he claimed will by necessity result in artificial groupings and disruption of natural taxa) and synthetic systematization. Linné emphasized that natural taxa are not defined by characters but must be built from the basic entities (species) upwards (Linnaeus 1737). He was far ahead of his time in doing this. The misrepresentation of Linné's views by Cain's and his followers has been thoroughly debunked by, for example, Skvortsov (2002), Winsor (2006), Müller-Wille (2013) and others, but it seems to be hard to eradicate.

A more amusing misunderstanding is the so-called flower clock, reputedly planted by Linné in the Hortus Academicus of Uppsala (now called Linnéträdgården, The Linné garden), about which numerous visitors and journalists ask each year. However, Linné's flower clock (1751) was a list of selected phenological observations, which never materialized in the Uppsala academic garden as an actual plantation, nor was it ever meant to. Attempts to plant flower clocks in gardens have shown that they are not very accurate as to general time-keeping across seasons and latitudes.

Note:
It seems to be quite common in English to insist on the use of titles for British people but not for foreigners. As noted by Stöver and Trapp in their book, "Carl von Linné" is best treated as the Swedish equivalent of "Sir Carl Linnaeus".

References

Anon. (1843) Summary of a lecture by F. V. Hope – on the portraits of Linnaeus – read for the Linnean society 21 Feb 1843 (E. Forster, Esq. in the chair). The Athenæum (Journal of english and Foreign Literature, Science and the Fine Arts) 801: 218. [in vol. 1 for the year 1843, installments 783 to 817]

Cain AJ (1958) Logic and memory in Linnaeus' system of taxonomy. Proceedings of the Linnean Society of London 169: 144-163.

Fries TM (1903) Linné. Lefnadsteckning, 2 vols. Stockholm.

Futuyma DJ (1998) Evolutionary Biology, 3 edn. Sinauer Associates, Sunderland MA.

Jackson BD (1923) Linnaeus. Abridged and adapted from Fries 1903. London.

Linnaeus C (1737) Genera Plantarum. Conrad Wishoff, Leiden.

Linnaeus C (1751) Philosophia Botanica. Godofr. Kiesewetter, Stockholm.

Mayr E (1982) The Growth of Biological Thought. Harvard University Press, Cambridge MA.

Müller-Wille S (2013) Systems and how Linnaeus looked at them in retrospect. Annals of Science 70: 305-317.

Skvortsov AK (2002) Systematics on the threshold of the 21st century: traditional principles and basics from the contemporary viewpoint. Zhurnal Obshchei Biologii 63: 82-93. [In Russian; abridged translation by Irina Kadis on WWW]

Tullberg T (1907) Linnéporträtt. Aktiebolaget Ljus, Stockholm.

Winsor MP (2006) Linnaeus' biology was not essentialist. Annals of the Missouri Botanical Garden 93: 2-7.



* On May 18 we had Linnés trädgårdsfest, which is Uppsala's celebration of Linné's working life in the town.

**According to Guido Grimm, a more literal translation would be: "It originates from an old friend of the deceased, who, being of rare noble character, summarized the widely accepted opinion(s) of experts".

Monday, May 14, 2018

Addition of a Message Board to the blog


This is a short post just to point out that there is now a Message Board on this blog, where people can post community information, such as jobs and scholarships, as well as any other requests or information. The link is at the upper-right of the blog pages.

To post a message to the Board, send an email to: Leo van Iersel.


Monday, May 7, 2018

Keeping it simple in phylogenetics


This is a post by Guido, with a bit of help from David.

There's an old saying in physics, to the effect that: "If you think you need a more complex model, then you actually need better data." This is often considered to be nonsense in the biological sciences and the humanities, because   the data produced by biodiversity is orders of magnitude more complex than anything known to physicists:
The success of physics has been obtained by applying extremely complicated methods to extremely simple systems ... The electrons in copper may describe complicated trajectories but this complexity pales in comparison with that of an earthworm. (Craig Bohren)
Or, more succinctly:
If it isn’t simple, it isn’t physics. (Polykarp Kusch)
So, in both biology and the humanities there has been a long-standing trend towards developing and using more and more complex models for data analysis. Sometimes, it seems like every little nuance in the data is important, and needs to be modeled.


However, even at the grossest level, complexity can be important. For example, in evolutionary studies, a tree-based model is often adequate for analyzing the origin and development of biodiversity, but it is inadequate for studying many reticulation processes, such as hybridization and transfer (either in biology or linguistics, for example). In the latter case, a network-based model is more appropriate.

Nevertheless, the physicists do have a point. After all, it is a long-standing truism in science that we should keep things simple:
We may assume the superiority, all things being equal, of the demonstration that derives from fewer postulates or hypotheses. (Aristoteles) 
It is futile to do with more things that which can be done with fewer. (William of Ockham) 
Plurality must never be posited without necessity. (William of Ockham) 
Everything should be as simple as it can be, but not simpler. (Albert Einstein)
To this end, it is often instructive to investigate your data with a simple model, before proceeding to a more complex analysis.

Simplicity in phylogenetics

In the case of phylogenetics, there are two parts to a model: (i) the biodiversity model (eg. chain, tree, network), and (ii) the character-evolution model. A simple analysis might drop the latter, for example, and simply display the data unadorned by any considerations of how characters might evolve, or what processes might lead to changes in biodiversity.

This way, we can see what patterns are supported by our actual data, rather than by the data processed through some pre-conceived model of change. If we were physicists, then we might find the outcome to be a more reliable representation of the real world. Furthermore, if the complex model and the simple model produce roughly the same answer, then we may not need "better data".


Modern-day geographic distribution of Dravidian languages (Fig. 1 of Kolipakam, Jordan, et al., 2018)

Historical linguistics of Dravidian languages

Vishnupriya Kolipakam, Fiona M. Jordan, Michael Dunn, Simon J. Greenhill, Remco Bouckaert, Russell D. Gray, Annemarie Verkerk (2018. A Bayesian phylogenetic study of the Dravidian language family. Royal Society Open Science) dated the splits within the Dravidian language family in a Bayesian framework. Aware of uncertainty regarding the phylogeny of this language family, they constrained and dated several topological alternatives. Furthermore, they checked how stable the age estimates are when using different, increasingly elaborate linguistic substitution models implemented in the software (BEAST2).

The preferred and unconstrained result of the Bayesian optimization is shown in their Figs 3 and 4 (their Fig. 2 shows the neighbour-net).

Fig. 3 of Kolipakam et al. (2018), constraining the North (purple), South I (red) and South II (yellow) groups as clades (PP := 1)
Fig. 4 of Kolipakam et al. (2018), result of the Bayesian dating using the same model but not constraints. The Central and South II group is mixed up.

As you can see, many branches have rather low PP support, which is a common (and inevitable) phenomenon when analyzing non-molecular data matrices providing non-trivial signals. This is a situation where support consensus networks may come in handy, which Guido pointed out in his (as yet unpublished) comment to the paper (find it here).

On Twitter, Simon Greenhill (one of the authors) posted a Bayesian PP support network as a reply.

A PP consensus network of the Bayesian tree sample, probably the one used for Fig. 3 of Kolipakam et al. 2018, constraining the North, South I, and South II groups as clades (S. Greenhill, 23/3/2018, on Twitter).

Greenhill, himself, didn't find it too revealing, but for fans of exploratory data analysis it shows, for example, that the low support for Tulu as sister to the remainder of the South I clade (PP = 0.25) is due to lack of decisive signal. In case of the low support (PP = 0.37) for the North-Central clade, one faces two alternatives: it's equally likely that the Central Parji and Olawi Godha are related to the South II group which forms a highly supported clade (PP = 0.95), including the third language of the Central group (one of the topological alternatives tested by the authors).

A question that pops up is: when we want to explore the signal in this matrix, do we need to consider complex models?

Using the simplest-possible model

The maximum-likelihood inference used here is naive in the sense that each binary character in the matrix is treated as an independent character. The matrix, however, represents a binary sequence of concepts in the lexica of the Dravidian languages (see the original paper for details).

For instance, the first, invariant, character encodes for "I" (same for all languages and coded as "1"), characters 2–16 encode for "all", and so on. Whereas "I" (character 1) may be independent from "all" (characters 2–16), the binary encodings for "all" are inter-dependent, and effectively encode a micro-phylogeny for the concept "all": characters 2–4 are parsimony-informative (ie. split the taxon set into two subsets, and compatible); the remainder are parsimony-uninformative (ie. unique to a single taxon).


The binary sequence for "All" defines three non-trivial splits, visualized as branches, which are partly compatible with the Bayesian tree; eg. Kolami groups with members of South I, and within South II we have two groups matching the subclades in the Bayesian tree.

Two analyses were run by the original authors, one using the standard binary model, Lewis’ Mk (1-paramter) model, and allowing for site-specific rate variation modelled using a Gamma-distribution (option -m BINGAMMA). As in the case of morphological data matrices (or certain SNP data sets), and in contrast to molecular data matrices, most of the characters are variable (not constant) in linguistic matrices. The lack of such invariant sites may lead to so-called “ascertainment bias” when optimizing the substitution model and calculating the likelihood.

Hence, RAxML includes an option to correct for this bias for morphological or other binary or multi-state matrices. In the case of the Dravidian language matrix, four out of the over 700 characters (sites) are invariant and were removed prior to rerun the analysis applying the correction (option -m ASC_BINGAMMA). The results of both runs show a high correlation— the Pearson correlation co-efficient of the bipartition frequencies (bootstrap support, BS) is 0.964. Nonetheless, BS support for individual branches can differ by up to 20 (which may be a genuine or random result, we don't know yet). The following figures show the bootstrap consensus network of the standard analysis and for the analysis correcting for the ascertainment bias.

Maximum likelihood (ML) bootstrap (BS) consensus network for the standard analysis. Green edges correspond to branches seen in the unconstrained Bayesian tree in Kolipakam et al. (2018, fig. 4), the olive edges to alternatives in the PP support network by S. Greenhill. Edge values show ML-BS support, and PP for comparison.

ML-BS consensus network for the analysis correcting for the ascertainment bias. BSasc annotated at edges in bold font, with BSunc and PP (graph before) provided for comparison. Note the higher tree-likeness of the graph.

Both graphs show that this characters’ naïve approach is relatively decisive, even more so when we correct against the ascertainment bias. The graphs show relatively few boxes, referring to competing, tree-incompatible signals in the underlying matrix.

Differences involve Kannada, a language that is resolved as equally related to Malayam-Tamil and Kodava-Yeruva — BSasc = 39/35, when correcting for ascertainment bias; but BSunc < 20/40, using the standard analysis); and Kolami is supported as sister to Koya-Telugu (BSasc = 69 vs. BSunc. = 49) rather than Gondi (BSasc < 20, BSunc = 21).

They also show that from a tree-inference point of view, we don't need highly sophisticated models. All branches with high (or unambiguous) PP in the original analysis are also inferred, and can be supported using maximum likelihood with the simple 1-parameter Mk model. This also means that if the scoring were to include certain biases, the models may not correct against this. At best, they help to increase the support and minimize the alternatives, although the opposite can also be true.

For relationships within the Central-South II clade (unconstrained and constrained analyses), the PP were low. The character-naïve Maximum likelihood analysis reflects some signal ambiguity, too, and can occasionally be higher than the PP. BS > PP values are directly indicative of issues with the phylogenetic signal (eg. lack of discriminative signal, topological ambiguity), because in general PP tend to overestimate and BS underestimate. The only obvious difference is that Maximum likelihood failed to provide support for the putative sister relationship between Ollari Gadba and Parji of the Central group.

The crux with using trees

When inferring a tree as the basis of our hypothesis testing, we do this under the assumption that a series of dichotomies can model the diversification process. Languages are particularly difficult in this respect, because even when we clean the data of borrowings, we cannot be sure that the formation of languages represents a simple split of one unit into two units. Support consensus networks based on the Bayesian or bootstrap tree samples can open a new viewpoint by visualizing internal conflict.

This tree-model conflict may be genuine. For example, when languages evolve and establish they may be closer or farther from their respective sibling languages and may have undergone some non-dichotomous sorting process. Alternatively, the conflict may be due to character scoring, the way one transforms a lexicon into a sequence of (here) binary characters. The support networks allow exploring these phenomena beyond the model question. Ideally, a BS of 40 vs. 30 means that 40% of the binary characters support the one alternative and 30% support the competing one.

In this respect, historical-linguistic and morphological-biology matrices have a lot in common. Languages and morphologies can provide tree-incompatible signals, or contain signals that infer different topologies. By mapping the characters on the alternatives, we can investigate whether this is a genuine signal or one related to our character coding.

Mapping the binary sequences for the concept "all" (example used above to illustrate the matrix basic properties; equalling 15 binary characters) on the ML-BS consensus network. We can see that its evolution is in pretty good agreement with the overall reconstruction. Two binaries support the sister relationship of the South II languages Koya and Telugo, and a third collects most members of the South I group. All other binaries are specific to one language, hence, do not produce a conflict with the edges in the network.

Monday, April 30, 2018

Stratification: how linguists traditionally identify borrowings


In my previous blog post, I illustrated how important it is to take the systemic aspects of sound change into account when comparing languages. What surfaces as a surprisingly regular process is in fact a process during which the sound system of a language changes. Since the words in a given language are derived from the sound system, a change in the system will necessarily change all words in which the respective sound occurs.

On one hand, this makes it much more difficult for linguists to identify homologous words across languages. On the other hand, however, it enables us to identify borrowings, by searching for exceptions to regular sound correspondences. I will be discussing the latter here.

Sound changes and borrowing

In order to illustrate how this can be done in practice, consider the examples of 15 cognates between German and English in the following table:

No. German  English
1 Dach  thatch
2 Daumen  thumb
3 Degen  thane
4 Ding  thing
5 drei  three
6 Durst  thirst
7 denken  think
8 Dieb  thief
9 dreschen  thresh
10 Drossel  throat

When comparing these words quickly, it is easy to see that in all cases where German has a d as the initial sound, English has a th. This sound correspondence, as we call it in historical linguistics, reflects a very typical systematic similarity between English and German, which we can identify for all related words in English and German which go back to Proto-Germanic θ-, a very regular sound change which is well accounted for in Indo-European linguistics.

Not all homologous words between English and German, however, show this correspondences, as we can easily see from the five examples provided in the next table:

No. German English
11 Dill dill
12 dumm dumb
13 Damm dam
14 Dunst dunst
15 Dollar dollar

It is easy to see that these words don't fit our expected pattern (d matching th as the first consonant). It is also clear from the overall similarity of the words that it is rather unlikely that they trace back to different words, and thus turn out to be not cognate at all. One of the simplest possible explanations for the divergence from our initial d in German corresponding to θ in English, which now surfaces as d = d, is borrowing, be it from German to English, from English to German, or from some third language.

Among the five examples, the final one, Dollar is the easiest to explain, as we are dealing with a recent borrowing of the name of the U.S. currency. English dollar itself has another cognate with German, namely German Taler, the name of a currency from ancient times (see here for the full etymology, based on Pfeifer 1993).

The other four terms in the table may seem less straightforward to explain as borrowings, as they are by no means of recent origin; but we can confirm their exceptional status by contrasting them with older Middle High German readings (11-14th century), which are listed in the following table for all 15 of our examples:

No. German English Middle High German
1 Dach thatch dah
2 Daumen thumb dūm
3 Degen thane degan
4 Ding thing ding
5 drei three drī
6 Durst thirst durst
7 denken think denken
8 Dieb thief diob
9 dreschen thresh dreskan
10 Drossel throat drozze
11 Dill dill tilli
12 dumm dumb tumb
13 Damm dam tam
14 Dunst dunst tunst
15 Dollar dollar

As can be easily seen from this table, examples 11-14 all have a t as the initial consonant in Middle High German, and not d, as in the other cases. The change from original Proto-Germanic d to t in German is a well-attested sound change, for which we have many examples in the form of sound correspondences (cf. day vs. Tag, do vs. tun, etc.). We can therefore conclude that the Middle High German readings like tilli vs. English dill reflect the readings we would expect if all words had changed according to the rules. Since no regular change from t in Middle High German to d in Standard High German can be attested, it is furthermore safe to assume that the words have been modified under the influence of contact with other Germanic language varieties.

Here, English is not the most obvious candidate for contact; and the influence is rather due to contact with neighboring language varieties in the North-West of Germany, such as Frisian or Dutch. Similar to English, they have retained the original d (cf. Dutch dille vs. English dill). If speakers of High German varieties borrowed the term from speakers of Low German varieties, they would re-introduce the original d into their language, as we can see in our examples 11-14.

Why some of these borrowings took place and some did not is hard to say. That people in the North-West, living on the coast, know more about the building of dams, for example, is probably a good explanation why High German borrowed the term: obviously, the High German speakers did not use the word tam all that frequently, but instead heard the word dam often in conversations with neighboring varieties closer to the coast. For the other words, however, it is difficult to tell what was the reason for the success of the alternative forms.

Conclusions

Despite its important role for historical language comparison, the kind of analysis described here, by which linguists infer exceptional patterns in order to identify borrowings, is not well documented, either in handbooks of historical linguistics or in the journal literature. Following Lee and Sagart (2008), it is probably best called stratification analysis, since linguists try to identify the layers of contact and inheritance which surface in the form of sound correspondences. If these layers are correctly identified, linguists can often not only determine the direction in which a borrowing occurred, but also the relative time window in which this borrowing must have happened. This is the reason why linguists can often give very detailed word histories, which show where a word was first borrowed and how it then traveled through linguistic landscapes.

As for so many methods in historical language comparison, it is difficult to identify a straightforward counterpart of this technique in biology. What probably comes closest is the usage of GC content as a proxy for the inference of directed networks of lateral gene transfer (as described in, for example, Popa et al. 2011). In contrast to lateral gene transfer in biology, however, our linguistic word histories are often much more detailed, especially in those cases where we have well-documented languages.

For the future, I hope that increased efforts to formalize the process of cognate identification, cognate annotation, and phonetic alignments in computer-assisted frameworks to historical language comparison may help to improve the way we infer borrowings in linguistics. There are so many open questions about lateral word transfer in historical linguistics that we cannot answer by sifting manually through datasets. We will need all the support we can get from automatic and semi-automatic approaches, if we want to shed some light on the many mysterious non-vertical aspects of language evolution.

References

Lee, Y.-J. and L. Sagart (2008) No limits to borrowing: The case of Bai and Chinese. Diachronica 25.3: 357-385.

Pfeifer, W. (1993) Etymologisches Wörterbuch des Deutschen. Akademie: Berlin.

Popa, O., E. Hazkani-Covo, G. Landan, W. Martin, and T. Dagan (2011) Directed networks reveal genomic barriers and DNA repair bypasses to lateral gene transfer among prokaryotes. Genome Research 21.4: 599-609.

Monday, April 23, 2018

A (wal)nut to crack – what a network tells you that no tree can


In this post, I will show a network that I generated some time ago as illustration of a point: morphological data should not be used to infer trees, but networks, instead — especially when the goal is to place some fossils in a modern-day phylogenetic framework.

In 2007, Manos et al. (Systematic Biology 56:412–430) published an interesting phylogenetic study that provided a phylogenetic framework to place some enigmatic fossils of the Juglandaceae, the walnut family. Following my preferred procedure (presumably without realizing it), they recruited a palaeobotanical expert to erect a morphological partition.

Given the high quality of the matrix, this is an ideal example to demonstrate the utility of networks in (palaeo)phylogenetic research and to discuss the question of potential ancestor-descendant relationships, and their poor representation in trees (especially cladograms). Phylogenetic relationships within modern Juglandaceae are relatively well resolved. Rhoiptelea, a relict genus found in the mountains of northern Vietnam and south-western China, is sister to the remainder of the family — it is now subfamily Rhoipteleoideae, but was traditionally its own family. Rhoiptelea is an living fossil: flowers with fitting in-situ pollen and seeds have been found in the Late Cretaceous (Heřmanová et al. 2011, IJPS 172: 285–293; cryptically named Budvaricarpus serialis, the "Serial Budvarseed", because one is not allowed to use a modern-day genus for naming a 85–90 million year old angiosperm, even when it looks the same). The remainder of the Juglandaceae falls into two main clades, recognized as subfamilies:
  1. the Juglandoideae — the walnuts (Juglans) and their closest relatives: the (eastern) North American-East Asian disjunct genus Carya, the Eurasian relict genus Pterocarya (mainly Transcaucasia, East Asia), and the monotypic genera Cyclocarya and Platycarya.
  2. the Engelhardioideae — a group of tropical-subtropical, mostly relict genera: Alfaroa + Oreomunnea in the equatorial regions of the New World; and South East Asian-Malesian genus Engelhardia and the, probably monotypic, Alfaropsis widespread in China (sometimes still included in Engelhardia; e.g. current Flora of China, despite unambiguous molecular and morphological evidence).
Juglandaceae produce (winged) seeds and pollen that are relatively easy to identify. They are well-known and very common companions of palaeontologists during much of the Cenozoic, especially the (today geographically very restricted) Engelhardioideae. But in addition to the modern genera, the family includes some very interesting, unique fossils — the idea is to place these in a phylogenetic framework.

Results of the study of Manos et al. (2007).
Arrows indicate the position of the fossils. a) A majority rule consensus cladogram using a cut-off of 50 based on the morphological partition; b) the total evidence counterpart.

As can be seen from the above trees (taken from the paper), morphology reflects some of the molecular phylogenetic relationships — the Juglandoideae are supported as a clade, as are most genera (except for Engelhardia and Oreomunnea). Two fossils, Pal(a)eoplatycarya and Platycarya americana were resolved as sister taxa to their modern counterpart, Platycarya strobilacea; and the two enigmatic fossils Polyptera (the "many-winged one") and Cruciptera (the "cross-winged one") could be associated with the Juglandoideae. The total evidence approach indicated that Cruciptera is part of the "crown-group" Juglandoideae, in contrast to Polyptera, that appears at a more "basal" (root-proximal) position in this subclade. A sixth fossil, Pal(a)eooreomunnea could not be resolved with certainty (placed as sister to all Juglandoideae in the total evidence tree). As the name indicates, literally the "Ancient Oreomunnea", we would have expected it to group with the Engelhardioideae, which form a clade in the total evidence tree.

This is okay so far as it goes but, beyond potential sister relationships, these cladograms show very little. When I place a fossil such as Cyclocarya in the phylogeny, I would like to know whether it is more closely related to Juglans, Pterocarya or Cyclocarya. Is it an early sister lineage of all of these, or even a precursor? Cladograms cannot answer such questions.

The persistent issue of pseudo-clades

It has been pointed out in earlier posts that clades/grades are not necessarily synonyms of Hennig's concepts of monophyly and paraphyly, mainly because of convergent evolution creating data splits that are incongruent with the true tree. Parsimony-based analyses are especially vulnerable, because each change represents a step to be optimized.

One alternative method to place fossils in a (molecular-based) phylogenetic framework is the evolutionary placement algorithm (EPA; Berger & Stamatakis 2010, AICCSA conference paper). This changes to a probabilistic framework, and queries each fossil alone using its morphological partition but using the molecular-based tree as framework.

Summarized result of the evolutionary placement algorithm as implemented in RAxML.
The number represents a probability to join the fossil at the according branch using maximum likelihood as optimality criterion.

This gives the above tree as the result for the Walnut data set. Palaeooreomunnea is now unambiguously linked to one of the two included species of Oreomunnea, O. mexicana. Cruciptera is associated (again unambiguously) with Cyclocarya. Furthermore, not only are Palaeoplatycarya and the extinct North American Platycarya relatives of the modern-day Platycarya, but also Polytera. This, according to the original analysis, is the first-branching member of the remainder of the Juglanoideae, ie. all genera except Platycarya.

And the network shows us why

The most important problem with morphological data sets is that their signals are complex, and usually not very tree-like. Hence, whenever we optimize fossils along a tree (either by directly analyzing the morphological data or by some form of total evidence approach), the analysis has to fit in this odd little OTU at all cost, even when it means collapsing an entire clade. Simultaneous optimisation of two or more fossils triggers further branching artifacts, and may decrease branch support, because we have no molecular data compensating for eventual branch attraction conflicting with the actual phylogeny.

Let's take the Polyptera as an example. If we de-root the trees, the original total evidence placement and the ML-EPA are not that different from each other: Polyptera is just moved one node. A easily inferred Neighbour-net, which is not 1-dimensional like a phylogenetic tree, but 2-dimensional, shows the reason why (and only by using the morphological data partition).

The neighbour-net based on the Manos et al.'s morpho-data partition.
Numbers at branches represent nonparametric boostrap support (Least-squares and Maximum parsimony criteria) and Bayesian posterior probabilities.

  • We can see that Polyptera has a unique morphology (it shows the longest terminal edge of all fossils), making it equally similar to Platycarya and the remaining Juglandoideae: Juglans, Pterocarya, Cyclocarya, and Carya (Annamocarya is a not-widely-accepted Chinese genus, genetically indistinct from other East Asian Carya). This explains its instability in tree-based reconstructions. Assuming that Rhoiptelea points to the actual root, one could use the relatively high branch support values as an argument to say that Polyptera evolved after Platycarya split from the remainder of the Juglandoideae. But the network shows that the signal is not that straightforward, and Polyptera may just be a third lineage within the Juglandoideae (note the short orange edge bundle in contrast to the large red and green ones). A crucial question to check, also regarding the ML-EPA result, is whether the orange-edge clade (including Polyptera) is supported by uniquely shared characters and not just a tree-branching artifact because of the distinctness of the Platycarya group. Being substantially distinct (genetically and morphologically) from the remainder of the Juglandoideae, they must be placed as sister taxa. Being a fossil Polyptera is not that distinct, hence, placed in the Juglandoideae core clade. Distance-based and parsimony methods are more vulnerable to long-branch attraction (or short-branch culling) than is ML; and Bayesian analysis optimizes to a tree best comforting all signals in the data (compatible or not).
  • Cruciptera is more similar to Cyclocarya and Pterocarya than to Juglans, and represents a more primitive (ancestral) form. Based on the position of Cyclocarya and Pterocarya, we can directly conclude that they are morphologically less derived than Juglans, their sister taxon. Hence, one should be careful interpreting Cruciptera as a precursor of eg. Pterocarya, but would have to go back into the matrix and assess which characters differentiate within this part of the graph, in order to decide whether the similarity between them is a genuine representation of shared (common) origin, and not just due to symplesiomorphies.
  • The fossil counterparts of modern-day Platycarya span a quite prominent box-like structure in the network, but the blue edge has little support from tree-based analyses. A simple explanation would be that these two more ancient members of the Platycarya lineage, and are less derived than their modern counterpart and the other Juglandoideae.
  • Palaeooreomunnea is placed as one would expect for an ancestral form of the Engelhardoideae. It is clearly closer to the New World pair Alfaroa and Oreomunnea than to the Old World Alfaropsis and Engelhardia.
Data & software for EPA

The data matrix that I used for the ML-EPA, the Neighbour-net and the competing branch support analyses can be found in the supplementary information of the original paper.

EPA is implemented in RAxML since Version 7 and usually used to place environmental short sequence reads (Berger et al. 2011, Syst. Biol. 60:291–302). For a published application of EPA to place fossils, see e.g. Bomfleur et al. 2015, BMC Evol. Biol. 15:126.

Monday, April 16, 2018

Networks in the news, at last


Phylogenetic networks do not always fare very well in the traditional media. The general public has enough troubles dealing with a phylogenetic tree, let alone networks. For example, many people still consider that Darwin claimed that monkeys are our ancestors (a chain-based relationship) rather than our cousins (a tree-based relationship) — who knows what they must think about humans inter-breeding with Neandertals (a network-based relationship).

Nevertheless, a few news reports about a recent network-based paper have suggested that the situation might be improving.


The paper in question is:
Úlfur Árnason, Fritjof Lammers, Vikas Kumar, Maria A. Nilsson, Axel Janke. Whole-genome sequencing of the blue whale and other rorquals finds signatures for introgressive gene flow. Science Advances 4: eaap9873.
This paper details extensive genomic admixture among six species of Baleen whales. The phylogenetic scenarios involving gene flow cannot be represented by a tree, of course, so the authors include the following set of networks (along with a Median network).


News reports have appeared in at least two places, reporting on this paper, that discuss the difference between networks and "Darwinian trees", and do quite a good job of it.

For example, this quotation is from the New York Times ("Baleen Whales intermingled as they evolved, and share DNA with distant cousins"):
The relationships are so complicated, however, that the senior researcher Axel Janke said "family tree" is too simple a metaphor. Instead, the species, all part of a group called rorquals, have evolved more into a network, sharing large segments of DNA with even distant cousins. Scientists expressed surprise that there had been so much intermingling of baleen whales, given the variety of sizes and shapes.
This quotation is from Popular Science ("A new study on whales suggests Darwin didn't quite get it right"):
Evolutionary network analysis takes the tree metaphor and turns it into a complex web, which acknowledges the different kinds of familial connections shown by whole-genome sequencing. Comparing the whole genomes of rorquals shows that genetics is much more fluid than the Darwinian “tree” model, Janke says.
"Gene flow and hybridization is more common than biologists usually think," Janke says. Analysis of the rorquals’ genes shows that they've interbred in different ways at various times in their evolutionary history. This doesn't make much sense if you rely only on Darwin's model, where branches of the family tree never touch again after they separate.
I think that these give us all a reason for optimism.

Monday, April 9, 2018

The curious case(s) of tree-like matrices with no synapomorphies


(This is a joint post by Guido Grimm and David Morrison)

Phylogenetic data matrices can have odd patterns in them, which presumably represent phylogenetic signals of some sort. This seems to apply particularly to morphological matrices. In this post, we will show examples of matrices that are packed with homoplasious characters, and thus lead to trees with a low Consistency Index (CI), but which nevertheless have high tree-likeness, as measured by a high Retention Index (RI) and a low matrix Delta Value (mDV). We will also try to explore the reasons for this apparently contradictory situation.

Background

A colleague of ours was recently asked, when trying to publish a paper, to explain why there were low CI but high RI values in his study. This reminded Guido of a set of analyses he started about a decade ago, using an arbitrary selection of plant morphological matrices he had access to.

The idea of that study was to advocate the use of networks for phylogenetic studies using morphological matrices, based on the two dozen data sets that he had at hand. The datasets were each used to infer trees and quantify branch support, under three different optimality criteria: least-squares (via neighbour-joining, NJ), maximum likelihood, and maximum parsimony. This study was was never wrapped up for a formal paper, for several reasons (one being that 10 years ago Guido had absolutely no idea which journal could possibly consider to publish such a paper, another that he struggled to find many suitable published matrices).

The signals detected in the collected matrices were quite different from each other. The set included matrices with very high matrix Delta Values (mDV), nontree-like signals, and astonishingly low mDVs, for a morphological matrix. Equally divergent were the CI and RI of the inferred equally most-parsimonious trees (MPT) and the NJ tree. The data for the MPTs and the primary matrices are shown in the first graph, as a series of scatterplots, where each axis covers the values 0-1. (Note: in most cases the NJ topologies are as optimal as the MPTs, and have similar CI and RI values.)


As you can see, the CI values (parsimony-uninformative characters not considered) are not correlated with either the RI or mDV values, whereas the latter two are highly correlated, with one exception.

The most tree-like matrix (mDV = 0.184, which is a value typically found for molecular matrices allowing for inference of unambiguous trees) was the one of Hufford & McMahon (2004) on Besseya and Synthyris. The number of MPTs was undetermined —using a ChuckScore of 39 steps (the best value found in test runs), PAUP* found more than 80,000 MPTs with a CI of 0.39 (third-lowest of all of the datasets), but an RI of 0.9 (highest value found).

A strict consensus network of the 80,003 equally parsimonious solutions, the network equivalent to the commonly seen strict consensus tree cladograms. Trivial splits are collapsed. Colours solely added for orientation (see next graph).

Oddly, the NJ tree had the same number of steps (under parsimony), but a much higher CI (0.69). The proportion of branches with a boostrap support of > 50% was twice as large in a distance-based framework than using parsimony.

Bootstrap consensus networks based on 10,000 pseudoreplicates each. Left, distance-based and inferred using the Neighbour-Joining algorithm; right, using a branch-and-bound search under parsimony as optimality criterion (one tree saved per replicate). Edge-lengths reflect branch support of sole or competing alternatives; alternatives found in less than 20% of the replicates not shown; trivial splits are collapsed. Same colour scheme than above for orientation.

The Neighbour-net based on this matrix has quite an interesting structure. Tree-like portions are clearly visible (hence, the low mDV) but the branches are not twigs but well developed trunks. The large number of MPTs is mainly due to the relative indistinctness of many OTUs from each other.


Neighbour-net based on simple mean (Hamming) morphological distances. Same colour scheme as above.
This distance-based 2-dimensional graph captures all main aspects of the tree inferences and bootstrap analyses, with one notable exception: B. alpina which is clearly part of the red clade in the tree-based analyses. We can see that the orange group, B. wyomingensis and close relatives, is (morphology-wise) less derived than the red species group. Although B. alpina is usually placed in a red clade, it would represent a morphotype much more similar to the orange cluster as it lacks most of the derived character suite that defines the rest of the red clade. In trees, B. alpina is accordingly connected to the short red root branch as first diverging "sister" with a very short to zero-long terminal branch, but in the network it is placed intermediate between the poorly differentiated but morphologically inhomogenous oranges and the strongly derived reds — being a slightly reddish orange. This reddishness may reflect a shared common origin of B. alpina and the other reds, in which case the tree-based inferences show us the true tree. Or just a parallel derivation in a member of the B. wyoming species aggregate, in which case the unambiguous clade would be a pseudo-monophylum (see also our recent posts on Clades, cladistics, and why networks are inevitable and Let's distinguish between Hennig and cladistics).

Interpretation, what does low CI but high RI stand for?

The distinction between the Consistency Index and the Retention index has been of long-standing practical importance in phylogenetics. For a detailed discussion, you can consult the paper by Gavin Naylor and Fred Kraus (The Relationship between s and m and the Retention Index. Systematic Biology 44: 559-562. 1995).

For each character, the consistency index is the fraction of changes in a character that are implied to be unique on any given tree (ie. one change for each character state): m / s, where m = the minimum possible number if character-state changes on the tree, and s = the observed number if character-state changes on the tree. The sum of these values across all characters is the ensemble consistency index for the dataset (CI).

The retention index (also called the homoplasy excess ratio) for each character quantifies the apparent synapomorphy in the character that is retained as synapomorphy on the tree: (g - s) / (g - m), where g = the greatest amount of change that the character may require on the tree. Once again, the sum of these values across all characters is the ensemble retention index for the dataset (RI).

Both CI and RI are comparative measures of homoplasy — that is, the degree to which the data fit the given tree. However, CI is negatively correlated with both the number of taxa and the number of characters, and it is inflated by the inclusion of parsimony-uninformative characters. RI is less sensitive to these characteristics. However, RI is inflated by the presence of unique states in multi-state characters that have some other states shared among taxa and, therefore, are potentially synapomorphic.

It is these different responses to character-state distributions (among the taxa) that apparently create the situation noted above for morphological data. Neither CI nor RI directly measures tree-likeness, but instead they are related to homoplasy. So, it is the relative character-state distributions among the taxa that matter in determining their values, not just the tree itself.

For example, increasing the number of states per character will, in general, increase CI faster than RI. Increasing the number of states that per character that occur in only one taxon will, in general, increase RI faster than CI.

Take-home message

This is just another example demonstrating that morphological data sets should not be used to infer (parsimony) trees alone, but analysed using a combination of Neighbour-nets and support Consensus Networks. No matter which optimality criterion is preferred by the researcher, the signal in such matrices is typically not trivial. It calls for exploratory data analysis, and inference methods that are able to capture more than a trivial sequence of dichotomies.

Monday, April 2, 2018

Things you can learn in a blink about your data


As phylogeneticists, we commonly have to deal with data that we don't initially understand. In this post, I'll use a recently published 8-gene dataset on lizards to show how much can be learned prior to any deeper analysis, just from producing a few Neighbour-nets.

The data

Solovyeva et al. (Cenozoic aridization in Central Eurasia shaped diversification of toad-headed agamas, PeerJ, 2018) sampled species of toad-headed agamas (lizards) across their natural range (north-western China to the western side of the Caspian Sea), to study their genetic differentiation in time and space. To do so they used two datasets. The mitochondrial data covers four gene regions: coxI, cytB, nad2, and nad4, and are complemented by four nuclear gene regions: AKAP9, NKTR, BDNF, RAG1.

This caught my eye, because the authors' preferred trees have a bunch of low branch-support values, so that this would be a good opportunity to advocate some Consensus networks. They also report only values above a certain threshold, as apparently recommended by several reviewers. My reviewers not rarely recommended the same, but I always ignored this — I believe we should give the value, because it makes a difference if its just below the threshold (e.g. bootstrap support, BS, of 49), or non-existent (BS < 5). The authors also note that their mitochondrial and nuclear genealogies are not fully congruent. In short, the signal from their matrix is probably not trivial, but could be interesting.

In contrast to many other journals, PeerJ has a strict open-data policy. Solovyeva et al. provide each gene as FASTA-formatted alignment as Supporting Information. So let's have some quick-and-dirty Neighbour-nets.

Using Neighbour-nets to decide on an analysis strategy

A comprehensive outgroup sampling can avoid outgroup-rooting artefacts, but adding very distant outgroups comes at a price. We need to invest much more computational effort, because the inference programmes not only try to optimize our focus group, but the entire taxon set. Another principal question is: what can an outgroup taxon provide as information for rooting an ingroup, while being completely different? Furthermore, when we do an ML (or Bayesian) analysis, e.g. with RAxML, we leave it to the program to optimize a substitution model (even when we predefine a model, its parameters will usually be optimized by the inference software on the fly). By adding distant outgroups, we optimize a model for them plus our focus group — by not using any outgroup, we optimize a model suiting just the situation in our focus group.

Fig. 1 shows the neighbour-net (uncorrected, codon-naive p-distances) for the first of the mitochondrial genes, coxI (the others are similar), which and tells us a lot about the data to be used for the tree inferences.

Fig. 1 Neighbour-net based on mitochondrial (coxI) uncorrected p-distances. The diffuse, non-treelike signal expressed in the A and B fans will be a hard nut for the tree inference, and will have little influence on questions dealing with the focal genus.
We can see that outgroup diversity is much higher than for the focus group, and that most outgroup taxa are very distinct from the ingroup. Looking at the closest outgroups (Stellagama, Agama, Laudakia, Paralaudakia, Xenagama, Pseudotrapelus), we see that finding an unambiguous sister taxon to the focal genus will be difficult. And we can realize that including more-distant taxa just gives the algorithm much more work (note the A and B bushes), but hardly will have any benefit for rooting the ingroup.

We also can see that the 3rd codon position is probably saturated to some degree, and that we will be dealing with a high level of stochasticity (randomly distributed mutation patterns) here — all terminal edges are long to very long. Since the same thing holds for the other three mitochondrial regions, it would not be a bad idea to do an additional inference including only the 1st and 2nd codon positions, in case all taxa should be included.

Using Neighbour-nets to understand the basic signal properties of your data

Fig. 2 shows the Neighbour-net (again, uncorrected p-distances) for one of the nuclear genes, AKAP9. The outgroup sample is somewhat different, but we can immediately see that this gene has more potency to infer unambiguous phylogenetic relationships among the sampled taxa — the graph has distinctly tree-like portions. We also see that saturation of 3rd codon position is much less of an issue here, compared to the cox1 gene (Fig. 1) — the terminal edges are comparatively short, with respect to the central edge bundles. [Nonetheless, it is never wrong to analyze coding gene data partitioned: 1st and 2nd codon positions vs. 3rd codon position.]


Fig. 2 Neighbour-net based on the nuclear (AKAP9) genetic distances. Note the much more treelike structure of the graph, the generally shorter terminal edges, and last-but-not-least the notable difference between ingroup (focal genus) and outgroup taxa.
For the general differentiation patterns, compare the minute extent of the focal group, green background in Fig. 2 vs. the prominent bush in Fig. 1. It is clear that including distant outgroups will not have any benefit. We may even consider reducing the outgroup sample (if one has to include an outgroup at all) to the two genetically closest genera Stellagama and Paralaudakia.

Similarly structured graphs are found for the other three nuclear genes.


Producing some quick Neighbour-nets doesn't hurt

Sometimes reviewers will pick on them — "distance-based phenetic method" is something I used to get a lot. In this case, you can still produce them just to get some basic impressions on your data set. This will help you to understand the results of your tree inferences, including why some of your branches have ambiguous support.

It comes as little surprise that the taxa one can identify, in these networks, as likely sister genera of the focal genus, come up as sister taxa in the explicit phylogenetic analyses done by Soloveya et al. — e.g., their fig. 2 showing the combined mitochondrial tree, and their fig. 3, showing the combined nuclear tree.

Soloveya et al. (2018) performed some incongruence tests (AU-topology test) using single-gene inferences (going further than many other studies), but did not dig deeper. One of the authors answered my question about potential signal issues that may cause topological incongruence between ML and Bayesian trees, as well as ambiguous support, but he considers this to be a solely a problem with methods — different algorithms prefer different phylogenies. Having looked at the basic differentiation pattern in the gene regions using Neighbour-nets, it may be more than just an issue with methods — ML and Bayesian analysis should always support the same splits when using the same or similar substitution models.

Like many other studies, the authors also use the data for Bayesian dating and dating-dependent biogeographic analysis. Lacking any ingroup fossils, the authors could only constrain nodes within the outgroup subtree, which are nodes far from those that they discuss and estimate. I have my doubts that we can put much faith in the uncorrelated clock process to handle such extreme differences between focus group (ingroup) and (constrained) outgroup-taxon lineages as seen in Fig. 2. Estimates for rate shifts between outgroup and ingroup usually render ingroup age estimates to be too young, compared to age estimates obtained with ingroup fossils. This is something that can be directly deduced from a graph like the one in Fig. 2.

Data and networks can be found at figshare

The original paper provides a comprehensive supplement with a lot of interesting information, but the FASTA-files, each comprising a single gene region and a few editing issues, are not yet ready to use. Hence, I transformed them into NEXUS-files, and generated a combined data matrix. The files and the Neighbour-nets for each gene region (and a full single-gene maximum likelihood analysis) can be found on figshare.

Monday, March 26, 2018

It's the system, stupid! More thoughts on sound change in language history


In various blog posts in the past I have tried to emphasize that sound change in linguistics is fundamentally different from the kind of change in phenotype / genotype that we encounter in biology. The most crucial difference is that sound sequences, i.e., our words or parts of the words we use when communicating, do not manifest as a physical substance but — as linguists say — "ephemerically", i.e. by the air flow that comes out of the mouth of a speaker and is perceived as an acoustic signal by the listener. This is in strong contrast to DNA sequences, for example, which are undeniably somewhere "out there". They can be sliced, investigated, and they preserve information for centuries if not millenia, as the recent boom in archaeogenetics illustrates.

Here, I explore the consequences of this difference in a bit more detail.

Language as an activity

Language, as Wilhelm von Humboldt (1767-1835) — the boring linguist who investigated languages from his armchair while his brother Alexander was traveling the world — put it, is an activity (energeia). If we utter sentences, we pursue this activity and produce sample output of the system hidden in our heads. Since the sound signal is only determined by the capacity of our mouth to produce certain sounds, and the capacity of our brain to parse the signals we hear, we find a much stronger variation in the different sounds available in the languages of the world than we find when comparing the alphabets underlying DNA or protein sequences.

Despite the large variation in the sound systems of the world's languages, it is clear that there are striking common tendencies. A language without vowels does not make much sense, as we would have problems pronouncing the words or perceiving them at longer distances. A language without consonants would also be problematic; and even artificial communication systems developed for long-distance communication, like the different kinds of yodeling practiced in different parts of the world, make use of consonants to allow for a clearer distinction between vowels (see the page about Yodeling on Wikipedia). But, between both extremes we find great variation in the languages of the world, and this does not seem to follow any specific pattern that could point to any kind of selective pressure, although scholars have repeatedly tried to demonstrate it (see Everett et al. 2015 and the follow-up by Roberts 2018).

What is also important here is that, not only is the number of the sounds we find in the sound system of a given language highly variable, but there is also variation in the rules by which sounds can be concatenated to form words (called the phonotactics of a language), along with the frequency of the sounds in the words of different languages. Some languages tolerate clusters of multiple consonants (compare Russian vzroslye or German Herbst), others refuse them (compare the Chinese name for Frankfurt: fǎlánkèfú), yet others allow words to end in voiced stops (compare English job in standard pronunciation), and some turn voiced stops into voiceless ones (compare the standard pronunciation of Job in German as jop).

Language as a system

Language is a system which essentially concatenates a fixed number of sounds to sequences, being only restricted by the encoding and decoding capacities of its users. This is the core reason why sound change is so different from change in biological characters. If we say that German d goes back to Proto-Germanic *θ (pronounced as th in path), this does not mean that there were a couple of mutations in a couple of words of the German language. Instead it means that the system which produced the words for Proto-Germanic changed the way in which the sound *θ was produced in the original system.

In some sense, we can think metaphorically of a typewriter, in which we replace a letter by another one. As a result, whenever we want to type a given word in the way we know it, we will type it with the new letter instead. But this analogy would be to restricted, as we can also add new letters to the typewriter, or remove existing ones. We can also split one letter key into two, as happens in the case of palatalization, which is a very common type of sound change during which sounds like [k] or [g] turn into sounds like [] and [] when being followed by front vowels (compare Italian cento "hundred", which was pronounced [kɛntum] in Latin and is now pronounced as [tʃɛnto]).

Sound change is not the same as mutation in biology

Since it is the sound system that changes during the process we call sound change, and not the words (which are just a reflection of the output of the system), we cannot equate sound change with mutations in biological sequences, since mutations do not recur across all sequences in a genome, replacing one DNA segment by another one, which may not even have existed before. The change in the system, as opposed to the sequences that the system produces, is the reason for the apparent regularity of sound change.

This culminates in Leonard Bloomfield's (1887-1949) famous (at least among old-school linguists) expression that 'phonemes [i. e., the minimal distinctive units of language] change' (Bloomfield 1933: 351). From the perspective of formal approaches to sequence comparison, we could restate this as: 'alphabets change'. Hruschka et al. (2015) have compared sound change with concerted evolution in biology. We can state the analogy in simpler terms: sound change reflects systemics in language history, and concerted evolution results from systemic changes in biological evolution. It's the system, stupid!

Given that sound systems change in language history, this means that the problem of character alignments (i.e. determining homology/cognacy) in linguistics cannot be directly solved with the same techniques that are used in biology, where the alphabets are assumed to be constant, and alignments are supposed to identify mutations alone. If we want to compare sequences in linguistics, where we have to compare sequences that were basically drawn from different alphabets, this means that we need to find out which sounds correspond to which sounds across different languages while at the same time trying to align them.

An artificial example for the systemic grounding of sound change

Let me provide a concrete artificial example, to illustrate the peculiarities of sound change. Imagine two people who originally spoke the same language, but then suffered from diseases or accidents that inhibited them from producing their speech in the way they did before. Let the first person suffer from a cold, which blocks the nose, and therefore turns all nasal sounds into their corresponding voiced stops, i.e., n becomes a d, ng becomes a g, and m becomes a b. Let the other person suffer from the loss of the front teeth, which makes it difficult to pronounce the sounds s and z correctly, so that they sound like a th (in its voiced and voiceless form, like in thing vs. that).


Artificial sound change resulting from a cold or the loss of the front teeth.

If we now let both persons pronounce the same words in their original language, they won't sound very similar anymore, as I have tried to depict in the following table (dh points to the th in words like father, as opposed to the voiceless th in words like thatch).

No.   Speaker Cold   Speaker Tooth 
1 bass math
2 buzic mudhic
3 dose nothe
4 boizy moidhy
5 sig thing
6 rizig ridhing

By comparing the words systematically, however, bearing in mind that we need to find the best alignment and the mapping between the alphabets, we can retrieve a set of what linguists call sound correspondences. We can see that the s of speaker Cold corresponds to the th of speaker Tooth, z corresponds to dh, b to m, d to n, and g to ng. Having probably figured out by now that my words were taken from the English language (spelling voiced s consequently as z), it is easy even to come up with a reconstruction of the original words (mass, music[=muzik], nose, noisy=[noizy], etc.).

Reconstructing ancestral sounds in our artificial example with help of regular sound correspondences.

Summary

Systemic changes are difficult to handle in phylogenetic analyses. They leave specific traces in the evolving objects we investigate that are often difficult to interpret. While it has been long since known to linguists that sound change is an inherently systemic phenomenon, it is still very difficult to communicate to non-linguistics what this means, and why it is so difficult for us to compare languages by comparing their words. Although it may seem tempting to compare languages with simple sequence-alignment algorithms with differences in biological sequences resulting from mutations (see for example Wheeler and Whiteley 2015), it is basically an oversimplifying approach.

Simple models undeniably have their merits, especially when dealing with big datasets that are difficult to inspect manually — there is nothing to say against their use. But we should always keep in mind that we can, and should, do much better than this. Handling systemic changes remains a major challenge for phylogenetic approaches, no matter whether they use trees, networks, bushes, or forests.

Given the peculiarity of sound change in linguistic evolution, and how well the phenomena are understood in our discipline, it seems worthwhile to invest time in exploring ways to formalize and model the process. During the past two decades, linguists have taken a lot of inspiration from biology. The time will come when we need to pay something back. Providing models and analyses to deal with systemic processes like sound change might be a good start.

References

Bloomfield, L. (1973) Language. Allen & Unwin: London.

Everett, C., D. Blasi, and S. Roberts (2015) Climate, vocal folds, and tonal languages: connecting the physiological and geographic dots. Proceedings of the National Academy of Sciences 112.5: 1322-1327.

Hruschka, D., S. Branford, E. Smith, J. Wilkins, A. Meade, M. Pagel, and T. Bhattacharya (2015) Detecting regular sound changes in linguistics as events of concerted evolution. Curr. Biol. 25.1: 1-9.

Roberts, S. (2018) Robust, causal, and incremental approaches to investigating linguistic adaptation. Frontiers in Psychology 9: 166.

Wheeler, W. and P. Whiteley (2015) Historical linguistics as a sequence optimization problem: the evolution and biogeography of Uto-Aztecan languages. Cladistics 31.2: 113-125.

Monday, March 19, 2018

Comparing neighbour-nets and PCA graphs – the example of Mediterranean oaks


Distance matrices offer many avenues for exploring data. A common method is Principal Component Analysis (PCA). A much less common method is the use of Neighbour-nets. We have previously compared PCA and Neighbor-nets using theoretical data. In this post, I'll compare a PCA graph and the corresponding Neighbour-net using some empirical data.

Genetic differentiation in Mediterranean oaks

In the paper by Vitelli et al. (2017), we explored the phylogeographic structuring of a group of Mediterranean oak species. The species represented the westernmost populations of one of the main Eurasian oak lineages: the evergreen Quercus section Ilex ("Ilex oaks"; see Denk et al. 2017 for an up-to-date classification of oaks; see also this figshare-spread-sheet). It was a follow-up study to the one by Simeone et al. (2016).

We found that one species, the most widespread (Quercus ilex), carry plastids from quite different origins. The 2016 paper identified three main plastid haplotypes in the Ilex oaks: the unique (within the entire genus) "Euro-Med" haplotype; the "Cerris-Ilex" haplotype shared with western Eurasian members of (essentially deciduous) section Cerris, the sister clade of section Ilex (see Denk & Grimm 2010; confirmed by NGS SNP data, Hipp et al. 2015); and the "WAHEA" haplotype, an east-bound haplotype of section Ilex. Vitelli et al. aimed to characterise the range of these three main haplotypes throughout the four Ilex oak species found in the Mediterranean.

Figure 1 shows the two multivariate data analyses, along with a map of the sample locations.

Fig. 1 Phylogeographic structure of Quercus section Ilex around the Mediterranean (after Vitelli et al. 2017). a. PCA graph, and b. Neighbour-net based on the same inter-haplotype pairwise distance matrix. c. A map depicting the distribution of main haplotype groups labelled by Roman numerals: I haplotypes of the "WAHEA" lineage, II "Cerris-Ilex"-lineage, III–VI, subtypes of the "Euro-Med" lineage (cf. Simeone et al. 2016, fig. 1)

Regarding the overall diversification pattern, the PCA graph and the Neighbour-net show similar things. The "Euro-Med" lineage is the most diverse group, with four subgroups — two larger (and widespread) ones (haplotypes IV, V) and two rare ones (III, VI) only found in the Aegean region.
  • According to the PCA, haplotype III (colored olive) is intermediate between "Euro-Med" IV (blue) and the haplotype II (yellow), which represents another lineage of oak haplotypes, the Aegean/Northern Turkish "Cerris-Ilex" lineage. The same can be seen in the Neighbour-net.
  • The PCA further places haplotype VI (red) as equidistant to all of the other types, with IV and I (green; representing the oriental "WAHEA" lineage) being a bit closer. In the Neighbour-net, we can sum up the length of the connecting edge-bundles to find the same pattern. A difference between the two analyses is that VI is connected only with part of V (purple) by a pronounced edge bundle, but not connected to I (green). This is strikingly different from III, which shares an edge bundle with II and IV+V.

At this point in the analyses, we can use the potential property of the Neighbour-net acting as a distance-based 2-dimensional graph and acting as a meta-phylogenetic network (Fig. 2). Based on the PCA, which also is a 2-dimensional depiction of the differentiation, one may be tempted to interpret VI as a bridge between IV/V and I, not much different from how III bridges between II and IV (Fig. 1). On the other hand, the network (Figs 1, 2) informs us that VI is a likely relative of V, which in turn is a likely relative of IV; and the only connection between I and VI is their increasing distinctness to the other haplotypes of the "Euro-Med" lineage, III/IV/V.

Fig. 2 The main splits expressed in the neighbour-net. III may either be sister to II, or is part of a clade comprising IV and V.

Using the main split patterns in the Neighbour-net, we can infer the one phylogenetic hypothesis, a tree, that can accommodate them all (Fig. 3).

Fig. 3 The tree solution congruent with the major split patterns (Fig. 2).

I rejected the alternative sister relationship between II and III because this would imply a sister clade that only includes IV, V and VI but not III, which clashes with the affinity of III to IV and V (Fig. 2). Interpreting III as a sister of IV and V, explains both its affinity to II (putative sister lineage to III–VI) and IV and V.

We might accept that all three plastome lineages are reciprocally monophyletic (in a quite broad sense), meaning that each lineage evolved from a pool of closely related mother plants. If so, then the higher similarity between III ("Euro-Med") and II ("Cerris-Ilex") may represent a relative lack of derivation, whereas the dissimilarity between VI ("Euro-Med") and I ("WAHEA") to all other types can be due to a higher level of distinctness. And we can come up with a "cactus"-type metaphorical tree (Fig. 4) explaining the Neighbour-net (and PCA graph).

Fig. 4 A "cactus"-type tree metaphor for the evolution of oak plastomes (based on the results of Simeone et al. 2016, Vitelli et al. 2017, and – outside the focus group, i.e. Mediterranean oaks of Subgenus Cerris – some partly arcane, not yet published knowledge, I have access to)
We thus learn more from the Neighbor-net than from the PCA.

There's no reason to stop with a PCA

One empirical example is far from being conclusive, but it shows what the Neighbour-nets have to offer.

Trees are fine for proposing phylogenetic hypotheses, but we should always be aware of equally valid alternatives to the tree that we have optimized. And with increasing numbers of taxa, inferring optimal trees and assessing their alternatives require increasing effort, and checking. For many questions, PCA has been used as a quick alternative, including in large-sample genetic studies (see Continued misuse of PCA in genomics studies).

Neighbour-nets are just a natural step further towards a phylogeny, which come with very little extra effort and can use the same data basis: a matrix of pairwise distances. In the case of genetic data, which usually reflects at least the main aspects of the actual phylogeny (trivial or complex) behind it, the "true tree", they should be obligatory. They are much more than just a clustering approach (even though their algorithm is based on a cluster algorithm) or a bivariate analysis. Neighbour-nets are meta-phylogenetic networks that have the capacity to contain the one or many topologies explaining the data. They are as straightforward as PCA, when it comes to recognising "natural", coherent and equal, groups (in contrast to phylogenetic trees).

Postscript

I would have liked to add some more examples with non-genetic data. Data sets where the distances are not the result of an explicit phylogenetic process. But this requires much more effort, since none of the PCA studies I browsed had documented the used distance data/matrix. However, I'm sure that inferring a Neighbour-net based on no-matter-what similarity data used for PCA, can be a fruitful and revealing endeavour (and the reason why you find Neighbour-net based on U.S. gun legislation, breast sizes, languages, cryptocurrencies, etc. on this blog, but few PCAs). So, try it out the next time you make a PCA, and share the results e.g. by using our comment option or even a post as guest-blogger.

Don't miss these earlier posts with similar topic:

Also, this paper introduces Neighbor-nets to the wider audience of multivariate data analyses:

References

Denk T, Grimm GW. 2010. The oaks of western Eurasia: traditional classifications and evidence from two nuclear markers. Taxon 59: 351–366.

Denk T, Grimm GW, Manos PS, Deng M, Hipp AL. 2017. An updated infrageneric classification of the oaks: review of previous taxonomic schemes and synthesis of evolutionary patterns. In: Gil-Pelegrín E, Peguero-Pina JJ, and Sancho-Knapik D, eds. Oaks Physiological Ecology. Heidelberg, New York: Springer, p. 13–38. Free Pre-Print at bioRxiv [major change: Ponticae and Virentes accepted as additional sections in final version]

Hipp AL, Manos P, McVay JD, ... , Avishai M, Simeone MC. 2015 [abstract]. A phylogeny of the World's oaks. Botany 2015. Edmonton.

Simeone MC, Grimm GW, Papini A, Vessella F, Cardoni S, Tordoni E, Piredda R, Franc A, Denk T. 2016. Plastome data reveal multiple geographic origins of Quercus Group Ilex. PeerJ 4: e1897 [open access, comments/questions welcomed]

Vitelli M, Vessella F, Cardoni S, Pollegioni P, Denk T, Grimm GW, Simeone MC. 2017. Phylogeographic structuring of plastome diversity in Mediterranean oaks (Quercus Group Ilex, Fagaceae). Tree Genetics and Genomes 13:3.