First, we propose an algorithm called RAP to infer speciation and duplication events, by comparison of gene and species trees tree reconciliation.
Second, we have developed a general method to search gene families for which the tree topology matches a peculiar pattern. A tree pattern is a peculiar tree structure, with various taxonomic and evolutionary parameters contained in nodes and leaves.
It can be also considered as a subtree which is a part of a larger tree. Then, this tree pattern is compared with all the phylogenetic trees of the database in order to retrieve the families in which one or several of its occurrences are found. By this way, it is possible to automatically retrieve all orthologs among a given set of species.
This system is not limited to the identification of orthologs as it can be used to retrieve any complex tree pattern. For example, it is possible to search for events of gene loss or gene transfer, or to search for gene duplication events. The standard procedure to determine whether a node in a phylogenetic tree corresponds to a speciation of a duplication event consists in comparing the gene tree with the species tree.
Efficient algorithms have been proposed to solve this problem of tree reconciliation Page and Charleston, ; Eulenstein et al. However, an important limitation to their use is that they require completely resolved i. In fact, species trees often have ambiguities, due to limitations in available paleontological and molecular data. On the other hand, gene trees are rarely completely reliable, because of limitations in the number of informative sites in sequence alignments, and because of approximations in the evolutionary models and algorithms that are presently used in molecular phylogeny.
Thus, when the gene tree contradicts the species tree, it is required to assess the reliability of the gene tree. This can be done by bootstrap values, or—when these values are not available—by considering the length of internal branches. This is another limitation of the previous algorithms, as they take into account only the topology of the gene and species trees, but not the length of their branches. To circumvent these problems, we propose an improved algorithm for tree reconciliation, allowing the presence of unresolved nodes both in the gene tree and in the species tree, and taking into account not only the tree topology, but also branch lengths.
This algorithm is based on the tree mapping method Page and Charleston, ; Eulenstein et al. The result of the comparison of the gene tree G and the species tree S is a third tree, the reconciled tree R Fig.
The first step of the method is to define R with the same topology as S. Then, R equivalent of S and G are stepped simultaneously: for each incongruent pair of nodes, a duplication node is inserted in R and gene losses are annotated.
In the example given in Figure 1 , the roots of G and S are the only pair of incongruent nodes. Our algorithm is intended to be used on the homologous gene family databases we developed, and a problem is that these datasets often include redundant sequences.
Although efforts are made to minimize redundancy, there are many cases where a single protein is represented by several entries. These redundant sequences are often not exactly identical, either because of polymorphism, sequencing errors, or because they correspond to alternative splice variants. With such data, the standard tree reconciliation algorithms would tend to overestimate the number of gene duplications, because these redundant sequences would be interpreted as paralogs.
To solve that problem, we have added a functionality in our algorithm so that two sequences from the same species are considered as paralogs only if they are more divergent than a given threshold, fixed by the user. It is possible to formulate the unordered tree pattern matching problem as follows:.
The unordered tree pattern matching problem is well known in computer science Aho et al. Moreover, for performing searches, both the target tree and tree pattern have to be rooted. Another problem is the fact that, very often, gene trees are not reliable and some of their parts may be erroneous. This is due to the limitations of phylogenetic reconstruction methods linked to saturation problems, long branch attraction artefacts or the difficulty of taking into account differences in evolutionary rates.
In order to cope with possible errors in the trees, we introduced the use of wildcards in the pattern searches. Such wildcards can be represented by multifurcations. Phylogenetic trees are binary trees and hence a true multifurcation cannot exist in a target tree.
In this section, we describe an implementation of the tree-mapping algorithm, which isolates the congruence function. The notations used are the following: G indicates a node or a leaf of the gene tree, so it corresponds to the whole tree if G is the root.
R indicates a node or a leaf of the reconciled tree, so it corresponds to the whole tree if R is the root. Gs indicates the set of G children, and Rs indicates the set of R children. Card X is the number of elements in the set X and Species Y is the number of leaves that are not losses under the node Y.
Reconcile : G , R. The tree mapping method uses a very basic definition of congruence; therefore the algorithm listed above is not directly applicable to real data. In the version implemented in RAP, the congruence function is improved in order to deal with n -ary nodes that may be encountered in species trees.
Also, many branches in gene trees and even in species tree are not absolutely reliable. In this case, the congruence function tries to collapse some branches to make the topology correspond. Branches that are collapsed must be of low reliability, and this is verified considering that the bootstrap value of a given branch is under a threshold score.
If the tree is not bootstrapped, branch lengths may be used for the same goal. For that purpose, the congruence function compares branch length ratios of G and S. The ratio of branch lengths is also used to introduce duplication nodes. In the example given in Figure 2 , topologies of G and S are equivalent, but the rate ratio of branch lengths is too high to consider these genes as orthologs.
A duplication node is then created to explain such a ratio. The minimum rate ratio before duplication is a parameter of the reconciliation method. Finally, to deal with polymorphism and redundancy, we consider that two sequences from the same species are considered as paralogs only if they are more divergent than a given threshold. If this is not the case, they are considered as redundant entries in the database.
Here, we describe the tree pattern matching algorithm we have implemented. The notations used are the following: T indicates a leaf or a node of the target tree and P indicates a leaf or a node of the tree pattern. SearchPattern T , P returns true if P is detected at least one time in T and returns false if it is not detected. Taxa X returns taxon or taxa labelled on any pattern or target tree node or leaf.
BranchConstraint P returns constraints on the branch just above the node or leaf P. Nature X returns speciation or duplication, depending on the node X. The SearchPattern algorithm varies, as explained below, depending on T and P being leaves or internal nodes. The definition of this first case of recurrence allows the use of different taxonomic levels.
A pattern leaf P is detected in a target tree leaf if and only if the taxon of T is included in the set of taxa of P. This simple case solves a leaf pattern search in a target tree node. The search is propagated recursively on the whole subtree under T. The only constraints to take care of are the branch constraints. This case solves the problem of a non-binary node P searched in a node T. As explained above, a non-binary node P is detected in target tree node T if and only if at least one binary version of P is detected.
To find the pattern P in tree T , we can have two kinds of hypotheses. First, when comparing a node T with node P , we can suppose that T and P are two matching nodes.
Then we must verify that children of P can be found in children of T. But for large phylogenetic trees and tree patterns, the hypotheses and solutions to explore are too numerous and it is not reasonable to apply this simple method to real data. A minor change, though, makes this algorithm efficient on large trees.
This simple verification can be done for each recurrence path of the algorithm. RAP has been developed in Java 1. We set RAP parameters so as to not overestimate the number of gene duplications: we considered that a node in the gene tree should be interpreted as a speciation event, as far as there is no strong evidence that the gene and species trees are incongruent. Since the trees from the three databases are not bootstrapped, the reliability of each gene tree topology was estimated by taking into account the length of internal branches.
As the NCBI tree has no branch lengths, we did not set any value for the rate ratio parameter allowing to infer duplication events when reconciliating a gene tree with the species tree. Tree pattern matching searches can be composed under the FamFetch interface, which also allows to perform many other kind of queries based on keywords, sequence names or accession numbers, families accession numbers, or by taxa crossing.
The pattern editor of FamFetch is made of two frames: the tool frame and the pattern frame Fig. The pattern frame is an interactive editor that permits the construction of any pattern, node by node and leaf by leaf.
Patterns can be loaded, saved and matched with a tree database from this frame. The tool frame allows choosing between tools to use in the pattern frame. The possibilities provided by these tools are: 1 add a new node to any part of the tree pattern; 2 set unresolved topologies i. After the pattern matching operation, the main frame of FamFetch displays the list of matching families. Finally, the results can be saved in a flat file, each pattern being numbered and described with its gene list.
This last feature is of special importance as ortholog identification is a key point when establishing molecular phylogenies. For that purpose, the user has only to build a pattern in which duplications are forbidden Fig. Also, due to the fact that the trees have been reconciliated with RAP, even hidden paralogies due to duplications followed by gene losses in some lineages are taken into account.
For instance, a search of all orthologous pairs between human and mouse in HOVERGEN release 46 found families in which 13 orthologs have been identified. It is important to note that reconciliation algorithms require that the trees are correctly rooted. The use of other labels like co-expression and phylogeny requires an extensive integration of databases and tools, yet as we have shown the gain is potentially substantial. Protein sequences were taken from the COG database [ 7 ].
As one of the two labels we are interested in is gene order conservation, and gene order evolves more slowly in bacteria [ 9 ], we excluded the three genomes of eukaryotes, leaving us with 63 bacterial genomes with a total of protein sequence. Gene order information was taken from Genbank [ 23 ] after mapping COG identifiers to Genbank gene identifiers. Gene order was considered to be conserved when two genes have one or more neighbours belonging to the same COG. Information on protein families was taken from version In the cases where an exact match of a protein in our data set could not be identified in the PFAM database, we assigned protein families using HMMer [ 5 ] and the models provided by PFAM; in all other cases we used the pre-calculated results distributed with the database.
For each bin we counted the homologous pairs sharing gene order, the homologous pairs not sharing gene order, the not-homologous pairs sharing gene order and the not-homologous pairs not sharing gene order. Numbers of hits were normalized by dividing by the number of hits in that category of a query protein by the total number of hits of that query in order to prevent large protein families which have many very similar hits from skewing the results.
Our methodology requires both a golden standard for defining homologous proteins as well as a standard for defining non-homologous proteins. Pairs were considered homologous when the proteins belong to the same protein family as defined by the PFAM database or to protein families belonging to the same clan [ 19 ], while pairs were considered to be non-homologous when they were part of protein families not belonging to the same clan.
The number of false negatives was reduced by excluding pairs belonging to different PFAM clans while at the same time belonging to the same orthologous group of the COG database. Query-hit pairs formed by one or two proteins not belonging to any protein family were left out of the analysis; this keeps protein pairs that are in fact homologous, but that are not covered by any of the models in the PFAM database, as well as proteins that are hit by a PFAM model but fall just below the gathering threshold established by PFAM, from being counted as non-homologous.
Koonin EV: Orthologs, paralogs, and evolutionary genomics. Annu Rev Genet , — Curr Opin Genet Dev , 13 6 — J Mol Biol , 3 — Nucleic Acids Res , 25 17 — Eddy SR: Profile hidden Markov models. Bioinformatics , 14 9 — Omics , 11 1 — BMC Bioinformatics , 4: Kursula P, Ojala J, Lambeir AM, Wierenga RK: The catalytic cycle of biosynthetic thiolase: a conformational journey of an acetyl group through four binding modes and two oxyanion holes.
Biochemistry , 41 52 — Trends Genet , 17 6 — Trends Biochem Sci , 23 9 — BMC Bioinformatics , 5: J Mol Biol , 1 — Bioinformatics , 21 7 — Article PubMed Google Scholar.
Coin L, Bateman A, Durbin R: Enhanced protein domain discovery by using language modeling techniques from speech recognition. BMC Genomics , 7: Hartman H, Fedorov A: The origin of the eukaryotic cell: a genomic investigation. Nucleic Acids Res , 34 Database issue :D— J Mol Biol , 4 — Nature , — Koonin EV: Eugene V.
Koonin Interview. Curr Biol , 14 3 :R96—7. Nucleic Acids Res , 34 Database issue :D16— Nucleic Acids Res , 32 5 — Guindon S, Gascuel O: A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst Biol , 52 5 — Page RD: TreeView: an application to display phylogenetic trees on personal computers. Comput Appl Biosci , 12 4 — Download references. You can also search for this author in PubMed Google Scholar.
Correspondence to Jos Boekhorst. BS conceived the study, participated in its design and coordination and contributed to the manuscript. JB performed the analysis and drafted the manuscript. Both authors read and approved the final manuscript.
This article is published under license to BioMed Central Ltd. Reprints and Permissions. Boekhorst, J. Identification of homologs in insignificant blast hits by exploiting extrinsic gene properties.
BMC Bioinformatics 8, Here are corresponding sections of the Pax6 -like eye-building genes for our visionaries. Similarities to the mouse gene are highlighted in green:. But why are these genes so similar when the animals from which they come, and the eyes that they develop, are so different?
As discussed earlier, there are two basic evolutionary explanations for similarities: homology and analogy. Are these genes homologous i. Based on the observations that all of these gene versions are remarkably similar in sequence, have related functions, and are incredibly widespread animals all across the tree of life have them , scientists have concluded that they must be homologous and must have been inherited from the common ancestor of all these animals.
It is just too unlikely that all these different animal lineages happened to independently evolve remarkably similar genes that do remarkably similar jobs.
0コメント