Tree Edit Distance

About

The Tree Edit Distance website is a reference place to measuring similarity of tree structured data using the tree edit distance (TED) measure. The tree edit distance is defined as the minimum-cost sequence of node edit operations that transform one tree into another.

On this website you can:

download the tree edit distance implementation as a runnable Java JAR file with manual
download the tree edit distance source code in Java
learn about the key concepts of the algorithms
get pointers to tree edit distance literature

Tree Edit Distance

Definition

The tree edit distance between ordered labeled trees is the minimal-cost sequence of node edit operations that transforms one tree into another. We consider following three edit operations on labeled ordered trees:

delete a node and connect its children to its parent maintaining the order.
insert a node between an existing node and a subsequence of consecutive children of this node.
rename the label of a node.

There are many different sequences that transform one tree into another. We assign a cost to each edit operation. Then, the cost of an edit sequence is the sum of the costs of its edit operations. Tree edit distance is the sequence with the minimal cost.

More details

Recursive solution

The tree edit distance problem has a recursive solution that decomposes the trees into subtrees and subforests. The distance between two forests is computed in constant time from the solution of smaller subproblems. At each recursive step there are two ways in which the forests can be decomposed into smaller problems: either by deleting the leftmost or the rightmost root node of the forests.

The distance between two forests is the minimum distance of four smaller problems. The costs of the edit operations are added to the corresponding problems.

The subproblems that occur during the computation of the tree edit distance are called relevant subproblems. Their number, which describes also the time complexity, depends on the choice between left and right solution at each recursive step. The tree edit distance algorithms try to minimize the number of relevant subproblems.

Algorithms

The straightforward implementation of the recursive solution results in exponential complexity. The state-of-the-art algorithms use dynamic programming to achieve polynomial runtime. Smaller subproblems are memorized and reused for solving bigger subproblems. They use a so called decomposition strategy to determine the choice of left vs. right at each recursive step.

Several algorithms have been propesed. Tai [1] presented the first non-exponential solution in 1979. Tai's algorithm runs in \(O(m^3n^3)\) time and space for trees with \(m\) and \(n\) nodes respectively. Zhang and Shasha [2] improve the time complexity to \(O(m^2n^2)\). They always choose the left decomposition rule. Morover, they take advantage of an observation that not all possible subproblems are needed to compute the tree edit distance. They partition the subproblems in a smart way and result in better, \(O(mn)\), space complexity. Klein [3] takes advantage of heavy paths in one of the trees to guide the choice between two recursive solutions. Klein obtains \(O(n^2m\log m)\) time and space complexity. Demaine et al. [4] use also heavy paths. However, contrary to Klein, the heavy paths are used in both trees. This makes the algorithm more efficient with \(O(n^2m(1+\log \frac{m}{n}))\) time complexity. Moreover, similarly to Zhang and Shasha, Demaine et al. found a smart way to partition subproblems and obtain \(O(mn)\) space complexity.

Problem of existing solutions

Due to the time and space complexities, the two main competitors are the algorithms of Zhang and Shasha and Demaine. The first one is effcient for some interesting tree shapes, e.g. balanced trees. The second one was shown to be worst-case optimal, but unfortunately the worst case happens frequently. The efficiency of the two algorithms havily depends on the tree shape and it is hard to chose between them. The choice of the wrong algorithm leads to a prohibitive runtime.

Addressing this problem, a robust algorithm (RTED) has been proposed by Pawlik and Augsten [5]. RTED is not only tree-shape independent, but for any problem instance computes as many relevant subproblems as the best competitor must compute. Moreover, in many cases RTED outperforms the competitors. RTED runs in \(O(n^3)\) time and requires \(O(mn)\) space.

Path strategies

The choice between left and right recursive solution can be determined by so called path decompositions. They use root-leaf paths in a tree to decompose it into subtrees and subforests which compose relevant subproblems.

The relevant subtrees of a tree for some root-leaf path are all subtrees that result from removing a path from the tree.

For each subproblem composed of two relevant subtrees we have to decide a path for the path decomposition. A path can be chosen either in the left-hand or in the right-hand tree. The paths which we choose define a path strategy. A path strategy is represented by a mapping between each pair of subtrees of the two input trees to a root-leaf path in one of those subtrees.

General Tree Edit Distance Algorithm

The concept of path decomposition (relevant subtrees and subforests) has been used in the literature to compute the tree edit distance in the optimal, \(O(mn)\), memory [2][4]. Moreover, these concepts allow to develop a general algortihm for the tree edit distance, which for any path strategy computes the tree edit distance in \(O(mn)\) memory.

A single-path function computes the tree edit distance between two relevant subtrees according to the chosen path. For two given trees, \(F\) and \(G\), a single path function computes the distances between each subtree of \(F\) rooted in the chosen path and all subtrees of \(G\). The distances are stored in a distance matrix for later reuse. The precondition is that the distances between every relevant subtree of \(F\) (according to the chosen path) and all subtrees of \(G\) have been computed beforehand.

In literature single-path functions for left and right paths [2] and heavy paths [3][4] have been described.

We were able to formulate the general algorithm with the following observation. Given two trees, \(F\) and \(G\), we can compute the distance between them in two steps. First, we choose a path in one of them, say \(F\), and compute the distances between the relevant subtrees of \(F\) and tree \(G\). Those distances will be reused in the next step. Second, we compute the distance between \(F\) and \(G\) in a bottom-up manner computing the distances between the relevant subforests of \(F\) and all corresponding subforests of \(G\). We do not recompute the distances obtained in the first step.

The general algorithm for the tree edit distance uses a path strategy to compute the tree edit distance. It performs the following steps.

For a given pair of trees, \((F, G)\), look up the path in the path strategy.
If the path is in \(F\) do the following steps, otherwise run the algorithm for \((G, F)\).

Run the algorithm for every relevant subtree \(F'\) in \(F\) and the tree \(G\).
Compute the single-path function for \((F, G)\) according to the path's type.

RTED - Robust Algorithm for the Tree Edit Distance

RTED computes the optimal strategy by performing an exhaustive search in the space of all possible strategies. The optimal strategy is computed in quadratic time and space, thus the strategy computation does not increase the complexity of the tree edit distance algorithm, which is at least \(O(n^2)\).

Running the optimal strategy with the general tree edit distance algorithm described above leads to the robust RTED algorithm. RTED is uaranteed to perform as good or better than its competitors in terms of relevant subproblems that must be computed.

Download

Source code and runnables

TopDiff

Tree edit distance for large trees.

TJoin

The ongoing research on tree similarity join.

Algorithms

Algorithms repository (under active development)

Experiments

Experiments framework repository

Datasets

Datasets repository

Citing TJoin

If you want to refer to TJoin in a publication, please cite the following paper.

T. Hütter, M. Pawlik, R. Löschinger, N. Augsten. Effective Filters and Linear Time Verification for Tree Similarity Joins. ICDE, 2019.

APTED

The current state of our research and experience is implemented in the APTED Java source code.

Visit https://github.com/DatabaseGroup/apted for the latest version of the source code.

Licence

APTED is free to redistribute and/or modify under the terms of the MIT licence.

Citing APTED

If you want to refer to APTED in a publication, please cite the following papers. The final implementation builds on top of both of these papers.

M. Pawlik and N. Augsten. Tree edit distance: Robust and memory-efficient. Information Systems 56. 2016.

M. Pawlik and N. Augsten. Efficient Computation of the Tree Edit Distance. ACM Transactions on Database Systems (TODS) 40(1). 2015.

Known issues

Strategy computation for trees larger than 40k nodes may be faulty. This may be due to float type used for cost representation.

RTED

Updates to RTED implementation are discontinued. APTED supersedes RTED.

RTED 1.2 [fixed support for non-unit costs]

RTED 1.1 [JAR file including source code; for manual run java -jar RTED_v1.1.jar -h]

Licence

RTED is free to redistribute and/or modify under the terms of the GNU Affero General Public License. Contact us for other licensing options.

Citing RTED

If you want to refer to RTED in a publication, please cite the following PVLDB paper.

M. Pawlik and N. Augsten. RTED: A Robust Algorithm for the Tree Edit Distance. PVLDB 5(4). 2011.

Known issues

Support for non-unit costs was faulty. Fixed in version 1.2. Deletion and insertion costs must be equal.
Non-alphanumeric characters cause input parsing errors - not fixed.
For large trees, the JVM's memory and stack size may need to be increased.

Repeatability packages

We provide the repeatability packages for two of our publications.

ACM TODS

The package is ready. Contact us if you'd like to have it.

Information Systems

We are currently working on the package.

Bibliography

Literature for the classic tree edit distance problem. All the solutions are given for ordered labeled trees and allow three node edit operations, i.e., delete a node, insert a node, rename a node's label.

[9] S. Schwarz, M. Pawlik and N. Augsten. A New Perspective on the Tree Edit Distance. SISAP. 2017. [PDF]

[8] M. Pawlik and N. Augsten. Tree edit distance: Robust and memory-efficient. Information Systems. 2016. [PDF]

[7] M. Pawlik and N. Augsten. Efficient Computation of the Tree Edit Distance. ACM Transactions on Database Systems (TODS). 2015. [PDF]

[6] M.Pawlik and N.Augsten. A Memory-Efficient Tree Edit Distance Algorithm. DEXA. 2014. [PDF] [presentation]

Pawlik and Augsten address the memory problem of the RTED algorithm. The optimal strategy computation step may use twice the memory of the distance computation and thus it is a memory bottleneck. The authors show a memory efficient algorithm for reducing the memory usage. This is achieved by systematically releasing memory early during the strategy computation.

[5] M.Pawlik and N.Augsten. RTED: A Robust Algorithm for the Tree Edit Distance. PVLDB. 2011. [PDF] [presentation]

Pawlik and Augsten show that the efficiency of the previous solutions for the tree edit distance heavily depends on the tree shape and run into their worst cases for some data instances. They generalize the previous approaches and develop a new algorithm with \(O(n^3)\) time (with \(n>m\)) and \(O(mn)\) space complexity. Their solution efficient for all tree shapes and never runs in the worst case if a better solution exists.

[4] E.D. Demaine, S. Mozes, B. Rossman and O. Weimann. An optimal decomposition algorithm for tree edit distance. ACM Trans. on Alg. 2009. [PDF]

Demiane et al. develop the first worst-case optimal algorithm for the tree edit distance. The algorithm runs in \(O(n^2m(1+\log \frac{m}{n}))\) time and \(O(mn)\) space. They show a solution for binary trees and explain how to extend it to general trees.

[3] P.N. Klein. Computing the edit-distance between unrooted ordered trees. In European Symposium on Algorithms (ESA). 1998. [PDF]

Klein improves the time complexity of Zhang and Shasha's algorithm to \(O(n^2m\log m)\). However, this solution requires \(O(n^2m\log m)\) space. They use Euler strings to encode the trees.

[2] K.Zhang and D.Shasha. Simple fast algorithms for the editing distance between trees and related problems. SIAM J. Comput. 1989. [PDF]

The algorithm of Zhang and Shasha improves Tai's algorithm and obtains \(O(m^2n^2)\) time and \(O(mn)\) space complexity. For trees with \(l\) nodes and depth \(d\) the runtime is \(O(mn \min(l_F,d_F) \min(l_G,d_G))\). The improvement is achieved by the observation that we don't need the distance between each pair of subforests. Furthermore, they partition the computation the way that only two distance matrices of \(|mn|\) size are used for memorizing the intermediate results.

[1] K.-C. Tai. The tree-to-tree correction problem. J. ACM. 1979. [PDF]

Tai presents the first non-exponential algorithm for computing the tree edit distance. It runs in \(O(m^3n^3)\) time and space for trees \(F\) and \(G\) with \(m\) and \(n\) nodes respectively.

Compare your trees with ease

About

Tree Edit Distance

Definition

More details

Recursive solution

Algorithms

Problem of existing solutions

Path strategies

General Tree Edit Distance Algorithm

RTED - Robust Algorithm for the Tree Edit Distance

Download

Source code and runnables

TopDiff

Algorithms

Experiments

Datasets

Citing TopDiff

TJoin

Algorithms

Experiments

Datasets

Citing TJoin

APTED

Licence

Citing APTED

Known issues

RTED

Licence

Citing RTED

Known issues

Repeatability packages

ACM TODS

Information Systems

Bibliography

Contact

Mateusz Pawlik

Nikolaus Augsten