[plt-scheme] Interacting w/ MzScheme
Felix Klock's PLT scheme proxy
pltscheme@pnkfx.org
Fri, 10 Dec 2004 01:00:54 -0500
Don-
On Dec 10, 2004, at 12:33 AM, Don Blaheta wrote:
> For list-related administrative tasks:
> http://list.cs.brown.edu/mailman/listinfo/plt-scheme
>
> Quoth Matthias Felleisen:
>> Way back when I was young and at Rice, I actually started planning a
>> tree diff -- and promptly forgot about it that afternoon.
>
> I needed a tree diff for some of my natural language processing
> research, and I got a lot of mileage out of plain old gnudiff with a
> little bit of preprocessing. Is the problem algorithmic, or just a
> "getting around to it" issue?
>
I don't know what Matthias's response is going to be, but I'd say big=20
part of it is that its an algorithmic problem. The best algorithms for=20=
calculating a minimal tree diff tend to be O(n^4) or so, where n is the=20=
number of nodes in the tree.
Of course, this isn't a proof that you are stuck with O(n^4) to get=20
useful results. The standard diff implementation doesn't calculate a=20
minimal diff either, so you might be able to choose a different=20
algorithm and get usable results in a reasonable amount of time (see=20
the work of Chawathe and Garcia-Molina referenced below).
----
Another problem that I don't think many people have given much thought=20=
to (and thus would be a big part of the research) is how to *display*=20
the feedback of the tree diff to the user; these algorithms all tend to=20=
return edit script, which are just sequences of tree editing=20
operations. This is because that's the way the algorithms for string=20
diff were formulated as well; its just that there happens to be some=20
very nice ways to present string edit scripts graphically. I'm not=20
sure that the tree edit scripts will be so easily applied to the=20
visualized s-exp in your editors, especially if you choose to include=20
operations like "copy-subtree" or "move-subtree" in your set of edit=20
operations.
----
Here's my summary of a literature review I did during October while=20
preparing a project proposal.
- Zhang and Shasha [15] give an algorithm for solving the tree edit=20
problem in time
O(|T1||T2| min{height (T1), |leaves(T1)|} min{height (T2),=20
|leaves(T2)|})
- Bille [3] develops a unified framework for describing tree-diff=20
algorithms and summarizes the worst-case behavior of various tree diff=20=
algorithms (including Zhang and Shasha=92s work) with various optional=20=
constraints on the problem domain (ordered versus unordered children,=20
subtree matchings, unit-cost, etc).
- Barnard et al. [2] extend previous work in Tree-to-Tree correction=20
(not specifically about program parse trees), adding Subtree Insertion,=20=
Deletion, and Swapping to the set of allowed operations.
- Chawathe and Garcia-Molina [6, 5] explored Tree transformation=20
supporting subtree move and copy. where the rules are applied in=20
parallel rather than in sequence. Their technique produces non-optimal=20=
solutions (in order to allow a more efficient algorithm), reducing the=20=
problem of change detection to one of choosing a min-cost edge cover of=20=
a bipartite graph.
- Yang [13] developed a tool for calculating differences between parse=20=
trees of C programs, focusing mainly on how to generate matchings=20
between the nodes of the parse tree and on how to display the diff to=20
the user. His tool did not support a Copy-Subtree operation in the tree=20=
edit scripts, and did not encode any knowledge of the operational=20
semantics of C or of any common refactorings for C programs. Yang did=20
consider some tree comparison algorithms much such like those those in=20=
Section 7.1. However, he discounted their relevance for his problem,=20
since he felt that their reliance on a distance metric between nodes=20
(specifically with respect to the relabeling tree operation) was at=20
odds with the constraints imposed upon him by the C language and its=20
mixing of types and constants in a single namespace that cannot be=20
distinguished by its context-free grammar.
- Horwitz [7] and Yang et al. [14] did incorporate knowledge of program=20=
behavior into a diff tool. Their approaches handled variable renaming=20
and other semantics preserving transformations. However, their tool was=20=
only applicable to a particularly weak language (no procedural=20
abstraction; just variables, constants, assignments, conditionals, and=20=
whileloops).
- Binkley [4] applies semantic differencing on dependence graphs to=20
determine which test cases in a regression test suite should be re-run,=20=
and also to generate a new program that captures only the behavior of=20
the affected components. While the technology employed by Binkley may=20
be applicable to the work here, the goals are decidedly different: his=20=
tool was meant to produce output intended for consumption by a testing=20=
tool, while my tool is intended to provide direct feedback for human=20
developers.
----
Bibliography
[2] David T. Barnard, Gwen Clarke, and Nicolas Duncan. Tree-to-tree=20
correction for document trees. Technical Report 95-372, Queen=92s=20
University, January 1995.
[3] Philip Bille. Tree edit distance, alignment distance and inclusion.=20=
Technical Report TR-2003-23, IT University of Copenhagen, March 2003.
[4] D. Binkley. Using semantic differencing to reduce the cost of=20
regression testing. In Proceedings of the International Conference on=20
Software Maintenance 1992, pages 41=96 50, 1992.
[5] S. Chawathe and H. Garcia-Molina. An expressive model for comparing=20=
tree-structured data, 1997.
[6] Sudarshan S. Chawathe, Anand Rajaraman, Garcia-Molina=20
Garcia-Molina, and Jennifer Widom. Change detection in hierarchically=20
structured information. In Proceedings of theACM SIGMOD International=20
Conference on Management of Data, volume 25, 2 of ACM SIGMOD Record,=20
pages 493=96504, New York, June 4=966 1996. ACM Press.
[7] Susan Horwitz. Identifying the semantic and textual differences=20
between two versions of a program. ACM SIGPLAN Notices, 25(6):234=96245,=20=
June 1990.
[12] Joel Winstead and David Evans. Towards differential program=20
analysis. In Proceedings WODA 2003 the Workshop on Dynamic Analysis,=20
pages 37=9640, Portland, Oregon, May 2003.
[13] Wuu Yang. Identifying syntactic differences between two programs.=20=
Software - Practice and Experience, 21(7):739=96755, 1991.
[14] Wuu Yang, Susan Horwitz, and Thomas Reps. Detecting program=20
components with equivalent behaviors. Technical Report CS-TR-1989-840,=20=
University of Wisconsin, 1989.
[15] K. Zhang and D. Shasha. Simple fast algorithms for the editing=20
distance between trees and related problems. SIAM J. Comput.,=20
18(6):1245=961262, 1989.
-Felix
----
"I probably would, Nick... but I'd appreciate it more
if you wouldn't park it on my kid."
-www.redmeat.com