[plt-scheme] Interacting w/ MzScheme

Felix Klock's PLT scheme proxy pltscheme@pnkfx.org
Fri, 10 Dec 2004 01:00:54 -0500


Don-

On Dec 10, 2004, at 12:33 AM, Don Blaheta wrote:

>   For list-related administrative tasks:
>   http://list.cs.brown.edu/mailman/listinfo/plt-scheme
>
> Quoth Matthias Felleisen:
>> Way back when I was young and at Rice, I actually started planning a
>> tree diff -- and promptly forgot about it that afternoon.
>
> I needed a tree diff for some of my natural language processing
> research, and I got a lot of mileage out of plain old gnudiff with a
> little bit of preprocessing.  Is the problem algorithmic, or just a
> "getting around to it" issue?
>

I don't know what Matthias's response is going to be, but I'd say big=20
part of it is that its an algorithmic problem.  The best algorithms for=20=

calculating a minimal tree diff tend to be O(n^4) or so, where n is the=20=

number of nodes in the tree.

Of course, this isn't a proof that you are stuck with O(n^4) to get=20
useful results.  The standard diff implementation doesn't calculate a=20
minimal diff either, so you might be able to choose a different=20
algorithm and get usable results in a reasonable amount of time (see=20
the work of Chawathe and Garcia-Molina referenced below).

----

Another problem that I don't think many people have given much thought=20=

to (and thus would be a big part of the research) is how to *display*=20
the feedback of the tree diff to the user; these algorithms all tend to=20=

return edit script, which are just sequences of tree editing=20
operations.  This is because that's the way the algorithms for string=20
diff were formulated as well; its just that there happens to be some=20
very nice ways to present string edit scripts graphically.  I'm not=20
sure that the tree edit scripts will be so easily applied to the=20
visualized s-exp in your editors, especially if you choose to include=20
operations like "copy-subtree" or "move-subtree" in your set of edit=20
operations.

----

Here's my summary of a literature review I did during October while=20
preparing a project proposal.

- Zhang and Shasha [15] give an algorithm for solving the tree edit=20
problem in time
O(|T1||T2| min{height (T1), |leaves(T1)|} min{height (T2),=20
|leaves(T2)|})

- Bille [3] develops a unified framework for describing tree-diff=20
algorithms and summarizes the worst-case behavior of various tree diff=20=

algorithms (including Zhang and Shasha=92s work) with various optional=20=

constraints on the problem domain (ordered versus unordered children,=20
subtree matchings, unit-cost, etc).

- Barnard et al. [2] extend previous work in Tree-to-Tree correction=20
(not specifically about program parse trees), adding Subtree Insertion,=20=

Deletion, and Swapping to the set of allowed operations.

- Chawathe and Garcia-Molina [6, 5] explored Tree transformation=20
supporting subtree move and copy. where the rules are applied in=20
parallel rather than in sequence. Their technique produces non-optimal=20=

solutions (in order to allow a more efficient algorithm), reducing the=20=

problem of change detection to one of choosing a min-cost edge cover of=20=

a bipartite graph.

- Yang [13] developed a tool for calculating differences between parse=20=

trees of C programs, focusing mainly on how to generate matchings=20
between the nodes of the parse tree and on how to display the diff to=20
the user. His tool did not support a Copy-Subtree operation in the tree=20=

edit scripts, and did not encode any knowledge of the operational=20
semantics of C or of any common refactorings for C programs. Yang did=20
consider some tree comparison algorithms much such like those those in=20=

Section 7.1. However, he discounted their relevance for his problem,=20
since he felt that their reliance on a distance metric between nodes=20
(specifically with respect to the relabeling tree operation) was at=20
odds with the constraints imposed upon him by the C language and its=20
mixing of types and constants in a single namespace that cannot be=20
distinguished by its context-free grammar.

- Horwitz [7] and Yang et al. [14] did incorporate knowledge of program=20=

behavior into a diff tool. Their approaches handled variable renaming=20
and other semantics preserving transformations. However, their tool was=20=

only applicable to a particularly weak language (no procedural=20
abstraction; just variables, constants, assignments, conditionals, and=20=

whileloops).

- Binkley [4] applies semantic differencing on dependence graphs to=20
determine which test cases in a regression test suite should be re-run,=20=

and also to generate a new program that captures only the behavior of=20
the affected components. While the technology employed by Binkley may=20
be applicable to the work here, the goals are decidedly different: his=20=

tool was meant to produce output intended for consumption by a testing=20=

tool, while my tool is intended to provide direct feedback for human=20
developers.

----

Bibliography

[2] David T. Barnard, Gwen Clarke, and Nicolas Duncan. Tree-to-tree=20
correction for document trees. Technical Report 95-372, Queen=92s=20
University, January 1995.

[3] Philip Bille. Tree edit distance, alignment distance and inclusion.=20=

Technical Report TR-2003-23, IT University of Copenhagen, March 2003.

[4] D. Binkley. Using semantic differencing to reduce the cost of=20
regression testing. In Proceedings of the International Conference on=20
Software Maintenance 1992, pages 41=96 50, 1992.

[5] S. Chawathe and H. Garcia-Molina. An expressive model for comparing=20=

tree-structured data, 1997.

[6] Sudarshan S. Chawathe, Anand Rajaraman, Garcia-Molina=20
Garcia-Molina, and Jennifer Widom. Change detection in hierarchically=20
structured information. In Proceedings of theACM SIGMOD International=20
Conference on Management of Data, volume 25, 2 of ACM SIGMOD Record,=20
pages 493=96504, New York, June 4=966 1996. ACM Press.

[7] Susan Horwitz. Identifying the semantic and textual differences=20
between two versions of a program. ACM SIGPLAN Notices, 25(6):234=96245,=20=

June 1990.

[12] Joel Winstead and David Evans. Towards differential program=20
analysis. In Proceedings WODA 2003 the Workshop on Dynamic Analysis,=20
pages 37=9640, Portland, Oregon, May 2003.

[13] Wuu Yang. Identifying syntactic differences between two programs.=20=

Software - Practice and Experience, 21(7):739=96755, 1991.

[14] Wuu Yang, Susan Horwitz, and Thomas Reps. Detecting program=20
components with equivalent behaviors. Technical Report CS-TR-1989-840,=20=

University of Wisconsin, 1989.

[15] K. Zhang and D. Shasha. Simple fast algorithms for the editing=20
distance between trees and related problems. SIAM J. Comput.,=20
18(6):1245=961262, 1989.

-Felix

----
"I probably would, Nick... but I'd appreciate it more
  if you wouldn't park it on my kid."
  -www.redmeat.com