[CS241] complexity of complete-link clustering

Matt Lease mlease at cs.brown.edu
Fri Sep 15 15:59:31 EDT 2006


In class Monday I described the time complexity of complete-link
clustering as O(n^3).  One student asked about this as the book
indicates the complexity is actually O(n^2 log n).  It turns out that
the latest edition of M&S has this improved time bound as a revision not
found in my older edition of the book.  While there is an online errata
of the book (http://nlp.stanford.edu/fsnlp/errata.html), in this case
the errata does not mention this revision, so I did not know about it.

Instead of recomputing the similarity matrix (n^2) n times, giving
O(n^3), the improvement comes from using a sorted priority queue:

"The different similarities ... can be computed once for each pair of
clusters and inserted into a priority queue. As a pair of clusters i and
j is selected to be merged to form cluster p, then the priority queue is
updated so that any gains corresponding to cluster pairs involving
either i or j are removed, and the gains of merging the rest of the
clusters with the newly formed cluster p are inserted. During the lth
agglomeration step, that involves O(n - l) priority queue delete and
insert operations. If the priority queue is implemented using a binary
heap, the total complexity of these operations is O((n - l) log(n - l)),
and the overall complexity over the n - 1 agglomeration steps is O(n2
log n). "

http://www-users.cs.umn.edu/~karypis/publications/Papers/Postscript/vhcluster2.ps



More information about the CS241 mailing list