Making All the Difference

Norman Walsh, Staff Engineer
Sun Microsystems, XML Technology Center


Table of Contents

Useful Links

Have you seen the review version of the Second Edition of the XML 1.0 Recommendation? By highlighting the changes since the first edition, the XML Core Working Group made it vastly easier to determine what has been revised. In this column, we'll look at diffmk a tool that makes it easy to build similar "diff" versions of your XML documents.

I write a lot of documents: white papers, technical reports, design specifications, notes, conference papers, proposals, etc. And, of course, I write them all in XML. Most of these documents go through some sort of review before publication, even if it's as informal as asking some of my colleagues to skim the document and make sure I haven't made a fool of myself.

After the first round of review, I want to republish the document in some format that will allow the reviewers to see what has been changed, that's where diffmk comes in.

Considering the Differences Between Two Documents

For the purposes of exposition, let's consider two versions of a contrived document. The first draft looks like this:

 

Example 1. The First Draft of a Document

<!DOCTYPE article 
  PUBLIC "-//Norman Walsh//DTD Simplified DocBook XML V4.1.2.3//EN"
  "http://nwalsh.com/docbook/simple/4.1.2.3/sdocbook.dtd">
<article>
<title>A Test Document</title>
<para>This is para 1.</para>
<para>This is para 2 <emphasis>with emphasis</emphasis> in it.</para>
<para>This is para 3.</para>
<para id="p4">This is para 4.</para>
<para id="p5">This is para 5.</para>
<para>This is para 6.</para>
<para>This is para 7.</para>
<para>This is para 8.</para>
<para>This is para 9.</para>
</article>


After hypothetical review, I arrive at another version:

 

Example 2. A Subsequent Draft of the Document

<!DOCTYPE article 
  PUBLIC "-//Norman Walsh//DTD Simplified DocBook XML V4.1.2.3//EN"
  "http://nwalsh.com/docbook/simple/4.1.2.3/sdocbook.dtd">
<article><title>A Contrived Test Document</title>
<para>This is para 1.</para>
<para>This is para 2 <emphasis role="bold">with emphasis</emphasis> changed in it.</para>
<para>This is a new para 2b.</para>
<para>This is
para 3.</para>
<para id="p4">This is a different para 4.</para>
<para>This is a new para 4b.</para>
<para id="p5">This is para 5.</para>
<para>This is para 8.</para>
<para>This is para 9.</para>
</article>


As the author, I can usually get by with the diff command that shows me what lines have changed in the document:

 

Example 3. A diff View of the Changes

4,5c4
< <article>
< <title>A Test Document</title>
---
> <article><title>A Contrived Test Document</title>
7,9c6,11
< <para>This is para 2 <emphasis>with emphasis</emphasis> in it.</para>
< <para>This is para 3.</para>
< <para id="p4">This is para 4.</para>
---
> <para>This is para 2 <emphasis role="bold">with emphasis</emphasis> changed in it.</para>
> <para>This is a new para 2b.</para>
> <para>This is
> para 3.</para>
> <para id="p4">This is a different para 4.</para>
> <para>This is a new para 4b.</para>
11,12d12
< <para>This is para 6.</para>
< <para>This is para 7.</para>

That's relatively useful to me as the author; I can see what I've edited and I can determine if the changes are significant or not. But it's nearly useless for my colleagues. It's likely that they've seen an HTML version of the document, not the XML sources. And even if they've seen the XML, it's almost certainly been filtered through some sort of stylesheet for presentation.

What I really want is the ability to publish a "colorized" diff version, like this:

 

Figure 1. A "colorized" View of the Changes

A screen shot of the colorized result

Here we see changed text in green, new text in yellow, and deleted text in red (with strike-outs).

There are three ways to achieve this appearance:

  1. By hand.

    With patience and care, the author can manually keep track of the changes made to a document.

    This is tedious and error-prone, but if carefully managed it has the advantage of providing the highest level of fidelity. The author can make value judgments about where best to identify changes (at the sentence, paragraph, or division level, for example). This is the technique used to produce the Second Edition of XML 1.0.

  2. With an intelligent editing tool.

    An editing application with enough information about the authoring schema can track changes to a document. Sophisticated applications can maintain a complete history of all changes.

    Although the author generally loses the ability to specify the level of granularity for individual changes, the tremendous benefit of having the authoring tool keep track of the changes generally outweighs this small sacrifice. Unfortunately, these applications are uncommon and often expensive. And because they require information about the schema, they are not easily portable across different authoring schemas.

  3. By automated comparison of two documents.

    This is the approach that diffmk takes. Both documents are loaded and the Perl Algorithm::Diff module[1] is used to calculate the differences between them. (Finding the smallest number of differences between two documents is an interesting problem. All of the modern implementations, including Algorithm::Diff, seem to be based on work done by Hunt and McIlroy, see [HM76] and [HS77].)

    The automated approach has several things going for it: the author is not required to manually track changes, it will work with any schema, and it does not require a specific editing tool. Its primary drawback is that it does not offer the same degree of fidelity.

    On the whole, the ease of use afforded by the automated approach, and the fact that it can be applied "after the fact" makes it a win.

What Does diffmk Do?

In a nutshell, diffmk works like this: it converts the documents into two lists of nodes (text and/or element nodes) and attempts to find the longest common subsequence of nodes. Phrased another way, it finds the smallest number of additions and deletions to each list that are required to make the two lists the same.

Unfortunately, the actual set of changes is not always the shortest distance between two versions. So, naturally, the algorithm sometimes identifies changes incorrectly. However, it does identify all of the changes. In other words, it sometimes incorrectly identifies what we would consider a "change" as an "addition followed by a deletion" or an "addition followed by a deletion" as a "change", but it never misses anything. So the highlighted text always shows at least everything that has changed, and sometimes a little more.

Identifying Changes

Simply finding the differences wouldn't be very useful if the changes weren't identified in some way. There are a number of possible ways for diffmk to record the differences:

In the long run, XLink is probably the most interesting answer, but given the current technologies available, using elements and attributes in the same namespace as the source document is easiest.

The consequence for diffmk is that in order for the diffmk-version of a document to be schema-valid, the schema must support a global "diff" attribute with values to identify changed, deleted, and added text.

The name of this attribute, its values, and the elements that diffmk will add to a document, if necessary, are controlled by the diffmk.xml Control File. The control file is an XML document, as described in the diffmk(1) reference page.

Styling the Result

If we consider the diffmk-version of our sample documents, we'll see that running diffmk is only half the battle:

 

Example 4. A diffmk View of the Changes

<!DOCTYPE article PUBLIC "-//Norman Walsh//DTD Simplified DocBook XML V4.1.2.3//EN" "http://nwalsh.com/docbook/simple/4.1.2.3/sdocbook.dtd">
<?diffmk version='0.1'
    oldfile='test1.xml'
    newfile='test2.xml'
    attribute='revisionflag'
    changed='changed'
    added='added'
    deleted='deleted'
    diff='both'
    showdelete='1'
    includedeletedelements='0'
    ignorewhitespace='1'
    usechanged='1'
?>
<article><title revisionflag="changed">A Contrived Test Document</title>
<para>This is para 1.</para>
<para>This is para 2 <emphasis role="bold" revisionflag="changed">with emphasis</emphasis><phrase revisionflag="changed"> changed in it.</phrase></para>
<para revisionflag="added">This is a new para 2b.</para>
<para revisionflag="added">This is
para 3.</para>
<para id="p4" revisionflag="changed">This is a different para 4.</para>
<para revisionflag="deleted">This is para 5.</para><para revisionflag="changed">This is a new para 4b.</para>
<para id="p5" revisionflag="changed">This is para 5.</para>
<para>This is para 8.</para>
<para>This is para 9.</para>
</article>

We really need stylesheets to display this information in a meaningful way. Exactly how you accomplish this will depend on the tools that you use to apply style. If you're displaying XML in a browser natively, you may be able to get the results you want with a CSS or XSL stylesheet. If you're printing, you'll probably need a different stylesheet to render the changed regions in a typographically distinct way.

One of the common ways to display XML documents today is to use XSLT to transform the XML sources into HTML, so that they can be viewed in any browser. With that in mind, let's look at a couple of techniques that may be useful for this purpose.

Using xsl:import

If you're using XSLT to transform your source documents into HTML, you probably already have an XSLT stylesheet for your authoring schema. The question is, how can you add support for the "diff" attributes?

The easiest way to start is by writing a stylesheet extension using xsl:import. Here's an example extracted from the DocBook stylesheets:

 

Example 5. Extending an XSL Stylesheet for Highlighting

<?xml version="1.0"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                version="1.0">

<xsl:import href="docbook.xsl"/>

<xsl:template match="*[@revisionflag]">
  <xsl:choose>
    <xsl:when test="local-name(.) = 'para'
                    or local-name(.) = 'section'
                    or local-name(.) = 'appendix'">
      <div class='{@revisionflag}'>
    <xsl:apply-imports/>
      </div>
    </xsl:when>
    <xsl:when test="local-name(.) = 'phrase'
                    or local-name(.) = 'ulink'
                    or local-name(.) = 'xref'">
      <span class='{@revisionflag}'>
    <xsl:apply-imports/>
      </span>
    </xsl:when>
    <xsl:otherwise>
      <xsl:message>
    <xsl:text>Revisionflag on unexpected element: </xsl:text>
    <xsl:value-of select="local-name(.)"/>
    <xsl:text>(Assuming block)</xsl:text>
      </xsl:message>
      <div class='{@revisionflag}'>
    <xsl:apply-imports/>
      </div>
    </xsl:otherwise>
  </xsl:choose>
</xsl:template>

</xsl:stylesheet>

This works because XSLT matches templates in the current stylesheet before looking at imported stylesheets. So any element ("*") that has a revisionflag attribute ("[@revisionflag]") will be matched by the template in this stylesheet. Based on whether the element is a block or inline element, we output a div or span with an appropriate class then apply imports.

Applying imports makes XSLT call the "real" template for the element in question.

This approach will probably cover most of the cases in your documents. You may have to add a few extra templates to handle special elements (like table rows and list items) where HTML does not allow arbitrary div or span elements.

Conclusion

The diffmk command provides an easy way to identify the differences between two XML documents. Although it is not 100% accurate, it never underestimates and it does not require author intervention during revision.

If you have comments or suggestions for diffmk, please let me know.

Bibliography

Hunt, J. W. and M. D. McIlroy, "An algorithm for differential file comparison", Computing Science Technical Report 41, AT&T Bell Laboratories, Murray Hill, N.J., 1976.

Hunt, J.W and T.G. Szymanski, "A fast algorithm for computing longest common subsequences", Communications of the ACM, vol. 20, no. 5, pp. 350-353, 1977.


The Author: Norman Walsh, a Staff Engineer in Sun's XML Technology Center, is chair of the OASIS DocBook Technical Committee and serves on a number of W3C Working Groups, including the XML Core, XSL, and XML Schema WGs.


[1] My thanks to Randal L. Schwartz for bringing this module to my attention.