Table of Content: - General overview
- The definition
- Using catalogs
- Some examples
- How to tune catalog usage
- How to debug catalog processing
- How to create and maintain catalogs
- The implementor corner quick review of
theAPI
- Other resources
What is a catalog? Basically it's a lookup mechanism used when an entity(a
file or a remote resource) references another entity. The catalog lookupis
inserted between the moment the reference is recognized by the software(XML
parser, stylesheet processing, or even images referenced for inclusionin a
rendering) and the time where loading that resource is actuallystarted. It is basically used for 3 things: - mapping from "logical" names, the public identifiers and a moreconcrete
name usable for download (and URI). For example it can associatethe
logical name
"-//OASIS//DTD DocBook XML V4.1.2//EN"
of the DocBook 4.1.2 XML DTD with the actual URL where it can
bedownloaded
http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd
- remapping from a given URL to another one, like an HTTP
indirectionsaying that
"http://www.oasis-open.org/committes/tr.xsl"
should really be looked at
"http://www.oasis-open.org/committes/entity/stylesheets/base/tr.xsl"
- providing a local cache mechanism allowing to load the
entitiesassociated to public identifiers or remote resources, this is a
reallyimportant feature for any significant deployment of XML or SGML
since itallows to avoid the aleas and delays associated to fetching
remoteresources.
Libxml, as of 2.4.3 implements 2 kind of catalogs: - the older SGML catalogs, the official spec is SGML Open
TechnicalResolution TR9401:1997, but is better understood by reading the SP Catalog
pagefromJames Clark. This is relatively old and not the preferred
mode ofoperation of libxml.
- XMLCatalogsis
far more flexible, more recent, uses an XML syntax andshould scale quite
better. This is the default option of libxml.
In a normal environment libxml2 will by default check the presence of
acatalog in /etc/xml/catalog, and assuming it has been correctly
populated,the processing is completely transparent to the document user. To
take aconcrete example, suppose you are authoring a DocBook document, this
onestarts with the following DOCTYPE definition: <?xml version='1.0'?>
<!DOCTYPE book PUBLIC "-//Norman Walsh//DTD DocBk XML V3.1.4//EN"
"http://nwalsh.com/docbook/xml/3.1.4/db3xml.dtd"> When validating the document with libxml, the catalog will beautomatically
consulted to lookup the public identifier "-//Norman Walsh//DTDDocBk XML
V3.1.4//EN" and the system
identifier"http://nwalsh.com/docbook/xml/3.1.4/db3xml.dtd", and if these
entities havebeen installed on your system and the catalogs actually point to
them, libxmlwill fetch them from the local disk. Note: Really don't use
thisDOCTYPE example it's a really old version, but is fine as an example. Libxml2 will check the catalog each time that it is requested to load
anentity, this includes DTD, external parsed entities, stylesheets, etc ...
Ifyour system is correctly configured all the authoring phase and
processingshould use only local files, even if your document stays portable
because ituses the canonical public and system ID, referencing the remote
document. Here is a couple of fragments from XML Catalogs used in libxml2
earlyregression tests in test/catalogs : <?xml version="1.0"?>
<!DOCTYPE catalog PUBLIC
"-//OASIS//DTD Entity Resolution XML Catalog V1.0//EN"
"http://www.oasis-open.org/committees/entity/release/1.0/catalog.dtd">
<catalog xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog">
<public publicId="-//OASIS//DTD DocBook XML V4.1.2//EN"
uri="http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd"/>
... This is the beginning of a catalog for DocBook 4.1.2, XML Catalogs
arewritten in XML, there is a specific namespace for catalog
elements"urn:oasis:names:tc:entity:xmlns:xml:catalog". The first entry in
thiscatalog is a public mapping it allows to associate a
PublicIdentifier with an URI. ...
<rewriteSystem systemIdStartString="http://www.oasis-open.org/docbook/"
rewritePrefix="file:///usr/share/xml/docbook/"/>
... A rewriteSystem is a very powerful instruction, it says
thatany URI starting with a given prefix should be looked at another
URIconstructed by replacing the prefix with an new one. In effect this acts
likea cache system for a full area of the Web. In practice it is extremely
usefulwith a file prefix if you have installed a copy of those resources on
yourlocal system. ...
<delegatePublic publicIdStartString="-//OASIS//DTD XML Catalog //"
catalog="file:///usr/share/xml/docbook.xml"/>
<delegatePublic publicIdStartString="-//OASIS//ENTITIES DocBook XML"
catalog="file:///usr/share/xml/docbook.xml"/>
<delegatePublic publicIdStartString="-//OASIS//DTD DocBook XML"
catalog="file:///usr/share/xml/docbook.xml"/>
<delegateSystem systemIdStartString="http://www.oasis-open.org/docbook/"
catalog="file:///usr/share/xml/docbook.xml"/>
<delegateURI uriStartString="http://www.oasis-open.org/docbook/"
catalog="file:///usr/share/xml/docbook.xml"/>
... Delegation is the core features which allows to build a tree of
catalogs,easier to maintain than a single catalog, based on Public
Identifier, SystemIdentifier or URI prefixes it instructs the catalog
software to look upentries in another resource. This feature allow to build
hierarchies ofcatalogs, the set of entries presented should be sufficient to
redirect theresolution of all DocBook references to the specific catalog
in/usr/share/xml/docbook.xml this one in turn could delegate
allreferences for DocBook 4.2.1 to a specific catalog installed at the same
timeas the DocBook resources on the local machine. The user can change the default catalog behaviour by redirecting queriesto
its own set of catalogs, this can be done by setting
theXML_CATALOG_FILES environment variable to a list of catalogs,
anempty one should deactivate loading the default
/etc/xml/catalog default catalog Setting up the XML_DEBUG_CATALOG environment variable willmake
libxml2 output debugging informations for each catalog operations,
forexample: orchis:~/XML -> xmllint --memory --noout test/ent2
warning: failed to load external entity "title.xml"
orchis:~/XML -> export XML_DEBUG_CATALOG=
orchis:~/XML -> xmllint --memory --noout test/ent2
Failed to parse catalog /etc/xml/catalog
Failed to parse catalog /etc/xml/catalog
warning: failed to load external entity "title.xml"
Catalogs cleanup
orchis:~/XML -> The test/ent2 references an entity, running the parser from memory
makesthe base URI unavailable and the the "title.xml" entity cannot be
loaded.Setting up the debug environment variable allows to detect that an
attempt ismade to load the /etc/xml/catalog but since it's not
present theresolution fails. But the most advanced way to debug XML catalog processing is to use
thexmlcatalogcommand shipped with libxml2, it allows to
loadcatalogs and make resolution queries to see what is going on. This is
alsoused for the regression tests: orchis:~/XML -> ./xmlcatalog test/catalogs/docbook.xml \
"-//OASIS//DTD DocBook XML V4.1.2//EN"
http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd
orchis:~/XML -> For debugging what is going on, adding one -v flags increase the
verbositylevel to indicate the processing done (adding a second flag also
indicatewhat elements are recognized at parsing): orchis:~/XML -> ./xmlcatalog -v test/catalogs/docbook.xml \
"-//OASIS//DTD DocBook XML V4.1.2//EN"
Parsing catalog test/catalogs/docbook.xml's content
Found public match -//OASIS//DTD DocBook XML V4.1.2//EN
http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd
Catalogs cleanup
orchis:~/XML -> A shell interface is also available to debug and process multiple
queries(and for regression tests): orchis:~/XML -> ./xmlcatalog -shell test/catalogs/docbook.xml \
"-//OASIS//DTD DocBook XML V4.1.2//EN"
> help
Commands available:
public PublicID: make a PUBLIC identifier lookup
system SystemID: make a SYSTEM identifier lookup
resolve PublicID SystemID: do a full resolver lookup
add 'type' 'orig' 'replace' : add an entry
del 'values' : remove values
dump: print the current catalog state
debug: increase the verbosity level
quiet: decrease the verbosity level
exit: quit the shell
> public "-//OASIS//DTD DocBook XML V4.1.2//EN"
http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd
> quit
orchis:~/XML -> This should be sufficient for most debugging purpose, this was
actuallyused heavily to debug the XML Catalog implementation itself. Basically XML Catalogs are XML files, you can either use XML tools
tomanage them or use xmlcatalogfor this. The basic step
isto create a catalog the -create option provide this facility: orchis:~/XML -> ./xmlcatalog --create tst.xml
<?xml version="1.0"?>
<!DOCTYPE catalog PUBLIC "-//OASIS//DTD Entity Resolution XML Catalog V1.0//EN"
"http://www.oasis-open.org/committees/entity/release/1.0/catalog.dtd">
<catalog xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog"/>
orchis:~/XML -> By default xmlcatalog does not overwrite the original catalog and save
theresult on the standard output, this can be overridden using the
-nooutoption. The -add command allows to add entries in
thecatalog: orchis:~/XML -> ./xmlcatalog --noout --create --add "public" \
"-//OASIS//DTD DocBook XML V4.1.2//EN" \
http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd tst.xml
orchis:~/XML -> cat tst.xml
<?xml version="1.0"?>
<!DOCTYPE catalog PUBLIC "-//OASIS//DTD Entity Resolution XML Catalog V1.0//EN" \
"http://www.oasis-open.org/committees/entity/release/1.0/catalog.dtd">
<catalog xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog">
<public publicId="-//OASIS//DTD DocBook XML V4.1.2//EN"
uri="http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd"/>
</catalog>
orchis:~/XML -> The -add option will always take 3 parameters even if some
ofthe XML Catalog constructs (like nextCatalog) will have only a
singleargument, just pass a third empty string, it will be ignored. Similarly the -del option remove matching entries from
thecatalog: orchis:~/XML -> ./xmlcatalog --del \
"http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd" tst.xml
<?xml version="1.0"?>
<!DOCTYPE catalog PUBLIC "-//OASIS//DTD Entity Resolution XML Catalog V1.0//EN"
"http://www.oasis-open.org/committees/entity/release/1.0/catalog.dtd">
<catalog xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog"/>
orchis:~/XML -> The catalog is now empty. Note that the matching of
-del isexact and would have worked in a similar fashion with the
Public IDstring. This is rudimentary but should be sufficient to manage a not too
complexcatalog tree of resources. First, and like for every other module of libxml, there is anautomatically
generated API page forcatalog
support. The header for the catalog interfaces should be included as: #include <libxml/catalog.h> The API is voluntarily kept very simple. First it is not obvious
thatapplications really need access to it since it is the default behaviour
oflibxml2 (Note: it is possible to completely override libxml2 default
catalogby using xmlSetExternalEntityLoadertoplug an
application specific resolver). Basically libxml2 support 2 catalog lists: - the default one, global shared by all the application
- a per-document catalog, this one is built if the document uses
the
oasis-xml-catalog PIs to specify its own catalog list, it
isassociated to the parser context and destroyed when the parsing
contextis destroyed.
the document one will be used first if it exists. Initialization routines:xmlInitializeCatalog(), xmlLoadCatalog() and xmlLoadCatalogs() should
beused at startup to initialize the catalog, if the catalog should
beinitialized with specific values xmlLoadCatalog() or
xmlLoadCatalogs()should be called before xmlInitializeCatalog() which would
otherwise do adefault initialization first. The xmlCatalogAddLocal() call is used by the parser to grow the
documentown catalog list if needed. Preferences setup:The XML Catalog spec requires the possibility to select defaultpreferences
between public and system delegation,xmlCatalogSetDefaultPrefer() allows
this, xmlCatalogSetDefaults() andxmlCatalogGetDefaults() allow to control if
XML Catalogs resolution shouldbe forbidden, allowed for global catalog, for
document catalog or both, thedefault is to allow both. And of course xmlCatalogSetDebug() allows to generate debug
messages(through the xmlGenericError() mechanism). Querying routines:xmlCatalogResolve(), xmlCatalogResolveSystem(),
xmlCatalogResolvePublic()and xmlCatalogResolveURI() are relatively explicit
if you read the XMLCatalog specification they correspond to section 7
algorithms, they shouldalso work if you have loaded an SGML catalog with a
simplified semantic. xmlCatalogLocalResolve() and xmlCatalogLocalResolveURI() are the same
butoperate on the document catalog list Cleanup and Miscellaneous:xmlCatalogCleanup() free-up the global catalog, xmlCatalogFreeLocal()
isthe per-document equivalent. xmlCatalogAdd() and xmlCatalogRemove() are used to dynamically modify
thefirst catalog in the global list, and xmlCatalogDump() allows to dump
acatalog state, those routines are primarily designed for xmlcatalog, I'm
notsure that exposing more complex interfaces (like navigation ones) would
bereally useful. The xmlParseCatalogFile() is a function used to load XML Catalog
files,it's similar as xmlParseFile() except it bypass all catalog lookups,
it'sprovided because this functionality may be useful for client tools. threaded environments:Since the catalog tree is built progressively, some care has been taken
totry to avoid troubles in multithreaded environments. The code is now
threadsafe assuming that the libxml2 library has been compiled with
threadssupport. The XML Catalog specification is relatively recent so there isn't
muchliterature to point at: If you have suggestions for corrections or additions, simply contactme: Daniel Veillard |