The latest version of this document should be located here. The latest version of repat should be downloadable from here. repat is now using the beta expat 1.95.1. The last version of repat using the stable expat 1.2 can be downloaded from here.
repat is a callback-based RDF parser built on James Clark's expat. It's written in C and should be usable in most environments. It can be used to extract statements from any "valid" RDF Model & Syntax Specification conforming document. I'm using the term "valid" rather loosely. See the sections below for the known "enhancements" that were made to the syntax.
repat was originally based on David Megginson's RDFFilter but has changed signifigantly since it's inception. Any bugs or deficiencies were undoubtedly introduced by myself.
The file rdfparse.h defines the API for repat. Please refer to it for precise details. The following is a very general overview.
Once an RDF parser is created (via RDF_ParserCreate), various handlers can be registered to receive callbacks from the parser when certain events take place. Applications that want to parse XML documents that contain embedded RDF or RDF documents that contain properties with parseType set to "Literal" can also register the standard expat callbacks to receive those "raw" XML events.
Callbacks to the host application informing it of an assertion require a little more information than the M&S would lead you to believe. In reality, the basic triple model has been polluted with distributed statements, prefixes, literal values, language tagging, and other little "quirks".
A host application's statement handler will receive most of the attention. It's required to accept the following parameters:
void * user_data, RDF_SubjectType subject_type, const XML_Char * subject, const XML_Char * predicate, int ordinal, RDF_ObjectType object_type, const XML_Char * object, const XML_Char * xml_lang
Arguments of interest include subject_type, ordinal, object_type, and xml_lang.
RDF_SubjectType can be one of RDF_SUBJECT_TYPE_URI, RDF_SUBJECT_TYPE_DISTRIBUTED, RDF_SUBJECT_TYPE_PREFIX, or RDF_SUBJECT_TYPE_ANONYMOUS.
Because repat is a stream based parser, distributed statements (aboutEach and aboutEachPrefix) need to be marked as such. It's possible that the bag containing the intended subjects had already been parsed when a distributed statement is encountered. The host application (or any layer above repat) is responsible for actually generating the final statements (if necessary).
RDF_SUBJECT_TYPE_ANONYMOUS is used to indicate to the host application that the subject URI is not a "real" URI but something that was generated by repat. The generated URI is prefixed with the document base which is set by calling RDF_SetBase.
The ordinal serves as a convenience when the predicate is one of the rdf:_n properties. Applications can check to see if the ordinal parameter is greater than 0 rather than parsing the predicate to determine if it's a member property.
RDF_ObjectType can be one of RDF_OBJECT_TYPE_RESOURCE, RDF_OBJECT_TYPE_LITERAL, or RDF_OBJECT_TYPE_XML.
Note that when a statement is asserted with an object_type of XML, the object parameter will be NULL. The application will then receive a StartParseTypeLiteral event followed by all of the StartElement, EndElement, and CharacterData events that make up that XML value, and finally ended with an EndParseTypeLiteralEvent.
xml:lang attributes add no triples to the model. Therefore, all statements with literal values include an optional language tag. Applications are free to use or ignore this as they choose.
This version of repat is merely a preview release. It needs to be cleaned up just a tad before it can be incorporated into other applications. I'm hoping that the community could give any feedback possible while I prepare it for a final release. Is the API usable? Does it correctly parse valid RDF documents? Does it build on all platforms? Suggestions, bug reports, and patches will all be greatly appreciated.
I'm also looking for comments on my proposed changes to the syntax. See below for my list of issues.
repat is available under either of the LGPL or MPL open source licenses.
repat 2000-11-10 requires the outermost rdf:RDF element. Production [6.1] of the RDF Model & Syntax 1.0 indicates that this is optional. I'm not entirely certain whether this is intentional or not since there aren't any examples in the M&S that actually omit the rdf:RDF element.
Requiring this element makes it easier to detect when to start interpreting XML as RDF when the RDF is embedded in some arbitrary XML document.
For example:
<?xml version="1.0"?> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <title>The END</title> <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/dc/elements/1.0/" > <rdf:Description about="http://injektilo.org/" dc:author="jason@injektilo.org" dc:title="ThE EnD" dc:subject="METADATA" dc:date="2000-10-08" /> </rdf:RDF> </head> <body> <p>this is not a breast.</p> </body> </html>
Given the above document, repat will pass the html, head, and title elements through to the host application before attempting to interpret the XML as RDF. It would be possible to add an option to the parser to immediately start interpreting the XML as RDF without requiring an rdf:RDF element but I'm not sure how useful this would be.
Productions [6.2] and [6.3] state that RDF subjects are either rdf:Description elements, containers, or typed nodes. Furthermore, productions [6.25] through [6.31] severely limit the expressiveness possible with containers and their members.
Even though the grammar restricts the containers to rdf:Bag, Seq, or Alt, a note in Section 3.2 states that RDF Schema may define a mechanism to declare subclasses of these containers and that production [18] (as listed in Section 3.2) should be extended to include these additional subclasses.
Given that a low-level RDF parser won't know the subclasses of rdfs:Container (nor is it even possible) should the grammar be restricting the rdf:_n and rdf:li properties to containers? Note that it doesn't necessarily say that you can't have a rdf:_n or rdf:li on a description but both of the RDF parsers that I've been studying (SiRPAC and Megginson's RDFFilter) either ignores them or throws an error.
Since it's possible for me to declare foo:Bar as a subClassOf rdfs:Container, all RDF parsers will have to somehow know that foo:Bar is one of the permissible container types. Or, they will have to allow rdf:_n and rdf:li properties on all resources. Whether or not that passes validation at a higher level is up to that higher level and not the lower level parser. This is the approach that repat takes.
This doesn't really break the syntax but my next example does. In Section 2.3.2 of RDF Schema, it states that "A class may be a subclass of more than one class." The schema serialization in Appendix A shows that rdf:Bag, Seq, and Alt each contain only one subClassOf property. So everything's cool. But what if I decide that my foo:Bar class is also a subClassOf baz:Quux which is NOT a subClassOf rdfs:Container?
Now, I would expect that since instances of foo:Bar are not just containers but also baz:Quux's, that I would be able to do something like this:
<foo:Bar about="..." foo:predicate="..."> <rdf:li resource="..."/> ... </foo:Bar>
Unfortunately, production [18] in Section 3.2 of the M&S and it's accompanying note say that since foo:Bar is a subClassOf rdfs:Container, it really needs to follow the rules for rdf:Bag, Seq, and Alt which means that it can't have an about attribute nor can it have any property attributes other than rdf:_n.
Additionally, RDF M&S Section 3.2 states that "Container resources may have other properties in addition to the membership properties and the type property." But the grammar states that the only property elements that a container can have are members (as described in production [6.28]). Yes, I realize that you could always include an rdf:Description element elsewhere in your document with an about attribute equal to the container's ID but that's hardly intuitive.
I imagine that when the Formal Grammar was defined, the possibility that classes might be subclasses of both containers and non-containers wasn't considered. I can't find anything in RDF Schema that precludes this. Should it be restricted (I'd say no) or should the grammar be loosened just a tad so it's not as restrictive with regards to containers.
repat doesn't include any "support" for rdf:Bag, Seq, or Alt. They're treated exactly like other types nodes. It's possible to give them an rdf:about attribute, property attributes other than rdf:_n, and property elements other than rdf:li. This means that rdf:_n and rdf:li can be used as properties for any subject and rdf:li is not restricted to just the rdf:resource or rdf:parseType attributes. They're treated just like other property elements with the exception that they're converted into a sequence of magic ordinal properties.
According to the formal grammar in Section 6, the attributes "ID", "about", and friends are NOT qualified with the RDF namespace.
Section 6 also mentions that: "It is recommended that property names always be qualified with a namespace prefix to unambiguously connect the property definition with the corresponding schema."
Does this mean that property attributes without namespace prefixes should NOT generate a statement? (With the exception of "type" via production [6.11]).
This seems to be the most logical interpretation of the spec that I can come up with but certainly not what most people would reasonably assume. Including whoever wrote the fourth example from Section 2.2.2:
<rdf:RDF> <rdf:Description about="http://www.w3.org/Home/Lassila"> <s:Creator rdf:resource="http://www.w3.org/staffId/85740"/> </rdf:Description> <rdf:Description about="http://www.w3.org/staffId/85740"> <v:Name>Ora Lassila</v:Name> <v:Email>lassila@w3.org</v:Email> </rdf:Description> </rdf:RDF>
Notice the "rdf" prefix on the "resource" attribute of the "s:Creator" property element. This is illegal according to production [6.18]. There are examples that both qualify and do not qualify the resource attribute throughout the spec.
In Sections 7.3, 7.4, and 7.5, "parseType" is qualified on all elements that are not in the RDF namespace. Productions [6.32] and [6.33] don't allow for this.
Section 7.5 contains an example that qualifies "about". See production [6.7].
Curiously, member attributes (according to production [6.31]) are the only attributes required to be qualified in the RDF namespace. Why are they singled out?
I can't find any examples in the M&S with a typed node resource that contains a property attribute from the same namespace as the element. Given the recommendation quoted above, I would assume that they would be qualified even though most humans would assume they would inherit their namespace from the element as it seems the writers of the M&S also assumed. Are the examples that explicitly qualify "about", "resource", and "parseType" simply in error?
The approach taken by repat is to require that all property attributes be qualified. RDF "keywords" like ID, about, etc, can be unqualified or qualified with the RDF namespace. This correctly parses all of the examples in the M&S.
Consider the last example in Section 2.2.2:
<rdf:RDF> <rdf:Description about="http://www.w3.org/Home/Lassila"> <s:Creator> <s:Person about="http://www.w3.org/staffId/85740"> <v:Name>Ora Lassila</v:Name> <v:Email>lassila@w3.org</v:Email> </s:Person> </s:Creator> </rdf:Description> </rdf:RDF>
Ignoring the fact that there's no namespace delcarations, the "about" property attribute on the "s:Person" typed node resource is lacking an "rdf" prefix. This is legal according to the grammar.
Should the above about attribute be interpreted as rdf:about or s:about? Currently, repat treats it like rdf:about. Unqualified attributes that are not RDF "keywords" are simply ignored. This goes against the quote in Section 6 of the M&S that states: "Specifically; each XML attribute A specified with a Description start tag other than the attributes ID, about, aboutEach, aboutEachPrefix, bagID, xml:lang, or any attribute starting with the characters xmlns results in the creation of a triple {p,r,v} where p is the expansion of the namespace-qualified attribute name of A..." Notice that it says "namespace-qualified attribute name of A."
Now I'm almost positive that the writers did not intend that this rule only apply to qualified attributes but that's how I'm interpreting it until further clarification is given. I could really go either way. I chose this approach simply because it didn't "break" any of the examples in the M&S.
Thanks,
jason@injektilo.org.
go home.