Semantic Sling Part II - Turtle Soup

Monday, July 2, 2012

In the first post of the series I have since affectionately titled "Semantic Sling," I took a rather hypothetical look at serving Linked-Data out of the Sling server and whether Sling and JCR exposed any functionality which facilitated such service. This analysis took the form of a mapping between the tenets of Linked-Data and the capabilities inherent to Sling and JCR. Of all the tenets, number three was underserved by both directly, though I did indicate that functionality present in both could be built upon.

Realization of Tenet III required work along two lines. The first is to ensure data can be stored in a format conducive to later presentation as Linked Data. The second is the presentation itself, which for the purposes of this post I will limit to presentation of the RDFa and Turtle formats. Illustrating the concepts concretely is a proof of concept project which I have put into Github for your viewing and using pleasure. Functionality in the proof of concept is, at this time, limited to only the Turtle format for reasons described later.

The Storage of Data

Data presented in a Linked format traditionally takes on a [subject, predicate, object] triple structure. This structure conveniently resembles the JCR [resource, property, value] structure. A mapping between the two would provide the means by which the Linked-Data could be extracted. What we want to avoid however is the need to imbue this mapping with much in the way of intelligence. That is to say, while the [resource, property, value] patterns need to be mapped to [subject, predicate, object] triples, we want to avoid having to craft the likes of complex rewriting rules in order to translate property names into predicates, resources into subjects, or values into objects.

Storing the data initially with names and value types reflective of the representation we seek would help avoid such rewriting rules. For example, assuming we want to present the title of a resource using the Dublin Core Terms specification, instead of storing a property with a name such as "title":

Resource 
  Property name="title" value="A Test Resource"

the property could be named "dc:title"

Resource
  Property name="dc:title" value="A Test Resource"

where "dc:" represents the Dublin Core Terms namespace. In this way, no assumption would need to be made (or coded for) in order to expose the property using the desired predicate. As another example, instead of storing a numeric property value as a String, it could be stored as a Long. This would allow us to base the output object data type on the type of the node property's value. Both of these examples illustrate data storage practices which may be undertaken in order to ease the translation from JCR content data to Linked-Data.

The later of these practices is trivial in so much as it can be achieved simply by selecting the proper JCR property value type for any given property. In order to achieve the former practice however, we need to register the "dc" namespace with the JCR. The JCR specification itself places very few limitations on the names of properties. One important (and reasonable) constraint JCR does however impose is that the namespace associated with a qualified property name be registered in JCR's namespace registry. There are a number of namespaces which are registered by default, though only one of them (XML Schema) is immediately germane. Ideally all the namespaces which we plan on using in our property (or resource (or value for that matter if you're properties are types as NAMEs) ) names are registered up front, perhaps as part of the build process.

Conveniently, the Sling site provides a number of starter Maven archetypes some of which, such a sling-initial-content-archetype, include the deployment of node types and namespaces, defined in the JCR Node type notation. Using this mechanism namespaces can easily be added to the registry at deployment time by adding lines of the form

<namespace-qualifier "namespace URI">

to the nodetypes.cnd file which is included in the sling-initial-content-archetype (of course you can change what this file is called and where it is placed so long as you update the POM accordingly).

Posting Data

With Semantically useful namespaces registered, posting data following the data storage practices discussed becomes considerably more trivial. To show this, I'll modify slightly the "Create Some Content" example provided in the Discovering Sling in 15 Minutes tutorial.

curl -u admin:admin -F"sling:resourceType=foo/bar" -F"dc:title=some title" http://localhost:8080/content/mynode

The modification which I have made is the change of "title" to "dc:title" which is allowed only after the dc namespace has been registered. With this one small change the example has been updated to follow our data storage practices and now, systems which understand the semantics of the Dublin Core Types specification will be able to glean much more information from the resource title than they were able to before, assuming they adhere to the same "meaning" of "dc:title" as that which the Dublin Core documents.

Going one step further, by coupling posted properties with @TypeHint suffix properties we can store property values with an appropriate type. This is unnecessary if you are storing a String, but if you are storing, let's say, a number, the following command could be issued:

curl -u admin:admin -F"sling:resourceType=foo/bar" -F"dc:title=The Number 3" -F"number=3" -F"number@TypeHint=Long" http://localhost:8080/content/mynode

This would result in the creation of the node "mynode" which would contain a "number" property of type "Long".

Presenting Data

Linked-Data can be presented in a number of ways to an end user. The primary format that I'm interested in for the purposes of this POC is Turtle, though I will touch on RDFa conceptually. At a high level, data in the Turtle format manifests as a literal representation of Linked-Data triples, encoding [subject, predicate, object] with little decoration. I find that this makes it both easy to output and easy for a human to read (not that the latter has much to do with the ends of Linked-Data, but it does help with understanding your output).

The heart of the proof of concept project is a Servlet, org.apache.sling.servlets.semantic.servlet.impl.TtlServlet, which listens to any GET request which contains a .ttl extension. Such listening is easily achieved leveraging Sling's URL decomposition paradigm and can be defined via annotation using the Maven SCR plugin. The only complexities to be found then are within the coding of the servlet itself, Sling makes the rest easy. Also, so long as your node properties are appropriately typed as discussed above, it is conceptually simple to map those JCR node types to XML Schema types.

Below (in the section titled "Turtle Servlet Quandaries") I've added some notes concerning issues I ran up against in coding the servlet. I've relegated said notes to the end of the post, because they are more of a dessert than a meal (and probably not a particularly delectable dessert).

Presentation of data in RDFa, a format which lives on top of HTML and is embedded into HTML, may also be eased by following the data storage practices set forth above. I'm choosing not to dive too deeply into this topic largely, because the variety of methods available for publishing HTML out of Sling makes it impractical and I have no desire to tether the discussion to jsp, or jsf, or some other particular technology. What can be said is that the same value which the TtlServlet finds in the aforementioned data storage practices can be found in the encoding of RDFa regardless of your technology of choice. Property names can be written directly, because they will already be defined as the URI's of the desired predicates. Namespaces can be extracted from the JCR namespace registry using JCR's API. Literal types can even be inferred directly from the JCR property types as the TtlServlet shows.

Getting and Running the Code

As noted above, the code for the Proof of Concept has been posted on CITYTECH's Github along with instructions in the README file concerning how to compile, deploy, and use the code. Feel free to discuss, fork, suggest additional functionality, or ask questions, ideally via Github, though I did include my e-mail on the project itself. My goal for the project is that it be augmented with support for further representations, such as JSON-LD, and maybe even some RDFa or XML support.

Turtle Servlet Quandaries

Namespaces and Namespace Fragility

Early on in thinking about the Servlet and about housing qualified properties in general, I ran up against a concern with changing namespaces. Specifically, I wondered what the best way would be to handle Ontologies with updated versions at different URIs. The JCR specification concerning namepsaces gives no guarantees as to what an implementation provides around changing and unregistering namespaces in so much as it states that an implementation can refuse to allow namespace updates and unregistrations for any reason. Jackrabbit takes the stance of not allowing namespace changes and unregistrations (see org.apache.jackrabbit.core.NamespaceRegistryImpl), because they can not guarantee that content is not already associated with the namespace. As such, you essentially have one shot at registering namespaces, so if you want to update the version of an Ontology that you are utilizing, it's going to be less than trivial.

This spurred me to consider the matter of namespace fragility in general. If you're feeling particularly bored, you can read my thoughts on why you wouldn't want to update your namespaces.

To Type or not to Type

The node type notation provides for the definition of namespaces and node types. As I brought up in the prior post there may be value in a transformation between Resource type definitions in an Ontology and Node type definitions in JCR. I did not undergo any such work though since, for the most part, I've found strict node typing in JCR to be overkill. Further, the property which defines a Resource type in a Semantic Web context is the rdf:type property which can be established independent of the jcr:primaryType property.

The Canonical URI of a Resource

When writing out paths, references, etc, in Turtle format, I ran across the question of what URI to associate with the path in order to allow a user to dereference the URI to obtain a representation of the resource. Since I was writing Turtle data I was tempted to add a .ttl extension to the end of any paths under the presumption that a user / agent would want to remain in the mode that they are presently in. I ended up landing on writing the path by itself, sans extension, because this to me seemed like the proper URI to associate with the Node itself. Concerns about the relationship between the URI without extension and the one with might be alleviated by establishing some relationship between the two.

The Semantics of Children

The proper handling of Child Nodes is an area I have not resolved to my satisfaction in this project. First, the JCR Specification defines a jcr:content node which has implied semantics in certain contexts. Specifically, this node represents the content of the node of which it is a child. The most immediate example is it's use in a file context where the jcr:content node holds the actual file data. If you are an Adobe CQ5 user, you will also recognize this node name as that housing the Page Content node associated with a particular page. Due to the importance of this node, it made some sense to handle it differently from other child nodes. This leaves the issue of handling all other child nodes. Should references to those nodes be written in the RDF document of the parent node? Should the user be able to request a tree of content in Turtle format much like they can do using the default GET servlet's JSON rendering? For now I've taken the approach of not rendering children with the parent node at all. While the default JSON rendering agents allow the user to drill down through the content tree to an arbitrary depth, the use case I wanted to support does not line up with that of the JSON agent. A user of this servlet is requesting a representation of a particular resource. As such, only the information for that resource is returned, not the entire content tree. I am entertaining the idea of writing out a collection (ordered or unordered depending on the node type) of children along with a parent node, but I'm not sure that's an appropriate handling of the situation either. What may be best is to have users explicitly define more meaningful relationships in the data itself if such relationships exist.

Top