Implementing the semantic web in EPiServer 6 using Smartlogic and the Google Search Appliance, by Paul Dunstone @ Rufus Leonard.
Our client came to us with a problem; they wanted a semantic website that modelled their business allowing users to locate unstructured content, efficiently integrated into a content management system.
To briefly take a step back it is important to define the semantic web and the concept behind it although I am sure most are familiar with it. The term was originally phrased by Tim Berners-Lee to describe the method of allowing computers to understand the meaning or semantics of information contained in the World Wide Web. By adding metadata to content, external agents and automated software would be able to classify this data more intelligently and therefore create logical groupings of information providing extra insight into the user’s search that they might not find in a normal search.
The out of the box EPiServer search allows for generic searching but like most search solutions it doesn’t cater for enhanced search facilities based on semantic groupings. Our client suggested using Smartlogic and their suite of software known as Semaphore to provide semantic web concepts for the content managed website. As EPiServer caters for in depth customisation of the CMS system it seemed that these two pieces of software could successfully work in parallel to provide the rich functionality we required, but still be easy for editors to use.
Semaphore itself is not a complete search solution, rather it is a middleware product set that provides the mechanism to define and store taxonomical/ontological information, automatically classify documents and enhance search facilities based upon taxonomical/ontological information. We therefore still required a search engine to provide our search results so we opted to use a Google Search Appliance (GSA) box to provide the rich search indexing that Google provides.
The goals of the project were as follows:
- Integrate Semaphore services seamlessly into EPiServer to make it easy for editors to classify content and relate it to other content
- Add the ability to tag content with terms in the ontology so that we can force Google to index the content based on preferred terms
- Provide queryable services for the GSA results
- Wrap these in a reusable component that can be rolled out and configured for future developments
Firstly, here’s short overview of the major components of Semaphore (you can find more details about the software on their website):
- Ontology Manager/Server - The Semaphore Ontology Manager is a client application designed for knowledge workers who have familiarity with the use of controlled vocabularies – thesauri, taxonomies and ontologies. The ontology is used to make specific relationships of terms which can then be used to build up a much richer semantic map of the client’s business.
- Classification Server - The Semaphore Classification Server (CS) is a server-based application that extracts keywords or terms from documents that are posted or uploaded to it via its XML API. An application such as a website will post content in the form of a web page, MS Word document or Adobe PDF document and Classification Server will process and send back an XML document containing a set of terms that best describe the document. It does this by making use of the information created by the Ontology Manager and Rulebase Generator.
- Search Enhancement Server - The Semaphore Search Enhancement Server (SES) is a server based application that provides an XML API to the model stored in Semaphore. SES provides an XML web service to expose the organization’s model. It has a number of services including finding information about a specific term to finding related terms.
Integration between Semaphore and EPiServer had never been built before, so the challenge for Rufus Leonard was to integrate these services seamlessly into EPiServer, making it easy for users to tag their content with terms in the ontology and automatically classify their documents using the classification server. We created a fully configurable Rufus Leonard Smartlogic solution that sits as a layer between EPiServer and Semaphore that seamlessly integrates the two pieces of software. It uses an n-tier architecture design to provide services to an EPiServer solution allowing developers to query those services and extract the data they require. The solution provides the following layers to support the integration:
- Configuration Layer - The Configuration Layer sits at the bottom of the pile. As its name suggests, it provides the application with the capability to configure each of the Services via configuration sections in the Web.config files.
- Service Layer - The Service Layer acts as a conduit for the application to request and handle the response from the Semaphore Classification Server, Search Enhancement Server, Search Enhancement Baseliner Service and Google’s Search Appliance.
- Data Layer - The Data Layer sits on top of the Service Layer and provides a wrapper to each of the services that the Service Layer handles. Instead of the developer calling the Service Layer objects directly and having to invoke web client requests and consume requests, the Data Layer handles all this for them. The developer simply instantiates an appropriate service data object, sets a request object value containing information needed to query the web service, passes the Service Layer web client the service object, invokes the web client to post or get the data from the web service and then consumes the response.
- Cache Layer - The Cache Layer sits on top of the Data Layer and supports the ability to cache data. Currently this is implemented for the Search Enhancement Service data and holds all the hierarchical Terms/Tags in the Stylus Ontology. Holding the data in this Cache is very important for application performance as it prevents the need to request the information from the remote web service. When dealing with a large amount of data without the caching layer the website very quickly grinds to a halt.
- Presentation Layer - The Presentation Layer is comprised of EPiServer plug-ins that provide the EPiServer CMS editor / administrator with an interface into Smartlogic’s Semaphore Servers.
By referencing these projects and configuring them through the web.config, users can easily add the services and plug-ins required to start classifying content and getting data back to display in their front ends.
The plug-ins allow users to classify documents and post these results to the GSA box in the form of an XML document so that the content is indexed. By injecting Dublin Core meta tags into the XML document we are able to inject tags based on their relevancy. For example, when we classify a document the Semaphore software may recommend that one term is more relevant compared to another term, based on this we are able to inject this term multiple times into the Dublin Core meta tags so that Google interprets this tag as a more relevant term and as such indexes this page and associated terms accordingly. As well as posting the XML to the GSA box it is of course also possible to have the GSA box crawl the website and have the Dublin Core meta tags output on the front end to achieve the same result. This can all be configured through the GSA front end.
The process to classify the documents through the custom EPiServer plugin and post the XML to Google has been refined to be easy to use and highly intuitive to the content editors. The process is shown below:
- User creates and publishes their page in EPiServer as usual
- The user accesses the new Classification tab in Edit mode. Here they can configure the Classification Service threshold that determines the relevancy of the service. The lower the threshold the less relevant the terms returned by the service are. Once the ontology grows in size, the threshold needs to be quite high to keep the relevancy of the results.
- The user can then review the suggested tags that the Semaphore Classification service returned and decide if they are relevant to their page and dismiss them if necessary.
- The Classification service may not always bring back expected tags every time especially if the relevancy is set too high. On the next screen we have allowed content editors the ability to review and choose their own tags. The tree is based on the full Ontology created in the Ontology manager and then cached locally in a serializable XML file for efficiency.
- Finally, the editors review their tags and the XML document is created and posted to the GSA box so that it can index the content.
As the GSA indexes are created we are then able to call services in the service layer in the Rufus Leonard Smartlogic solution to receive meaningful results so that we can bring back content that otherwise would not have been found in a standard search. Not only do we use these search results when a user actually queries the site, but we also use these to drive users to other content by suggesting related pages or populating content areas with pages that might be of interest. Although these pages may not have been specifically tagged with the same term, through the semantic search we can drive them to meaningful related content as defined by the businesses ontology.
Currently the solution is only compatible with EPiServer 6 although it can be retrofitted for earlier versions if required. The working solution took several months to implement with a large amount of that time spent building the underlying infrastructure without much to demo, so it was great once everything came together and we were able to see a fully functional example of the semantic web in action.
If you would like to hear more about the Rufus Leonard Semaphore integration project or talk to us about EPiServer in general please contact us: http://www.rufusleonard.com/london/