There are at least two major reasons, why you may want to shy away from using many of the popular services for text analysis and annotation in an enterprise setting. Firstly, their nature as a service which forces you to send content to a third party, and secondly, their restricted use of other target datasets to suggest links to named entities available in public glossaries such as wikipedia. In contrast Apache Stanbol (incubating) provides you with the freedom to work within your own IT environment and your own business terminology as you see fit.
Apache Stanbol now enables you to upload your own custom vocabulary to annotate unstructured text with related web documents indexed to that vocabulary. This particular enhancement engine is called “Keyword Linking Engine”. The engine and together with the “Entity Hub” for managing local terminologies has been designed and written by Rupert Westenthaler. The enhanced content along with the entities can then be used in more advanced semantic search applications. In this blog I will show the use of the enhancement capabilities of Stanbol together with an inline annotation widget to enrich unstructured texts with images.
Enrich texts with images
In my example, I use metadata from the image archive of the Austrian National Library which has been made available as Europeana Linked Open Data from CKAN (Euopeana produces several datasets from various other European archives and museums). The specifity of this dataset is, that the main entities are images and photographs from Austrian History, and not just descriptions of entities (persons, places, organisations, concepts) such as in wikipedia or other open data sources.
(1) Create a SOLr index out of your custom vocabulary
To start, you need to make use of the indexing capabilities of the Stanbol Entity Hub – the component for caching indexes of linked data to be used as targets for the enhancement process. See the Readme, which describes the entire process in detail. For our example the most important steps:
- First you need to built the indexing tool itself by building it in the directory
genericrdf
with$ mvn assembly:single
- Then you copy the outcoming
org.apache.stanbol.entityhub.indexing.genericrdf-*-jar-with-dependencies.jar
from thetarget
directory into a working directory and initalize the indexing process with$ java -jar org.apache.stanbol.entityhub.indexing.genericrdf-*-jar-
with-dependencies.jar init
- The indexing tool provides you with configuration options you may want to use in oder to get a proper index of your RDF input.
- Do your adjustments to the mappings configuration. This file defines, which properties will be indexed. In our case, as the main namespaces such as Dublin Core, FOAF etc. are already present, you just need to add a few lines to the file
mappings.txt:
# --- Europeana / Austrian National Library http://www.europeana.eu/schemas/edm/*
http://www.openarchives.org/ore/terms/*
- Provide a name, some description on the source and licence information to the
indexing.properties
as well as choose the indexing strategy. For most of the cases, you may just use the default values. - Put the RDF source files into the
indexing/resources
folder and call$ java -Xmx1024m -jar org.apache.stanbol.entityhub.indexing.dblp-*-jar-with-dependencies.jar index
(2) Configure the keyword linking engine to work with your vocabulary
- Move the ZIP archive of the index into your
{stanbol-root}/sling/datafiles
directory. - Install and start the bundle creatied by the indexing at the OSGI console.
- Deactivate all other EnhancentEngines and configure the
KeywordLinkingEngine
to use the index by specifying the referenced site.
The user interface for configuring the Stanbol Keyword Linking Engine allows you e.g. to choose the target vovabulary, to choose the number of suggestions and also restrict the engine to specific languages.
(3) Get (semi-)automatic depiction for articles of your domain
Paste an example text from wikipedia about the Austrian Civic War in the 1930s (because the domain of the image library is in this time period and region) to the system. Use the IKS annotate widget together with Apache Stanbol to get entity annotation suggestions for some occurences within your text. With the annotate widget, designed and written by Szaby Grünwald, you can select, accept or decline annotations. By accepting them, the entity link is stored in HTML/RDFa in a human and machine readable format.
For all selected and accepted links, a slightly modified html view for this showcase of annotate.js retrieves and displays the relevant images from the image repository. In this example case, we retrieved the images directly from the europeana library.
What could be done better?
What I’ve shown in this example is the ability of Apache Stanbol to easily handle local vocabularies and to use them in the enhancement process. The frontend widgets retrieve such information and support (semi-) automatic annotation of unstructrured texts. In the example its about depiction of historical situations, but the system is not restricted to this example. One could also imagine using a very specific product catalogue and using the engine for creating a faceted semantic search over a repository of documents about such products or use the same engine to classify incoming mails according to some enterprise specific keywords.
Still, there are some features missing, which would be needed to support more real world implementations, such as
- the multilingual support for both, the analysis engines as well as the frontend interaction widget needs to be improved,
- a better human and visual disambiguation support through the preview of entities,
- the possibility to switch to an automatic annotation mode with a very high recall rate,
- a broader connection from the frontend to the datastore in order to easily change views according to client’s needs.