"Unstructured information may be defined as the direct product of human communication. Examples include natural language documents, email, speech, images and video. It is information that was not specifically encoded for machines to process but rather authored by humans for humans to understand. We say it is “unstructured” because it lacks explicit semantics (“structure”) required for applications to interpret the information as intended by the human author or required by the end-user application. "
UIMA - Unstructured Information Management Architecture - is a technical infrastructure for processing natural language documents. It has been initiated by IBM and has become an OASIS standard.
The basic concept is a pipeline of three stages.
|1.||Collection Reader - a collection reader provides information as text. A simple collection reader could just read a text file from the file system and provide the contained text as it is.|
|2.||Analysis Engine - an analysis engine processes the text provided by the collection reader. It annotates pieces of text to be a particular kind of information. As an example, an analysis engine may scan the text for phone numbers. The information is stored in indexes.|
|3.||Consumer - a consumer accesses the structured information in the indexes and processes them in some way. A typical consumer may store the information somewhere, e.g. in a database.|
UIMA provides an extensible type system which includes primitive types like integers or strings. In addition there is a built-in type for (textual) annotations which consists of a start and an end index within a text. Usually each analysis engine extends the type system by one or more new types. For example, a new type 'PhoneNumber' can be added as a sub type of the annotation type. The analysis engine then annotates the text by creating new instances of PhoneNumber, including the start and end index in the text that the covered text is a textual representation of a phone number.
Each type may have associated features. This is a field for each instance of a type that contains a particular part of information. For example, the start and the end index are features of the annotation type. An analysis engine can define new features on its types, so the PhoneNumber type could have a feature 'normalizedNumber' which contains a normalized version of the phone number found in the text, for example by removing spaces.
NOTE: For more inormation about UIMA take a lool at http://uima.apache.org/documentation.html