Text Analytics 2.0 with Carabao Language Kit

Business data does not always come pre-packaged in spreadsheets, relational databases, or XML. Developers often face challenging tasks involving processing unstructured text. As the expectations on computers being 'smarter' persist, text mining and text analytics have become increasingly important parts of the Business Intelligence.

 

 

 Architecture

General Principle

Text analytics is a broad area which encompasses many smaller areas, including while not limited to:

  • Entity extraction
  • Domain extraction
  • Sentiment analysis
  • Stemming and lemmatization
  • Semantic search
  • more

While humans are capable of using the same cognitive mechanism for these tasks, most natural language processing applications have to be built specifically for each of these tasks. This is not the case with Carabao Language Kit.

Once converted to the disambiguated language-neutral representation, the majority of natural language processing tasks are reduced to traversing lists of codes.

Cross-lingual Semantic Network

The linguistic database is built around a semantic network, shared across languages. As shown on Fig. 2, synonyms and word forms have an identification number, called family ID, shared across all words denoting the same concept:

Linguistic Abstraction.

In order to achieve higher efficiency and flexibility, the kernel of Carabao is made linguistically abstract. Even the fundamental linguistic terms, such as “part of speech”, “noun”, “verb”, “singular”, “plural”, are defined outside of the kernel, in the linguistic database via the linguistic management tool, The linguistic abstraction achieves two purposes.

First, the linguistic development is less dependent on IT personnel.

Second, with the wide variety of languages, and unique challenges required by certain linguistic tasks, often the standard set of grammatical metadata is not enough and must be adjusted. Linguistic terms, rules, parsing parameters, classification rules for non-dictionary entries, and pretty much every aspect of the text processing, are all accessible and editable in a linguistic workbench tool, called Carabao Data Manager.

Domain of Discourse Extraction

In order to track the context of the current content, the system analyzes domains of discourse for every sentence in the text. While the main purpose is to help find an exact meaning of every word (or collocation), the domains of discourse can be used for a variety of tasks, such as classification of documents, or contextual

The system supplies breakdown of the domains per sentence as shown on Fig 10 below; the domains are not limited to a set of 20 – 40 set items. Every concept may be a domain by itself.

 
Entity Extraction

The entity extraction provides output in cross-lingual ID codes. This essentially means that the same code will work for all languages in the database, as demonstrated