In this article, we will briefly discuss why you should build your customer knowledge around terms and how our term identification system works.
Why do terms matter?
What we call terms, or key expressions, or key phrases or sometimes keywords are the pieces of information that describe single, important ideas in a text.
Here, we consider terms to be the true holders of the information that matters for you, in opposition to, say, whole documents or sentences. There are several reasons for this, that are too often neglected.
The document identity crisis
A document can either contain a single idea, multiple ideas, or none. Thus when you try to classify thoroughly whole texts into meaningful boxes (or categories), you will have a hard time deciding which goes where.
Like the Rolling Stones said, you’re not the only one with mixed emotions: most people include positive and negative points in their reviews, which is only discernible by working at the term level.
We’ll discuss this topic in more depth in a future article on the topic of opinion detection.
You can see terms as the building blocks of your user generated content.
As such, manipulating precise terms instead of raw documents lets you apply a coherent structure to your data, so that you can get a clearer understanding of the emerging insights.
Why should I use term extraction?
I have a nice dictionary of terms of my own, so why bother searching for new terms every time?
We strongly believe in letting the texts speak for themselves, in order to discover reality for what it is, instead of approaching the analysis with some a priori bias. Furthermore, thehuman mind is simply not adapted to the kind of scales that you typically use through our API.
If you use a dictionary, you will find what you are looking for, but you are likely to miss important things you haven’t thought about. In other words, you are not listening to the voice of your customers.
Instead of constraining our technology to meet expected results, we rather apply dictionary-free unconstrained term extraction and let important terms emerge.
Term extraction methods
There are two main approaches when it comes to term extraction. These are known as rule-based matching, which is a more linguistic-oriented, hands-on approach andsequence labeling which is a more probabilistic-oriented, abstract approach of the problem. In the end, it’s all a matter of pathfinding in the graph of words that compose the document, provided that this document has been – at least – correctly tokenized.
In the case of rules, this graph is built manually, on the basis of linguistic knowledge and trial and error on a good amount of sample documents.
The pathfinding is then heavily constrained as the different possible paths are only those defined by these rules. Also, time being a limited resource, the manual work required for this method can only be focused on the most efficient elements that represent words : the words themselves, or theirPOS tags.
On the other hand, a sequence labeling model can benefit from more elements, of features, that define each word. The pathfinding in this case is based on previous observations.
The characteristic sequences of words that represent a term are then defined by the most probable series of features among all the possible combinaisons. Here the manual bottleneck – because there’s always one – is to produce enough of such observations for the model to be trained to recognize terms in various situations.
So what exactly is a term?
It’s important to note that in both of the methods we’ve just seen, the graph that is used by the term identification algorithm entirely depends on the initial human input. Because of this the formal description of what a term, on which this manual work heavily rely on, must remain as coherent as possible.
What's wrong with the other terms?
It seems obvious that terms that do not bring any relevant information to the document or that are longer than needed are less interesting, and thus are considered as “noise” in our data.
However, the choice of ruling out certain forms of terms, such as single adjectives or verbs, is less trivial as this comes from our experience in information extraction.
The problematic cases arise when such words seem strongly related to an idea, and appear as valid alternatives to an existing term to describe this idea, such as “pay” or “cheap” which are semantically related to the idea of “price”.
The reality is that in customer reviews, it is extremely rare that such meaningful adjectives or verbs are employed outside a compound noun of verb, whereas single adjectives or verbs that do not hold any useful information are plenty, so it is not worth the trouble to extract these forms.
Terms and ideas
Terms are fine-grain blocks that build up documents, however you need a coarser grain to extract meaningful insights from a huge set of documents.
What you really want to work with though are ideas, not terms. What we mean by this is that it seems obvious that when you are analyzing opinions about the price of a product, you don’t want to be only pointed to the reviews that contains the term “price”, but also terms such as “cost”, “value”, “amount”, “charge” and maybe other related terms.
One way to look at this is to consider ideas as groups, or clusters, of terms. Gathering the variants of a term is a very difficult task, yet it is necessary in order to provide useful insights to our customers.