Can Text Analytics Become More Analytical?
Practical Text Analytics Author Steven Struhl Explains the Need to Combine Text Analytics with Predictive Models to Go Beyond Counting and Describing
The term text analytics sounds highly technical. Yet in articles and discussions about this field, the two forms of output that get the most mentions are little more than forms of counting. The realm of analytics does not get stretched this liberally anywhere else.
What are these outputs and why do they pass the cut as “analytical” in connection with text? Let’s start with the output, and then move on to the reasons.
Perhaps the most ubiquitous representation of text analytics is an image showing words of various sizes stacked and piled into a shape. Most often this is a simple rectangle, although in Practical Text Analytics we have a striking example that looks like a highly distressed grand piano. This display is sometimes called a word cloud and sometimes a wordle.
For all its visual appeal, this display simply shows the relative frequencies of words in a document. Bigger words are more prevalent, smaller ones less so. While this is often passed off as a slick analytical feat, the only good trick is in the computer’s skill at fitting the words together.
Additional computer processing is required in the background, starting with the removal of so-called stop words. These are the smaller, frequent words that typically do not bear the meaning of a document, but rather serve to stitch it together. In the last sentence, stop words would include “these,” “are,” “the,” “that,” “but,” “to,” and so on. The computer eliminates these by consulting a dictionary. This often is a simple text file that the user can modify. (We discuss the process of purging stop words in Chapter 2 of Practical Text Analytics.) Wordles also require the computer to do stemming, the process of removing tenses and regularizing endings so that different forms of the same word do not get counted as two distinct entities.
Some programs go further and perform lemmatization, which tries to recognize the part of speech of a word, so that (for instance), “busy” (the verb) does not get lumped in with “business” (the noun) or “busy” (the adjective). However, stop word removal, stemming and even lemmatization are computational problems, not analytical ones. Similarly, you just ask for a wordle and one quickly appears. The computer may work hard, but you yourself can get by without any analytical thinking.
Sentiment analysis is the other often-mentioned form of output which is not quite analytical. As in the removal of stop words, this involves the use of a dictionary. And as in the creation of the wordle, it is centered on counting. That is, the analysis typically refers to dictionaries of positive and negative words and phrases. Looking at a document or set of documents, the computer counts the occurrences of each type. The balance of plus and minus gives the “sentiment score.” Dictionaries of sentiment words can be found online, and even downloaded for use. We mention a popular one in Chapter 5 of Practical Text Analytics, consisting of about 2000 positive phrases and about 6000 negative ones.
Once again, the computational challenges can be ferocious. Recapping an example in the book, you would make a mistake about sentiment if you just counted positive and negative in this sentence: “My new SoggyOs breakfast food is so good, it makes Sorghum Sweeties taste bad by comparison, and Kardboard Krunchies taste terrible.” There are two negatives and one positive, so a simple count would peg this sentence as negative.
Yet clearly, the writer is singing the praises of his/her new breakfast-like substance (if somewhat idiosyncratically). Indeed, unusual locutions, neologisms, slang, sarcasm and awful sentence structure often prove highly confusing to computer. But resolving these requires better algorithms and faster machines, not more thought from the analyst.
We can expect counting and enumeration to continue getting more accurate. But still, these will remain counting.
Now let’s go on to the reasons for the bar to the “analytical” being set so low. These counting-based activities are so strongly identified as “analytical” in part because of the quick evolution of computers and their abilities. Not that many years ago, stemming and stop word removal were the height of high technology. Doing only that much with text constituted a remarkable feat of computation.
Still, things are moving very quickly and it is time that we expect more. Indeed, text analytics already has made the next step, going from processing and enumerating to solving problems and forecasting outcomes so that actions can be changed. That is, we actually can move from the realm of summing up what is in text to the realm called the predictive.
Adoption of true analytical methods has been slow because this involves challenges. First and foremost, we need to think and to frame the right questions. For instance, we should not stop with simply asking, “What are my customers saying?” We can instead ask, “What are my customers saying that we can use to get them to behave differently?”
In chapters 6, 7 and 8 in Practical Text Analytics, we show how verbatim text comments can provide insights that can lead to change—when linked with other information. We highlight three analytical methods (regression, classification tree analysis and Bayesian Networks) that do this.
The catch is that to get to predictive models, you do need this connection to other data, whether it is purchasing patterns, online behavior, or responses to a survey. Without an outcome variable—such as percent of business renewed, a product or service rating, or willingness to recommend—text cannot be used for prediction.
You may see broad patterns—and indeed text comments alone may give early warnings about an impending disaster—but you must combine other knowledge with text to get beyond counting and describing, and so get to changing behaviors in ways you want. You need to measure a behavior to change it. The effort required to merge text with additional information may be another reason so much of text analytics has remained not quite analytic.
About the Author: Dr. Steven Struhl has been involved in marketing science, statistics and psychology for over 25 years. Before founding Converge Analytic, he was Sr. Vice President at Total Research/ Harris Interactive for 15 years, and earlier served as director of market analytics and new product development at statistical software maker SPSS. He brings a wealth of practical experience at the forefront of marketing science and data analytics. Steven is also known for academic work, having written a book on market segmentation and over 25 articles for academic and trade journals and taught at both the graduate levels and in a research certification program. He is a regular speaker at trade conventions and seminars.
Save 20% when you buy Practical Text Analytics with code MKTPTAB