Text Analytics and Finding the Needle in the Haystack
13th July 2015 | Steven Struhl
Practical Text Analytics Author Steven Struhl Outlines How to Use Great Masses of Text Data in Guiding Behavior
If you want to find a needle in a haystack, according to an old joke, just sit down in the haystack. This speaks to the way things just seem to go wrong so many times, but it also has a second and subtler meaning. That is, people have a deep-down wish that there are quick fixes, things that will happen fortuitously (even if they do the wrong thing) which will get them what they want. Here, it is clearly wrong to sit down and get poked by a sharp object, but it also finds you that needle.
When we look at the field of text analytics, and the somewhat terrifying terrain of “big data,” this wish for a quick answer, something that happens almost by itself, is strongly visible. It has been this way ever since computers became big enough and fast enough to hold and process more data than we can easily imagine.
“If you just get more data, you will find something in it,” unfortunately is an often-heard claim. With text, this is modified slightly, to something like, “If you just get enough commentary, then you will know what to do.” Sadly, this was false 25 or so years ago, way back in the days of data mining, and it remains as wrong today. Those deeply-held wishes we mentioned must be the reason so many people are still swayed by these claims.
In fact, it was definitively proven that more data is not always better about 80 years ago. Before the 1936 US presidential election, a magazine called Literary Digest conducted the largest pre-election poll that had ever been done. They sent out ten million letters asking prospective voters who they would choose for president. This represented about 25% of US voting population at the time. The magazine crowed that their prediction would be right to within tenths of a percent, with this many being polled.
They got back 2.4 million responses and called the election for a Mr. Alf Landon, saying he would beat the slightly better-known Franklin Delano Roosevelt by a margin of 57% to 43%. As it turned out, Roosevelt won 62% to 38%.
This is said to be the biggest error ever in a large public poll. Meanwhile, another familiar figure, Mr. George Gallup (of the eponymous Gallup Poll) used a much smaller sample and got the election results right.
How did this happen?
The answer is that, while the Digest did get 2.4 million people, these were the wrong people to use for calling an election. They used lists from telephone directories, club memberships and magazine subscriptions. However, in 1936, a telephone was still something of a luxury item (only about 40% of US households had one), and in the cash-strapped years of the Great Depression, only the relatively affluent could afford club memberships or magazine subscriptions.
Their poll missed the vast majority of less fortunate and less privileged Americans, who turned out to vote for Roosevelt in droves. Gallup took a more scientific approach, making sure to represent the entire population more accurately. He got the election results right even with a far smaller sample (about 50,000, which is large by present-day polling standards, but likely needed back then due to less sophisticated forecasting algorithms.)
The moral is clear: More data is not always better. To return briefly to the haystack, simply adding more hay does not make it easier to find the needle, but rather makes it harder.
Similarly, with text, getting hold of a wealth of online commentary does not mean you have found what you need. You may find thousands of comments on a topic but that does not make them into the right comments on which to base your actions. For instance, it is now estimated that some 200 billion tweets are posted on Twitter a year. Clearly, one is not required to invest much mental, physical or emotional effort to post a tweet. Those tweeting away on a topic of great interest to you may on average have very low levels of involvement with it.
And, as our story about the presidential poll illustrated, you can go terribly wrong by acting on the opinions of the wrong people. For instance, the person complaining so bitterly about your fine alcoholic beverage product could be a bored twelve-year-old in Buzzard Flats, Nevada.
It is not even clear if a storm of well-deserved online outrage causes any lasting damage. In Practical Text Analytics we mentioned an incident in which Domino’s Pizza suffered from online videos posted by two employees doing things with the food that you honestly do not want to see, and unfortunately cannot un-see once you have viewed them. The Web buzzed with outrage, but Domino’s took quick action, and after a minor setback, their stock rose rapidly—over several years to some seven times its value at the time the videos were posted.
Clearly, Domino’s did the right thing in noticing and facing this online maelstrom—but it is also apparent that the negative commentary did little to influence the typical behavior of their customers. And indeed, while there appears to have been nothing in print about customers’ responses to these damaging videos, their actions speak very clearly.
Most companies go directly to their customers and prospects after a public relations disaster, and it would be very surprising if Domino’s do this to help shape their responses. Most typically, questioning would be done with a survey and the sample would number in the thousands or even the hundred—hardly.
Perhaps as tellingly, extensive work with a major social networking (with which I have worked extensively) relies on focused direct questioning when they want to improve their service offerings. They have the opportunity to listen passively to billions of public comments, but when they want to influence change, they ask about what they need to know. Rather than flailing around in a huge haystack, they go directly to seeking out to the information they need.
This smart practice taken together with the Domino’s story underlines the correct use for great masses of text data in guiding behavior. Monitor it to keep alert to any whiff of an oncoming problem. But if something appears on the horizon, do not keep gathering more in hopes that this will show you what to do. Rather focus your efforts, zero in on a manageable sample, and ask them the questions to which you need answers. That will find you your needle.
About the Author: Dr. Steven Struhl has been involved in marketing science, statistics and psychology for over 25 years. Before founding Converge Analytic, he was Sr. Vice President at Total Research/ Harris Interactive for 15 years, and earlier served as director of market analytics and new product development at statistical software maker SPSS. He brings a wealth of practical experience at the forefront of marketing science and data analytics. Steven is also known for academic work, having written a book on market segmentation and over 25 articles for academic and trade journals and taught at both the graduate levels and in a research certification program. He is a regular speaker at trade conventions and seminars.
Save 20% when you order Practical Text Analytics with discount code MKTPTAB