Unstructured text data represents the biggest data set available to enterprises, yet most are unable to process the vast amount of data they collect to get any meaningful insight. Up to 80 percent of data available to enterprises is unstructured data, and comes in a variety of forms, such as intellectual property, financial statements, CRM notes, news, analyst reports and social media posts. If analyzed correctly, enterprises stand to gain knowledge on everything from customer sentiment to service level optimization. With the right tools, businesses can implement a wide range of applications that draw on past experiences to make better business decisions in the future.
Enterprises can realize the true potential of their unstructured text data by employing a machine-learning model. If trained on the appropriate data, a machine learning model can be very helpful in streamlining business processes and decision making. However, creating appropriate training sets for the right machine learning problem is easier said than done.
Take for instance a recent case study of training supervised machine-learning models to classify tweets with no prior training sets.
At Skytree, we were presented with the problem of tagging tweets by the categories they belonged to. At first glance, this seemed like a classic text categorization problem, but after a deeper look into the data, several major challenges immediately presented themselves. Tweets are very short and contain very little lexical signal, resulting in the copious use of alternate spellings and abbreviations, which increases noise in the data. To further complicate the task, there were no training sets for the target categories to train a model. One possible approach to this problem, without the use of machine learning, would have been to manually create keyword sets that target each category. This is not desirable, however, because manually creating keyword sets is a time-consuming process that produces inadequate results. This means that the keywords are often incomplete and therefore ambiguous, thus missing potential matches leading to false positives. The task also required a high-precision model that was suitable to deploy to a production environment quickly, and it had to be adaptive so that it would improve with feedback over time.
To create a seed high-precision model, a training set was still needed (i.e., a set of tweets labeled as positive examples of a category and a set labeled as negative examples). Instead of creating the training set manually, we decided to approach the task from a different angle. Starting with a Wikipedia article, we extracted a large number of keywords and traversed the knowledge base from a starting point and collected article titles up to a specified depth to create a training set. For example, to train a classifier for the category “NBA,” we would start traversing Wikipedia at the article NBA and collect titles that would include names of NBA teams, players, stadiums and so on. Tweets containing any of the keywords were defined as positive and the rest as negative. While this is obviously a false assumption, the rationale was that with a large enough training set, we could build a classifier that generalizes beyond the initial keywords. After creating the training set this way, we trained a machine-learning model tuned for high precision and ran the resulting model on new and completely unseen tweets. As suspected, the resulting model did exhibit very high precision on new data; almost every tweet that it labeled as positive was indeed positive.
The question however was, “Is this model any better than the initial keyword set?” In order to answer this question, we trained models on a topic, withheld one or more keywords and saw whether the classifier would label tweets containing those keywords and no other keyword as positive. For example, we trained a classifier for the category “NBA” without using the word “NBA” as a keyword, or a classifier for the category “sports” without using the keyword “baseball.” In all these cases, the classifier was able to recover relevant tweets that contained the omitted keywords and more. Additionally, these initial models had sufficient precision and recall to use in production.
The results we gathered from this exercise would not have been possible without taking advantage of a large dataset. The dataset provided the machine-learning algorithm with enough linguistic variation and related lexical patterns to allow it to pick up additional reliable signals.
With unstructured text data tasks on large datasets such as tweets, manually categorizing keywords would take a lifetime to accomplish. External knowledge sources such as Wikipedia can bootstrap the learning process and aid in the curation of training data in cases where the task at hand requires a vast amount of world knowledge otherwise inaccessible to machine learning systems. By pulling in large unstructured text datasets to create training sets, machine learning can distinguish signal from noise. The key to deriving strong value out of unstructured text datasets is to approach the task with what is available, rather than build manually annotate training data from the ground up.