Data Doesn’t Lie – But it Doesn’t Tell the Whole Truth


The recent Scientific American article “How a Machine Learns Prejudice” provides an excellent wake-up call for the technology world in which the buzzwords “Big Data” and “Machine Learning” are flying about connected to grandiose promises, as if they were the passwords to finally opening up the vault of Truth. Here we finally have the tools to create Artificial Intelligence that is not based on the manual encoding of biased human judgments but is instead based on the actual raw data, and the data does not lie.

Or does it?

Like in most real-world questions, the answer is both yes and no. The data does not lie – but it may not tell the story you think it does. The ways in which we collect data can still unconsciously impart human biases, notes author Jesse Empak. For example, medical studies and clinical trials have long been criticized because of the overwhelming representation of white males among study subjects. This leads to conclusions that overlook the different reactions different bodies may have – or the fact, for example, that women with cardiovascular problems might present different symptoms than men do. artificial-intelligence-507813_960_720

Sometimes, as Kate Crawford opined in a related New York Tim es article, science can be stuck in the “white guy problem,” a self-perpetuating culture in which achieving diverse representation is hindered because those already in positions of power are simply unaware that everyone looks like them. Studies have shown that engineering teams that are diverse are more creative problem solvers than non-diverse teams (see this article by Forbes and this one by the American Society of Mechanical Engineering), and yet women still make up a tiny fraction of our industry. When those non-diverse teams turn their attention to designing solutions or creating interfaces, they may fail to realize their solutions take only their own needs into account, or that their interfaces succeed only when their test population is similar to the creators.

Likewise, running a Machine Learning algorithm on a set of data is a powerful tool to bring out the patterns and facts from that data – but it also runs the risk of “over-fitting,” where it can learn patterns that are specific to that data and not generalizable to other data. From the apocryphal story about tanks and neural nets (which I, in my turn, made the mistake of perpetuating when I was teaching A.I.) to the recent struggles Crawford mentions with processing non-Caucasian facial features, the data scientist must actively engage in an effort to make certain the data reflects the whole story and not just his perspective on it.

The television show “Burnsitown” lampooned this problem in a scene where two Scots try to get a voice-activated elevator to go to the 11th floor. “You need to try an American accent,” says one of the characters when they continually fail. This is because many voice-recognition systems are the result of using Machine Learning on a large set of example sentences spoken by Americans; they learn a model of how Americans pronounce something, which does not generalize to other accents.

From my personal area of science, gathering diverse textual linguistic data is just as challenging. It is a key step to creating a chatbot or virtual assistant, but a single developer entering sentences into a chatbot toolkit will create a bot who can answer questions posed by that developer – but not questions from someone who expresses herself in very different ways. Diverse linguistic data sets are difficult to gather but vitally important for creating general use bots reflecting the broad diversity of global linguistic patterns.

The saying “Garbage In, Garbage Out” is as old as the digital computer, and yet it is just as true in the age of Machine Learning as it ever has been. Microsoft’s Tay chatbot fiasco was perhaps a very explicit example of this; once the Twitterverse realized that she was learning word patterns from the tweets directed at her in order to construct her own, natural-sounding utterances, they fed her a diet heavy on racism and inflammatory language – a Tweetstorm of vindictiveness – resulting in similar tweets of garbage. But most garbage is unintentional, and it is only through vigilance – and explicit work towards the inclusion of diverse sources and voices – that our industry will avoid these errors.


Lisa Michaud

Lisa Michaud is a Data Architect on the Enterprise Architecture team at Aspect. She has 20 years of research experience in the field of Natural Language Processing / Computational Linguistics and pursues diverse interests in user modeling, dialogue, parsing, generation, and the analysis of non-grammatical text. She holds a PhD in Computer Science and has been published in multiple international journals, workshops, and conferences in the fields of user-adaptive interaction and Computational Linguistics.