The Future of ITR: Normalizing Text

avatar

 

It’s not news to anyone who communicates via any of the text-based web or phone channels that these environments have developed a dialect of their own; “textspeak” often involves deliberate modifications to the spelling and grammatical standards of natural language, and has spawned entire domains of linguistic research. The phenomenon, however, is hardly new to human communication. Morse code operators, in the interest of economizing their keystrokes, also developed a shorthand and it can still be observed today by listening in on the conversation between any two Ham operators, such as:

NC1M DE AA1JD    GA DR OM UR RST 5NN HR     QTH TIMBUKTU     OP IS MATT     HW? NC1M DE AA1JD KN

Translation: To NC1M from AA1JD: Good afternoon dear old man. You are RST 599 here (signal is very readable (5) and very strong (9), with very good tone (9)).  I’m located in Timbuktu. The operator’s name is Matt.  How do you copy?  To NC1M from AA1JD, listening for response from a specific station.

Multiple publications have claimed that around 15% of the words occurring in SMS and Twitter text are not in the dictionary, which is a significant problem for automated Natural Language Understanding (NLU) tools. Furthermore, while there will always be the “fat fingers” problem with simple unintentional typos, many of these spelling variants are intentional; for brevity, stylistic, or emphasis purposes, texters deviate from standard spellings with acronyms, shortened/simplified versions of words, and even lengthened variations. One study (Bieswanger, 2007) found an average of 5.5 shortened spellings per text message sent by English-speaking millennials. This poses a serious challenge to the ITR domain; user adoption will be impacted if the system understands only perfect spelling.

So, how do we “fix” the spelling? While there are many resources listing common slang and abbreviations, experiments have illustrated that their coverage isn’t enough (cf. Han et al., 2013). This means that we need to augment our slang dictionary with a reliable way to transform unrecognized text into standard spelling in order for its meaning to be understood. Most approaches normalize text by generating potential respelling candidates and ranking them from most likely to least likely in order to pick the top one. In typo correction, respelling candidates are generated through knowledge of how fingers or memories typically stumble. Thankfully, we can do the same with intentional spelling variants; some studies (cf. Beckley, 2015) have shown that if you can describe the systematic ways  people typically modify words in text-based communication, you can use this to generate and rank respelling possibilities effectively. Combined with the differentiation enabled by understanding the syntactic and semantic context of the word, we are on our way toward achieving the goal of a robust dialogue where users can express themselves naturally.

References:

Morse code example modified from Wikipedia https://en.wikipedia.org/wiki/Morse_code_abbreviations
with thanks to NC1M for additional consultation.

Russell Beckley.  Bekli: A Simple Approach to Twitter Text Normalization.  Proceedings of the ACL 2015 Workshop on Noisy User-generated Text, pages 82–86, Beijing, China, July 31, 2015.

http://www.aclweb.org/anthology/W/W15/W15-43.pdf#page=94

Bieswanger, M. (2007). 2 abbrevi8 or not 2 abbrevi8: A Contrastive Analysis of Different Space- and Time-Saving Strategies in English and German Text Messages. Texas Linguistic Forum, Vol. 50.

http://studentorgs.utexas.edu/salsa/proceedings/2006/Bieswanger.pdf

Bo Han, Paul Cook, and Timothy Baldwin. 2013. Lexical normalisation for social media text. ACM Transactions on Intelligent Systems and Technology, 4(1):5:1–5:27.

http://people.eng.unimelb.edu.au/tbaldwin/pubs/tist2013-lexnorm.pdf

avatar