How come my Nokia text messaging programme prefers to turn my pressing of 782 into ‘rub’ first and only then offers me ‘pub’? Surely, the dictionaries which predictive texting programmes use can be improved. The only question is how. Psycholinguistics may have found the answer on the television screen.
The basis on which the initial texting dictionaries are compiled has not been disclosed. Still, the order in which alternatives are offered suggests that it is out of date. Below is an indication of why ‘rub’ is offered first and ‘pub’ thereafter: that’s how their respective frequencies (vertical axis) used to be ordered. Google’s ngram viewer, however, suggests that by 1980 (time is on the horizontal axis) ‘pub’ overtook ‘rub’. Other such word pairs also switched order of use during the twentieth century: ‘boy’/‘box’ (1990), ‘rope’/‘pose’ (1983), ‘lord’/‘lose’ (1904), ‘Ford’/‘dose’ (1951).
Still, predictive texting does not only face the challenge of how to order suggestions. It also has to limit the choices it offers for memory and usability reasons. For example, for 236737 my mobile phone does know ‘afores’ but not ‘adorer’. The challenge of which words to include is faced by all dictionaries. The Oxford Dictionary (LINK) uses a huge collection of texts called the Oxford English Corpus and includes new words if there is evidence that they are significant or important. I suspect that the most crucial criterion is simply how many times a word is used.
But which corpus is the best to determine that? Surely, for text messaging one would prefer a corpus made up only of text messages. In the absence of such corpora for all the languages text messaging is used in, one may turn to big corpora under the assumption that size matters or one may turn to corpora of sources users are likely to read, e.g. the internet. But how do you arbitrate between these options? Ideally, one would like to simply try out different approaches in real world text messaging and see how they perform. For the moment, however, it is easier to turn to what we already know from psycholinguistic research about word frequencies.
Psycholinguists have been interested in word frequencies for a long time because they account for up to 40% of the reading speed, as measured by how long it takes people to judge whether a letter string is a word or not (Brysbaert et al., 2011). Given that reading speed is often taken as a proxy for how words a represented in the mental lexicon, controlling for frequency is now standard practice in Psycholinguistics. Recently, Brysbaert and colleagues (2011) pitted different corpora against each other in order to see which one accounts best for reading speed. Google’s scanning of words resulted in the biggest corpus in the study but it did not perform the best. Even limiting the google corpus to more recent books did not improve its performance enough to outperform its unlikely rival which is 99.9% smaller: subtitles.
Whether in English, French, German or Chinese, word frequencies based on subtitles outperform their rivals based on books. They account for more variance in word judgement times in both typical, young research participants as well as old people in their seventies (Brysbaert et al., 2011). Furthermore, corpora based on subtitles have the added benefit of being easy to compile as well as update and they could be made available in all televised languages.
Too many people waste too much time correcting predicted words to ignore the need for better dictionaries. Given the strong psycholinguistic support for subtitle based word frequencies, it would be worth trying them out in predictive texting programmes. Perhaps this way ‘rub’ and ‘pub’ will be offered in the right order and 236737 will no longer be an ‘afores’ but instead an ‘adorer’.
Brysbaert, M., Keuleers, E., & New, B. (2011). Assessing the usefulness of Google Books’ word frequencies for psycholinguistic research on word processing. Frontiers in Psychology, 2, 1-8. doi: 10.3389/fpsyg.2011.00027