Michael Rundell shows that word-frequency data is relevant to teaching and learning languages.

ETp logo

Michael Rundell has been a lexicographer since 1980. He is Editor-in-Chief of the Macmillan family of learner’s dictionaries, notably the two editions of the Macmillan English Dictionary for Advanced Learners (2002, 2007). He has had a leading role in the design and development of English corpora, including the British National Corpus, and has published widely in the field of corpus-based language study. He is one of the directors of the Lexicography MasterClass (www.lexmasterclass.com), an independent company that runs dictionary projects and provides training courses in lexicography and lexical computing.

The following article is reproduced with the kind permission of English Teaching professional magazine.

The 'rules'

Language is a living, creative force, and words are capable of infinite nuance and variation. So at first sight, it may seem a rather dreary enterprise to reduce our communicative behaviour to the level of dry statistics. In fact, however, we can learn a lot about language from frequency data. Almost every aspect of our linguistic output is to some degree ‘rule-governed’ – if we use the term ‘rule’ to refer to a regular pattern of behaviour that can be observed when we look at language in use, rather than to prescriptions and prohibitions handed down from above. Investigating language by analysing large corpora (or data on the web, for that matter) has never been easier1: computers are not very intelligent, but counting is one thing they are really good at. And all this exploration is revealing facts about language whose existence we never even guessed.

Some ‘rules’ are familiar and easy to identify. Much of what we call grammar boils down to statistically significant tendencies for words to change their form or combine with other words in particular ways. But many other properties of language are statistically marked too. There are some words, for example, whose collocational networks are so stable and predictable that the word cannot really be ‘known’ unless we also know the set of words that regularly occur alongside it. Take the phrasal verb build up. When it is used transitively, the contexts are almost always positive: you can build up a successful business, a reputation, valuable experience, reserves of cash, and so on. We know this because good corpus data, allied to the right software, allows us to observe that contexts like this occur repeatedly in all sorts of texts, and it is very rare to find cases of people ‘building up’ something bad or unwanted.

Oddly, however, the same verb behaves quite differently when intransitive. Here it almost always describes the gradual accumulation of some undesirable quality or substance: anger, frustration and resentment build up, and so does traffic, pollution and tooth decay. Fluent speakers make these linguistic choices intuitively and unconsciously, and it is unlikely that most of us could articulate our reasons for choosing one set of collocates rather than another. But few language learners have the luxury of relying on intuition: if they are to assimilate information of this type, it needs to be explicitly described. This is what dictionaries and other reference resources are now able to do, thanks essentially to the software we use for analysing the language in large text corpora.


Anchor Point:1Word frequency

One of the most useful things to know about a word is its frequency. It may be interesting to learn that the word esurient means the same as hungry, but the other vital fact about this word is that it is extremely rare – so rare, in fact, that most native-speakers probably don’t know it. It appears just once in the 100-million-word British National Corpus (compared with about 1,900 occurrences of hungry), while a search on Google produces an even more extreme ‘hungry:esurient ratio’ of about 4,000:1. It is all very well knowing what esurient means, but the word isn’t much use to you if no one understands you when you say it.

At the other extreme, there is a relatively small number of very common words, and we now know (statistics again) that most of what we say and write is made up of this ‘core’ vocabulary, recycled again and again. An educated native-speaker may know more than 50,000 English words, but most of these are used very infrequently, while over 90% of our language output consists of the 7,000 commonest words. Even more strikingly, about 50% of our output is made up of the most frequent 100 words.

It follows, logically, that very common words are more ‘worth knowing’ than words that are used only rarely: knowing a word like hungry contributes much more to your ability to communicate than learning a word like esurient.

The general idea was well understood long before we had access to word-frequency data. From Ogden and Richards’ ‘Basic English’ (a language-learning programme developed in the 1930s and based on just 850 common words), to Longman’s 2,000-word ‘defining vocabulary’ (launched in 1978, and the model for similar lists used for simplifying the definitions in all the main learner’s dictionaries), efforts have been made to identify a core of ‘must-have’ items that will equip learners to negotiate most kinds of simple communicative task. The same thinking underlies ‘graded readers’: a restricted set of words and constructions is used to ensure that students can read texts without getting bogged down in difficult language.


Anchor Point:2Language corpora

The arrival of large language corpora in the 1980s gave a major boost to the whole enterprise of establishing a core vocabulary. We now have very reliable data on word frequency2, and this goes far beyond simply knowing how often a word occurs. We can learn about the relative frequency of all the different meanings of a common word or of all the different grammatical patterns and phraseological relationships that it enters into. Corpora based on work produced by learners can help us identify the words or structures that cause them the biggest problems. And with the right software, we can now provide a detailed account of collocation. Adam Kilgarriff’s ‘Word Sketches’ use state-of-the-art computational techniques to build a complete picture of a word’s collocational networks. The Word Sketches supplied the data for the ‘Collocation Boxes’ in the Macmillan learner’s dictionaries.

It is important to stress that information like this isn’t intended to be normative or prescriptive: it simply describes what usually happens in language – and corpus-based statistics continue to show us that there is a lot more regularity in our linguistic behaviour than most of us suspected. The Longman Grammar of Spoken and Written English, for example, is packed with statistics about every aspect of language use. Much of this information illustrates the effects that register exerts on the way we use words. Take, for instance, the use of that-clauses, where we learn that ‘in conversation, the omission of that is the norm’, whereas ‘retention of that is the norm in academic prose’. In other words, we are much more likely (by a factor of almost 9 to 1) to say ‘I told her I would be late’ than ‘I told her that I would be late’ when we are talking to someone, but the position is reversed in academic writing. These are both registers in which many learners are expected to operate, so it can only be helpful to learn that each text-type has different characteristics, and that the differences are more or less systematic. This has nothing to do with either intelligibility or ‘correctness’. There is no law saying that learners should conform to native-speaker norms, but it must surely be useful for them to know what the norms are. 


Anchor Point:3Learner's dictionaries

One of the main ways in which statistical data can be made available is in dictionaries, and several of the advanced learner’s dictionaries provide word-frequency information. The Longman Dictionary of Contemporary English, for example, identifies the top 3,000 words in spoken and written text. The Macmillan learner’s dictionaries have used frequency data in an overtly pedagogical way, to identify a central core of vocabulary that is most likely to be needed by students working in both receptive and productive modes. These key words are shown in red (all the other words in the dictionary are in black), and additionally marked with one, two, or three stars – three-star red words being the most basic vocabulary of all. And a distinction is made between what counts as key vocabulary at advanced level, and what is seen as vital at intermediate level: in the Macmillan English Dictionary for Advanced Learners (2002, new edition 2007), there are about 7,500 ‘red’ words, while the intermediate Macmillan Essential Dictionary (2003) marks just 3,500 words in this way. Feedback from users has been positive, and in a questionnaire a large majority of the 1,200 or so respondents noted their approval of this approach, with only 12% disagreeing with the proposition that ‘a dictionary should make a distinction between productive vocabulary that learners need to use and receptive vocabulary that they need to recognise’.

As far as the future goes, the availability of ever-larger corpora – and the increasing sophistication of the software for analysing them – looks set not only to bring us closer to understanding how language works, but also to uncover statistical information that has the potential to help students become better learners.


Anchor Point:4References

Anchor Point:51. For more information, see my article in the Pilgrims webzine, Humanising Language Teaching, May 2003:   http://www.hltmag.co.uk/may03/idea1.htm

Anchor Point:62. See, for example, Adam Kilgarriff’s homepage, from which several word-frequency lists can be downloaded: http://www.kilgarriff.co.uk/  and a similar site run by the University of Lancaster: http://www.comp.lancs.ac.uk/ucrel/bncfreq/.