One of the things I always thought computers could help me with is learning languages. I’m a big language fan, and a very small fan of paying teachers for things I don’t really need. So I’m always looking for ways to teach me.
As someone who loves to start learning languages and not so much to practice, my biggest need is help practicing. One of my favorite ways to practice is reading books in the target language – my eBook library has books in about 10 different languages. But this can only help in very specific cases – I need to understand a large enough portion of the book so I can still enjoy a story rather than just spend all the time with the dictionary deciphering words. If it takes me an entire day to figure out who Samuel Ferguson is and what exactly he wants to do (although in this case the huge balloon picture on the cover was a good clue), I’m probably not going to keep reading it for long.
Another problem is knowing when to look up a word and when not to, when to bother memorizing a word and when not to. Some words are very common, and learning them is a very important step towards the previous goal (understanding a large enough portion of the book). Others can only waste your time in the beginning – if you’re learning English, and you’re still struggling with “is” and “that”, then you will only confuse yourself by learning “fuchsia”. But without knowing the language, how can we know which words are worth learning? If Hamlet says “Where be your gibes now”, how do you know “where” is a useful word to know, but “gibes” you’ll probably learn and forget after months without encountering it again?
The second problem can be partially solved by using a list of the most common words in a language, but those lists will obviously come from certain collections of texts, which might be very different from what you want to read – if you’re learning English to read Shakespeare, you’ll probably want to learn “exeunt” more than “computer”, but if you’re trying to read the Linux man pages, it might be different.
So what do we do? We use a computer to help us. Generally, it might be difficult for a computer to recognize words well. We would want “book” and “books” to count as a single word, and that might not be easy to achieve. However, it’s not impossible, and in some languages might be easier than in others – as I’m trying to start learning Mandarin these days, I noticed that Chinese languages are perfect for such analysis – with no suffixes and inflections, it’s very easy to count how many times a character appears in a text.
My first target text was the Analects of Confucius. I wrote a little program to read it, count every instance of a character, and list the most common ones. The text (as appears on the Gutenberg project) as 15967 Chinese characters, of which 1349 are unique. Here is the top ten, along with their frequencies:
1. 子: 979
2. 曰: 758
3. 之: 613
4. 不: 584
5. 也: 533
6. 而: 346
7. 其: 268
8. 人: 219
9. 者: 219
10. 以: 211
This is indeed informative. Almost every passage in the Analects starts with some variation of “子曰” (which means “The master said”). Most other characters are very common in Chinese, as I saw later in other texts. So it certainly looks like these words are worth memorizing.
But how useful is knowing those words? Well, let’s check how much every group of characters covers of the total text – we can easily get that from summing the top characters’ frequencies and dividing by the total number of characters in the text (which, as we said, was 15967):
10 characters cover 29% of the text.
20 characters cover 39% of the text.
30 characters cover 46% of the text.
40 characters cover 51% of the text.
50 characters cover 54% of the text.
75 characters cover 61% of the text.
100 characters cover 66% of the text.
250 characters cover 81% of the text.
500 characters cover 91% of the text.
1000 characters cover 97% of the text.
So this not only tells us how useful this method is (memorizing the top 10 characters gives us 29% of the text, while the next 10 only adds 10%. So we really want to focus on the top ones), but it gives us an idea to help with our first problem – we wanted to find texts that we are able to mostly read, so we won’t have to go to the dictionary so much we’ll miss the reading experience. Now we can take all the texts we consider, and see which ones only require a small amount of words to mostly understand. This will really become strong if we’ll add a list of known words, so we’ll check only how many unknown words we need to cover the text. We’ll get back to that later.
