Only 10 Words Make Up 25% Of The English Language
Those ten words, listed in order of frequency, comprise around 25% of the recorded English language, according to an ambitious project at Oxford University.
The project, called the Oxford English Corpus, is a growing database of examples from 21st-century English, ranging from literature and scientific journals to emails. The Corpus contains more than two billion instances of words, called "tokens."
"A type is a unique string of letters, regardless of how often it is used. A token is a single occurrence of a type. The sentence 'the cat sat on the mat' contains six tokens but five types, because there are two occurrences of the type 'the,'" Professor Patrick Hanks, former editor of English dictionaries at the Oxford University Press, told Business Insider.
The Corpus hit two billion tokens in 2010. Lexicographers then deduced the ten words that appear the most. But a Harvard professor named George Kingsley Zipf had already predicted the result back in 1935.
"The weak version of Zipf's Law says that words are not evenly distributed across texts; instead, there are a few words that are very common and a very large number of words that are very rare. And there is a neat curve linking the two extremes. Useful words such as 'useful' and 'curve' are quite low on the curve; boring words like 'thing,' 'go,' 'say,' 'give' and 'take' are quite high on the curve," Hanks said.
Hanks doesn't mean "neat" as a outdated form of "cool," either. He means orderly, organized, statistically beautiful.
The ten aforementioned words comprise about 25% of our language. Going further, the top 100 words comprise about 50% of our language, while 50,000 words comprise 95% of our language. To account for the last 5%, we need a vocabulary of more than a million words.
To test the theory, I counted the number of times each of the ten words appears in this article - 98 out of 391. Thus, "the," "be," "to," "of," "and," "a," "in," "that," "have," and "I" make up about 25.06% of this article. Right on the money.
If we consider "content words" (words with tangible meaning) instead of "function words," the top ten list changes to include: "time," "person," "year," "way," "day," "thing," "man," "world," "life," and "hand."
- I spent $2,000 for 7 nights in a 179-square-foot room on one of the world's largest cruise ships. Take a look inside my cabin.
- Colon cancer rates are rising in young people. If you have two symptoms you should get a colonoscopy, a GI oncologist says.
- Saudi Arabia wants China to help fund its struggling $500 billion Neom megaproject. Investors may not be too excited.
- Catan adds climate change to the latest edition of the world-famous board game
- Tired of blatant misinformation in the media? This video game can help you and your family fight fake news!
- Tired of blatant misinformation in the media? This video game can help you and your family fight fake news!
- JNK India IPO allotment – How to check allotment, GMP, listing date and more
- Indian Army unveils selfie point at Hombotingla Pass ahead of 25th anniversary of Kargil Vijay Diwas
- JNK India IPO allotment date
- JioCinema New Plans
- Realme Narzo 70 Launched
- Apple Let Loose event
- Elon Musk Apology
- RIL cash flows
- Charlie Munger
- Feedbank IPO allotment
- Tata IPO allotment
- Most generous retirement plans
- Broadcom lays off
- Cibil Score vs Cibil Report
- Birla and Bajaj in top Richest
- Nestle Sept 2023 report
- India Equity Market