< Home

Talking Thai <> English Dictionary+Phrasebook

for iPhone/iPad/iPod Touch

Thai Phrasebook and Dictionary App Entry Counts

Note 1: The following explanation of our Thai Phrasebook and Dictionary App Entry Counts appears both here on our website (to benefit those who are deciding whether to purchase our apps) and inside our apps (version 2.0 and greater). If you have our apps already, we recommend you read this explanation inside the app (go to "Help" > "Entry Counts") because the text will have lots of useful links directly to the Help pages and features being discussed.

Note 2: Though we use Thai as our example here, we use the same principles to count entries in our Chinese language apps.

There are many ways to compare different phrasebooks and dictionaries meant for those learning Thai.

Probably the most important factor is the quality of the entries: whether the words and phrases presented are useful, correctly translated, and whether the system for reading entries (see app "Help" > "Using the App" > "How to Read Entries") is clearly documented so that you can get the most out of them (for example, knowing how to choose the right translation (see app "Help" > "Using the App" > "How to Read Entries") so you don't get into trouble).

The second most important factor is the depth of the entries: whether entries include high-quality Thai sound recordings (see app "Help" > "Using the App" > "Hearing Words") (or any Thai sound recordings at all, for that matter), well-documented parts of speech (see app "Help" > "Speaking/Listening" > "Parts of Speech"), glosses (see app "Help" > "Using the App" > "How to Read Entries") (extra explanations which clarify shades of meaning when the English word is ambiguous), word registers (see app "Help" > "Speaking/Listening" > "Word Register") (which tell you the social context in which you need to use a word, e.g. formal or vulgar), classifiers/measure words (see app "Help" > "Speaking/Listening" > "Classifiers") (crucial for counting or measuring anything in Thai), and whether or not translations found in entries are ordered by commonality (see app "Help" > "Using the App" > "How to Read Entries" > "Which Thai Word Should I Pick?").

The third most important factor, but sadly one which often receives people's sole attention, is the entry count.

Fortunately, our dictionary and phrasebook apps excel in all three of these categories.

In this section we will warn you about several pitfalls that may harm you when comparing entry counts published by different vendors, we describe our entry counting method, and we present our current counts. We link to this section from all parts of the app and our website where we cite entry counts.

We want to make people aware of several surprising reasons why judging by entry count alone will likely lead you into trouble.

First, in the Thai learning space you should be aware that the vast majority of Thai dictionary apps and programs today get their entries solely from the large, free, public-domain LEXiTRON dataset from the Thai government organization NECTEC. LEXiTRON was designed to help Thai people learn English, not the other way around. While it is fantastic that NECTEC released this free resource to the Thai public to help Thai people, the dataset is full of errors that can easily lead you into trouble, and the dataset lacks English glosses (see app "Help" > "Using the App" > "How to Read Entries") and other annotations needed to really be useful to an English speaker (does that Thai word for "glass" that you found mean "pane of glass" or "drinking glass"?). Free is nice, but you get what you pay for. Paiboon Publishing's dictionary production team spent years hand-crafting our dictionary and phrasebook datasets from scratch specifically with the needs of English speakers in mind.

Secondly, entry counts themselves can be amazingly misleading. We've noticed that different vendors use vastly different ways of counting their entries, and at the same time no vendor is currently doing a good job of explaining how their published counts are computed. The differences in counting methods can easily make a 2-5x difference in the final number, making comparisons meaningless.

To help resolve this situation so that our customers can make fair, meaningful, informed comparisons, here we will present crucial issues of entry counting that you may not even been aware of, and then we will show our counts and exactly how we computed them. We encourage all phrasebook and dictionary vendors to do the same.

We look at each of the counting issues from two useful perspectives: does each method of counting reflect how useful the dictionary/phrasebook is to actual customers, and does each method of counting fairly reflect the amount of work the author had to do? In other words, is the way of counting fair from the customer's point of view, and is it fair from the author's point of view.

In this section we use our Thai phrasebook and dictionary apps as an example, but we use the same counting method for all our dictionary and phrasebook apps in any language.

The main counting issues are:

Do you count the dictionary directions/sections (Thai-to-English vs. English-to-Thai) separately?

Clearly the answer is "yes" for traditional paper dictionaries: some dictionaries only had one direction and some had two, and those with two were clearly a lot more useful and should be counted as such, especially since the total entry count corresponded directly to the size and weight of the book: a critical factor for customers. But in any case, dictionary authors need to be clear about their counting method so that customers can compare meaningfully (e.g. a bidirectional dictionary with 50,000 total entries actually has roughly half as many Thai-to-English entries as a single-direction dictionary with 50,000 entries).

When we move over to the modern realm of software dictionaries, things become more complex, but surprisingly a lot of the same issues remain.

For software, "space" is much less of an issue (space is now primarily tied to sound recordings) and "weight" is a total non-issue. Furthermore, because software dictionaries can support full-text search (see app "Help" > "Using the App" > "Search" > "Search Modes" > "Power Search"), it means that some dictionary authors can choose only to author in a single direction (for example, they may spend all their authoring time writing Thai-to-English entries only) and then rely on full-text search to give their software dictionary the appearance of having two directions.

Should the published entry count only include the direction they authored, or both directions?

This is where we must understand a surprising and really significant harsh reality: authors who want to create a bi-directional dictionary cannot simply author in one direction, flip the Thai and English, re-sort the entries, and call it done. In fact, even with the best high-tech authoring tools, authors of bidirectional dictionaries need to do almost twice as much work as single-direction dictionaries, and the benefit to customers of a true bidirectional dictionary is also as much as double. This is because of two suprising harsh realities of language:

The first harsh reality is that in both languages, there are words that simply have no translation in the other language: you need to have a long phrase to explain the concept in the other language. This includes many simple, everyday terms in both languages like "room service" and เกรงใจ [greeng-jai, greeng-jai, เกฺรง^M-ไจ^M, graehngM-jaiM, grayng-jai, ˈkreeŋ ˈcay, ˈkreːŋ ˈtɕaj, krēŋ^M-čhai^M, ˈgrayng ˈjai, krehng-jai, grayng-jai, kreng-chai]—terms that native speakers of each language would never expect lack direct translations into the other language. Authors who create single-direction dictionaries only have to face that issue in one direction, and customers only get the benefit of translations in one direction. Having a single-direction app with full-text search won't improve the situation, because you simply won't find non-directly-translatable words like these when you search. Authors who create true bidirectional dictionaries need to write a lot more entries. In our experience this can easily add 30-40% more entries that would not exist in the other direction, and easily consume half the work, since these entries are much more difficult to translate.
The second harsh reality is that almost every English word translates to more than one Thai word, and almost every Thai word translates to more than one different English word. As just one typical example, English "fat" translates either to Thai อ้วน [ûuan, ûan, อ้วน^F, uaanF, ôoan, ˈʔûan, ˈʔûːan, ʿūan^F, ˈôo-an, ûan, ûan, ûan] or มัน [man, man, มัน^M, manM, man, ˈman, ˈman, man^M, ˈmun, man, man, man] depending on the meaning, and มัน [man, man, มัน^M, manM, man, ˈman, ˈman, man^M, ˈmun, man, man, man] translates back to English as "fat," "it," or "entertainingly interesting!" Languages are like tangly spider webs with links going off in totally different directions for Thai-to-English vs. English-to-Thai. That means bidirectional paper dictionary authors (including Paiboon Publishing) had to painstakingly choose which headwords and translations to include in each section by balancing their expert sense of how useful the word is in that direction against the available space (since the cost of paper and weight of the book are major factors for authors too). Just because a given pair of words appears in the Thai-to-English section does not mean that it should also apear in the English-to-Thai section. For example, you definitely want to include the Thai word มัน [man, man, มัน^M, manM, man, ˈman, ˈman, man^M, ˈmun, man, man, man] in the Thai-to-English section as "fat," since customers will hear Thai people use this common contraction and will need to be able to look it up, but in the English-to-Thai section under "fat," you want to include only ไขมัน [kǎi-man, kǎi-man, ไข^R-มัน^M, khaiR-manM, kǎi-man, ˈkhǎy ˈman, ˈkʰǎj ˈman, khai^R-man^M, ˈkǎi ˈmun, khǎi-man, kǎi-man, khǎi-man] because that longer word has fewer other meanings and so is a better choice for the customer to use when speaking Thai. This editing process is incredibly time-consuming.
When we move from paper to the software world, we have plenty of space, so we can include every word in every direction, but that doesn't actually solve the problem for the customer: if you look up the word "eat" and you see 10 different Thai translations, of course you will scream "Which word do I choose? (see app "Help" > "Using the App" > "How to Read Entries") I just want to know how to say 'eat!'" In a true bidirectional dictionary, the author will spend time ordering the translations that appear under headwords in each section so that the most general-purpose translation is first, followed by other translations, annotated with what makes their use more restricted (which, in the case of the verb "eat," is likely to be their word register (see app "Help" > "Speaking/Listening" > "Word Register")). And even for words that have the same register, there is still a sense of which word is the most common—the word choice that will make you sound more native or will avoid problems for you. This is just like English where "I went into the shop yesterday" sounds more natural in most contexts than "I entered the shop yesterday" so "went into" is the translation that should be listed first in a dictionary meant for Thai people learning English.

You, as an end-customer, can only get these extra benefits if the dictionary author spent the huge amount of extra time needed to write or hand-pick each translation separately in each direction.

Authoring in only one direction certainly saves a lot of time for the author, but in reality it's not as useful to the customer, even if we compare dictionaries with similar entry quality and depth as defined above.

So when we only think about entry counts, and we ignore the crucial reality of whether the author created a true bidirectional dictionary dataset or just slapped full-text search on top of a single-direction dataset, we do ourselves a great disservice.

In reality, true bidirectional datasets are both much harder for the author to create, and much more useful for customers like you.

So the issue of whether a bidirectional dataset should have a higher entry count than a single-directional dataset is really a red herring: the real issue is that each software dictionary vendor must clearly disclose whether the underlying dataset is truly bidirectional or not, and provide entry and translation counts for each section the vendor originally authored.

How do you count if the dictionary also has Thai Sound?

Things get even more tricky when you think about three-way (see app "Help" > "Using the App" > "Search" > "Three Ways to Search") dictionaries: those that let you Search-By-Sound™ in a third Thai Sound section.

Paiboon Publishing created its industry-first three-way paper dictionary in 1996, and all of Paiboon's software dictionaries and apps have been three-way as well.

To understand the impact of three-way on counting methods, it helps to name the three sections using kind of silly terminology compatible with the discussion just above. A three-way dictionary has these sections:

English-to-Thai-script-and-sound
Thai-script-to-English-and-Thai-sound
Thai-sound-to-English-and-Thai-script

Should the third section (the Thai Sound section) count toward the entry count? Even if not, should the fact that all sections now have Thai Sound count somehow?

The third section certainly adds a lot of utility for customers, and it takes a huge amount of time for authors like us to add Thai Sound pronunciation guides alongside each Thai Script word (especially in our software apps where you can choose from 12 different systems and our app keeps the entries sorted according to your system), so for both of these reasons it makes sense to "credit" the third section in the entry count. And for the paper volumes, for the practical reasons of weight and size above it also makes sense to include the Thai Sound section in entry count since that is such a crucial factor for both authors and customers.

But again we are trying to wedge something that is fundamentally more complicated than a single number into a single number, for no rational reason. We have to get past our obsession over entry count, in particular our desire to force a single number on each product for comparison!

A much better system is for each dictionary vendor to state what "sections" they have (e.g. English, Thai Sound, and Thai Script), provide separate entry counts for these, and provide a total, as we do below. That way, customers can make meaningful comparisons.

Do you count bold headwords (individually or together), translations, or what?

Further complicating matters is that a typical dictionary entry doesn't just have one Thai word and one English word. A single entry often has one or more bold headwords, and one or more translations (plus glosses (see app "Help" > "Using the App" > "How to Read Entries"), classifiers/measure words (see app "Help" > "Speaking/Listening" > "Classifiers"), and other content).

Do you count each entry as 1? Do you count each entry according to the number of bold headwords? Do you count each entry by the number of translations? What about the extra depth information like glosses, classifiers, or categories, which may not even be present in other dictionaries?

Currently vendors count differently and do not disclose how they count, preventing you from making meaningful comparisons, but they should.

We give our exact counting method below.

Or do you count only unique headwords? Unique Words?

In all dictionary products you will have the same English or Thai words appear more than once as translations, and in some dictionary products you will have the same English or Thai words appear as headwords.

For example, in many dictionaries, entries for the same headword are split by parts of speech, classifiers, or other factors. You might have one verb entry with English headword "can" and translations meaning "able to," and another noun entry wtih English headword "can" and translations meaning "metal container for liquid."

Should those multiple "can" entries be counted separately, or should we count all the "can"s together as 1 (so that we're really counting the number of "unique" headwords)? On the one hand, it's two separate meanings so it should count as 2 since it's twice as useful to you compared to a word that only has one meaning (and the author had to do twice as much work to create it). But on the other hand, it's only one English word.

Yet another way to count is to count the total number of unique words in each language that appear in any entry (either as a headword or a translation), counting each unique word once. But this method will under-represent the very common words that have more meanings and are used more often in our daily speech.

There are valid arguments for all these counting methods. Ultimately it doesn't matter as long as you know two products you are comparing use the same method, but...

Just like the issue above, currently vendors count differently and do not disclose how they count, preventing you from making meaningful comparisons.

We give our exact counting method below.

What about "glossary" and technical word padding?

Perhaps the single worst thing about comparing dictionaries using entry counts is that most words are useless! That is, in dictionaries with more than around 30,000 entries per section, the remaining words are likely to be words you will never use. This can include formal/technical variants of common words, but it can also include huge "glossary" laundry-lists of place names, biological species (LEXiTRON in particular is rife with these), names of Kings and Gods other historical or religious figures, common surnames, and any other kind of lists that are easily added to a dictionary dataset en masse without adding that much value to most users.

At the very least, these "pad" entries unintentionally mislead you into thinking a dictionary is more useful than it is. In the worst case, the dictionary author may have put them in intentionally to mislead you (in some cases an author can get a 20,000-50,000 entry bump by simply downloading a few publicly available spreadsheets from the internet and merging them into the product in a few hours).

The "pad" words are certainly of use to someone, and so should not be deleted, but it is important for dictionary authors to disclose these kinds of words when stating the entry count to customers, the majority of whom will not get any benefit from the "pad" words.

For our Paiboon dictionaries, the closest thing we have to "pad" words is the more than 53,000 entries that contain Thai place names. Starting back in 2012 when we introduced the first place names, and continuing until today, we have never included the place name entries in our entry counts, and have noted this explicitly in our published stats.

What about phrasebook vs. dictionary?

Some apps combine the traditional roles of dictionary and phrasebook. In some of these dual-purpose apps, such as ours, you can find phrasebook phrases (see app "Help" > "Five Minute Tour" > "Start in the Search Screen") from the Search function that is traditionally meant for the dictionary, and in some apps, you cannot.

So if the vendor publishes a single entry count for the whole app, does that include both phrasebook and dictionary?

Where terms overlap between the two, are they double-counted?

What about the fact that phrasebooks typically only go from English-to-Thai, whereas the dictionary may have two or three directions as explained above? If the vendor sums up the two/three directions for the dictionary entries, should the vendor do the same for the phrasebook entries even if they are only searchable as English-to-Thai?

As you can see, our old bad habit of trying to assign a single number to a complex concept is hurting us again.

In reality, a vendor needs to disclose exactly how each dataset (dictionary and phrasebook) was authored, how they are searchable, and how any overlap is counted.

Paiboon Publishing Entry Counts

Ok, so with the really important background established above (you didn't just scroll down here, did you?), we can get to the numbers. It's really important to read the text above before you try to interpret these numbers.

For the Paiboon phrasebook and dictionary apps:

We define one "entry" as what appears between the thin horizontal rules in the Search or Categories screen: a collection of one or more bold headwords and one or more non-bold translations. So an entry with multiple bold headwords still counts as 1 entry. And an entry with multiple translations still counts as 1 entry. For counting purposes, super phrases (see app "Help" > "Using the App" > "Placeholders"), which may have tens to thousands of possible variations depending on your choices, still count as 1 entry (which, in and of itself, is an example of why using a single number to judge an app is misleading!).
In our dictionary dataset, it is possible to have multiple entries with the same bold headword. This happens when the two entries have different parts of speech (e.g. one "can" noun entry as in "metal container" and one "can" verb entry as in "able to"), different classifiers, or when the headword in question is grouped differently with other headwords (e.g. one noun "can" entry meaning "slang word for jail" and another noun "can, tin" entry meaning "metal container"). Additionally, for English entries only, this happens when the two English entries have a different gloss (see app "Help" > "Using the App" > "How to Read Entries") (e.g. "glass (drinking)" vs "glass (pane)"). Finally, for Thai entries only, this happens when two entries have a different word register (see app "Help" > "Speaking/Listening" > "Word Register").
We define one "translation" as a non-bold word that appears in an entry in the opposite language as the bold headword, not including classifiers. So an entry with bold English headwords that lists 3 Thai words (not including classifiers) would count as 3 translations.
Paiboon Publishing's dictionary dataset is a true bidirectional dataset. Our dictionary team spent almost double the time because we hand-edited entries in the Thai-to-English direction separately from entries in the English-to-Thai direction, and this gives you significant benefits, as explained above.
Paiboon Publishing's phrasebook dataset is a single-directional dataset because it was authored and edited in an English-to-Thai direction only. However, during phrasebook production, knowing that our apps would support full-text search (see app "Help" > "Using the App" > "Search" > "Search Modes" > "Power Search"), we included approximately 500-700 entries where the Thai is a common simple word but the English is a long explanatory phrase, which normally one would only have included during a bidirectional production process. So our phrasebook dataset has some degree of bidirectionality.
In the Categories screen (both phrasebook and dictionary apps), all entries show with English bold headwords, except for a handful of categories where all entries have Thai Script bold headwords.
In the Search screen (both phrasebook and dictionary apps), entries will show with English, Thai Script, or Thai Sound bold headwords depending on how you search and depending on whether the entry itself has English headwords or not. As explained above, the Thai Script and Thai Sound entries are not simply reversals of the English entries: our authoring team painstakingly chose which translations to include for each entry, and in what order.
The Search screen can find any entry in the Categories screen, including phrases and complete sentences found in Categories. So Search has 100% overlap with what is in Categories.
Our dictionary dataset includes a large number of "place name" entries that include Thai district, subdistrict, and city names, country names, and major world city names. We separate out these place names in our counts below for reasons explained above.
From the dictionary Search screen, if you search in English, you can reach:
- 81,494 different entries total,
- 10,625 of which are place names entries, so
- 70,869 entries not including place names
From the dictionary Search screen, if you search in Thai Script, you can reach:
- 84,547 different entries total,
- 21,231 of which are place names entries, so
- 63,316 entries not including place names
From the dictionary Search screen, if you search in Thai Sound, the number of entries you can reach is the same as Thai Script above (there is a tiny difference because a few Thai words have multiple spellings or multiple pronunciations (see app "Help" > "Reading and Writing" > "Irregular Spellings"), but the difference is not significant).
In order to compute a total count for dictionary English, Thai Script, and Thai Sound that is consistent to allow comparison with the way we have counted our entries since the earliest paper dictionary, we add together the English, Thai Script, and Thai Sound numbers above:
- 250,588 different entries total,
- 53,807 of which are place names entries, so
- 197,501 entries not including place names
In our dictionary app Help and marketing materials, we use the total entry count 197,501 rounded down to 195,000 and the total translation count 250,588 rounded down to 250,000, again both without place names. When discussing the place names, we round 53,807 down to 53,000. When using these numbers we provide a link directly to this explanation wherever possible.
Of the total number of dictionary entries above, the following numbers of entries include Thai classifiers/measure words (if an entry has more than one classifier, it still counts as 1 in this count):
- 117,816 different entries total,
- 53,807 of which are place names entries, so
- 64,729 entries not including place names
So we have "60,000+" noun entries with classifiers.
In our phrasebook dataset (which appears inside both phrasebook and dictionary apps), we have:
- 1,356 total categories (including the top-level home category), 1,242 of which are "leaf" categories containing only items and no sub-categories
- 1,003 and 926 of which are place name categories, respectively
- 353 and 316 of which are non-place-name categories, respectively
So we round down our total of 316 non-place-name, leaf categories to 300 and publish that we have "300+" categories. Note we have taken the most conservative possible interpretation of Category count.
Inside our categories, we have the following number of "items," where an item is an entry, a link to the Help, or a cross-reference to another category that begins with See: (but not a subcategory):
- 23,703 different items total,
- 10,612 of which are place names entries, so
- 13,091 items not including place names
Inside our categories, we have the following number of entries (so this does not count links, cross-references, or subcategories):
- 23,514 different entries total,
- 10,612 of which are place names entries, so
- 12,902 entries not including place names
So we round down our total of 12,902 non-place-name entries to 12,000 and publish that we have "12,000+" entries visible from the Categories screen.
Important: unlike the Search figures above, in the Categories screen you do not get to choose whether you see entries with English, Thai Script, or Thai Sound bold headwords, nor did the phrasebook authoring team spend time separately creating and editing English-to-Thai and Thai-to-English entries as we did for the dictionary dataset (with the exception of the 500-700 phrasebook entries mentioned above). So the Categories item and entry figures above are not a sum of English + Thai Script + Thai Sound as with Search. Keep this in mind when comparing how much overlap there is between Search and Categories.
In our phrasebook apps specifically (as opposed to our dictionary apps), there are around 4,000 entries that are searchable from the Search screen but which cannot be found in any category of the Categories screen. These include the most useful and common words like "is" and "good" and "person" which don't belong in any category.
So, adding together the "12,000+" entries visible from the Categories screen (see above) with these additional 4,000 Search-screen-only entries, we arrive at the published "16,000+" figure for the total number of entries in our phrasebook apps.
In our phrasebook dataset there are a few categories outside the place name categories with a large number of rare entries. These categories have much fewer entries than the place name categories, so we didn't subtract them out from our entry counts above, but you may want to consider them when evaluating entry count. The main ones are "Categories" > "Glossary" > "Nature" > "Animals", "Categories" > "Glossary" > "Nature" > "Birds of Thailand", and "Categories" > "Glossary" > "Nature" > "Ornamental Fish". There are also other "Categories" > "Glossary" categories you can browse, but most of them are either very small, contain common words, or both.
Our dictionary contains 60,484 high-quality Thai sound recordings (see app "Help" > "Using the App" > "Hearing Words") totalling 29 hours and 58 minutes of audio. This includes the Thai words used as headwords and translations in entries as well as the sample sounds used in the Help "Help" > "Speaking/Listening" and "Help" > "Reading and Writing" sections.

The long-term solution to allow customers like you to meaningfully compare dictionary/phrasebook product entry counts is:

First, each vendor should publish their detailed counts and counting method, as we did above, so at least we know what we have and whether it can be compared.
Then, the next step is for the vendors to pick one or more common counting methods and all count their products using the same method(s) so as to benefit customers. We can choose a variety of methods to fairly highlight the strengths of each product.