|
|
Talking Thai <> English Dictionary+Phrasebook
Talking Thai <> English Dictionary+Phrasebook
for iPhone/iPad/iPod Touch
|
|
|
for iPhone/iPad/iPod Touch
|
Thai Phrasebook and Dictionary App Entry Counts
Note 1: The following explanation of our Thai Phrasebook and Dictionary App
Entry Counts appears both here on our website (to benefit those who
are deciding whether to purchase our apps) and inside our apps
(version 2.0 and greater).
If you have our apps already, we recommend
you read this explanation inside the app
(go to "Help" > "Entry Counts")
because the text will have lots of useful links directly to
the Help pages and features being discussed.
Note 2: Though we use Thai as our example here, we use the same
principles to count entries in our Chinese language apps.
There are many ways to compare different phrasebooks and dictionaries
meant for those learning Thai.
Probably the most important factor is the quality of the
entries: whether the words and phrases
presented are useful, correctly translated,
and whether the
system for reading entries (see app "Help" > "Using the App" > "How to Read Entries")
is clearly documented so that you can get the most out of them
(for example, knowing how to
choose the right
translation (see app "Help" > "Using the App" > "How to Read Entries") so you don't get into trouble).
The second most important factor is the depth of the entries:
whether entries include high-quality
Thai sound recordings (see app "Help" > "Using the App" > "Hearing Words")
(or any Thai sound
recordings at all, for that matter), well-documented
parts of speech (see app "Help" > "Speaking/Listening" > "Parts of Speech"),
glosses (see app "Help" > "Using the App" > "How to Read Entries") (extra explanations which
clarify shades of meaning when the English word is ambiguous),
word registers (see app "Help" > "Speaking/Listening" > "Word Register") (which tell you the
social context in which you need to use a word,
e.g. formal or vulgar),
classifiers/measure words (see app "Help" > "Speaking/Listening" > "Classifiers") (crucial for
counting or measuring anything in Thai), and whether or not
translations found in entries are
ordered by commonality (see app "Help" > "Using the App" > "How to Read Entries" > "Which Thai Word Should I Pick?").
The third most important factor, but sadly one which often receives
people's sole attention, is the entry count.
Fortunately, our dictionary and phrasebook apps excel in
all three of these categories.
In this section we will warn you about several pitfalls that may harm
you when comparing entry counts published by different vendors, we
describe our entry counting method, and we present our current counts.
We link to this section from all parts of the app
and our website where we cite entry counts.
We want to make people aware of several surprising reasons
why judging by entry count alone will likely lead you into
trouble.
First, in the Thai learning space you should be aware that the vast
majority of Thai dictionary apps and programs today get their entries
solely from the large, free, public-domain LEXiTRON dataset from the
Thai government organization NECTEC. LEXiTRON was designed to help
Thai people learn English, not the other way around. While it is
fantastic that NECTEC released this free resource to the Thai public
to help Thai people, the dataset is full of errors that can easily
lead you into trouble, and the dataset lacks English
glosses (see app "Help" > "Using the App" > "How to Read Entries") and other annotations needed to
really be useful to an English speaker (does that Thai word for
"glass" that you found mean "pane of glass" or "drinking glass"?).
Free is nice, but you get what you pay for. Paiboon Publishing's
dictionary production team spent years hand-crafting our dictionary
and phrasebook datasets from scratch specifically with the needs of
English speakers in mind.
Secondly, entry counts themselves can be amazingly misleading. We've
noticed that different vendors use vastly different ways of counting
their entries, and at the same time no vendor is currently doing a
good job of explaining how their published counts are computed. The
differences in counting methods can easily make a 2-5x difference in
the final number, making comparisons meaningless.
To help resolve this situation so that our customers can make fair,
meaningful, informed comparisons, here we will present crucial issues
of entry counting that you may not even been aware of, and then we
will show our counts and exactly how we computed them. We encourage
all phrasebook and dictionary vendors to do the same.
We look at each of the counting issues from two useful perspectives:
does each method of counting reflect how useful the
dictionary/phrasebook is to actual customers, and does each method of
counting fairly reflect the amount of work the author had to do? In
other words, is the way of counting fair from the customer's point of
view, and is it fair from the author's point of view.
In this section we use our Thai phrasebook and
dictionary apps as an example, but we use the same counting method for
all our dictionary and phrasebook apps in any
language.
The main counting issues are:
Do you count the dictionary directions/sections
(Thai-to-English
vs. English-to-Thai) separately?
Clearly the answer is "yes" for traditional paper dictionaries: some
dictionaries only had one direction and some had two, and those with
two were clearly a lot more useful and should be counted as such,
especially since the total entry count corresponded directly to the
size and weight of the book: a critical factor for customers. But in
any case, dictionary authors need to be clear about their counting
method so that customers can compare meaningfully (e.g. a
bidirectional dictionary with 50,000 total entries actually has
roughly half as many Thai-to-English entries as a
single-direction dictionary with 50,000 entries).
When we move over to the modern realm of software dictionaries, things
become more complex, but surprisingly a lot of the same
issues remain.
For software, "space" is much less of an issue (space is now primarily
tied to sound recordings) and "weight" is a total non-issue.
Furthermore, because software dictionaries can support
full-text search (see app "Help" > "Using the App" > "Search" > "Search Modes" > "Power Search"), it
means that some dictionary authors can choose only to author in a
single direction (for example, they may spend all their authoring time
writing Thai-to-English entries only) and then rely
on full-text search to give their software dictionary the appearance
of having two directions.
Should the published entry count only include the direction they
authored, or both directions?
This is where we must understand a surprising and really
significant harsh reality: authors who want to create a bi-directional
dictionary cannot simply author in one direction, flip the
Thai and English, re-sort the entries, and call it
done. In fact, even with the best high-tech authoring tools, authors
of bidirectional dictionaries need to do almost twice as much work as
single-direction dictionaries, and the benefit to customers of a true
bidirectional dictionary is also as much as double. This is because
of two suprising harsh realities of language:
- The first harsh reality is that in both languages, there are words
that simply have no translation in the other language: you need to
have a long phrase to explain the concept in the other
language. This includes many simple, everyday terms in both
languages like
"room service" and เกรงใจ [greeng-jai, greeng-jai, เกฺรงM-ไจM, graehngM-jaiM, grayng-jai, ˈkreeŋ ˈcay, ˈkreːŋ ˈtɕaj, krēŋM-čhaiM, ˈgrayng ˈjai, krehng-jai, grayng-jai, kreng-chai]—terms that native speakers of each language
would never expect lack direct
translations into the other language.
Authors who create single-direction dictionaries only have
to face that issue in one direction, and customers only get the benefit
of translations in one direction. Having a single-direction
app with full-text search won't improve the situation,
because you simply won't find
non-directly-translatable words like these when you search.
Authors who create true bidirectional
dictionaries need to write a lot more entries. In our experience
this can easily add 30-40% more entries that would not exist in the
other direction, and easily consume half the work,
since these entries are much more difficult to translate.
- The second harsh reality is that almost every English word
translates to more than one Thai word, and almost
every Thai word translates to more than one
different English word.
As just one typical example, English "fat"
translates either to Thai
อ้วน [ûuan, ûan, อ้วนF, uaanF, ôoan, ˈʔûan, ˈʔûːan, ʿūanF, ˈôo-an, ûan, ûan, ûan]
or
มัน [man, man, มันM, manM, man, ˈman, ˈman, manM, ˈmun, man, man, man]
depending on the meaning, and
มัน [man, man, มันM, manM, man, ˈman, ˈman, manM, ˈmun, man, man, man]
translates back to English as "fat," "it,"
or "entertainingly interesting!"
Languages are like tangly spider webs
with links going off in totally different directions for
Thai-to-English vs.
English-to-Thai.
That means bidirectional paper dictionary authors (including
Paiboon Publishing) had to painstakingly choose which headwords and
translations to include in each section by balancing their expert
sense of how useful the word is in that direction against the
available space (since the cost of paper and weight of the book are
major factors for authors too). Just because a given pair of words
appears in the Thai-to-English section does not
mean that it should also apear in the English-to-Thai
section.
For example, you definitely want to include the Thai word
มัน [man, man, มันM, manM, man, ˈman, ˈman, manM, ˈmun, man, man, man]
in the Thai-to-English section as "fat," since
customers will hear Thai people use this common contraction and
will need to be able to look it up,
but in the English-to-Thai section under "fat,"
you want to include only
ไขมัน [kǎi-man, kǎi-man, ไขR-มันM, khaiR-manM, kǎi-man, ˈkhǎy ˈman, ˈkʰǎj ˈman, khaiR-manM, ˈkǎi ˈmun, khǎi-man, kǎi-man, khǎi-man] because that longer word has
fewer other meanings and so is a better choice for
the customer to use when speaking Thai.
This editing process is incredibly time-consuming.
When we move from paper to the software world, we have plenty of
space, so we can include every word in every direction, but that
doesn't actually solve the problem for the customer: if you look up
the word "eat" and you see 10 different Thai translations, of course
you will scream "Which word do I
choose? (see app "Help" > "Using the App" > "How to Read Entries") I just want to know how to say 'eat!'" In a true
bidirectional dictionary, the author will spend time ordering
the translations that appear under headwords in each section
so that the most general-purpose translation
is first, followed by other translations,
annotated with what makes their use more restricted (which, in the
case of the verb "eat," is likely to be their
word register (see app "Help" > "Speaking/Listening" > "Word Register")). And even for words
that have the same register, there is still a sense of which word is
the most common—the word choice that will make you sound more
native or will avoid problems for you.
This is just like English where "I went into the shop
yesterday" sounds more natural in most contexts than "I entered the
shop yesterday" so "went into" is the translation that should be
listed first in a dictionary meant for Thai people learning English.
You, as an end-customer, can
only get these extra benefits if the
dictionary author spent the huge amount of extra time needed to write
or hand-pick each translation separately in each direction.
Authoring in only one direction certainly saves a lot of time for the
author, but in reality it's not as useful to the customer, even if we
compare dictionaries with similar entry quality
and depth as defined above.
So when we only think about entry counts, and we ignore the crucial
reality of whether the author created a true bidirectional dictionary
dataset or just slapped full-text search on top of a single-direction
dataset, we do ourselves a great disservice.
In reality, true bidirectional datasets are both much harder for
the author to create, and much more useful for customers like you.
So the issue of whether a bidirectional dataset should have a higher
entry count than a single-directional dataset is really a red herring:
the real issue is that each software dictionary vendor must clearly
disclose whether the underlying dataset is truly bidirectional or not,
and provide entry and translation
counts for each section the vendor originally
authored.
How do you count if the dictionary also has Thai Sound?
Things get even more tricky when you think about
three-way (see app "Help" > "Using the App" > "Search" > "Three Ways to Search") dictionaries:
those that let you
Search-By-Sound™ in a third Thai Sound section.
Paiboon Publishing created its industry-first three-way paper
dictionary in 1996, and all of Paiboon's software dictionaries
and apps have been three-way as well.
To understand the impact of three-way on counting methods, it helps
to name the three sections using kind of silly terminology compatible
with the discussion just above. A three-way dictionary has
these sections:
- English-to-Thai-script-and-sound
- Thai-script-to-English-and-Thai-sound
- Thai-sound-to-English-and-Thai-script
Should the third section (the
Thai Sound section)
count toward the entry count? Even if not,
should the fact that all sections now
have
Thai Sound count somehow?
The third section certainly adds a lot of utility for customers, and
it takes a huge amount of time for authors like us to add Thai Sound
pronunciation guides alongside each Thai Script word (especially in our
software apps where you can choose from 12
different systems
and our app keeps the entries sorted according to your system), so for
both of these reasons it makes sense to "credit" the third section in
the entry count. And for the paper volumes, for the practical reasons
of weight and size above it also makes sense to include the Thai Sound
section in entry count since that is such a crucial factor for both
authors and customers.
But again we are trying to wedge something that is fundamentally more
complicated than a single number into a single number, for no rational
reason. We have to get past our obsession over entry count, in
particular our desire to force a single number on each product for
comparison!
A much better system is for each dictionary vendor to state what
"sections" they have (e.g. English, Thai Sound, and Thai Script),
provide separate entry counts for these, and provide a total,
as we do below. That way, customers can make meaningful comparisons.
Do you count bold headwords (individually or together), translations,
or what?
Further complicating matters is that a typical dictionary entry
doesn't just have one Thai word and one English word.
A single entry often has one or more bold headwords, and one or more
translations (plus glosses (see app "Help" > "Using the App" > "How to Read Entries"),
classifiers/measure words (see app "Help" > "Speaking/Listening" > "Classifiers"),
and other content).
Do you count each entry as 1? Do you count each entry according to
the number of bold headwords? Do you count each entry by the number
of translations? What about the extra depth
information like glosses,
classifiers, or categories, which may not even be present in other
dictionaries?
Currently vendors count differently and do not disclose how they count,
preventing you from making meaningful comparisons, but they should.
We give our exact counting method below.
Or do you count only unique headwords? Unique Words?
In all dictionary products you will have the same English or
Thai words appear more than once as translations,
and in some dictionary products you will have the same English or
Thai words appear as headwords.
For example, in many dictionaries, entries for the same headword are
split by parts of speech, classifiers, or other factors. You might
have one verb entry with English headword "can" and translations
meaning "able to," and another noun entry wtih English headword
"can" and translations meaning "metal container for liquid."
Should those multiple "can" entries be counted separately, or should
we count all the "can"s together as 1 (so that we're really counting
the number of "unique" headwords)? On the one hand, it's two separate
meanings so it should count as 2 since it's twice as useful to you
compared to a word that only has one meaning (and the author had to
do twice as much work to create it). But on the other hand,
it's only one English word.
Yet another way to count is to count the total number of unique words
in each language that appear in any entry (either as a headword or a
translation), counting each unique word once. But this method
will under-represent the very common words that have more meanings
and are used more often in our daily speech.
There are valid arguments for all these counting methods. Ultimately
it doesn't matter as long as you know two products you are comparing
use the same method, but...
Just like the issue above, currently vendors count differently and do
not disclose how they count, preventing you from making meaningful
comparisons.
We give our exact counting method below.
What about "glossary" and technical word padding?
Perhaps the single worst thing about comparing dictionaries using
entry counts is that most words are useless! That is, in
dictionaries with more than around 30,000 entries per section, the
remaining words are likely to be words you will never use. This can
include formal/technical variants of common words,
but it can also include
huge "glossary" laundry-lists of place names, biological species
(LEXiTRON in particular is rife with these),
names of Kings and Gods other historical or religious figures,
common surnames, and any other kind of lists that are easily added
to a dictionary dataset en masse without adding that much
value to most users.
At the very least, these "pad" entries unintentionally mislead you
into thinking a dictionary is more useful than it is. In the worst
case, the dictionary author may have put them in intentionally to
mislead you (in some cases an author can get a 20,000-50,000 entry
bump by simply downloading a few publicly available spreadsheets
from the internet and merging them into the product in a few hours).
The "pad" words are certainly of use to someone, and so should not be
deleted, but it is important for dictionary authors to disclose these
kinds of words when stating the entry count to customers, the majority
of whom will not get any benefit from the "pad" words.
For our Paiboon dictionaries, the closest thing we have to "pad" words
is the more than 53,000 entries that contain Thai place names.
Starting back in 2012 when we introduced the first place names, and
continuing until today,
we have never included the place name entries
in our entry counts, and have
noted this explicitly in our published stats.
What about phrasebook vs. dictionary?
Some apps combine the traditional
roles of dictionary and phrasebook. In some of these
dual-purpose apps,
such as ours, you can find phrasebook
phrases (see app "Help" > "Five Minute Tour" > "Start in the Search Screen") from the Search function that is traditionally meant for
the dictionary, and in some apps, you cannot.
So if the vendor publishes a single entry count for the whole app,
does that include both phrasebook and dictionary?
Where terms overlap between the two, are they double-counted?
What about the fact that phrasebooks typically only go from
English-to-Thai, whereas the dictionary may have
two or three directions as explained above? If the vendor sums
up the two/three directions for the dictionary entries, should
the vendor do the same for the phrasebook entries even if they
are only searchable as English-to-Thai?
As you can see, our old bad habit of trying to assign a single
number to a complex concept is hurting us again.
In reality, a vendor needs to disclose exactly how each
dataset (dictionary and phrasebook) was authored, how they are
searchable, and how any overlap is counted.
Paiboon Publishing Entry Counts
Ok, so with the really important background established above
(you didn't just scroll down here, did you?), we can get to the
numbers. It's really important to read the text above before
you try to interpret these numbers.
For the Paiboon phrasebook and dictionary apps:
- We define one "entry" as what appears between the thin horizontal
rules in the Search or Categories screen: a collection of one or more
bold headwords and one or more non-bold translations. So an entry
with multiple bold headwords still counts as 1 entry. And an entry
with multiple translations still counts as 1 entry. For counting
purposes, super phrases (see app "Help" > "Using the App" > "Placeholders"),
which may have tens
to thousands of possible variations depending on your choices, still
count as 1 entry (which, in and of itself, is an example of why using
a single number to judge an app is misleading!).
- In our dictionary dataset, it is possible to have multiple entries
with the same bold headword. This happens when the two entries have
different parts of speech (e.g. one "can" noun entry as in "metal
container" and one "can" verb entry as in "able to"), different
classifiers, or when the headword in question is grouped differently
with other headwords (e.g. one noun "can" entry meaning "slang word
for jail" and another noun "can, tin" entry meaning "metal
container"). Additionally, for English entries only, this happens
when the two English entries have a different
gloss (see app "Help" > "Using the App" > "How to Read Entries") (e.g. "glass (drinking)" vs "glass (pane)").
Finally, for
Thai entries only, this happens when two entries
have a different word register (see app "Help" > "Speaking/Listening" > "Word Register").
- We define one "translation" as a non-bold word that appears in an
entry in the opposite language as the bold headword,
not including classifiers.
So an entry with bold English headwords
that lists 3 Thai words (not
including classifiers) would count as 3 translations.
- Paiboon Publishing's dictionary dataset is a true bidirectional
dataset. Our dictionary team spent almost double the time because
we hand-edited entries in the Thai-to-English
direction separately from entries in the English-to-Thai
direction, and this gives you significant benefits, as explained above.
- Paiboon Publishing's phrasebook dataset is a single-directional
dataset because it was authored and
edited in an English-to-Thai
direction only. However, during phrasebook production, knowing
that our apps would support
full-text search (see app "Help" > "Using the App" > "Search" > "Search Modes" > "Power Search"),
we included
approximately 500-700 entries where the Thai is a common simple word
but the English is a long explanatory phrase, which normally one
would only have included during a bidirectional production process.
So our phrasebook dataset has some degree of bidirectionality.
- In the Categories screen (both phrasebook and dictionary
apps), all entries show with English
bold headwords, except for a handful of categories
where all entries have Thai Script bold headwords.
- In the Search screen (both phrasebook and dictionary
apps), entries will show with
English, Thai Script, or Thai Sound bold headwords depending on how you
search and depending on whether the entry itself has
English headwords or not. As explained above, the
Thai Script and Thai Sound entries are not simply reversals of the
English entries: our authoring team painstakingly
chose which translations to include for each entry, and
in what order.
- The Search screen can find any entry in the Categories screen,
including phrases and complete sentences found in Categories.
So Search has 100% overlap with what is in Categories.
- Our dictionary dataset includes a large number of "place name"
entries that include Thai district, subdistrict, and city names,
country names, and major world city names. We separate out these
place names in our counts below for reasons explained above.
- From the dictionary Search screen, if you search in English, you can reach:
- 81,494 different entries total,
- 10,625 of which are place names entries, so
- 70,869 entries not including place names
- From the dictionary Search screen, if you search in Thai Script,
you can reach:
- 84,547 different entries total,
- 21,231 of which are place names entries, so
- 63,316 entries not including place names
- From the dictionary Search screen, if you search in Thai Sound, the number
of entries you can reach is the same as Thai Script above
(there is a tiny difference because
a few Thai words have
multiple spellings or multiple pronunciations (see app "Help" > "Reading and Writing" > "Irregular Spellings"),
but the difference is not significant).
- In order to compute a total count for dictionary English, Thai Script, and
Thai Sound that is consistent to allow comparison
with the way we have counted our
entries since the earliest paper dictionary, we add together the
English, Thai Script, and
Thai Sound numbers above:
- 250,588 different entries total,
- 53,807 of which are place names entries, so
- 197,501 entries not including place names
In our dictionary app Help and marketing materials, we use
the total entry count 197,501 rounded down to 195,000 and the
total translation count 250,588 rounded down to 250,000, again
both without place names. When discussing the place names,
we round 53,807 down to 53,000. When using these numbers
we provide a link directly to this explanation wherever
possible.
- Of the total number of dictionary entries above, the following numbers of
entries include Thai classifiers/measure words
(if an entry has more than one classifier, it still counts as
1 in this count):
- 117,816 different entries total,
- 53,807 of which are place names entries, so
- 64,729 entries not including place names
So we have "60,000+" noun entries with classifiers.
- In our phrasebook dataset (which appears inside both phrasebook
and dictionary apps), we have:
- 1,356 total categories (including the top-level home category),
1,242 of which are
"leaf" categories containing only items and no sub-categories
- 1,003 and 926 of which are place name categories, respectively
- 353 and 316 of which are non-place-name categories, respectively
So we round down our total of 316 non-place-name, leaf categories to
300 and publish that we have "300+" categories. Note we have
taken the most conservative possible interpretation of Category
count.
- Inside our categories, we have the following number of "items,"
where an item is an entry, a link to the Help, or a cross-reference to
another category that begins with See: (but not a subcategory):
- 23,703 different items total,
- 10,612 of which are place names entries, so
- 13,091 items not including place names
- Inside our categories, we have the following number of entries (so
this does not count links, cross-references, or subcategories):
- 23,514 different entries total,
- 10,612 of which are place names entries, so
- 12,902 entries not including place names
So we round down our total of 12,902 non-place-name entries to 12,000
and publish that we have "12,000+" entries visible from the Categories
screen.
Important: unlike the Search figures above, in the Categories screen
you do not get to choose whether you see entries with English,
Thai Script, or Thai Sound bold headwords, nor did the phrasebook
authoring team spend time separately creating and editing
English-to-Thai and
Thai-to-English entries as we did for the
dictionary dataset (with the exception of the
500-700 phrasebook entries mentioned above).
So the Categories item and entry
figures above are not a sum of English + Thai Script +
Thai Sound as with Search. Keep this in mind when comparing how much
overlap there is between Search and Categories.
- In our phrasebook apps specifically
(as opposed to our dictionary apps),
there are around 4,000 entries that
are searchable from the Search screen but which cannot be found in
any category of the Categories screen. These include the most
useful and common words like "is" and "good" and "person" which
don't belong in any category.
So, adding together the "12,000+" entries visible from the Categories
screen (see above) with these additional 4,000 Search-screen-only
entries, we arrive at the published "16,000+" figure for the
total number of entries in our phrasebook apps.
- In our phrasebook dataset there are a few categories outside
the place name categories with a large number of rare entries.
These categories have much fewer entries than the place name
categories, so we didn't subtract them out from our entry counts above,
but you may want to consider them when evaluating entry count.
The main ones are
"Categories" > "Glossary" > "Nature" > "Animals",
"Categories" > "Glossary" > "Nature" > "Birds of Thailand",
and
"Categories" > "Glossary" > "Nature" > "Ornamental Fish".
There are also other "Categories" > "Glossary" categories you can
browse, but most of them are either very small, contain common words,
or both.
- Our dictionary contains 60,484
high-quality Thai
sound recordings (see app "Help" > "Using the App" > "Hearing Words")
totalling 29 hours and 58 minutes of audio. This includes the
Thai words used as headwords and translations
in entries as well as the sample sounds used in the Help
"Help" > "Speaking/Listening"
and "Help" > "Reading and Writing"
sections.
The long-term solution to allow customers like you to meaningfully
compare dictionary/phrasebook product entry counts is:
- First, each vendor should publish their detailed counts and
counting method, as we did above, so at least we know what we have and
whether it can be compared.
- Then, the next step is for the vendors to pick one or more common
counting methods and all count their products using the same method(s)
so as to benefit customers. We can choose a variety of methods to
fairly highlight the strengths of each product.