|
|||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object | +--org.thdl.tib.text.tshegbar.UnicodeGraphemeCluster
A UnicodeGraphemeCluster is either a non-Tibetan codepoint (such
as whitespace or control characters or a Latin "character"), or a
vertically stacked set of Tibetan consonants, vowels, marks, and
signs. The Unicode string
"\u0F40\u0F0B\u0F41\u0F0B"
specifies
four UnicodeGraphemeClusters (the name of the Tibetan alphabet,
you might notice), while the Unicode string
"\u0F66\u0FA5\u0F39\u0F90\u0FB5\u0F71\u0F80\u0F7F"
is one Tibetan stack, sa over fa over ka over Sha with an a-chung,
a reversed gi-gu, and a visarga, plus a ngas-bzung-sgor-rtags mark
underneath all of that. I assume the latter grapheme cluster is
nonsense, but it is considered one grapheme cluster because all
but the first char are combining chars. See Unicode Technical
Report 29.
As the above example demonstrates, not all UnicodeGraphemeClusters are syntactically legal in the Tibetan language. Not all of them are syntactically legal in Sanskrit transcribed in the Tibetan alphabet, either.
The Unicode 3.2 standard (see especially Technical Report 29)
refers to "grapheme clusters." A UnicodeGraphemeCluster is
precisely a grapheme cluster as described by that standard. We
interpret the standard as saying that U+0F3E
and
U+0F3F
are each grapheme clusters unto themselves,
even though they are combining codepoints.
Field Summary |
Fields inherited from interface org.thdl.tib.text.tshegbar.UnicodeConstants |
EW_ABSENT, EW_achung, EWC_a, EWC_achen, EWC_ba, EWC_ca, EWC_cha, EWC_da, EWC_dza, EWC_ga, EWC_ha, EWC_ja, EWC_ka, EWC_kha, EWC_la, EWC_ma, EWC_na, EWC_nga, EWC_nya, EWC_pa, EWC_pha, EWC_ra, EWC_sa, EWC_sha, EWC_ta, EWC_tha, EWC_tsa, EWC_tsha, EWC_wa, EWC_ya, EWC_za, EWC_zha, EWSUB_la_btags, EWSUB_ra_btags, EWSUB_wa_zur, EWSUB_ya_btags, EWV_e, EWV_i, EWV_o, EWV_u, NORM_NFC, NORM_NFD, NORM_NFKC, NORM_NFKD, NORM_NFTHDL, NORM_UNNORMALIZED |
Constructor Summary | |
UnicodeGraphemeCluster(String unicodeString)
Creates a new GraphemeCluster given a legal sequence of Unicode codepoints corresponding to a single grapheme cluster. |
Method Summary | |
static int |
breakUnicodeIntoGraphemeClusters(Vector grcls,
String unicode,
boolean validate,
boolean correctErrors)
Given some (possibly unnormalized) Unicode 3.2 string unicode, appends grapheme clusters to the vector of GraphemeClusters grcls if grcls is nonnulla. |
String |
getThdlWylie(boolean needsVowel)
Returns the THDL Extended Wylie transliteration of this grapheme cluster, or null if there is none (which happens for a few Tibetan codepoints, if you'll recall). |
String |
getTopToBottomCodepoints()
FIXMEDOC |
String |
getUnicodeRepresentation()
Returns a string of codepoints in NFTHDL form. |
boolean |
hasUnicodeRepresentation()
Returns true. |
boolean |
isLegalTibetan()
Returns true iff this stack could occur in syntactically correct, run-of-the-mill Tibetan (as opposed to Tibetanized Sanksrit, Chinese, et cetera). |
boolean |
isTibetan()
DLC SOON |
String |
toConciseXML()
Returns a |
String |
toVerboseXML()
Returns a |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Constructor Detail |
public UnicodeGraphemeCluster(String unicodeString) throws IllegalArgumentException
IllegalArgumentException
- if unicodeString is not a
syntactically correct Unicode 3.2 sequence (if it begins with
a combining codepoint or has a Tibetan vowel before another
combining character, for example, or if it is more than one
grapheme cluster. Note that syntactical correctness for
non-Tibetan codepoints is not likely to be known by this
routine.Method Detail |
public String getUnicodeRepresentation()
getUnicodeRepresentation
in interface UnicodeReadyThunk
public boolean hasUnicodeRepresentation()
hasUnicodeRepresentation
in interface UnicodeReadyThunk
public boolean isLegalTibetan()
public String toConciseXML()
public String toVerboseXML()
public String getThdlWylie(boolean needsVowel)
"\u0F56\u0F4F\u0F42\u0F66"
, and
needsVowel must be set to false for all but the grapheme
cluster corresponding to \u0F4F
if you wish
to get the preferred THDL Extended Wylie.
public static int breakUnicodeIntoGraphemeClusters(Vector grcls, String unicode, boolean validate, boolean correctErrors) throws IllegalArgumentException, NullPointerException
BadTibetanUnicodeException
- if the unicode is not
syntactically legal
IllegalArgumentException
- if correctErrors and
validate are both true
NullPointerException
- if unicode is nullpublic String getTopToBottomCodepoints()
public boolean isTibetan()
isTibetan
in interface UnicodeReadyThunk
|
|||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |