|
|||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object | +--org.thdl.tib.text.tshegbar.UnicodeUtils
This non-instantiable class contains utility routines for dealing with Tibetan Unicode codepoints and strings of such codepoints.
Field Summary |
Fields inherited from interface org.thdl.tib.text.tshegbar.UnicodeConstants |
EW_ABSENT, EW_achung, EWC_a, EWC_achen, EWC_ba, EWC_ca, EWC_cha, EWC_da, EWC_dza, EWC_ga, EWC_ha, EWC_ja, EWC_ka, EWC_kha, EWC_la, EWC_ma, EWC_na, EWC_nga, EWC_nya, EWC_pa, EWC_pha, EWC_ra, EWC_sa, EWC_sha, EWC_ta, EWC_tha, EWC_tsa, EWC_tsha, EWC_wa, EWC_ya, EWC_za, EWC_zha, EWSUB_la_btags, EWSUB_ra_btags, EWSUB_wa_zur, EWSUB_ya_btags, EWV_e, EWV_i, EWV_o, EWV_u, NORM_NFC, NORM_NFD, NORM_NFKC, NORM_NFKD, NORM_NFTHDL, NORM_UNNORMALIZED |
Method Summary | |
static boolean |
containsRa(char unicodeCP)
Inefficient shortcut. |
static boolean |
containsRa(String unicodeString)
Returns true iff there exists at least one codepoint cp in unicodeString such that cp is ra or contains
ra (like \u0F77 ). |
static boolean |
isDiscouraged(char tibetanUnicodeCP)
Returns true iff tibetanUnicodeCP is a Tibetan codepoint and if the
Unicode 3.2 standard discourages the use of
tibetanUnicodeCP. |
static boolean |
isEntirelyTibetanUnicode(String unicodeString)
Returns true iff unicodeString consists only of codepoints from the Unicode range U+0F00-U+0FFF. |
static boolean |
isInTibetanRange(char unicodeCP)
Returns true iff unicodeCP is a codepoint from the Unicode range U+0F00-U+0FFF. |
static boolean |
isNonSubjoinedConsonant(char x)
Returns true iff x is a Unicode codepoint that represents a consonant or two-consonant stack that has a Unicode code point. |
static boolean |
isPreferredFormOfConsonant(char x)
Returns true iff x is the preferred representation of a Tibetan or Sanskrit consonant and cannot be broken down any further. |
static boolean |
isRa(char ch)
Returns true iff ch corresponds to the Tibetan letter ra. |
static boolean |
isSubjoinedConsonant(char x)
Returns true iff x is a Unicode codepoint that represents a subjoined consonant or subjoined two-consonant stack that has a Unicode code point. |
static boolean |
isTibetanConsonant(char cp)
Returns true iff cp is a Unicode 3.2 Tibetan consonant, subjoined or not. |
static boolean |
isWa(char ch)
Returns true iff ch corresponds to the Tibetan letter wa. |
static boolean |
isYa(char ch)
Returns true iff ch corresponds to the Tibetan letter ya. |
static void |
toMostlyDecomposedUnicode(StringBuffer tibetanUnicode,
byte normForm)
Puts the Tibetan codepoints in tibetanUnicode, a sequence of Unicode codepoints, into either Normalization Form KD (NFKD), D (NFD), or THDL (NFTHDL), depending on the value of normForm. |
static String |
toMostlyDecomposedUnicode(String tibetanUnicode,
byte normForm)
Like toMostlyDecomposedUnicode(StringBuffer, byte) ,
but does not modify its input. |
static String |
toNormalizedForm(char tibetanUnicodeCP,
byte normalizationForm)
There are 19 codepoints in the Tibetan range of Unicode 3.2 which can be decomposed into longer strings of codepoints in the Tibetan range of Unicode. |
static String |
unicodeCodepointToString(char cp)
Returns a human-readable, ASCII form of the Unicode codepoint cp. |
static String |
unicodeStringToString(String s)
|
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Method Detail |
public static boolean isNonSubjoinedConsonant(char x)
\u0F40
) and for Sanskrit consonants (like
\u0F71
) and the simple two-consonant stacks in
Unicode (like \u0F43
). Returns false for, among
other things, subjoined consonants like
\u0F90
.
public static boolean isSubjoinedConsonant(char x)
\u0F90
) and for Sanskrit
consonants (like \u0F9C
) and the simple
two-consonant stacks in Unicode (like \u0FAC
).
Returns false for, among other things, non-subjoined
consonants like \u0F40
.
public static boolean isPreferredFormOfConsonant(char x)
\u0F90
, two-component consonants
like \u0F43
, and fixed-form consonants like
'\u0F6A'. The new consonants (for transcribing Chinese, I
believe) "\u0F55\u0F39" (which EWTS calls "fa"),
"\u0F56\u0F39" ("va"), and "\u0F5F\u0F39" ("Dza") are
two-codepoint sequences, but you should be aware of them
also.
public static boolean isInTibetanRange(char unicodeCP)
isEntirelyTibetanUnicode(String)
public static boolean isEntirelyTibetanUnicode(String unicodeString)
public static void toMostlyDecomposedUnicode(StringBuffer tibetanUnicode, byte normForm)
org.thdl.tib.text.tshegbar#UnicodeGraphemeCluster
because NFKD normalizes U+0F0C
and neither NFD
nor NFKD breaks down U+0F00
into its constituent
codepoints. NFTHDL uses a maximum of codepoints, and it never
uses codepoints whose use has been discouraged
.
The Tibetan passages of the returned string are in the
chosen normalized form, but codepoints outside of the range
U+0F00
-U+0FFF
are not necessarily
put into normalized form.
Recall that normalized forms are not necessarily closed under string concatenation, but are closed under substringing.
Note well that only well-formed input guarantees well-formed output.
tibetanUnicode
- the codepoints to be decomposednormForm
- NORM_NFKD, NORM_NFTHDL, or NORM_NFDpublic static String toMostlyDecomposedUnicode(String tibetanUnicode, byte normForm)
toMostlyDecomposedUnicode(StringBuffer, byte)
,
but does not modify its input. Instead, it returns the NFKD-
or NFD-normalized version of tibetanUnicode.
public static String toNormalizedForm(char tibetanUnicodeCP, byte normalizationForm)
tibetanUnicodeCP
- the codepoint to normalizenormalizationForm
- NORM_NFTHDL, NORM_NFKD, or NORM_NFD
if you expect something nontrivial to happen
public static boolean isDiscouraged(char tibetanUnicodeCP)
is a Tibetan codepoint
and if the
Unicode 3.2 standard discourages the use of
tibetanUnicodeCP.
public static boolean isRa(char ch)
\u0F62
is thought of as the nominal
representation. Returns false for some codepoints that
contain ra but are not merely ra, such as \u0F77
public static boolean isWa(char ch)
\u0F5D
is thought of as the
nominal representation.
public static boolean isYa(char ch)
\u0F61
is thought of as the
nominal representation.
public static boolean containsRa(String unicodeString)
is ra
or contains
ra (like \u0F77
). This method is not implemented
as fast as it could be. It calls on the canonicalization code
in order to maximize reuse and minimize the possibility of
coder error.
public static boolean containsRa(char unicodeCP)
containsRa(String)
public static String unicodeCodepointToString(char cp)
public static String unicodeStringToString(String s)
public static boolean isTibetanConsonant(char cp)
U+0FA7
as consonants. If you don't wish to
treat such as consonants, then put the input into NORM_NFD,
NORM_NFKD, or NORM_NFTHDL first. If it changes under such a
normalization, it is a precomposed consonant.
|
|||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |