org.thdl.tib.text.tshegbar
Class UnicodeUtils

java.lang.Object
  |
  +--org.thdl.tib.text.tshegbar.UnicodeUtils
All Implemented Interfaces:
UnicodeConstants

public class UnicodeUtils
extends Object
implements UnicodeConstants

This non-instantiable class contains utility routines for dealing with Tibetan Unicode codepoints and strings of such codepoints.

Author:
David Chandler

Field Summary
 
Fields inherited from interface org.thdl.tib.text.tshegbar.UnicodeConstants
EW_ABSENT, EW_achung, EWC_a, EWC_achen, EWC_ba, EWC_ca, EWC_cha, EWC_da, EWC_dza, EWC_ga, EWC_ha, EWC_ja, EWC_ka, EWC_kha, EWC_la, EWC_ma, EWC_na, EWC_nga, EWC_nya, EWC_pa, EWC_pha, EWC_ra, EWC_sa, EWC_sha, EWC_ta, EWC_tha, EWC_tsa, EWC_tsha, EWC_wa, EWC_ya, EWC_za, EWC_zha, EWSUB_la_btags, EWSUB_ra_btags, EWSUB_wa_zur, EWSUB_ya_btags, EWV_e, EWV_i, EWV_o, EWV_u, NORM_NFC, NORM_NFD, NORM_NFKC, NORM_NFKD, NORM_NFTHDL, NORM_UNNORMALIZED
 
Method Summary
static boolean containsRa(char unicodeCP)
          Inefficient shortcut.
static boolean containsRa(String unicodeString)
          Returns true iff there exists at least one codepoint cp in unicodeString such that cp is ra or contains ra (like \u0F77).
static boolean isDiscouraged(char tibetanUnicodeCP)
          Returns true iff tibetanUnicodeCP is a Tibetan codepoint and if the Unicode 3.2 standard discourages the use of tibetanUnicodeCP.
static boolean isEntirelyTibetanUnicode(String unicodeString)
          Returns true iff unicodeString consists only of codepoints from the Unicode range U+0F00-U+0FFF.
static boolean isInTibetanRange(char unicodeCP)
          Returns true iff unicodeCP is a codepoint from the Unicode range U+0F00-U+0FFF.
static boolean isNonSubjoinedConsonant(char x)
          Returns true iff x is a Unicode codepoint that represents a consonant or two-consonant stack that has a Unicode code point.
static boolean isPreferredFormOfConsonant(char x)
          Returns true iff x is the preferred representation of a Tibetan or Sanskrit consonant and cannot be broken down any further.
static boolean isRa(char ch)
          Returns true iff ch corresponds to the Tibetan letter ra.
static boolean isSubjoinedConsonant(char x)
          Returns true iff x is a Unicode codepoint that represents a subjoined consonant or subjoined two-consonant stack that has a Unicode code point.
static boolean isTibetanConsonant(char cp)
          Returns true iff cp is a Unicode 3.2 Tibetan consonant, subjoined or not.
static boolean isWa(char ch)
          Returns true iff ch corresponds to the Tibetan letter wa.
static boolean isYa(char ch)
          Returns true iff ch corresponds to the Tibetan letter ya.
static void toMostlyDecomposedUnicode(StringBuffer tibetanUnicode, byte normForm)
          Puts the Tibetan codepoints in tibetanUnicode, a sequence of Unicode codepoints, into either Normalization Form KD (NFKD), D (NFD), or THDL (NFTHDL), depending on the value of normForm.
static String toMostlyDecomposedUnicode(String tibetanUnicode, byte normForm)
          Like toMostlyDecomposedUnicode(StringBuffer, byte), but does not modify its input.
static String toNormalizedForm(char tibetanUnicodeCP, byte normalizationForm)
          There are 19 codepoints in the Tibetan range of Unicode 3.2 which can be decomposed into longer strings of codepoints in the Tibetan range of Unicode.
static String unicodeCodepointToString(char cp)
          Returns a human-readable, ASCII form of the Unicode codepoint cp.
static String unicodeStringToString(String s)
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Method Detail

isNonSubjoinedConsonant

public static boolean isNonSubjoinedConsonant(char x)
Returns true iff x is a Unicode codepoint that represents a consonant or two-consonant stack that has a Unicode code point. Returns true only for the usual suspects (like \u0F40) and for Sanskrit consonants (like \u0F71) and the simple two-consonant stacks in Unicode (like \u0F43). Returns false for, among other things, subjoined consonants like \u0F90.


isSubjoinedConsonant

public static boolean isSubjoinedConsonant(char x)
Returns true iff x is a Unicode codepoint that represents a subjoined consonant or subjoined two-consonant stack that has a Unicode code point. Returns true only for the usual suspects (like \u0F90) and for Sanskrit consonants (like \u0F9C) and the simple two-consonant stacks in Unicode (like \u0FAC). Returns false for, among other things, non-subjoined consonants like \u0F40.


isPreferredFormOfConsonant

public static boolean isPreferredFormOfConsonant(char x)
Returns true iff x is the preferred representation of a Tibetan or Sanskrit consonant and cannot be broken down any further. Returns false for, among other things, subjoined consonants like \u0F90, two-component consonants like \u0F43, and fixed-form consonants like '\u0F6A'. The new consonants (for transcribing Chinese, I believe) "\u0F55\u0F39" (which EWTS calls "fa"), "\u0F56\u0F39" ("va"), and "\u0F5F\u0F39" ("Dza") are two-codepoint sequences, but you should be aware of them also.


isInTibetanRange

public static boolean isInTibetanRange(char unicodeCP)
Returns true iff unicodeCP is a codepoint from the Unicode range U+0F00-U+0FFF.

See Also:
isEntirelyTibetanUnicode(String)

isEntirelyTibetanUnicode

public static boolean isEntirelyTibetanUnicode(String unicodeString)
Returns true iff unicodeString consists only of codepoints from the Unicode range U+0F00-U+0FFF. (Note that these codepoints are typically not enough to represent a Tibetan text, you may need ZWSP (zero-width space) and various whitespace from other ranges.)


toMostlyDecomposedUnicode

public static void toMostlyDecomposedUnicode(StringBuffer tibetanUnicode,
                                             byte normForm)
Puts the Tibetan codepoints in tibetanUnicode, a sequence of Unicode codepoints, into either Normalization Form KD (NFKD), D (NFD), or THDL (NFTHDL), depending on the value of normForm. NFD and NFKD are specified by Unicode 3.2; NFTHDL is needed for org.thdl.tib.text.tshegbar#UnicodeGraphemeCluster because NFKD normalizes U+0F0C and neither NFD nor NFKD breaks down U+0F00 into its constituent codepoints. NFTHDL uses a maximum of codepoints, and it never uses codepoints whose use has been discouraged.

The Tibetan passages of the returned string are in the chosen normalized form, but codepoints outside of the range U+0F00-U+0FFF are not necessarily put into normalized form.

Recall that normalized forms are not necessarily closed under string concatenation, but are closed under substringing.

Note well that only well-formed input guarantees well-formed output.

Parameters:
tibetanUnicode - the codepoints to be decomposed
normForm - NORM_NFKD, NORM_NFTHDL, or NORM_NFD

toMostlyDecomposedUnicode

public static String toMostlyDecomposedUnicode(String tibetanUnicode,
                                               byte normForm)
Like toMostlyDecomposedUnicode(StringBuffer, byte), but does not modify its input. Instead, it returns the NFKD- or NFD-normalized version of tibetanUnicode.


toNormalizedForm

public static String toNormalizedForm(char tibetanUnicodeCP,
                                      byte normalizationForm)
There are 19 codepoints in the Tibetan range of Unicode 3.2 which can be decomposed into longer strings of codepoints in the Tibetan range of Unicode. Often one wants to manipulate decomposed codepoint strings. Also, HTML and XML are W3C standards that require certain normalization forms. This routine returns a chosen normalized form for such codepoints, and returns null for codepoints that are already normalized or are not in the Tibetan range of Unicode.

Parameters:
tibetanUnicodeCP - the codepoint to normalize
normalizationForm - NORM_NFTHDL, NORM_NFKD, or NORM_NFD if you expect something nontrivial to happen
Returns:
null if tibetanUnicodeCP is already in the chosen normalized form, or a string of two or three codepoints otherwise

isDiscouraged

public static boolean isDiscouraged(char tibetanUnicodeCP)
Returns true iff tibetanUnicodeCP is a Tibetan codepoint and if the Unicode 3.2 standard discourages the use of tibetanUnicodeCP.


isRa

public static boolean isRa(char ch)
Returns true iff ch corresponds to the Tibetan letter ra. Several Unicode codepoints correspond to the Tibetan letter ra (in its subscribed form or otherwise). Oftentimes, \u0F62 is thought of as the nominal representation. Returns false for some codepoints that contain ra but are not merely ra, such as \u0F77


isWa

public static boolean isWa(char ch)
Returns true iff ch corresponds to the Tibetan letter wa. Several Unicode codepoints correspond to the Tibetan letter wa. Oftentimes, \u0F5D is thought of as the nominal representation.


isYa

public static boolean isYa(char ch)
Returns true iff ch corresponds to the Tibetan letter ya. Several Unicode codepoints correspond to the Tibetan letter ya. Oftentimes, \u0F61 is thought of as the nominal representation.


containsRa

public static boolean containsRa(String unicodeString)
Returns true iff there exists at least one codepoint cp in unicodeString such that cp is ra or contains ra (like \u0F77). This method is not implemented as fast as it could be. It calls on the canonicalization code in order to maximize reuse and minimize the possibility of coder error.


containsRa

public static boolean containsRa(char unicodeCP)
Inefficient shortcut.

See Also:
containsRa(String)

unicodeCodepointToString

public static String unicodeCodepointToString(char cp)
Returns a human-readable, ASCII form of the Unicode codepoint cp.


unicodeStringToString

public static String unicodeStringToString(String s)

isTibetanConsonant

public static boolean isTibetanConsonant(char cp)
Returns true iff cp is a Unicode 3.2 Tibetan consonant, subjoined or not. This counts precomposed consonant stacks like U+0FA7 as consonants. If you don't wish to treat such as consonants, then put the input into NORM_NFD, NORM_NFKD, or NORM_NFTHDL first. If it changes under such a normalization, it is a precomposed consonant.



These API docs were created 02/02/2003 08:19 PM.
Copyright © 2001-2002 Tibetan and Himalayan Digital Library. All Rights Reserved.
Hosted by SourceForge_Logo