org.thdl.tib.text.tshegbar
Class UnicodeGraphemeCluster

java.lang.Object
  |
  +--org.thdl.tib.text.tshegbar.UnicodeGraphemeCluster
All Implemented Interfaces:
UnicodeConstants, UnicodeReadyThunk

public class UnicodeGraphemeCluster
extends Object
implements UnicodeReadyThunk, UnicodeConstants

A UnicodeGraphemeCluster is either a non-Tibetan codepoint (such as whitespace or control characters or a Latin "character"), or a vertically stacked set of Tibetan consonants, vowels, marks, and signs. The Unicode string "\u0F40\u0F0B\u0F41\u0F0B" specifies four UnicodeGraphemeClusters (the name of the Tibetan alphabet, you might notice), while the Unicode string "\u0F66\u0FA5\u0F39\u0F90\u0FB5\u0F71\u0F80\u0F7F" is one Tibetan stack, sa over fa over ka over Sha with an a-chung, a reversed gi-gu, and a visarga, plus a ngas-bzung-sgor-rtags mark underneath all of that. I assume the latter grapheme cluster is nonsense, but it is considered one grapheme cluster because all but the first char are combining chars. See Unicode Technical Report 29.

As the above example demonstrates, not all UnicodeGraphemeClusters are syntactically legal in the Tibetan language. Not all of them are syntactically legal in Sanskrit transcribed in the Tibetan alphabet, either.

The Unicode 3.2 standard (see especially Technical Report 29) refers to "grapheme clusters." A UnicodeGraphemeCluster is precisely a grapheme cluster as described by that standard. We interpret the standard as saying that U+0F3E and U+0F3F are each grapheme clusters unto themselves, even though they are combining codepoints.

Author:
David Chandler

Field Summary
 
Fields inherited from interface org.thdl.tib.text.tshegbar.UnicodeConstants
EW_ABSENT, EW_achung, EWC_a, EWC_achen, EWC_ba, EWC_ca, EWC_cha, EWC_da, EWC_dza, EWC_ga, EWC_ha, EWC_ja, EWC_ka, EWC_kha, EWC_la, EWC_ma, EWC_na, EWC_nga, EWC_nya, EWC_pa, EWC_pha, EWC_ra, EWC_sa, EWC_sha, EWC_ta, EWC_tha, EWC_tsa, EWC_tsha, EWC_wa, EWC_ya, EWC_za, EWC_zha, EWSUB_la_btags, EWSUB_ra_btags, EWSUB_wa_zur, EWSUB_ya_btags, EWV_e, EWV_i, EWV_o, EWV_u, NORM_NFC, NORM_NFD, NORM_NFKC, NORM_NFKD, NORM_NFTHDL, NORM_UNNORMALIZED
 
Constructor Summary
UnicodeGraphemeCluster(String unicodeString)
          Creates a new GraphemeCluster given a legal sequence of Unicode codepoints corresponding to a single grapheme cluster.
 
Method Summary
static int breakUnicodeIntoGraphemeClusters(Vector grcls, String unicode, boolean validate, boolean correctErrors)
          Given some (possibly unnormalized) Unicode 3.2 string unicode, appends grapheme clusters to the vector of GraphemeClusters grcls if grcls is nonnulla.
 String getThdlWylie(boolean needsVowel)
          Returns the THDL Extended Wylie transliteration of this grapheme cluster, or null if there is none (which happens for a few Tibetan codepoints, if you'll recall).
 String getTopToBottomCodepoints()
          FIXMEDOC
 String getUnicodeRepresentation()
          Returns a string of codepoints in NFTHDL form.
 boolean hasUnicodeRepresentation()
          Returns true.
 boolean isLegalTibetan()
          Returns true iff this stack could occur in syntactically correct, run-of-the-mill Tibetan (as opposed to Tibetanized Sanksrit, Chinese, et cetera).
 boolean isTibetan()
          DLC SOON
 String toConciseXML()
          Returns a element that contains the THDL Extended Wylie transliteration for this cluster.
 String toVerboseXML()
          Returns a element that contains this cluster broken down into its constituent decomposed codepoints.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

UnicodeGraphemeCluster

public UnicodeGraphemeCluster(String unicodeString)
                       throws IllegalArgumentException
Creates a new GraphemeCluster given a legal sequence of Unicode codepoints corresponding to a single grapheme cluster.

Throws:
IllegalArgumentException - if unicodeString is not a syntactically correct Unicode 3.2 sequence (if it begins with a combining codepoint or has a Tibetan vowel before another combining character, for example, or if it is more than one grapheme cluster. Note that syntactical correctness for non-Tibetan codepoints is not likely to be known by this routine.
Method Detail

getUnicodeRepresentation

public String getUnicodeRepresentation()
Returns a string of codepoints in NFTHDL form.

Specified by:
getUnicodeRepresentation in interface UnicodeReadyThunk
Returns:
a String of Unicode codepoints

hasUnicodeRepresentation

public boolean hasUnicodeRepresentation()
Returns true.

Specified by:
hasUnicodeRepresentation in interface UnicodeReadyThunk

isLegalTibetan

public boolean isLegalTibetan()
Returns true iff this stack could occur in syntactically correct, run-of-the-mill Tibetan (as opposed to Tibetanized Sanksrit, Chinese, et cetera). sga is a legal Tibetan stack, but g+g is not, for example.


toConciseXML

public String toConciseXML()
Returns a element that contains the THDL Extended Wylie transliteration for this cluster.


toVerboseXML

public String toVerboseXML()
Returns a element that contains this cluster broken down into its constituent decomposed codepoints.


getThdlWylie

public String getThdlWylie(boolean needsVowel)
Returns the THDL Extended Wylie transliteration of this grapheme cluster, or null if there is none (which happens for a few Tibetan codepoints, if you'll recall). If needsVowel is true, then an "a" will be appended when there is no EW_achung or explicit simple vowel. If there is an explicit vowel or EW_achung, it will always be present. Note that needsVowel is provided because btags is the preferred THDL Extended Wylie for the four contiguous grapheme clusters "\u0F56\u0F4F\u0F42\u0F66", and needsVowel must be set to false for all but the grapheme cluster corresponding to \u0F4F if you wish to get the preferred THDL Extended Wylie.


breakUnicodeIntoGraphemeClusters

public static int breakUnicodeIntoGraphemeClusters(Vector grcls,
                                                   String unicode,
                                                   boolean validate,
                                                   boolean correctErrors)
                                            throws IllegalArgumentException,
                                                   NullPointerException
Given some (possibly unnormalized) Unicode 3.2 string unicode, appends grapheme clusters to the vector of GraphemeClusters grcls if grcls is nonnulla. Performs good error checking if validate is true. If an error is found, grcls may have been modified if nonnull. Setting grcls to null and setting validate to true is sometimes useful for testing the validity of a Unicode string.

Returns:
the number of grapheme clusters that were or would have been added to grcls
Throws:
BadTibetanUnicodeException - if the unicode is not syntactically legal
IllegalArgumentException - if correctErrors and validate are both true
NullPointerException - if unicode is null

getTopToBottomCodepoints

public String getTopToBottomCodepoints()
FIXMEDOC


isTibetan

public boolean isTibetan()
DLC SOON

Specified by:
isTibetan in interface UnicodeReadyThunk


These API docs were created 02/02/2003 08:19 PM.
Copyright © 2001-2002 Tibetan and Himalayan Digital Library. All Rights Reserved.
Hosted by SourceForge_Logo