org.thdl.tib.text.tshegbar
Class UnicodeGraphemeCluster

java.lang.Object
  |
  +--org.thdl.tib.text.tshegbar.UnicodeGraphemeCluster
All Implemented Interfaces:
UnicodeConstants, UnicodeReadyThunk

public class UnicodeGraphemeCluster
extends Object
implements UnicodeReadyThunk, UnicodeConstants

A UnicodeGraphemeCluster is either a non-Tibetan codepoint (such as whitespace or control characters or a Latin "character"), or a vertically stacked set of Tibetan consonants, vowels, marks, and signs. The Unicode string "\u0F40\u0F0B\u0F41\u0F0B" specifies four UnicodeGraphemeClusters (the name of the Tibetan alphabet, you might notice), while the Unicode string "\u0F66\u0FA5\u0F39\u0F90\u0FB5\u0F71\u0F80\u0F7F" is one Tibetan stack, sa over fa over ka over Sha with an a-chung, a reversed gi-gu, and a visarga, plus a ngas-bzung-sgor-rtags mark underneath all of that. I assume the latter grapheme cluster is nonsense, but it is considered one grapheme cluster because all but the first char are combining chars. See Unicode Technical Report 29.

As the above example demonstrates, not all UnicodeGraphemeClusters are syntactically legal in the Tibetan language. Not all of them are syntactically legal in Sanskrit transcribed in the Tibetan alphabet, either.

The Unicode 3.2 standard (see especially Technical Report 29) refers to "grapheme clusters." A UnicodeGraphemeCluster is precisely a grapheme cluster as described by that standard. We interpret the standard as saying that U+0F3E and U+0F3F are each grapheme clusters unto themselves, even though they are combining codepoints.

Author:
David Chandler

Field Summary
private static int MAX_HEIGHT
           
private static int MIN_HEIGHT
           
private  String unicodeString
          The Unicode codepoints that compose this grapheme cluster.
 
Fields inherited from interface org.thdl.tib.text.tshegbar.UnicodeConstants
EW_ABSENT, EW_achung, EWC_a, EWC_achen, EWC_ba, EWC_ca, EWC_cha, EWC_da, EWC_dza, EWC_ga, EWC_ha, EWC_ja, EWC_ka, EWC_kha, EWC_la, EWC_ma, EWC_na, EWC_nga, EWC_nya, EWC_pa, EWC_pha, EWC_ra, EWC_sa, EWC_sha, EWC_ta, EWC_tha, EWC_tsa, EWC_tsha, EWC_wa, EWC_ya, EWC_za, EWC_zha, EWSUB_la_btags, EWSUB_ra_btags, EWSUB_wa_zur, EWSUB_ya_btags, EWV_e, EWV_i, EWV_o, EWV_u, NORM_NFC, NORM_NFD, NORM_NFKC, NORM_NFKD, NORM_NFTHDL, NORM_UNNORMALIZED
 
Constructor Summary
private UnicodeGraphemeCluster()
          Do not use this constructor.
  UnicodeGraphemeCluster(String unicodeString)
          Creates a new GraphemeCluster given a legal sequence of Unicode codepoints corresponding to a single grapheme cluster.
 
Method Summary
static int breakUnicodeIntoGraphemeClusters(Vector grcls, String unicode, boolean validate, boolean correctErrors)
          Given some (possibly unnormalized) Unicode 3.2 string unicode, appends grapheme clusters to the vector of GraphemeClusters grcls if grcls is nonnulla.
private static int getCPHeight(char x)
          Returns the height for the Tibetan Unicode codepoint x.
 String getThdlWylie(boolean needsVowel)
          Returns the THDL Extended Wylie transliteration of this grapheme cluster, or null if there is none (which happens for a few Tibetan codepoints, if you'll recall).
 String getTopToBottomCodepoints()
          FIXMEDOC
private static StringBuffer getTopToBottomCodepoints(StringBuffer NFTHDLString, int start, int end)
          Returns a new StringBuffer consisting of the codepoints in NFTHDLString at indices [start, end) sorted in top-to-bottom order, or null on some occasions when NFTHDLString is already sorted.
 String getUnicodeRepresentation()
          Returns a string of codepoints in NFTHDL form.
 boolean hasUnicodeRepresentation()
          Returns true.
 boolean isLegalTibetan()
          Returns true iff this stack could occur in syntactically correct, run-of-the-mill Tibetan (as opposed to Tibetanized Sanksrit, Chinese, et cetera).
 boolean isTibetan()
          DLC SOON
 String toConciseXML()
          Returns a element that contains the THDL Extended Wylie transliteration for this cluster.
 String toVerboseXML()
          Returns a element that contains this cluster broken down into its constituent decomposed codepoints.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

MIN_HEIGHT

private static final int MIN_HEIGHT
See Also:
getCPHeight(char), Constant Field Values

MAX_HEIGHT

private static final int MAX_HEIGHT
See Also:
getCPHeight(char), Constant Field Values

unicodeString

private String unicodeString
The Unicode codepoints that compose this grapheme cluster. This is legal, i.e. if there is a Tibetan vowel, it is the last codepoint. It is in Normalization Form THDL (NFTHDL).

Constructor Detail

UnicodeGraphemeCluster

private UnicodeGraphemeCluster()
Do not use this constructor.


UnicodeGraphemeCluster

public UnicodeGraphemeCluster(String unicodeString)
                       throws IllegalArgumentException
Creates a new GraphemeCluster given a legal sequence of Unicode codepoints corresponding to a single grapheme cluster.

Throws:
IllegalArgumentException - if unicodeString is not a syntactically correct Unicode 3.2 sequence (if it begins with a combining codepoint or has a Tibetan vowel before another combining character, for example, or if it is more than one grapheme cluster. Note that syntactical correctness for non-Tibetan codepoints is not likely to be known by this routine.
Method Detail

getUnicodeRepresentation

public String getUnicodeRepresentation()
Returns a string of codepoints in NFTHDL form.

Specified by:
getUnicodeRepresentation in interface UnicodeReadyThunk
Returns:
a String of Unicode codepoints

hasUnicodeRepresentation

public boolean hasUnicodeRepresentation()
Returns true.

Specified by:
hasUnicodeRepresentation in interface UnicodeReadyThunk

isLegalTibetan

public boolean isLegalTibetan()
Returns true iff this stack could occur in syntactically correct, run-of-the-mill Tibetan (as opposed to Tibetanized Sanksrit, Chinese, et cetera). sga is a legal Tibetan stack, but g+g is not, for example.


toConciseXML

public String toConciseXML()
Returns a element that contains the THDL Extended Wylie transliteration for this cluster.


toVerboseXML

public String toVerboseXML()
Returns a element that contains this cluster broken down into its constituent decomposed codepoints.


getThdlWylie

public String getThdlWylie(boolean needsVowel)
Returns the THDL Extended Wylie transliteration of this grapheme cluster, or null if there is none (which happens for a few Tibetan codepoints, if you'll recall). If needsVowel is true, then an "a" will be appended when there is no EW_achung or explicit simple vowel. If there is an explicit vowel or EW_achung, it will always be present. Note that needsVowel is provided because btags is the preferred THDL Extended Wylie for the four contiguous grapheme clusters "\u0F56\u0F4F\u0F42\u0F66", and needsVowel must be set to false for all but the grapheme cluster corresponding to \u0F4F if you wish to get the preferred THDL Extended Wylie.


breakUnicodeIntoGraphemeClusters

public static int breakUnicodeIntoGraphemeClusters(Vector grcls,
                                                   String unicode,
                                                   boolean validate,
                                                   boolean correctErrors)
                                            throws IllegalArgumentException,
                                                   NullPointerException
Given some (possibly unnormalized) Unicode 3.2 string unicode, appends grapheme clusters to the vector of GraphemeClusters grcls if grcls is nonnulla. Performs good error checking if validate is true. If an error is found, grcls may have been modified if nonnull. Setting grcls to null and setting validate to true is sometimes useful for testing the validity of a Unicode string.

Returns:
the number of grapheme clusters that were or would have been added to grcls
Throws:
BadTibetanUnicodeException - if the unicode is not syntactically legal
IllegalArgumentException - if correctErrors and validate are both true
NullPointerException - if unicode is null

getTopToBottomCodepoints

public String getTopToBottomCodepoints()
FIXMEDOC


getTopToBottomCodepoints

private static StringBuffer getTopToBottomCodepoints(StringBuffer NFTHDLString,
                                                     int start,
                                                     int end)
Returns a new StringBuffer consisting of the codepoints in NFTHDLString at indices [start, end) sorted in top-to-bottom order, or null on some occasions when NFTHDLString is already sorted. A top-to-bottom ordering is a useful form for applications wishing to render the grapheme cluster. Note that this method is only useful if NFTHDLString is part of or an entire grapheme cluster. Does no error checking on NFTHDLString.

Parameters:
NFTHDLString - a buffer with characters at indices i, where start <= i < end, being the Unicode codepoints for a single grapheme cluster or part of a grapheme cluster
start - NFTHDLString.charAt(start) is the first codepoint dealt with
end - NFTHDLString.charAt(end) is the first codepoint NOT dealt with
Returns:
null only if (but not necessarily if) NFTHDLString is already sorted top-to-bottom, or the sorted form of NFTHDLString

getCPHeight

private static int getCPHeight(char x)
Returns the height for the Tibetan Unicode codepoint x. This relative height is 0 for a base consonant, digit, punctuation, mark, or sign. It is -1 for a subjoined consonant, -2 for EWSUB_wa_zur, -3 for EW_achung, +1 for EWV_gigu, and so on according to the height these codepoints appear relative to one another when on the same stack. If two codepoints have equal height, they should not exist in the same grapheme cluster unless one is U+0F39, which is an integral part of a consonant when tacked on to, e.g., EWC_PHA.

If x is not a Unicode 3.2 codepoint in the Tibetan range, or if x is not in NFTHDL form, 0 is returned. The height code of U+0F76 is not valid, and it is not an accident that U+0F76 is not in NFTHDL form.


isTibetan

public boolean isTibetan()
DLC SOON

Specified by:
isTibetan in interface UnicodeReadyThunk


These API docs were created 02/02/2003 08:20 PM.
Copyright © 2001-2002 Tibetan and Himalayan Digital Library. All Rights Reserved.
Hosted by SourceForge_Logo