Unicode Phase 1: The Review


"Beyond Transliteration: Digitizing South Asian Scripts 
Using the Unicode Standard"

Phase 1, The Review
Prepared by: Suzanne McMahon, Project Manager
Draft 2
Dec. 11, 1997
Table of Contents


    Character Sets
          Unicode/ISO 10646
  Markup languages
  Transliteration Schemes for the Internet
          ISO Working Group for Transliteration of Indic Scripts

     Operating Systems
          Windows NT
     Development software                                                   
          JDK 1.* 
          Duke Humanities Computing Facility's UniEdit and WinCalis 
`         GammaPro's UniType 
          MATHEMA TextEditor
          Monotype Fonts
          TamilWeb in Singapore Fonts 
          Apple Indian Language Kit
          Netscape Communicator 4.0 

     Non-Unicode Software and Standards
          TeX and LaTeX
     Inventory of Electronic Text Archives
          Other South Asian Languages

     The Environment
     The Standards
     The Texts

Appendix 1 Keyboards for Latin 1-*
Appendix 2  Unicode/ISO-10646 Summary Table
Appendix 3 Unicode Maps for South Asian Languages
Appendix 4: Unicode Project Proposal

While the English language has dominated computing technology, the need to create
hardware and software that can handle the world's many script systems has been
recognized for over a decade.  Working towards solutions for multi-lingual computing has
been called internationalization.  The first phase of internationalization concentrated on the
creation of localized systems.  These systems generally used non-standard character sets
coded in the upper half of the ASCII keyboard and were capable of handling only one
language using language-specific applications, for instance, Chinese Windows.  A second
phase came with the development of software like Apple's WorldScript that made it
possible to install of language kits on native systems and then to switch between two

But as networking became widespread the need for portability across applications,
platforms, and networks became critical.  A central issue of internationalization became
character encoding and the need for character set standards.

"By the late 1980s it was clear that existing international encoding standards and quickly
emerging national sub-standards, while serving their local needs well, could not provide the
necessary foundation for a unified global system. ASCII, the venerable 7-bit system, could
code only 128 characters . . . ISO 8859 (aka Latin-1), an 8-bit step-up, could code 256
characters. While sufficient for most European alphabets and some non-roman alphabets,
such  as Cyrillic, Hebrew, and Arabic, these standards lacked the capacity to render
non-alphabetic languages that use  sets of symbols and ideographs. To accommodate the
goals of internationalization, the lowest common denominator -- the number of bits assigned
to each character -- had to be increased."{1}

While national standards agencies in various countries had already begun to develop
character set standards using 7-bit and 8-bit schemes, a truly interoperable scheme
needed international scope and expanded capacity.  The 16- and 32-bit schemes of
Unicode and ISO 10646 were created to address these needs and they have the potential
to enable reliable multilingual information interchange world-wide. These character set
standards now map most of the worlds known scripts, but  software to implement the
maps has not yet been fully developed. Software to support Unicode must recognize the
Unicode character set and be able to render in on the screen and on a printout.  Most
major companies are now developing software that is Unicode conformant and it is safe to
predict that Unicode will eventually become as widely accepted as ASCII is today.  But the
label "Unicode conformant" can be a little misleading. It indicates accuracy of coverage
rather than breadth of coverage.  Unicode conformant software must successfully enable
some sub-set of the Unicode standard, but does not need to enable the entire standard.
Just because a software is Unicode conformant does not mean that it will handle all
languages or enable true multi-lingual computing.

Most of the Indic scripts of South Asia are derived from Brahmi, known to exist at least
from the time of Ashok (3rd century B.C.).  These include Assamese, Bangla, Devanagari,
Gujarati, Gurumukhi, Kannada, Malayalam, Tamil, Telugu.  These scripts have many
similarities in structure and he languages written with them can be abstracted to a small
number of features and technical solutions applicable to one can often be extended to the
others. {2} The Perso-Arabic script is used for Urdu, Sindhi, and sometimes for Panjabi.

The purpose of this research project has been to review and evaluate available Unicode
conformant fonts and software for rendering, i.e., converting generic character codes into
shaped glyph's for output to screen or printer, and for transferring character codes over
the Internet.  I have also considered existing transliteration schemes, text archives, and
alternatives to Unicode. Based on this review I have formulated recommendations for a
software environment suitable for creating prototype texts in Tamil and Perso-Arabic

Character Sets
The English language has, until recently, dominated the Internet. E-mail, gopher, ftp, use the
ASCII standard, a 7-bit scheme allowing for 128 character positions.  Reliance on ASCII for
information interchange has been severely limiting for languages written in non-roman
scripts, and scholars have often been forced to use various inadequate transliteration
schemes to handle non-roman characters when digitizing text.  Extended ASCII, the
character standard for microcomputer software uses 8-bits to achieve 256 characters.
Before networking of personal computers became common font and software developers
for languages written in non-Roman scripts laid out non-standard character maps, usually
employing extended ASCII positions 128-256, to accommodate non-Roman characters. The
non-standard character maps made it possible to use non-roman characters with many
stand-alone applications and, after loading special fonts, even on the Web. But
interoperability remains problematic with non-standard schemes.

The increasing popularity of the Internet has made interoperability a major issue.  HTML
(RFC1866) has relied for character encoding on ISO-8859-1, known as Latin-1, a version of
extended ASCII, which is appropriate only for English and Western European languages. 
However, ISO-8859 has made it possible to standardize character sets for some widely
used scripts and it now includes a series of 10 standardized character sets for writing in
occidental alphabetic languages:  1. Latin1 (West European);  2. Latin2 (East European); 
3.Latin3 (South European);  4. Latin4 (North European);  5. Latin-Cyrillic;  6. Latin-Arabic; 7.
Latin-Greek;  8. Latin-Hebrew;  9. Latin5 (Turkish);  10. Latin6 (Nordic). 

Compared to Unicode these ISO-8859 character sets are extremely limited but they are
usable on the Internet, for instance, with MIME software, and they are a major
improvement  over the 7 bit US-ASCII.   Characters 0 to 127 are always identical with
US-ASCII and the upper positions hold characters for other scripts.  The ISO 8859
charsets were designed by the European Computer Manufacturer's Association (ECMA)
and endorsed by the International Standards Organisation (ISO). {3} Keyboard layouts of the
approved ISO8859 character sets are attached as Appendix 1.

"Dev" is a draft proposal for encoding Latin/Devanagari as an ISO-8859 character set. The
aim of "Dev" is to standardize on an encoding for Devanagari just as has been done for
Greek, Hebrew, Arabic, Russian etc. This will enable one to use Devanagari directly in
documents, email etc.{4}

Since the 1970's committees of the Indian Department of Official Languages and the
Department of Electronics (DoE) have been evolving character set encoding for Indic
scripts, derived from Brahmi,  based on their common phonetic structure.

In July 1983, DOE announced the ISCII (Indian Script Code for Information Interchange),
which complied with the ISO 8-bit code standard. While retaining the ASCII character set in
the lower half, it provides the Indic script character sets in the upper slots. Along with the
character set there was also a recommendation for a phonographic based keyboard layout
for all the Indian scripts called Inscript.  In 1988 DoE revised ISCII, making it more compact
and the standard has since been adopted by the Bureau of Indian Standards as IS
13194:1991. {5} The ISCII standard is currently undergoing another substantive revision.
Soon, the evolution of standards was expanded to other scripts ranging from Perso-Arabic
writing systems to Thai, Sinhalese, Tibetan, and Dzongkha.  Two more systems based on
ISCII have also been approved ACII (Alphabetic Code for Information Interchange) and ISFOC
(Intelligence based Script Font Code).{6} 

The relationship between Unicode, developed by the Unicode Consortium, which includes
among others IBM, Metaphor, Claris, Microsoft, NeXT, Sun Microsystems, Apple, and
Xerox, and ISO10646, developed by the International Standards Organization, has been the
source of a little confusion.  The two standards are virtually identical and a recent decision
to merge them, with Unicode becoming a 16-bit subset of ISO10646, should eliminate
future difficulties ("Unicode" and "ISO10646" are used interchangeably in the report). 

Unicode, which is identical with ISO 10646 BMP, is based on 16-bit encoding  that permits
65,535 characters instead of the 256 characters of Latin-1.  The BMP stands for Basic
Multilingual Plane, sometimes referred to as Plane Zero.  While Latin-1 assigns a value of
1-256 to a character Unicode assigns the value U+nnnn to a character where nnnn is a
four digit number in hexadecimal notation; this value is referred to as a code point (A
summary table of the Unicode code points and graphic representations of the schemes for
South Asian scripts are included as Appendix 2 and Appendix 3).  Because scripts may be
made up of combinations of letters and of diacritics character names include two special
symbols along with the values, the dashed circle and the dashed box.  A character that is
shown in the standard with a dashed circle must be rendered in relation to the previous
characters in the data stream.  A character that is shown as text surrounded by a dashed
box has no visible manifestation on its own. {7}

Beyond the BMP, ISO 10646 allows for 32-bit encoding and is divided into 32,000 planes
with 65,535 character capacity each or around 2,080,000,000 characters slots in all. 
The larger capacity might be necessary in the future. For instance, while Unicode, in order
to save space, employed a scheme called Han unification that maps duplicate or very
similar characters from Chinese Han, Japanese Kanji, and Korean Hanja to one position,
ISO 10646 planes beyond BMP would be needed for full unique sets of the characters.
Currently processing time is prohibitive, even for the 16-bit schemes, but the situation
should improve rapidly.  The large Unicode font files, even for the Asian scripts, are very
small in comparison to video and sound files that are now routinely transferred over the
Internet. Currently only the BMP plane of ISO10646 has been implemented.

You can represent Unicode using 7-bit, 8-bit, and 16-bit format standards.  UTF-7
(Universal Character Set Transformation Format-7) is a format that breaks the Unicode
codepoint into 7-bit values; these can be transferred through email (which uses the 7-bit
ASCII encoding) and on the Internet. This format presents difficulties because some values
are ambiguous, but it is still usable. The 8-bit UTF-8 (Universal Character Set
Transformation Format-8)  format breaks Unicode values into 8-bit sequences, which work
well on the Web.  UCS-2 (Universal Character Set -2) stores each character as a 16-bit
value and each value corresponds directly to its value (codepoint) in the Unicode standard.

Unicode encodes text by script, not by language. This avoids duplication of letters. For
instance, the Latin A is used without distinction for text in Catalan, English, Indonesian,
Swedish, or Swahili. Similarly a letter in Devanagari script could be used in Sanskrit, Hindi,
or Nepali.   

Character set standards relates to basic interchange of text.  They don't include attributes
regarding language, display format, color, or typeface. Unicode characters are made visible
through a distinct rendering process that maps characters (entities used in data
interchange that generically specify a particular symbol) to glyphs (the particular shapes of
a given characters asthey are displayed).{8}   For example, in English lower-case "c" is the
character (a code for a generic lower-case "c") transmitted.  When the code is received it
is converted into a glyph, a letter with a particular shape that will be displayed on the
screen or sent to a printer.  Loosely defined under the rendering process is the interaction
of hardware and software that will translate character codes into glyphs. This interaction
may differ substantially from one system to another.  The process includes at least some of
the following components:  operating system; locale and language settings; keyboard and
display software; word processing software; type rasterizer; hardware for input and output. 

Unicode has presented new challenges for developers of rendering software.  In the English
and Western European scripts handled by ASCII and Latin-1 there is always a one-to-one
correspondence between the character the glyph the character set can also serve a font
encoding.  The case different for languages written in, say, Perso-Arabic script such as
Urdu or Persian where an identical character may have a different basic form, glyph, 
based on whether it occurs initially, medially, or finally in a word.  In this case font encoding
won't have a one-to-one match with the character set.

TSCII (draft)
The draft TSCII standard that will be submitted In January 1998 to the Tamilnadu
Computer Standardisation Committee (TNC) for possible adoption. An International
Conference devoted to Tamil Information Processing, called TamilNet'97 was organized
early this year in Singapore by the Internet Resources and Development Unit (IRDU) of the
National University of Singapore to discuss possible standardization of the large number of
different Tamil fonts being used on the Internet.  The two highly active Tamil email
discussion lists tamilnet and TamilWeb (recently merged) have been discussing the issue
avidly for over a year and list participants are now commenting on the proposed standard. 

The standard differs from Unicode and ISCII in that it defines glyphs as well as characters. 
The glyphs chosen were included because of frequency of occurrence in modern Tamil. 
The  standard is being developed anticipating that it will eventually be superseded by
Unicode. However, solutions for widespread use of Unicode are not yet in place and the
8-bit TSCII standard can be used immediately with existing software and hardware.{9}

The Hypertext Markup Language (HTML) is a markup language used to create hypertext
documents that are platform independent.  Initially, it used Latin-1 (ISO-8859-1) for
document text and ASCII for markup. This limitation of character sets means that HTML is
useful for Western languages, but problematic for anything written in non-Roman scripts. 
Internet draft on internationalization of HTML (draft-ietf-html-il8n-03.txt) prepared by the
Internet Engineering Task Force (IETF) , suggests altering the HTML  Document Type
Definition (DTD), i.e., the formal definition of the HTML syntax in terms of SGML, by 
encompassing a larger character repertoire than ISO-8859-1, while still remaining SGML
compliant.{10} The  larger character repertoire suggested is Unicode, ISO 10646 BMP.

XML (draft)
Extensible Markup Language (XML) has been designed as an alternative to HTML that can
be used easily over the Internet.  "XML describes a class of data objects stored on
computers and partially describes the behavior of programs which process these objects. 
Such objects are called XML documents. XML is an application profile or restricted form of
SGML, the Standard Generalized Markup Language [ISO 8879]."{11} 

XML documents are made up of storage units called entities, which contain either text or
binary data. Text is made up of characters, some of which form the character data in the
document, and some of which form markup. Markup encodes a description of the
document's storage layout, structure, and arbitrary attribute-value pairs associated with
that structure. XML provides a mechanism to impose constraints on the storage layout and
logical structure. A software module called an XML processor is used to read XML
documents and provide access to their content and structure. It is assumed that an XML
processor is doing its work on behalf of another module, referred to as the application. 

Each text entity in an XML document may use a different encoding for its characters, but
all XML processors must be able to read entities in either the UTF-8 or UCS-2 Unicode
encodings. It is recognized that for some advanced work, particularly with Asian languages,
the use of the UTF-16 encoding is required, and correct handling of this encoding is a
desirable characteristic in XML processor implementations. 

The very important question is how XML will relate to non-HTML DTD's like TEI. 
As the situation currently stands TEI specifies a number of constructs which are not legal
in  XML, unless the user provides a modified DTD.  The TEI group is currently studying the
work involved in modifying the TEI DTD to be legal XML.{12}

ISO Working Group for Transliteration of Indic Scripts
In May 1997, a Working Group was set up under the International Organization for
Standardization, responsible for developing transliteration schemes from Indic scripts into
Latin script (ISO/TC46/SC2/WG12).  The following scripts are to be covered by this
working group: Assamese, Bengali, Devanagari, Gujarati, Gurumukhi, Kannada, Oriya,
Malayalam, Sinhala, Telugu, Tamil.  The purposes of transliteration to be covered by the
standard may include academic publications (both traditional and electronic), electronic
transmission of texts, bibliography, geographical names, email, and Web documents. Both
7-bit and 8-bit schemes will be considered.  It will be particularly important that any
transliteration scheme in the standard applies to each of the scripts listed above. This will
involve the discussion of linguistic equivalents in the different scripts. 

CSX stands for Classical Sanskrit eXtended.   It is the closest thing to an accepted
standard encoding for romanised Indian text that currently exists. CSX is an 8-bit
character-set, mapped in the upper half of extended ASCII.  It includes the accented
characters most likely to be needed by scholars working in Indian languages. While it is
especially useful for Sanskrit, including Vedic Sanskrit, it has been expanded to include
letters necessary for working with modern Indic scripts. {13}

As mentioned, above Unicode conformant does not necessarily mean software implements
the entire standard. Software is said to be Unicode conformant if it makes use of
independent fixed-width 16-bit characters and uses Unicode code points to represent
Unicode-defined characters. Code conversion from other standards to the Unicode
standard will be considered conformant if the matching table produces accurate conversion
in both directions.{14}

While most living languages have been mapped in the Unicode standard no software yet
supports the entire standard.  Some applications do offer partial support. 

Windows NT
Windows NT is a 2-byte operating system.   To say that Windows NT "supports" Unicode
is somewhat misleading.  Windows doesn't have built in support for all of the character
sets mapped by ISO 10646.  A more accurate description is that Windows NT uses
Unicode strings natively, and that it also supports ANSI (8-bit) strings. An API (Application
Programming Interface) in Windows NT is coded in Unicode strings. If a programmer uses
an ANSI (single character, 8 bits) designation of an API function, Windows NT converts
ANSI to Unicode strings and uses the Unicode version to do the real work.  All of the
programs that come with Windows NT are compiled to use the Unicode APIs.  Since the
building blocks of the NT system are two-byte Unicode strings, this means that the NT
system can efficiently handle fonts and rendering software for Unicode text, particularly if
the software is also written using Unicode strings instead of ANSI. {15}

Microsoft plans to incorporate multilingual capability in its Windows NT 5.0 OS, which will
should be available in 1998.  NT 5.0 will allow users to switch easily between languages by
adding language packs. 

Mac OS 8.0  and Win98 will also have Unicode support features, but not extensive
multilingual capability.

DEVELOPMENT SOFTWARE                                              
JDK 1.*
Java was one of the first programming languages to offer Unicode support. In the Java
Development Kit (JDK) all concerned data types have been extended from 8 to 16 bits and
thus can potentially hold Unicode data. In addition JDK supports the UTF-8 storage format
and special notations for the integration Unicode characters. This makes Java a
predestined development platform for all kinds of Unicode applications. However, for the
development of a word processing system which is capable of handling texts consisting of a
large variety of different Unicode characters (and not only a small subset) Java's built-in
Unicode facilities are not yet sufficient. In order to display Unicode documents there is a
need for a specially adapted Unicode font management which provides the necessary fonts.

Java currently  supports only a small choice of fonts.

Duke Humanities Computing Facility's UniEdit and WinCalis 
 UniEdit, a standalone Unicode text editor, was developed at Duke University's Humanities
Computing Facility as part of WinCALIS, an authoring system which allows instructors to
create and present lessons for students in almost any language through a graphical
interface. WinCALIS provides full support for the Unicode character set and allows
representation of multiple languages simultaneously. The basic components of the
WinCALIS system are WinCALIS Author (for creation of materials), UniEdit (for text editing),
the Multimedia Editor (for incorporating images, sound, and video clips in a lesson) and the
Student Delivery Station.{16}

GammaPro's UniType 
Gamma has developed Unicode conformant fonts for most scripts mapped by Unicode. 
The fonts could be used in most Microsoft software but aren't reliable in with other
software. UniType also includes a text editor. Gamma's distribution has recently been
transferred to Unitype and it is unclear who will be responsible for further development of
the fonts and text editor.

Monotype Fonts
Monotype, a company that has been making type for a hundred years, is entering the
Unicode market. They have developed a base level module of Unicode conformant fonts
including: Pan-European Latin, Cyrillic, Greek, Hebrew, and Arabic.  They also are considering
development of 3 similar modules for Indic languages--Indic 1: Devanagari, Bengali, Tamil,
Telugu; Indic 2: Gujarati, Oriya, Sinhalese; Indic 3: Gurumukhi, Malayalam, Kannada.  Their
strategy is to combine groups of complementary Unicode-conformant fonts into a 'families'
of fonts.{17}

TamilWeb in Singapore Fonts
The fonts available on the Singapore TamilWeb are not Unicode conformant but are
encoded in the upper half of the ASCII scheme.  Viewers who don't chose to load the
special font can click on a icon beside the Tamil text to have the Tamil letters converted on
the fly and displayed by a Java applet. TamilWeb is sponsored as an experimental
collaborative project between National Institute of Education/Nanyang Technological
University and the Internet Research and Development Unit(IRDU)/National University of
Singapore. {18}

Apple Indian Language Kit
The Apple Indian language kit contains the necessary software to work with Devanagari,
Gurumukhi and Gujarati scripts on the MAC platform. The kit is not Unicode conformant,
but is does conform to ISCII, which was the basis of the Unicode mappings.

Netscape Communicator 4.0 
Netscape's Communicator 4.0, available for Windows95,  NT, and MAC, makes it possible
to browse Unicode documents in the UTF-7 and UTF-8 encoding schemes, but you must
install a Multilanguage Support plug-in.  Multilanguage Support handles only a subset of
scripts from the Unicode standard. 

Other browser's are offering some level of multi-lingual Unicode support include
Accent Multilingual Mosaic, a stand-alone multilingual Web browser (Navigate with an
Accent, is a multilingual plugin also produced by Accent to be used with Netscape
Navigator);  HotJava Win32 and Tango. {19} Tango has recently licensed the Unicode
conformant fonts created by Monotype.{20}

While the browsers listed above permit you to see Latin1 and one other character set
(e.g., Arabic, or Hebrew, or Greek) within single document, they lack the functionality for
viewing more than two languages at once.  Babble, created at the University of Virginia by
John Unsworth, enables browsing of multiple non-roman character sets in a single
document. Babble also makes it possible to search within Unicode documents, compare of
multiple texts, and to handle non-HTML SGML.

Babble is designed to read UCS-2 documents only. Several programs are available with
Babble that can convert documents to and from UTF-8.  Babble is not a text editor and it
doesn't come with its own internal fonts, so third-party products are needed if Babble is
going to be used for input or text editing.{21}

The approval the Unicode standard and the development of fonts, text editors, and web
browsers to support it could have significant implications for researchers working with
languages written in non-roman scripts. Humanist scholars in the fields of language,
literature, and linguistics have already initiated significant projects to digitize literary corpora
in non-roman languages, for instance, the Thesaurus Linguae Graecae and the Buddhist
Pali canon. Most of the major corpora are not currently encoded in Unicode and tend to
adhere to local standards developed for a particular project.  But the desirability of
achieving widespread interoperability, as well as accuracy and authenticity vital to collation,
concordancing, and interpretation should insure that future corpora projects will conform to
the broader standards.

No major South Asian archives has been built yet using Unicode.  Most South Asian
electronic text files use local schemes of encoding in South Asian scripts or of
transliteration to roman script.  Some transliteration schemes have developed software to
convert transliterated text files into a display in the original script.  A few archives include
significant numbers of bit-mapped scans of original manuscripts or books. 

Below I have given basic information about standards and software that are not Unicode
conformant, but that have wide acceptance among South Asianists involved in digitizing text
in South Asian Scripts.  I have also reviewed major text archives; when an encoding scheme
or software is used only for one particular archive I have included information about the
encoding scheme or software with the description of the archive.

The  GIST Group in Pune, India, part of the Center for Development of Advanced Computing
(C-DAC), developed the Graphics & International Script Technology (GIST).  GIST
technology has used the ISCII standard for character sets (requires content-specific
pre-processing before display or printing), but has recently also developed another standard
called ISCLAP, for use with technology such as pagers. This standard defines a few
thousand characters for each Indian script, taking into account all the possible
combinations of consonants and vowels. This scheme allows the "characters" to be placed
one after the other in the manner of Roman script languages.  GIST has submitted the
standard to ISO for possible inclusion in ISO 10646.

The foundation of GIST was laid in 1983 by an IIT Kanpur student, Mohan Tambe.  He
developed the Integrated Devnagri Terminal (IDT),and demonstrated it at the First
International Hindi Sammelan the same year. The Department of Electronics (DoE) funded
further development and the IDT evolved into the GIST Terminal, which supported a code
based on a common phonetic overlay for all Indian scripts. An add-on GIST card was also
developed for IBM PC's as they became more popular. In 1988, the GIST technology was
entrusted to C-DAC and the GIST group.  In 1992 the Group began commercialization of
some of its products like the GIST Terminal, GIST 9000 (a micro-chip) and the GIST Card. 
The technology is being applied is diverse areas such as sub-titles for movies shown on
television, messages displayed on pagers, railway reservation charts, and records of
banking transactions. {22}

ITRANS is a transliteration package which uses romanized (ASCII) input and produces
output in a number of Indian languages (Hindi, Marathi, Sanskrit, Bengali, and Telugu). 
Output is produced by running the ASCII input through interpreters (ITRANS, TeX, and
DVI's) to produce a PostScript file that can either be viewed on the screen or printed out
on a PostScript printer.  ITRANS is used for the ITRANS Song Book, which includes lyrics
for over 1000 Hindi songs.  It is also being used to transliterate Sanskrit text that is being
mounted on the Web. {23}

TeX and LaTeX
TeX is a public domain typesetting package. To create a document using TeX the user first
produces an ASCII file, which is then input into TeX. TeX reads the file and converts it to a
DVI file (DeVice Independent). A DVI is called device independent because it can be output
with equal accuracy on a laser printer, computer screen, or phototypesetter. The DVI file is
read by another program (called a device driver) that produces the output to the computer
screen or printer.

The LaTeX macro package (this is the package the ITRANS Song Book uses) expands the
capabilities of TeX.  LaTeX can automatically create an index, a table of contents, and a
bibliography.  It can insert some elementary graphic figures such as circles, ovals, lines, and
arrows. The user can also define "style files" to set up specific page parameters. {24} TeX
and LaTeX are used widely with many non-Roman scripts, not just Indic scripts, but the
programs are mentioned here because of their wide implementation in projects digitizing
texts in Indic scripts.

By far the largest number of South Asian archives of electronic text are in Sanskrit.  Many
of these text are encoded in ITRANS or transliterated using the CSX encoding scheme. 
The most extensive archive is the "Virtual e-Text Archive of Indic Texts" on Dominik
Wujastyk's INDOLOGY website{25}  Readme files explain transliteration schemes for texts
not using ITRANS or CSX.  Major works included are the Rig Ved, the Mahabharata,
Panini's Astadhyayi and Patanjali's Yogasutra.  The page also links to the Sanskrit text
archive at Kyoto, the Pali Canon, John Smith's Cambridge Archive, and a number of other
important Sanskrit sites. The Indology site also includes digitized images of Sanskrit
manuscripts from the Wellcome Institute at the University of London.The University
of Pennsylvania has the largest number of Sanskrit manuscripts in the United States and
that collection is being scanned by the University's Center for Electronic Text and Image
and mounted on the University's website. {26} One group inputting Sanskrit texts is the
Sanskrit Documents project in Utah.  Much of the input is by volunteers and texts are
encoded using ITRANS. Ongoing projects include the Online Sanskrit Dictionary, the
RigVeda, the SaamaVeda, and the Amarakosha.

There are three major Tamil Electronic Text Archives.  The first is the Pioneer Tamil World
Wide Web Archive in Singapore.{27} The site includes selections from Sangam literature,
the Thirukkural, selection from Singapore Tamil Poetry, and the Purananuru in Unicode
UTF8.  The non-Unicode texts on the site are encoded in the draft standard TSCII.  The
second important site is the University of Koeln Archives include several online Tamil and
Sanskrit lexicons and the Tamil Text Thesaurus (TTT). {28} Tamil texts on the Koeln site are
transliterated according to a special scheme developed for the Koeln digitizing project. The
third important site is at EPF-Lausanne. The EPF-L site has put up around 50 texts in Mylai
font, beginning with the 2nd century Thirukural, and around 25 texts in a transliteration
scheme developed for the Lausanne project. {29}
Other South Asian Scripts
For reference to Electronic Text Archives in other Indian and Pakistani languages refer to
the following websites: Links to India Information: Languages{30}, Unicode Progress
report{31}, and Pakistan: Education and Languages{32}.  For other South Asian countries
check the appropriate listing on the World Wide Web Virtual Library{33} 

My original Phase 1 proposal for the Unicode project was only for a review of the
standards, software, and existing archives.  However, sufficient time and funds remain to
complete a prototype data entry before submitting a proposal for Phase 2 implementation.
Based on the completed review the standards and software detailed below will be used to
complete a prototype project.

*Operating System
     Windows NT (installed)
     Babble with Netscape Communicator 4.0 (Communicator 4.0 installed)
*Text Editor
     Gamma UniType (ordered)
          second choice is:
          Duke Humanities Computing Facility's UniEdit

     Gamma UniType fonts (ordered)
          second choice is:
          Monotype Packages Indic 1-3

*Character Set
     Unicode, UCS-2
     TEI Lite (markup will remain conformant with the draft XML standard) 

For Tamil the text will be selections from the modern Tamil poet Kaviyarasu Kannadhasan
and for Urdu selections from the ghazals from Ghalib.  An additional selection from Hir
Varis Shah will be input in both Urdu and Gurumukhi scripts. Prototypes will be completed
by July 1998.


{1}   Joseph Hargitai "Unicode: Writing in the Global Village" in "Connect: Fall 1996"
Accessed 12/1/97

{2}   Glenn Adams, " Introduction to Unicode, A Tutorial "
Accessed 12/5/97

{3}   Roman Czyborra, "The ISO 8859 Alphabet Soup"
Accessed 12/1/97

{4}  Sandeep Sibal, "Dev - an encoding for Devanagari" 
Accessed 12/1/97

{5}   Hacharan Singh and A. S. Rawat, "Asian Forum for Standardization of Information
Technology: Universal Code Set and Internationalization: Country Report: delivered at the
7th AFSIT, October 22, 1992, Tokyo, Japan. "
Accessed 12/4/97

{6}  Indian Script Code for Information Exchange ­p; ISCII. (New Delhi: Bureau of
Indian Standards, 1991).

{7}   Dave Johnson, "Concepts of C/UNIX Internationalization: Unicode" 
Accessed 12/4/97

{8}  "A Unicode Font Solution: A White Paper from Monotype"
Accessed 12/1/97

{9}  "A Proposal for A Tamil Standard Code For Information Interchange
(TSCII)"  http://www.geocities.com/Athens/5180/tsic.html
Accessed 12/7/97

{10}  F. Yergeau et al, "Network Working Group Internet Draft: 
:  Internationalization"
Accessed 12/4/97

{11}  Tim Bray and C. M. Sperberg-McQueen "  Extensible Markup Language
(XML):WD-xml-961114 : W3C Working Draft 14-Nov-96 " 
Accessed 12/1/97

{12} Michael Sperberg-McQueen, personal email transmission, 12/13/97.

{13}  John Smith,  "The CSX encoding"
Accessed 12/12/97

{14} Dave Johnson, "Concepts of C/UNIX Internationalization: Unicode" 
Accessed 12/4/97

{15}  Matt Pietrek "Unicode strings used natively in Windows NT,"
     Microsoft Systems Journal v12, n12 (Dec, 1997):67 (7 pages).
telnet:// Melvyl.ucop.edu/
Accessed 12/7/97

{16} Duke University . Humanities Computing Facility, "WinCALIS Language Learning
and Authoring System"
Accessed 12/15/97

{17} "A Unicode Font Solution: A White Paper from Monotype"
Accessed 12/1/97

{18}  "The Pioneer Tamil World Wide Web Archive in Singapore"
Accessed 12/13/97

{19}  "Internet with an Accent"
Accessed 12/5/97

{20}  "Tango Browser"
Accessed 12/1/97

{21} "Babble: A Synoptic Unicode Browser"
Accessed 12/3/97

{22}  "The GIST Of Languages" DATAQUEST INDIA, 30 AUGUST 1997
Accessed 11/9/97

Accessed 12/3/97

{24}  "An Intro to TeX/LATEX"
Accessed 12/1/97

{25} Dominik Wujastyk, "INDOLOGY: Internet Resources for Indological Scholarship:
Virtual e-Text Archive of Indic Texts"
Accessed 12/1/97

{26} "Sanskrit Manuscripts"
Accessed 12/1/97

{27}  "Pioneer Tamil World Wide Web Archive in Singapore"
Accessed 11/27/97

{28}  "IITS -- Institute of Indology and Tamil Studies: Indological Resources "
Accessed 11/15/97

{29} "Tamil ETexts available at EPF-Lausanne"
Accessed 21/3/97

{30}  Sergio Paoli, "Links to India Information: Languages"
Accessed 12/1/97    

{31} Lakshmi Vasist, "Unicode Project Progress Report"
Accessed 12/1/97

{32} "Pakistan: Education and Languages "
Accessed 12/1/97

{33}  "World Wide Web Virtual Library "
Accessed 12/1/97

Copyright (C) 1997 by the Library, University of California, Berkeley. All rights reserved.
Document maintained on server: http://www.lib.berkeley.edu/ smcmahon@library.berkeley.edu
Last update 12/21/97. Server manager: webman@library.berkeley.edu