SDA 3.4 Documentation for LANGUAGE


NAME

language - Using non-English languages

DESCRIPTION

SDA allows users to set up SDA datasets and to display results in practically any language. This document summarizes the issues involved. Note that Unicode may be used for many languages, so long as the UTF-8 encoding is used (NOT the UTF-16 or UTF-32 encoding).

This document includes the following topics:



DATA DEFINITIONS IN VARIOUS LANGUAGES

The names and labels and question text for variables are all defined in a DDL file. The variable names must only contain ASCII characters, but the labels and question text may be encoded with any character set.

After the DDL file has been used to create the SDA dataset (by using the MAKESDA program), all displays of SDA results will (try to) use that character set.


SPECIFYING THE CHARACTER ENCODING

If the text in your DDL file is just plain ASCII (also known as ’US-ASCII’), then you don’t have to worry about character encoding issues. However, if the data definitions are encoded with another character set (to include accent marks, for example), or if the user interface has been modified using another character set, browsers might not display the characters properly.

If you are not using ASCII text, then the name of the character encoding used for a dataset should be specified using the ’CHARSET=’ keyword in the general section of a DDL file. The name of this character set will then be stored as a permanent part of the SDA dataset (in the STUDYINF/studyinf file) when MAKESDA is executed.

For a list of recognized character sets, see the list of IANA Character Sets. Some commonly encountered encodings are: ’Windows-1252’ (older Windows files) and ’ISO-8859-1’ (Western European). However, UTF-8 is today the preferred encoding for storing a study’s metadata text because:

Therefore UTF-8 should be used for creating and storing metadata for datasets whenever possible. (Remember, ASCII text is UTF-8 text. So if your metadata is plain ASCII then you’re already using UTF-8.)

When HTML pages are generated by various SDA programs, the charset information stored with the dataset will be taken into account so the pages can be displayed correctly in a browser. Usually the charset information will be used to write a meta tag in the head element of an HTML page. For example:

<meta http-equiv="Content-Type" content="text/html;charset=utf-8">
HTML output written by one of SDA’s servlet-based webapps is an exception to this rule however. Java internally encodes all characters in Unicode. Therefore, the crucial point here is that encoded metadata text from an SDA dataset must be "de-coded" correctly -- using the charset specification stored in the dataset’s STUDYINF/studyinf file -- when it is imported into the Java environment and turned into Unicode. After that, the "native" encoding of the text is lost. Therefore, HTML pages produced by a servlet will typically not include a meta tag specifying the original charset for the dataset. Instead, servlet pages are usually written using UTF-8 encoding.

There are a couple of other technical issues concerning character encoding that should be kept in mind.

SPECIFYING THE FONT FOR CHARTS

Normally it is the browser that is responsible for selecting the correct font for displaying text, using whatever information is available in the HTML code (or response header from the server) as a guide. However, charts present a special problem because the chart inserted into the HTML output is just an image -- a picture -- and the browser has no control over selecting the font that is used in the chart’s headings, labels, etc. Instead, the font is selected when the chart image is created by the chartgen servlet on the server.

By default, the chartgen servlet uses the generic Java "SansSerif" font when displaying text. (This "SansSerif" font is mapped to a particular physical font on the server on a system- dependent basis.) In many instances this default font will work fine. However, there may be cases where a specific font is required to display a given language. There are two ways this information about the required font can be relayed to the chartgen servlet: 1) a font specification can be globally applied in the chartgen configuration file; 2) a chart font for a particular dataset can be specified in the HARC file (overriding any global specification). For more information on specifying fonts for charts, see the section on charts in the SDA Archive Developer’s Guide. Remember too that the font specified must actually be present on the server machine that’s running the Java JVM. And the server must be configured so that the font is available to Tomcat (or your chosen servlet container). For more information on language issues in Java and in the servlet environment see the Java Internationalization FAQ.

LANGUAGE LIMITATIONS IN SDA SEARCH

The SDA search utility currently works with search terms entered in English or a Western European language. The search utility is configured so that accented Latin characters (German umlauts, etc.) will be displayed correctly; however, the search terms themselves can only be entered using non-accented characters. Languages that aren’t compatible with the Latin character set at all -- Asian ideographs, Georgian script, etc. -- can’t be used in search terms (although they will still display correctly in search results). These language limitations in SDA searching will likely be removed in a future version of SDA. However, it is important to be aware of these issues if you have datasets that are not in English.

SPECIFYING THE LANG ATTRIBUTE

In addition to specifying a charset in the global section of the DDL file, you can also specify the dataset’s "lang" attribute. The "lang" attribute is generally of far less importance than the "charset" specification in displaying HTML correctly and will probably rarely be needed. However, if you do specify a "lang" attribute in the DDL file, it will also be written to the SDA dataset’s STUDYINF/studyinf file when MAKESDA is executed.

Here is an example of specifying a "charset" and a "lang" attribute in the global section of a DDL file:

title = French Canadian Study charset = utf-8 lang = fr-CA When SDA programs write HTML, the dataset’s "lang" (if any) will be written as an attribute of the main "html" tag. For example: <html lang="fr-CA"> A two-character language name like ’fr’ represents the language itself. An optional subfield can be added, to indicate the country in which that language is spoken, in case a browser might know what to do with that information. In the example above "fr- CA" indicates that the language is French, as spoken in Canada.

For more information on the declaration of the "lang" attribute and the uses of that attribute by browsers (or other user agents), see this W3C document on the "lang" attribute in HTML.


MODIFYING THE USER INTERFACE

The SDA option screens and the output from analysis programs can be changed to any language. There is a separate interface document that explains how to do this.

To ensure that browsers know how to display characters properly, it is best to have also specified the character encoding, as described above, if the modified interface uses a character set other than ’US- ASCII’.


SEE ALSO

archive Archive Developer’s Guide
interface Modifying the SDA User Interface


CSM, UC Berkeley
February 10, 2010