SDA 4.1 Documentation for INTERNATIONALIZATION

NAME

Internationalization - Using non-English languages

DESCRIPTION

SDA allows users to set up SDA datasets and to display results with the SDAWEB interface in practically any language. This document summarizes the issues involved.

This document includes the following topics:

Specifying the 'charset' (character encoding) in the DDL file
Specifying the 'lang' (language) in the DDL file
Specifying labels and question text in the DDL file
Non-English character variables in data files
Modifying the language files for the user interface, analysis output and codebook output
Specifying the font for charts
Importing Stata and SPSS files
Other technical Issues

SPECIFYING THE 'CHARSET' (CHARACTER ENCODING) IN THE DDL FILE

If the language you're using is plain 'US-ASCII' (also known as simply 'ASCII'), then you don't have to worry about character encoding issues. However, if you are using a non-English language, then the browser needs to know the encoding you are using to display the characters properly.

For a list of recognized character sets, see the list of IANA Character Sets. Some commonly encountered encodings are: "ISO-8859-1" (Latin alphabet no. 1) and the similar "Windows-1252" (found in some older Windows files). Both were used in the past to encode various European languages. However, UTF-8 is today the preferred encoding for the Web because:

It is part of the Unicode standard.
UTF-8 encodes all languages supported by the Unicode standard -- essentially all languages. Other encodings, such as ISO-8859-1, support only a small subset of languages.
It does not have any byte-order "endian" issues (unlike other Unicode encodings such as UTF-16 and UTF-32).
It is a superset of the original ASCII text encoding and is therefore fully backwards compatible with it. ASCII text is UTF-8 text.
The W3C, the main international standards organization for the World Wide Web, recommends using UTF-8 encoding for all HTML content.

Therefore, today, UTF-8 should always be used as the character encoding for non-English languages.

Warning: do NOT use other Unicode encodings such as UTF-16 or UTF-32. Also, do NOT use so-called "character entities" for non-ASCII characters. These are the HTML codes that start with a '&' and end with a ';'.

If you have documentation in another encoding, there are various tools available to convert it to UTF-8. The Linux, Unix and Mac OS X operating systems all include the "iconv" utility program which converts text from one encoding to another. For example, the following command will convert an "originalfile" encoded in ISO-8859-1 to a "newfile" encoded in UTF-8.

iconv -f ISO-8859-1 -t UTF-8 originalfile > newfile

Various Windows editors also provide encoding conversion capabilities:

In Microsoft Word: open a file, select "Save as ...", then "Plain Text (*.txt)", then "Save". In the "File Conversion" dialog box click "Other encoding" and choose "Unicode (UTF-8)".

The popular freeware program Notepad++ also provides encoding conversion. Open a file, select the "Encoding" menu, then choose "Convert to UTF-8". (The built-in Microsoft Notepad will also save files as UTF-8, but it automatically inserts a BOM at the beginning of the file -- which is not ideal.)

In SDA 4.1.1 and later, UTF-8 is the default charset so you do not need to specify the charset in the DDL file if your dataset uses this recommended encoding -- which, again, includes US_ASCII. You only need to explicitly specify a charset in the DDL file if your dataset is in one of the outdated, legacy encodings such as ISO-8859-1. The name of the character encoding used for a dataset is specified using the general section of the DDL file. The name of this character set will then be stored as a permanent part of the SDA dataset (in the STUDYINF/studyinf file) when MAKESDA is executed.

Note that in earlier versions of SDA you were required to specify the charset in the DDL file if you were not using US_ASCII. If you have DDL files that specify the UTF-8 charset explicity, there is no need to remove that specification. It is just no longer required.

When HTML pages are generated by various SDA programs, the charset information stored with the dataset (if any) will be taken into account so the pages can be displayed correctly in a browser. The charset information will be used to write a meta tag in the head element of an HTML page. If no charset information is stored in the dataset, then the meta tag will specify UTF-8 as the charset by default.

SPECIFYING THE 'LANG' (LANGUAGE) IN THE DDL FILE

In addition to specifying a charset in the global section of the DDL file, you should also specify the dataset's 'lang' attribute. If you specify a lang attribute in the DDL file, it will also be written to the SDA dataset's STUDYINF/studyinf file when MAKESDA is executed.

Here is an example of specifying a charset and a lang attribute in the global section of a DDL file:

  title = French Canadian Study
  charset = utf-8
  lang = fr-CA

When SDA programs write HTML, the dataset's lang will be written as an attribute of the main HTML tag. For example:

   <html lang="fr-CA">

A two-character language code like 'fr' represents the generic language. An optional subfield can be added, to indicate a regional dialect. (Even finer variations -- with longer "lang" codes -- are occasionally found.) In the example above "fr-CA" indicates that the language is French, as spoken in Canada. However, unless you have a compelling reason to distinguish between regional dialects of a given language you should always use the generic language code.

A complete list of the "lang" codes and their corresponding resource bundle file names is available.

SPECIFYING LABELS AND QUESTION TEXT IN THE DDL FILE

The names, labels and question text for variables are all defined in the DDL file. The variable names must only contain ASCII characters, but the variable labels, category labels and question text may be in any language. However, as noted above, you must specify the character encoding (if it's not UTF-8) using the 'CHARSET=' keyword and the language using the "LANG=" keyword in the general section of the DDL file.

After the DDL file has been used to create the SDA dataset (by using the SDAMANAGER or by using the MAKESDA program directly), all displays of SDA results will use that language.

NON-ENGLISH CHARACTER VARIABLES IN DATA FILES

Before SDA 4.1.1 any raw data file that was used to create an SDA dataset had to be encoded in plain US-ASCII. SDA 4.1.1 and later can now handle character variables in other languages as long as they are encoded in UTF-8.

Note that raw data files in non-English languages should be delimited data files (CSV or TSV), not fixed-format files. The width of a variable in a fixed-format data file must be the number of bytes the variable occupies in the data file. In US-ASCII encoding each character occupies one byte -- so this is not a problem. However, UTF-8 is a variable-length encoding where a single character can occupy anywhere from one to four bytes. This makes it extremely difficult to use UTF-8 encoded data as fixed-format data.

MODIFYING THE LANGUAGE FILES FOR THE USER INTERFACE, ANALYSIS

The user interface for SDAWEB, analysis output and codebooks can be changed to another language by modifying some or all of the default English character strings with alternate wording. (You can even modify the English wording if desired.)

The strings to be modified are contained in a number of different files, corresponding to the displays generated by the main SDAWEB interface for selecting options and by the various analysis and codebook programs. The following sections describe how to obtain copies of the files with the default language strings, how to modify them, and where to put the modified files.

OBTAINING COPIES OF THE DEFAULT LANGUAGE FILES

There are three separate language files that can be modified.

User Interface
The user interface screens are those used to specify which procedure to run, which variables to analyze, and which options to use. The language strings for the interface screens can be found in the following subdirectory of your Tomcat application:
[tomcat-directory]/webapps/sdaweb/WEB-INF/classes
Within that subdirectory, the default (English) strings are in the file 'sdaweblang.properties'. Make a copy of that file for purposes of modification.

Analysis Output
The analysis output is produced by the various analysis programs (like the TABLES program or the MEANS program) to display the results of the analysis. A copy of those default language strings can be obtained by using the '-t' option with the TABLES program. The following command will put a copy of those strings into the file 'lang-analysis.txt':
```
   tables -t lang-analysis.txt 
```
The TABLES program is located in the directory in which the SDA programs have been installed.
Codebook Output
The XCODEBK program has its own set of language strings. A copy of those default language strings can be obtained by using the '-t' option with the XCODEBK program. The following command will put a copy of those strings into the file 'lang-codebook.txt':
```
   xcodebk -t lang-codebook.txt 
```
The XCODEBK program is located in the directory in which the SDA programs have been installed.

Once you have a copy of the language files, you can proceed to modify them.

MODIFYING THE LANGUAGE FILES

All of the language files have the same format. There is a keyword, then an equal sign, then the string used by the SDA programs. When entering foreign language strings on the right side of the equal sign, you will sometimes need to enter characters that are not included in the set of simple 'US-ASCII' characters.

Although it is possible to enter these special characters using an English keyboard and special ALT-codes, it is probably easiest in most situations to invest in a language-appropriate keyboard so these special characters can be typed directly.

Also, if you use Microsoft Word (or similar word-processing software) be sure to save the file as a ".txt" file instead of a ".doc" or ".docx" file; ".doc" and ".docx" files contain hidden formatting that will interfere with the processing of the language files.

The analysis and codebook languages files should be saved as UTF-8 files. However, the language file for the user interface is a Java "resource bundle" file which must conform to the special requirements of these files. Java resource bundles must be encoded using ISO-8859-1 (Western European). Or, if the language cannot be encoded in ISO-8859-1, then Unicode escape codes (such as '62b5') must be used. Fortunately, the Java JDK provides a "native2ascii" tool which can be used to convert "any character encoding that is supported by the Java runtime environment to files encoded in ASCII, using Unicode escapes for all characters that are not part of the ASCII character set." See this Oracle documentation for more information on how to use the "native2ascii" tool. Note that although the input language file for the user interface is not UTF-8, the resulting output HTML for the user interface is UTF-8.

Update on Java "resource bundle" files and JDK 9: In Java 9 and later, properties files are loaded in UTF-8 encoding, rather than ISO-8859-1. This is a big improvement for anyone who has to use resource bundles. For more information see the Oracle document Internationalization Enhancements in JDK 9.

After you have modified the strings, you can proceed to put the modified files into the appropriate locations.

WHERE TO PUT THE MODIFIED LANGUAGE FILES

The location of the language files depends on the program for which it is designed.

The modified user interface strings must be located in the same subdirectory of the Tomcat application as the default strings -- namely:
[tomcat-directory]/webapps/sdaweb/WEB- INF/classes
Within that subdirectory, the default (English) strings are in the file 'sdaweblang.properties'. Put the modified user interface strings in a file with the appropriate name corresponding to the "lang" code. For example, French strings would be in the file 'sdaweblang_fr.properties'.
A complete list of the "lang" codes and their corresponding resource bundle file names is available.
The language files used by the SDA analysis programs and the codebook program can be named anything and can be put anywhere on the server computer. You must specify the full pathnames of those files within the SDA manager. These pathnames are part of the global configuration used by specified datasets.
The programs will use the default strings unless you have given a pathname for the analysis language file and/or the codebook language file.
If you run the analysis programs or the XCODEBK program in batch mode, the pathname of the language file is given after the 'LANGuagefile=' keyword in the batch file.

SPECIFYING THE FONT FOR CHARTS

Normally it is the browser that is responsible for selecting the correct font for displaying text, using whatever information is available in the HTML code (or response header from the server) as a guide. However, charts present a special problem because the chart inserted into the HTML output is just an image -- a picture -- and the browser has no control over selecting the font that is used in the chart's headings, labels, etc. Instead, the font is selected when the chart image is created by the SDAWEB application on the server.

By default, the SDAWEB application uses the generic Java "SansSerif" font when displaying text. (This "SansSerif" font is mapped to a particular physical font on the server on a system- dependent basis.) In many instances this default font will work fine. However, there may be cases where a specific font is required to display a given language. This font setting is done in the SDA Manager under "Custom chart font" in the "Global Specifications" section. Remember that the font specified must actually be present on the server machine that's running the Java JVM. And the server must be configured so that the font is available to Tomcat.

IMPORTING STATA AND SPSS FILES

Since SDA 4.1 the SDAMANAGER has been able to import Stata .dta files and SPSS .sav files. Both Stata and SPSS are now UTF-8 based. Below are some notes on each.

Stata 14 and later versions use UTF-8 to encode all strings, including variable labels and category labels. Therefore, the SDA import procedure simply processes and outputs all strings from Stata 14 and later as UTF-8.
Stata 13 and earlier used extended ASCII instead of UTF-8. However, the SDA import procedure does not attempt to support all extended ASCII encodings. Instead, it supports the most popular extended ASCII encoding, ISO 8859-1 (also known as ISO Latin 1) which covers the most common Western European languages. However, the SDA import procedure still outputs the metadata (DDL) in UTF-8 encoding.
It should also be noted that recent versions of Stata provide a "unicode translate" command which translates Stata 13 and earlier .dta files from any extended ASCII encoding into UTF-8. If you have legacy .dta files in an extended ASCII encoding that is not ISO 8859-1, then you should use Stata's "unicode translate" procedure to convert these files to UTF-8 before importing them into SDA.
SPSS has supported UTF-8 since version 16. Since version 21 it operates in "Unicode mode" (i.e., UTF-8) by default. Although it is possible to specify an alternative "code page mode", SPSS files should be processed in the default "Unicode mode" if they will be imported into SDA. The SDA import procedure assumes SPSS .sav files are encoded in UTF-8 and it outputs metadata (DDL) in UTF-8.

OTHER TECHNICAL ISSUES

There are a couple of other technical issues concerning character encoding that should be kept in mind.

The Byte-Order Mark (BOM): Some Unicode-capable text editors (such as Windows' Notepad) will automatically output a "BOM" (Byte-Order Mark) at the beginning of any file saved in a Unicode format. However, byte-order is irrelevant for UTF-8 encoded files and it is recommended that you use a text editor (such as the popular Notepad++) that can output UTF-8 files without a BOM. For more information see the section on the BOM in the Unicode FAQ. If a DDL file contains a BOM in the initial bytes of the file, then MAKESDA (and other SDA programs that process DDL files) will ignore the BOM during processing. However, if MAKESDA is run from the command line, the program will display a message informing the user that a BOM was encountered and ignored.
Apache Web Server's Default Charset: Although most Web pages for SDA version 4 are served up by Tomcat, in some SDA configurations codebook pages may still be served by Apache or IIS. If you are using Apache to display codebook pages -- and are experiencing problems with displaying non-ASCII characters -- then you should be aware of the following issue. In 2.x versions of the Apache configuration file (httpd.conf), there is often an "AddDefaultCharset" directive that is turned on by default. The charset specified by this directive will be added to the server's response header that accompanies every HTML page and will override any charset setting in the meta tag of an HTML file, making all SDA charset specifications inoperative. Therefore, this directive should usually be commented out in the Apache httpd.conf file so that SDA charset specifications will be effective. For more information see the section on the AddDefaultCharset directive in the online Apache Manual.