This document includes the following topics:
If you are not using ASCII text, then the name of the character encoding used for a dataset must be specified using the 'CHARSET=' keyword in the general section of the DDL file. The name of this character set will then be stored as a permanent part of the SDA dataset (in the STUDYINF/studyinf file) when MAKESDA is executed.
For a list of recognized character sets, see the list of IANA Character Sets. Some commonly encountered encodings are: 'ISO-8859-1' (Latin alphabet no. 1) and the similar 'Windows-1252' (found in some older Windows files). Both were used in the past to encode various European languages. However, UTF-8 is today the preferred encoding for the Web because:
Therefore, today, UTF-8 should always be used as the character encoding for non-English languages.
Warning: do NOT use other Unicode encodings such as UTF-16 or UTF-32. Also, do NOT use so-called "character entities" for non-ASCII characters. (These are the HTML codes that start with a '&' and end with a ';'.) If you have documentation in another encoding, there are various tools available to convert it to UTF-8. The Linux, Unix and Mac OS X operating systems all include the "iconv" utility program which converts text from one encoding to another. For example, the following command will convert an "originalfile" encoded in ISO-8859-1 to a "newfile" encoded in UTF-8.Various Windows editors also provide encoding conversion capabilities:iconv -f ISO-8859-1 -t UTF-8 originalfile > newfile
In Microsoft Word: open a file, select "Save as ...", then "Plain Text (*.txt)", then "Save". In the "File Conversion" dialog box click "Other encoding" and choose "Unicode (UTF-8)".
The popular freeware program Notepad++ also provides encoding conversion. Open a file, select the "Encoding" menu, then choose "Convert to UTF-8". (The built-in Microsoft Notepad will also save files as UTF-8, but it automatically inserts a BOM at the beginning of the file -- which is not ideal.)
When HTML pages are generated by various SDA programs, the charset information stored with the dataset will be taken into account so the pages can be displayed correctly in a browser. Usually the charset information will be used to write a meta tag in the head element of an HTML page. For example:
Here is an example of specifying a charset and a lang attribute in the global section of a DDL file:
When SDA programs write HTML, the dataset’s lang will be written as an attribute of the main HTML tag. For example:title = French Canadian Study charset = utf-8 lang = fr-CA
A two-character language code like 'fr' represents the generic language. An optional subfield can be added, to indicate a regional dialect. (Even finer variations -- with longer "lang" codes -- are occasionally found.) In the example above "fr-CA" indicates that the language is French, as spoken in Canada. However, unless you have a compelling reason to distinguish between regional dialects of a given language you should always use the generic language code.
A complete list of the "lang" codes and their corresponding resource bundle file names can be found here.
After the DDL file has been used to create the SDA dataset (by using the SDAMANAGER or by using the MAKESDA program directly), all displays of SDA results will use that language.
Note that the raw data file must be encoded in plain US-ASCII. Any other encoding will not work.The strings to be modified are contained in a number of different files, corresponding to the displays generated by the main SDAWEB interface for selecting options and by the various analysis and codebook programs. The following sections describe how to obtain copies of the files with the default language strings, how to modify them, and where to put the modified files.
The user interface screens are those used to specify which procedure to run, which variables to analyze, and which options to use. The language strings for the interface screens can be found in the following subdirectory of your Tomcat application:
[tomcat-directory]/webapps/sdaweb/WEB-INF/classes
Within that subdirectory, the default (English) strings are in the file 'sdaweblang.properties'. Make a copy of that file for purposes of modification.
The analysis output is produced by the various analysis programs (like the TABLES program or the MEANS program) to display the results of the analysis.
A copy of those default language strings can be obtained by downloading the 'lang-analysis.txt' file.
A copy of these strings can also be obtained by using the ’-t’ option with the TABLES program. The following command will put a copy of those strings into the file ’filename2.txt’:
The TABLES program is located in the directory in which the SDA programs have been installed.tables -t filename2.txt
The XCODEBK program has its own set of language strings.
A copy of those default language strings can be obtained by downloading the ’lang-codebk.txt’ file.
A copy of those strings can also be obtained by using the ’-t’ option with the XCODEBK program. The following command will put a copy of those strings into the file ’filename3.txt’:
However, be aware that the codebook strings obtained directly from the XCODEBK program (using the ’-t’ option) will include many extra strings that were formerly used to document questionnaires. Those extra strings should be ignored or deleted.xcodebk -t filename3.txt
The XCODEBK program is located in the directory in which the SDA programs have been installed.
Here are a few such strings used for the output from analysis programs:
ROWVAR = Row COLVAR = Column WGT = Weight FLT = Filter
Here are those same strings converted to Portuguese:
The first three strings are simple to enter. The fourth one, however, includes characters that are not included in the set of simple 'US-ASCII' characters. Notice that the Portuguese words in the 'FLT' string include a 'c' with the cedilla and an 'a' with a tilde over it. Although it is possible to enter these special characters using an English keyboard and special ALT-codes, it is probably easiest in most situations to invest in a language-appropriate keyboard so these special characters can be typed directly. (These keyboards can often be purchased for $30 or less.) Also, if you use Microsoft Word (or similar word-processing software) be sure to save the file as a ".txt" file instead of a ".doc" or ".docx" file; ".doc" and ".docx" files contain hidden formatting that will interfere with the processing of the language files. The analysis and codebook languages files should be saved as UTF-8 files. However, the language file for the user interface is a Java "resource bundle" file which must conform to the special requirements of these files. Java resource bundles must be encoded using ISO-8859-1 (Western European). Or, if the language cannot be encoded in ISO-8859-1, then Unicode escape codes (such as '\u62b5') must be used. Fortunately, the Java JDK provides a "native2ascii" tool which can be used to convert "any character encoding that is supported by the Java runtime environment to files encoded in ASCII, using Unicode escapes for all characters that are not part of the ASCII character set." See this Oracle documentation for more information on how to use the "native2ascii" tool. Note that although the input language file for the user interface is not UTF-8, the resulting output HTML for the user interface is UTF-8.ROWVAR = Var. de linha COLVAR = Var. de coluna WGT = Peso FLT = Var. de Seleção
After you have modified the strings, you can proceed to put the modified files into the appropriate locations.
The modified user interface strings must be located in the same subdirectory of the Tomcat application as the default strings -- namely:
[tomcat-directory]/webapps/sdaweb/WEB-INF/classes
Within that subdirectory, the default (English) strings are in the file ’sdaweblang.properties’. Put the modified user interface strings in a file with the appropriate name corresponding to the "lang" code. For example, French strings would be in the file ’sdaweblang_fr.properties’.
A complete list of the "lang" codes and their corresponding resource bundle file names can be found here.
The language files used by the SDA analysis programs and the codebook program can be named anything and can be put anywhere on the server computer. You must specify the full pathnames of those files within the SDA manager. These pathnames are part of the global configuration used by specified datasets.
The programs will use the default strings unless you have given a pathname for the analysis language file and/or the codebook language file.
If you run the analysis programs or the XCODEBK program in batch mode, the pathname of the language file is given after the ’LANGuagefile=’ keyword in the batch file.
By default, the SDAWEB application uses the generic Java "SansSerif" font when displaying text. (This "SansSerif" font is mapped to a particular physical font on the server on a system-dependent basis.) In many instances this default font will work fine. However, there may be cases where a specific font is required to display a given language. This font setting is done in the SDA Manager under "Custom chart font" in the "Global Specifications" section. Remember that the font specified must actually be present on the server machine that’s running the Java JVM. And the server must be configured so that the font is available to Tomcat.
Languages that are not compatible with the Latin character set at
all -- Asian ideographs, Georgian script, etc. -- cannot be used
in search terms
(although they will still display correctly in
search results).
These language limitations in SDA searching will
likely be removed in a future version of SDA. However, it is
important to be aware of these issues if you have datasets that
are not in English.
DDL | Data Description Language |