SDA 4.1 Documentation for MAKESDA

NAME

makesda - generate SDA variables out of DDL and a data file

USAGE

makesda [-option] -l DDLfile -d datafile

DESCRIPTION

MAKESDA creates or modifies an SDA dataset. Prior to SDA version 4.0, the MAKESDA program had to be run as a command-line program. Starting with version 4.0, the SDA Manager generally handles the creation of SDA datasets (by itself running MAKESDA). However, MAKESDA can still be used directly by the archivist to create or modify SDA datasets, provided that the location of the resulting dataset is communicated to the SDA database by updating the configuration information for the dataset in the SDA Manager.

OVERVIEW

MAKESDA reads a data file (specified after '-d') and a DDL file (specified after '-l' -- a lower-case L) which contains the metadata describing the content of the data file. It then stores the defined variables in a special format in the SDA dataset.

An SDA dataset consists of a 'study directory' and two subdirectories named 'VARS' and 'STUDYINF'.

The 'VARS' subdirectory contains one file for each SDA variable. Each variable file contains a binary version of the values of that variable for each case. It also contains the metadata for that variable, as specified in the DDL file.
The 'STUDYINF' subdirectory contains study-level information for that dataset. There is always a file named 'studyinf' which contains the title for the study. There may also be information used for searching or for imposing certain disclosure rules.

Depending on the type of input data file being used, MAKESDA treats any existing variables in an SDA dataset differently.

If the input data file is a TSV or CSV file (where the variables are delimited by tabs or commas) then MAKESDA always deletes any previously existing variables in the SDA dataset before creating all the variables listed in the DDL file.

If the input data file is a fixed-format file (where each variable is in a fixed set of columns) then, by default, MAKESDA will add any new variables listed in the DDL file but not modify or delete any existing variables. To modify existing variables or delete all existing variables before creating new ones, the -m (modify) or -z (zap) options can be used. See the discussion below.

A list of the variables defined in the DDL file is written onto the file 'MAKESDA.LST' whenever MAKESDA is run. If that file already exists (from a previous MAKESDA run), it is overwritten. This list of variables can be very useful for creating a variable list file for the XCODEBK program. Note that variable names longer than 32 characters cannot currently be used by the MAKESDA program to create variables in SDA.

MEANING OF THE OPTIONS

The meaning of the options is listed below. Note that the -m (modify) and -z (zap) options are only relevant when using a fixed-format data file. TSV and CSV data files always use 'zap' mode automatically when creating an SDA dataset.

-m: For fixed-format files only: modify existing variables in an SDA dataset. Without this option, only new variables defined in the DDL file will be created, and pre-existing variables will NOT be overwritten. If you use this option, and if you have defined a CASEID variable, the content of the CASEID variable for each case in the existing SDA dataset must match the value in the data file specified after the '-d' flag.
-z: For fixed-format files only: delete (zap) the specified SDA dataset variables before creating a new one. This option will remove all of the SDA variables in the 'VARS' subdirectory of the SDA dataset, including the CASEID variable. If you are adding cases to an SDA dataset, you must use this option, which basically removes the previous SDA dataset to make room for the new one. Nevertheless, this option does not remove the contents of the 'STUDYINF' subdirectory such as the 'SEARCH' directory used by the SDA search procedures or the 'disclosure.txt' file, if they exist. Note, however, that the 'studyinf' file (located in the 'STUDYINF' subdirectory) is overwritten every time the program 'makesda' is run.
-c: Check the syntax of the DDL file, but do not create any SDA variables. It is not necessary to specify the name of a datafile on the command line if this option is requested.
-x filename: Generate an expanded version of the DDL file onto the file named `filename'. If the DDL file has been created with `copy' commands (to avoid repeating identical specifications for many variables), this expansion procedure will eliminate those commands and produce a full data description for each variable. Also, keywords that have been set globally (in the first segment of a DDL file) are repeated in each variable definition.
-h: Display short program help and available options. (The program will not do anything else.)

INPUT FILES

The data file used as input to MAKESDA must be a plain text file (not a binary file). The data file may be formatted as a CSV file (comma-separated values) or a TSV file (tab-separated values) or a fixed-column ASCII data file. If the format is "fixed," each variable must be in a fixed set of columns, and the file must have a fixed number of records for each case. And if a record is shorter than the number of characters defined by the `reclen=' or `lrecl=' keyword, it is padded at the end with blanks.

The data description (metadata) file must be written in SDA's Data Description Language (DDL). The file with DDL can be created with a text editor or with a converter program like XCONVERT. MAKESDA can also read older DDL files in the format used by the CSA programs.

The DDL file must describe the characteristics both of the overall data file as well as of the individual variables to be converted into SDA variables. If there is an optional CASEID variable, the first variable description MUST be for that variable.

A note on using a fixed-format data file: if variables are added to an existing SDA dataset, MAKESDA checks the contents of the CASEID variable (if one exists) to make sure that the CASEID value for each case matches the value stored previously in the SDA dataset. It also checks the contents of CASEID if variables are being modified. If you anticipate adding or modifying variables, it is a good idea to have a CASEID variable, to enable this checking.

CHARACTER VARIABLES

Beginning with version 2.0, SDA expanded the treatment of character variables. This section provides important information on how MAKESDA reads character data from the input data file and stores the data as character variables.

Blanks in a character field

When MAKESDA processes character variable values, spaces are automatically "normalized" before being stored as SDA variables. This means:

Leading and trailing blanks are NOT considered significant for character values
Multiple INTERNAL spaces are replaced with a single space.

For example " New York " is stored as "New York".

All-blank fields

There are various things you can do with an input field that is completely blank:

Leave it as a valid code:
An input field that is completely blank can be stored as such and can be used as a filter variable by specifying the content as two quotation marks with nothing in between: ""
Define it as missing-data:
An all-blank field can be defined as a missing-data field by using the following DDL specification:
md_c = ""
Convert it to other characters
An all-blank input field can be converted to some other character value before being stored in an SDA variable file by using the following DDL specification:
blank_c = New Content
If the "New Content" you specify has more characters than are defined in the 'width=' specification for this variable, the "New Content" will all be stored anyway in the SDA dataset.

Forcing Upper or Lower Case

By default, the case of a character input field is left as is, and it is stored in SDA as a case-sensitive character variable.

However, in the DDL file specifications for a character variable, you can specify that the input string be converted entirely to upper case or to lower case. This is done by specifying either 'case_c=upper' or 'case_c=lower' for a particular variable (or this can also be done globally for all character variables defined in that DDL file).

Be aware that this conversion only works if the character input field is writen in US-ASCII. If the characters are written with non-ASCII characters in UTF-8 (which is legitimate), the conversion will not be carried out.

Unless the case of a character variable really matters, it is often a good idea to force the characters to be all the same case. For example, if you have a character variable for gender, and if the contents are 'M' or 'm' for male, and 'F' or 'f' for female, you would probably want to make all of the values either upper or lower case. Otherwise, when you use that variable in a table, you will get four rows or columns for gender instead of two.

If you use the 'case_c=' specification, there are some ramifications:

Missing data definitions:
The case conversion is applied to the character code defined as a missing-data code.
For example, if you specify that the input string should be converted to upper case, then the following specifications all have the same meaning:
Category labels:
The case conversion is applied to the character code for which a label is defined.
For example, if you specify that the input string should be converted to upper case, then the following specifications all have the same meaning:

Note, however, that the case conversion applies only to the category code -- and NOT to the category label. In the above example, the label "Don't know" remains in mixed upper and lower case, regardless of what happens to the category code itself.

Selection filter variables:

Character variables can be used as selection filters in the same way that numeric variables are used. Note, however, that the values of a selection filter variable are NOT case sensitive. Also, leading and trailing blanks are stripped from character codes specified as filter variables, and multiple internal blanks are reduced to a single blank. This is the same as happens to character values before they are stored as SDA variables, so the filter values should match the stored character values unless there is a substantive difference in the codes.

For example, the following filter specifications all have the same effect, regardless of whether the values of the character variable 'state' have been forced to upper case or to lower case, or have been left as mixed case:

DIAGNOSTIC MESSAGES

Diagnostic and error messages are appended to the file `MAKESDA.MSG'. Messages about progress in the number of variables processed are displayed on the screen.

EXAMPLES

makesda -c -l myddl: Check the DDL file named 'myddl'
makesda -l myddl -d mydata: Create an SDA dataset out of the files 'myddl' and 'mydata', but do NOT modify any existing SDA variables in the study specified in the 'path=' keyword in the top section of the DDL file 'myddl'.
makesda -m -l myddl -d mydata: Create an SDA dataset out of the files 'myddl' and 'mydata', and MODIFY any existing SDA variables that are included in 'myddl'.