SDA 4.0 Documentation for MAKESDA

NAME

makesda - generate SDA variables out of DDL and a data file

USAGE

makesda [-option] -l DDLfile -d datafile

DESCRIPTION

MAKESDA creates or modifies an SDA dataset. Prior to SDA version 4.0, the MAKESDA program had to be run as a command-line program. Starting with version 4.0, the SDA Manager generally handles the creation of SDA datasets (by itself running MAKESDA). However, MAKESDA can still be used directly by the archivist to create or modify SDA datasets, provided that the location of the resulting dataset is communicated to the SDA database by updating the configuration information for the dataset in the SDA Manager.

OVERVIEW

MAKESDA reads a plain text data file (specified after '-d') and a DDL file (specified after '-l' -- a lower-case L) which contains the metadata describing the content of the data file. It then stores the defined variables in a special format in the SDA dataset.

An SDA dataset consists of a 'study directory' and two subdirectories named 'VARS' and 'STUDYINF'.

The 'VARS' subdirectory contains one file for each SDA variable. Each variable file contains a binary version of the values of that variable for each case. It also contains the metadata for that variable, as specified in the DDL file.
The 'STUDYINF' subdirectory contains study-level information for that dataset. There is always a file named 'studyinf' which contains the title for the study. There may also be information used for searching or for imposing certain disclosure rules.

MAKESDA can create an entirely new dataset, or add new variables to an existing dataset, or modify (overwrite) existing variables.

A list of the variables defined in the DDL file is written onto the file 'MAKESDA.LST' whenever MAKESDA is run. If that file already exists (from a previous MAKESDA run), it is overwritten. This list of variables can be very useful for creating a variable list file for the XCODEBK program. Note that variable names longer than 32 characters cannot currently be used by the MAKESDA program to create variables in SDA.

MEANING OF THE OPTIONS

The meaning of the options is as follows:

-c: Check the syntax of the DDL file, but do not create any SDA variables. It is not necessary to specify the name of a datafile on the command line if this option is requested.
-m: Modify existing variables in an SDA dataset. Without this option, only new variables defined in the DDL file will be created, and pre-existing variables will NOT be overwritten. If you use this option, the content of the CASEID variable for each case in the existing SDA dataset must match the value in the data file specified after the '-d' flag. This 'modify' option, therefore, cannot be used if you are adding new cases to an existing SDA dataset.
-z: Remove (zap) the specified SDA dataset before creating a new one. This option will remove all of the SDA variables in the 'VARS' subdirectory of the SDA dataset, including the CASEID variable. If you are adding cases to an SDA dataset, you must use this option. Nevertheless, this option does not remove the contents of the 'STUDYINF' subdirectory such as the 'SEARCH' directory used by the SDA search procedures or the 'disclosure.txt' file, if they exist. Note, however, that the 'studyinf' file (located in the 'STUDYINF' subdirectory) is overwritten every time the program 'makesda' is run.
-x filename: Generate an expanded version of the DDL file onto the file named `filename'. If the DDL file has been created with `copy' commands (to avoid repeating identical specifications for many variables), this expansion procedure will eliminate those commands and produce a full data description for each variable. Also, keywords that have been set globally (in the first segment of a DDL file) are repeated in each variable definition.
-h: Display short program help and available options. (The program will not do anything else.)

INPUT FILES

The data file used as input to MAKESDA must be a plain text file, having a fixed number of records for each case. If a record is shorter than the number of characters defined by the `reclen=' or `lrecl=' keyword, it is padded at the end with blanks.

The data description file must be written in the Data Description Language (DDL). The file with DDL can be created with a text editor or with various converter programs. MAKESDA can also read older DDL files in the format used by the CSA programs.

The DDL file must describe the characteristics both of the overall data file as well as of the individual variables to be converted into SDA variables. The first variable description MUST be for a variable named `CASEID'.

If variables are added to an existing SDA dataset, MAKESDA checks the contents of the CASEID variable to make sure that the CASEID value for each case matches the value stored previously in the SDA dataset. It also checks the contents of CASEID if variables are being modified.

CHARACTER VARIABLES

Beginning with version 2.0, SDA expanded the treatment of character variables. This section provides important information on how MAKESDA reads character data from the input data file and stores the data as character variables.

Blanks in a character field

When MAKESDA processes character variable values, spaces are automatically "normalized" before being stored as SDA variables. This means:

Leading and trailing blanks are NOT considered significant for character values
Multiple INTERNAL spaces are replaced with a single space.

For example " New York " is stored as "New York".

All-blank fields

There are various things you can do with an input field that is completely blank:

Leave it as a valid code:
An input field that is completely blank can be stored as such and can be used as a filter variable by specifying the content as two quotation marks with nothing in between: ""
Define it as missing-data:
An all-blank field can be defined as a missing-data field by using the following DDL specification:
md_c = ""
Convert it to other characters
An all-blank input field can be converted to some other character value before being stored in an SDA variable file by using the following DDL specification:
blank_c = New Content
If the "New Content" you specify has more characters than are defined in the 'width=' specification for this variable, the "New Content" will all be stored anyway in the SDA dataset.

Forcing Upper or Lower Case

By default, the case of a character input field is left as is, and it is stored in SDA as a case-sensitive character variable.

However, in the DDL file specifications for a character variable, you can specify that the input string be converted entirely to upper case or to lower case. This is done by specifying either 'case_c=upper' or 'case_c=lower' for a particular variable (or this can also be done globally for all character variables defined in that DDL file).

Unless the case of a character variable really matters, it is often a good idea to force the characters to be all the same case. For example, if you have a character variable for gender, and if the contents are 'M' or 'm' for male, and 'F' or 'f' for female, you would probably want to make all of the values either upper or lower case. Otherwise, when you use that variable in a table, you will get four rows or columns for gender instead of two.

If you use the 'case_c=' specification, there are some ramifications:

Missing data definitions:
The case conversion is applied to the character code defined as a missing-data code.
For example, if you specify that the input string should be converted to upper case, then the following specifications all have the same meaning:
Category labels:
The case conversion is applied to the character code for which a label is defined.
For example, if you specify that the input string should be converted to upper case, then the following specifications all have the same meaning:

Note, however, that the case conversion applies only to the category code -- and NOT to the category label. In the above example, the label "Don't know" remains in mixed upper and lower case, regardless of what happens to the category code itself.

Selection filter variables:

Character variables can be used as selection filters in the same way that numeric variables are used. Note, however, that the values of a selection filter variable are NOT case sensitive. Also, leading and trailing blanks are stripped from character codes specified as filter variables, and multiple internal blanks are reduced to a single blank. This is the same as happens to character values before they are stored as SDA variables, so the filter values should match the stored character values unless there is a substantive difference in the codes.

For example, the following filter specifications all have the same effect, regardless of whether the values of the character variable 'state' have been forced to upper case or to lower case, or have been left as mixed case:

DIAGNOSTIC MESSAGES

Diagnostic and error messages are appended to the file `MAKESDA.MSG'. Messages about progress in the number of variables processed are displayed on the screen.

EXAMPLES

makesda -c -l myddl: Check the DDL file named 'myddl'
makesda -l myddl -d mydata: Create an SDA dataset out of the files 'myddl' and 'mydata', but do NOT modify any existing SDA variables in the study specified in the 'path=' keyword in the top section of the DDL file 'myddl'.
makesda -m -l myddl -d mydata: Create an SDA dataset out of the files 'myddl' and 'mydata', and MODIFY any existing SDA variables that are included in 'myddl'.