SDA 4.1 Documentation for MAKESDA
NAME
makesda - generate SDA variables out of DDL and a data file
USAGE
makesda [-option] -l DDLfile -d datafile
DESCRIPTION
MAKESDA creates or modifies an SDA dataset. Prior to SDA version
4.0, the MAKESDA program had to be run as a command-line program.
Starting with version 4.0, the
SDA Manager
generally handles the creation of SDA datasets (by itself running
MAKESDA).
However, MAKESDA can still be used directly by the archivist to
create or modify SDA datasets, provided that the location of the
resulting dataset is communicated to the SDA database by updating
the configuration information for the dataset in the SDA Manager.
CONTENTS OF THIS DOCUMENT
OVERVIEW
MAKESDA reads data file (specified after '-d') and a
DDL file
(specified after '-l' -- a lower-case L) which contains the
metadata describing the content of the data file. It then stores
the defined variables in a special format in the SDA dataset.
An SDA dataset consists of a 'study directory' and two
subdirectories named 'VARS' and 'STUDYINF'.
- The 'VARS' subdirectory contains one file for each SDA
variable. Each variable file contains a binary version of the
values of that variable for each case. It also contains the
metadata for that variable, as specified in the
DDL file.
- The 'STUDYINF' subdirectory contains study-level information
for that dataset. There is always a file named 'studyinf' which
contains the title for the study. There may also be information
used for
searching
or for imposing certain
disclosure rules.
MAKESDA can create an entirely new dataset, or add new variables
to an existing dataset, or modify (overwrite) existing variables.
A list of the variables defined in the DDL file is written onto
the file 'MAKESDA.LST' whenever MAKESDA is run. If that file
already exists (from a previous MAKESDA run), it is overwritten.
This list of variables can be very useful for creating a
variable list file
for the
XCODEBK
program.
Note that variable names longer than 32 characters cannot
currently be used by the MAKESDA program to create variables in
SDA.
MEANING OF THE OPTIONS
The meaning of the options is as follows:
- -c
- Check the syntax of the DDL file, but do not create any SDA
variables. It is not necessary to specify the name of a datafile
on the command line if this option is requested.
- -m
- Modify existing variables in an SDA dataset. Without this
option, only new variables defined in the DDL file will be
created, and pre-existing variables will NOT be overwritten. If
you use this option, and if you have defined a CASEID variable,
the content of the CASEID variable for each case in the existing
SDA dataset must match the value in the data file specified after
the '-d' flag.
- -z
- Remove (zap) the specified SDA dataset before creating a new
one. This option will remove all of the SDA variables in the
'VARS' subdirectory of the SDA dataset, including the CASEID
variable. If you are adding cases to an SDA dataset, you must
use this option, which basically removes the previous SDA dataset
to make room for the new one. Nevertheless, this option does not
remove the contents of the 'STUDYINF' subdirectory such as the
'SEARCH' directory used by the
SDA search
procedures or the
'disclosure.txt'
file, if they exist. Note, however, that the 'studyinf' file
(located in the 'STUDYINF' subdirectory) is overwritten every
time the program 'makesda' is run.
- -x filename
- Generate an expanded version of the DDL file onto the file
named `filename'. If the DDL file has been created with `copy'
commands (to avoid repeating identical specifications for many
variables), this expansion procedure will eliminate those
commands and produce a full data description for each variable.
Also, keywords that have been set globally (in the first segment
of a DDL file) are repeated in each variable definition.
- -h
- Display short program help and available options. (The
program will not do anything else.)
INPUT FILES
The data file used as input to MAKESDA must be
a plain text file (not a binary file). The data file may be
formatted as a CSV file (comma-separated values) or a TSV file
(tab-separated values) or a fixed-column ASCII data file. If the
format is "fixed," each variable must be in a fixed set of
columns, and the file must have a fixed number of records for
each case. And if a record is shorter than the number of
characters defined by the `reclen=' or `lrecl=' keyword, it is
padded at the end with blanks.
The data description file must be written in
the
Data Description Language
(DDL). The file with DDL can be created with a text editor or
with a converter program like
XCONVERT.
MAKESDA can also read older DDL files in the format used by the
CSA programs.
The DDL file must describe the characteristics both of the
overall data file as well as of the individual variables to be
converted into SDA variables. If there is a CASEID variable, the
first variable description MUST be for that variable.
If variables are added to an existing SDA dataset, MAKESDA checks
the contents of the CASEID variable (if one exists) to make sure
that the CASEID value for each case matches the value stored
previously in the SDA dataset. It also checks the contents of
CASEID if variables are being modified. If you anticipate adding
or modifying variables, it is a good idea to have a CASEID
variable, to enable this checking.
CHARACTER VARIABLES
Beginning with version 2.0, SDA expanded the treatment of
character variables. This section provides important information
on how MAKESDA reads character data from the input data file and
stores the data as character variables.
Blanks in a character field
When MAKESDA processes character variable values, spaces are
automatically "normalized" before being stored as SDA variables.
This means:
- Leading and trailing blanks are NOT considered significant
for character values
- Multiple INTERNAL spaces are replaced with a single space.
For example " New York
" is stored as "New York".
All-blank fields
There are various things you can do with an input field that is
completely blank:
- Leave it as a valid code:
An input field that is completely blank can be stored as such and
can be used as a filter variable by specifying the content as two
quotation marks with nothing in between: ""
- Define it as missing-data:
An all-blank field can be defined as a missing-data field by
using the following DDL specification:
md_c = ""
- Convert it to other characters
An all-blank input field can be converted to some other character
value before being stored in an SDA variable file
by using the following DDL specification:
blank_c = New Content
If the "New Content" you specify has more characters than are
defined in the 'width=' specification for this variable, the "New
Content" will all be stored anyway in the SDA dataset.
Forcing Upper or Lower Case
By default, the case of a character input field is left as is,
and it is stored in SDA as a case-sensitive character variable.
However, in the DDL file specifications for a character variable,
you can specify that the input string be converted entirely to
upper case or to lower case. This is done by specifying either
'case_c=upper' or 'case_c=lower' for a particular variable (or
this can also be done globally for all character variables
defined in that DDL file).
Be aware that this conversion only works if the character input
field is writen in US-ASCII. If the characters are written with
non-ASCII characters in UTF-8 (which is legitimate), the
conversion will not be carried out.
Unless the case of a character variable really matters, it is
often a good idea to force the characters to be all the same
case. For example, if you have a character variable for gender,
and if the contents are 'M' or 'm' for male, and 'F' or 'f' for
female, you would probably want to make all of the values either
upper or lower case. Otherwise, when you use that variable in a
table, you will get four rows or columns for gender instead of
two.
If you use the 'case_c=' specification, there are some
ramifications:
- Missing data definitions:
The case conversion is applied to the character code defined as a
missing-data code.
For example, if you specify that the input string should be
converted to upper case, then the following specifications all
have the same meaning:
md_c= REFUSED
md_c= Refused
md_c= refused
- Category labels:
The case conversion is applied to the character code for which a
label is defined.
For example, if you specify that the input string should be
converted to upper case, then the following specifications all
have the same meaning:
catlabels=
DK Don't know
dk Don't know
Dk Don't know
Note, however, that the case conversion applies only to the
category code -- and NOT to the category label. In the above
example, the label "Don't know" remains in mixed upper and lower
case, regardless of what happens to the category code itself.
Selection filter variables:
Character variables can be used as selection filters in the same
way that numeric variables are used. Note, however, that the
values of a selection filter variable are NOT case sensitive.
Also, leading and trailing blanks are stripped from character
codes specified as filter variables, and multiple internal blanks
are reduced to a single blank. This is the same as happens to
character values before they are stored as SDA variables, so the
filter values should match the stored character values unless
there is a substantive difference in the codes.
For example, the following filter specifications all have the
same effect, regardless of whether the values of the character
variable 'state' have been forced to upper case or to lower case,
or have been left as mixed case:
state("New York")
state(" NEW YORK ")
state("New York")
DIAGNOSTIC MESSAGES
Diagnostic and error messages are appended to the file
`MAKESDA.MSG'. Messages about progress in the number of
variables processed are displayed on the screen.
EXAMPLES
- makesda -c -l myddl
- Check the DDL file named 'myddl'
- makesda -l myddl -d mydata
- Create an SDA dataset out of the files 'myddl' and 'mydata',
but do NOT modify any existing SDA variables in the study
specified in the 'path=' keyword in the top section of the DDL
file 'myddl'.
- makesda -m -l myddl -d mydata
- Create an SDA dataset out of the files 'myddl' and 'mydata',
and MODIFY any existing SDA variables that are included in
'myddl'.
SEE ALSO
DDL |
Summary of the Data Description Language |
CSM, UC Berkeley/ISA
September 25, 2020