SDA 4.1 Documentation for DDL

NAME

DDL - Data Description Language (SDA metadata format)

DESCRIPTION

SDA's Data Description Language (DDL) is a metadata format used for describing the characteristics of a dataset. The descriptions are of two types: a description of the study as a whole, and a description of each variable. The file with DDL can be created with a text editor or with a converter program like XCONVERT. DDL files can be modified or merged with the DDLMOD program.

Beginning with version 4.0, the DDL file is generally imported and processed by the SDA Manager to produce the SDA dataset. And note that the current version of the SDA Manager can import SPSS and Stata system files directly into SDA, without the need to create a DDL file at all.

OVERVIEW

The description of the study as a whole includes a title for the dataset, and (optionally) the number of cases in the data file, and the pathname of the directory into which the SDA dataset should be placed (by the MAKESDA program). If the text and labels for variables are written using a special character set, that information should also be included here.

Begining with SDA version 4.1, the description may also include the filetype of the data file. The default filetype is still a fixed-column text file. However, the data file can now be a CSV file (comma-separated values) or a TSV file (tab-separated values). In a CSV file or a TSV file, the first row must contain the names of the variables, separated by the appropriate delimiters (commas or tabs).

Default values for many of the keywords of individual variables may also be specified in this section. (Examples would be for the minimum valid code, or the default missing-data code.) In that case, the corresponding characteristic will be set to this specified default, unless overriden in the individual variable specification.

The description of each variable MUST include at least the name of the variable. (See the Rules for variable names.)

If the filetype is fixed, the description must also include the the beginning column number for the variable. If they are different from the default values, the variable description must also include the record number and the width (number of columns).

Each variable description MAY also include a long label, descriptive text (such as questionnaire wording), and category labels for the code categories. If some of the code values represent invalid response codes, they may be flagged for exclusion from analysis; a minimum and a maximum valid code can also be specified (default values for these specifications can also be set).

For fixed-format input data files only: if the dataset has a case identifier which you want to define as the CASEID variable, that variable definitiion must be the first one in the DDL file. It is useful to have a CASEID variable when you add or modify variables. If new variables are added to an existing SDA dataset, or if a new version of the data is used to modify existing variables, MAKESDA will compare each value of CASEID in the data file with the value of CASEID for the same case in the SDA dataset. If the values do not match, an error message is generated. Note that the values of CASEID do not have to be unique for each case. The only thing that matters is that the new and old values be the same for each case when MAKESDA is run on a pre-existing SDA dataset.

RULES FOR VARIABLE NAMES

Names of variables in SDA must follow these rules:

No longer than 32 characters
Only US-ASCII letters (lower or upper case), numbers, or underscores (_)
Cannot begin with a number or underscore
Cannot be one of the Microsoft Windows reserved filenames:
CLOCK$, CON, PRN, AUX, NUL, COM1, COM2, COM3, COM4, COM5, COM6, COM7, COM8, COM9, COM0, LPT1, LPT2, LPT3, LPT4, LPT5, LPT6, LPT7, LPT8, LPT9, LPT0

DDL FILE LAYOUT

The DDL file must be laid out in various parts, separated by an asterisk (*) in column 1. The first part is the description of the study as a whole. Each subsequent part describes one variable. The general layout is as follows:

description of the dataset as a whole
*
description of the CASEID variable (if there is one)
*
description of another variable
*
description of another variable

KEYWORDS USED FOR THE DATA DESCRIPTIONS

The descriptions are given in the form "keyword= something" with the keyword placed at the beginning of a line. Keywords may be either uppercase or lowercase and may be given in any order within each segment. Lines beginning with `#' are treated as comments and are not significant. Blanks lines are significant within a block of text, but otherwise they are ignored. The valid keywords are as given below; note that the possible specifications and the defaults are those currently used by the MAKESDA program.

GENERAL DATASET CHARACTERISTICS


Keyword       Possible Specification          Default (if no keyword)
_____________________________________________________________________


(Descriptions of the data file)


title=        remainder of the line           REQUIRED

filetype=     CSV or TSV                      fixed

ncases=       number of cases                 No checking for a
                                                specific N of cases
path=         directory for new dataset       Current directory

charset=      character set                   UTF-8
               (used to specify an alternate
                encoding of text;
                see below)

lang=         language                        No language enforcing
               (code to pass to browsers
                for display purposes;
                see the internationalization
                document)

(if filetype = fixed)

records/case= number of records per case      1

reclen=       number of characters per record 80



(RESET DEFAULT VALUES for individual variable specifications)


blank=        a number into which an          No default conversion
                all-blank field will be         for blanks
                converted
blank_c=      blank conversion for            No default conversion
                character variables

other=        a number into which a field     No default conversion
                with other non-numeric          for other characters
                characters will be converted
                (numeric type only)

case_c=       upper or lower                  No default
                (default case conversion        case conversion
                for character variables
                in ASCII only)

min=          default minimum valid code      No default min
max=          default maximum valid code      No default max

md=           default missing-data code(s)    No default md
                for numeric variables
md_c=         default MD code(s) for          No default md
                character variables
sysmdlabel=   default label for system        (No Data)
                missing-data value

type=         default variable type:          numeric
                numeric or character
decimals=     default number of implied       0
                decimal places

(if filetype=fixed):

record=       default record number for       1
                location of variables
width=        default number of columns       1
                for each variable

If default values for variable specifications have been set as part of the general dataset characteristics, those defaults (or global values) can be overridden for a particular variable by simply re-specifying the keyword as part of the definition of that variable.

Those default values can be nullified for a particular variable by setting the keyword equal to a blank or by specifying 'noglobal'. For example, `min= ' or `min=noglobal' will nullify the default `min' for the current variable being defined (because a minimum valid value does not need to be defined for that variable).

CHARACTERISTICS OF EACH NUMERIC VARIABLE


Keyword       Possible Specification          Default (if no keyword)
_____________________________________________________________________

name=         a single string of 1-32         REQUIRED (see rules)
                ASCII characters
iname=        name for this item in the       No instrument name
                instrument or questionnaire
decimals=     number of implied decimal       Use dataset default
                places                          (usually 0)
type=         numeric                         numeric, unless another
                                                type has been set as
                                                dataset default

label=        remainder of the line           No long variable label

catlabels=    category labels and text        No category labels
                (see discussion below)

md=           list of invalid codes and/or    No md codes
                 ranges of codes (separated
                 by blanks or commas)
                See discussion below.

min=          minimum valid code              No defined minimum
max=          maximum valid code              No defined maximum

blank=        code into which a field         System missing-data code
                containing only blanks
                will be converted

other=        code into which a field         Unless a non-numeric
                containing non-numeric          character is defined as MD,
                characters will be              non-numeric fields will
                converted                       become system missing-data

sysmdlabel=   label for system missing-data   (No Data)
                value (from a blank input
                field)

text=         descriptive text of any length  No text stored for this
                (until next keyword)            variable


(if filetype=fixed):

record=       number of the record            Use dataset default
                containing this variable        (usually 1)
column=       column location of the          REQUIRED
                left-most character
width=        number of columns used by       Use dataset default
                this variable                   (usually 1)

CHARACTERISTICS OF EACH CHARACTER VARIABLE (if any)


Keyword       Possible Specification          Default (if no keyword)
_____________________________________________________________________

name=         a single string of 1-32         REQUIRED (see rules)
                ASCII characters
iname=        name for this item in the       No instrument name
                instrument or questionnaire

type=         character                       REQUIRED, unless default
                                                type is character

label=        remainder of the line           No long variable label

catlabels=    category values and labels      No category labels
                (see discussion below)

md_c=         list of character codes to be   No missing data codes
                treated as invalid or MD
                (multiples separated by
                blanks or commas; blanks can
                be specified as MD by using
                empty quotes -- "")
                See discussion below.

blank_c=      character field into which an   No conversion of blanks
                all-blank field will be
                converted (including quotes
                if you give them)

case_c=       upper or lower                  Mixed case preserved
                (Convert all the characters
                 into upper or lower case.
                 Note that this only works
                 if the characters are in
                 US-ASCII.)

text=         descriptive text of any length  No text stored for this
                (until next keyword)            variable


(if filetype=fixed):

record=       number of the record            Use dataset default
                containing this variable        (usually 1)
column=       column location of the          REQUIRED
                left-most character
width=        number of columns used by       Use dataset default
                this variable                   (usually 1)

MISSING DATA SPECIFICATIONS -- NUMERIC VARIABLES

In numeric variables, it is possible to flag any number of codes or ranges as invalid and to be excluded from analysis. Multiple codes or ranges are separated by a comma or a blank. An asterisk (*) can be used to indicate all codes above or below a certain value. Some examples are as follows:

MD= 8,9: Codes 8 and 9 are missing-data
MD= 7-9: Codes 7 through 9 are missing-data
MD= 8-*: All numbers 8 and larger are missing-data
MD= *-0: All numbers less than or equal to zero are missing-data
MD= D,R: The characters 'D' and 'R' are missing-data
MD= -1,97-99,D,R: Several missing-data codes

MISSING DATA SPECIFICATIONS -- CHARACTER VARIABLES

Missing data specifications for character variables work the same way as for numeric variables, except that the appropriate keyword for character variables is `md_c'. However, ranges are not allowed. Multiple missing data codes must be separated by a comma or a blank.

Note in particular that embedded blanks or quotes must be enclosed in single or double quotation marks.

Some examples are as follows:

MD_C = DK, REF: Cases with the values `DK' and `REF' are treated as missing. Those two values are separate by a comma and/or a blank.
MD_C = "Refused to answer": Because of the embedded blanks, quotes are necessary to define the whole phrase as MD. Otherwise, the individual words `Refused', `to', and `answer' would be defined as three distinct missing-data values.
MD_C = "Don't know": Notice that the single quote in "Don't know" requires double quotes around the whole character value.
MD_C = 'Don''t know', Refused: Alternatively, a single quote can be repeated inside of a pair of single quotes. In this case the apostrophe or single quote in "Don't know" is repeated.
MD_C = 'Mr. "Jack" Doe': A character value with double quotes must be surrounded with single quotes.
MD_C = "": Empty quotes refer to a blank. This specification means that a completely blank character value should be treated as missing data. Blank character fields are not treated as missing data unless you provide this specification.

CATEGORY LABELS OR TEXT

Labels or text assigned to response categories can be of any length. The basic format is to specify pairs of codes and labels, with each pair on a separate line:

For a numeric variable with values of 1 and 2:
The equivalent for a character variable with values of Y and N is:

The syntax rules for specifying the category codes of character variables are the same as for specifying missing-data codes, as described in the section immediately above. In particular see that section if the category codes for a character variable include blanks or quotation marks.

Long Category Text

If the text corresponding to a category is long, some analysis programs (outside of SDA) will create a shorter category label; this shorter label would be more appropriate for printing the results of an analysis such as crosstabulation.

Depending on the analysis program, the category label might be created by truncating the text to the first 16 or 20 characters. If the label created by truncating the text would be unclear or ambiguous, it is useful to provide your own abbreviated category label. This is done by enclosing the short label in square brackets after the category text. Programs that read the DDL file can then differentiate between the (long) text of a category and the (short) label corresponding to the same category.

     catlabels=
       1 Definitely will vote in the next election     [Definitely vote]
       2 Probably will vote in the next election       [Probably vote]
       3 Probably will not vote in the next election   [Prob not vote]
       4 Definitely will not vote in the next election [Def not vote]

Category text can extend over more than one line, provided that a backslash (`\') is the last character of every line except the last line:

     catlabels=
       1 Definitely will vote\
           in the next election       [Def vote]
       2 Probably will vote\
           in the next election       [Prob vote]
       3 Probably will not\
           vote in the next election  [Prob not vote]
       4 Definitely will not vote\
           in the next election       [Def not vote]
       8 Don't know
       9 Refused

OTHER OPTIONS AND CLARIFICATIONS

DECIMAL PLACES FOR NUMERIC VARIABLES

There are two ways to process numeric values with decimals, depending on whether or not the input data fields contain an explicit decimal point.
(Users should note that the treatment of numeric variables with decimal points changed, beginning in Version 2.1 of SDA.)

IMPLIED decimal places in the input data field
If an input data field does NOT have an explicit decimal point, you can specify that the number in the input file should be interpreted as having a certain number of IMPLIED decimal points by using the `decimals=' keyword. The value given after the `decimals=' keyword is the number of IMPLIED decimal places in the input data.
If decimals=2, for example, the input value `1234' would be stored as `12.34'.
(This is the same as in previous versions of SDA.)
EXPLICIT decimal points in the input data field
If a number in the input data field has an EXPLICIT decimal point, that number is stored with the full number of decimal places given in the data, regardless of the value given after the `decimals=' specification for that variable. (This is also true if the data field for a case contains scientific notation.)
If decimals=2, for example, the input value `1.237' will retain all of its decimals and will be stored as `1.237' in versions 2.1 and later of SDA. (In previous versions of SDA that input value would have been rounded to 2 decimal places and would have been stored as `1.24'.)

NON-NUMERIC INPUT FOR NUMERIC VARIABLES

An input field defined as `numeric' (the default type) can contain a variety of characters, in addition to simple numbers. A numeric input field may contain leading and trailing blanks (which are ignored), and a possible minus or plus sign. It can also contain decimal points or numbers in scientific notation (like 5e02, for 500).

In SDA, blank input fields will be set to the system missing-data value, unless the DDL specification for that variable (or for all variables, globally) includes the `blank=' keyword, to specify what number those fields are to be converted into. (For example, one could specify `blank=-1', to convert all blank numeric input fields to `-1' in the SDA dataset.) This conversion does NOT affect the original ASCII data file.

Non-numeric characters such as 'D' and 'R' are valid for a numeric variable in SDA, and those characters will be stored as such in the dataset, provided that those characters have been defined as missing-data codes. (If those non-numeric characters have not been defined as missing-data codes, they will be treated as invalid codes.)

A period ('.') by itself in a field, or an input field containing other non-numeric characters that have not been defined as missing-data codes, will ordinarily be converted to the system missing-data value in SDA. However, if the DDL specification for that variable (or for all variables, globally) includes the `other=' keyword, the non- numeric fields will be converted by SDA to the value specified after `other='. That value will then be examined like any other input value, to see whether it is a valid value or has been defined as missing-data or out-of-range.

SPECIAL CHARACTER SETS

Note that the NAMES of variables must currently be written only in US-ASCII. The contents of the DATA FILE and all other specifications (including text fields for character variables) can also be written in UTF-8, which can cover any language. (UTF-8 includes US-ASCII as a subset.) Other character sets can only be used in a DDL file for long variable labels, category labels, and descriptive text (question text) for variables.

For information on using other character sets, see the document on Internationalization.

COPY MODE

A set of variables will often have many attributes in common, including missing-data specifications and category labels. Instead of repeating those attributes for every variable, it is possible to specify that the attributes of some previously defined variable (in the same DDL file) apply to another variable. (Of course the name and column location of a variable cannot be copied; those attributes must be unique for each variable.)

This copy feature is invoked by putting the word `copy' on the asterisked line preceding the variable's specifications. The variable whose attributes can be copied is either the previous variable (if no specific name is given) or some specific variable defined earlier in the same DDL file. The general layout is as follows:

description of v101

* copy
description of v102,
using all variable definitions of the PREVIOUS variable (v101) that are not specifically redefined in this new variable description.

* copy v75
description of v103,
using all variable definitions for v75 that are not specifically redefined in this new variable description (assuming that v75 has already been defined).

BACKWARD COMPATIBILITY

Although there are some extensions and changes in DDL syntax since previous releases of the non-Web CSA programs, the programs that read DDL also read the earlier version of DDL.

The following keywords are still recognized and are equivalent to the new keywords shown after the equal sign:

labels = catlabels
lrecl = reclen
noglob = noglobal
scale = decimals

The older missing-data keywords `MD1=mdvalue1' and `MD2=mdvalue2' are also recognized and are equivalent to the new form:

MD= mdvalue1, mdvalue2

EXAMPLE OF DDL

title=    Some Election Study
records/case=2
reclen=   80
path=     /mysda/election

*
name=     CASEID
label=    Case ID of Respondent
record=   1
column=   1
width=    4

*
name=     v75
label=    R's Interest in Campaign
record=   1
column=   11
md=       8,9
catlabels=
          1 Very Interested
          2 Somewhat Interested
          5 Not Interested
          8 Don't know, can't answer [DK]
          9 Refused to answer        [Ref]
text=
   Some people don't pay much attention to political
   campaigns so far this year.  How about you, are you very
   interested, somewhat interested, or not interested at all?

* copy v75
# Copy the category labels and MD definitions from the variable 'v75'
# (Other specifications are redefined for 'v76')
name=     v76
label=    R's Interest in Primary Election Results
column=   12
text=
   How about the results of primary elections.
   How interested in those results are you?
   Are you very interested, somewhat interested,
   or not interested at all?

*
name=     age
label=    Age of respondent
record=   2
column=   20
width=    2
md=       97-*
catlabels =
          97 Age 97 or over
          98 Don't know
          99 Refused

*
name=     region
label=    Character code for each region
record=   2
column=   24
width =   2
type  =   character
md_c  =   X
catlabels=
          NE Northeastern states
          NC North Central states
          S Southern states
          W Western states
          X (Not available)
text =
Region of the country - coded from the state codes

*
name=     weight
label=    Weight variable
record=   2
column=   50
width=    6
decimals= 4
md=       0
text=
Weight variable with 4 implied decimal places.

_____________________________________________________________________

(For a more extended example, see the DDL file for the SDA test data which is distributed with the SDA programs.)