SDA 3.5 Documentation for DDL


NAME

DDL - Data Description Language

DESCRIPTION

Data Description Language (DDL) is used for describing the characteristics of a dataset. The descriptions are of two types: a description of the study as a whole, and a description of each variable. The file with DDL can be created with a text editor, or with the XCONVERT or Q4TODDL programs. DDL files can be modified or merged with the DDLMOD program.

CONTENTS OF THIS DOCUMENT


OVERVIEW

The description of the study as a whole includes a title for the dataset, the number of records per case in the data file, the length of each record (number of characters), and (optionally) the number of cases in the data file, and the pathname of the directory into which the SDA dataset should be placed (by the MAKESDA program). If the text and labels for variables are written using a special character set, that information should also be included here.

Default values for many of the keywords of individual variables may also be specified in this section. (Examples would be for the width of the field, the minimum valid code, or the default missing-data code.) In that case, the corresponding characteristic will be set to this specified default, unless overriden in the individual variable specification.

The description of each variable MUST include its name and its location in the data file (beginning column number). The description must also include the following specifications, IF they are different from the default values: the width (number of columns), the record number, and the number of implied decimal places (if there are implied decimal places in the input field).

Each variable description MAY also include a long label, descriptive text (such as questionnaire wording), and category labels for the code categories. If some of the code values represent invalid response codes, they may be flagged for exclusion from analysis; a minimum and a maximum valid code can also be specified (default values for these specifications can also be set).

The first variable description MUST be for a variable named ‘CASEID’, if the DDL file is to be input to the program MAKESDA in order to create or to add variables to an SDA dataset. If variables are added to an existing SDA dataset, MAKESDA checks the contents of CASEID to make sure that the value for each case matches the value stored previously.


DDL FILE LAYOUT

The DDL file must be laid out in various parts, separated by an asterisk (*) in column 1. The first part is the description of the study as a whole. The second part contains the description of the CASEID variable. Each subsequent part describes one variable. The general layout is as follows:

description of the dataset as a whole
*
description of the CASEID variable
*
description of a variable
*
description of another variable

KEYWORDS USED FOR THE DATA DESCRIPTIONS

The descriptions are given in the form "keyword= something" with the keyword placed at the beginning of a line. Keywords may be either uppercase or lowercase and may be given in any order within each segment. Lines beginning with ‘#’ are treated as comments and are not significant. Blanks lines are significant within a block of text, but otherwise they are ignored. The valid keywords are as given below; note that the possible specifications and the defaults are those currently used by the MAKESDA program.

GENERAL DATASET CHARACTERISTICS


Keyword       Possible Specification          Default (if no keyword)
_____________________________________________________________________


(Descriptions of the data file)


title=        remainder of the line           REQUIRED

records/case= number of records per case      1

reclen=       number of characters per record 80

ncases=       number of cases                 No checking for a
                                                specific N of cases
path=         directory for new dataset       Current directory

charset=      character set                   U.S. ASCII
               (used to specify an alternate
                encoding of text;
                see below)

lang=         language                        No language enforcing
               (code to pass to browsers
                for display purposes;
                see the language document)



(DEFAULT VALUES for individual variable specifications)


blank=        a number into which an          No default conversion
                all-blank field will be         for blanks
                converted
blank_c=      blank conversion for            No default conversion
                character variables

other=        a number into which a field     No default conversion
                with other non-numeric          for other characters
                characters will be converted
                (numeric type only)

case_c=       default case conversion         No default case
                for character variables         conversion

min=          default minimum valid code      No default min
max=          default maximum valid code      No default max

md=           default missing-data code(s)    No default md
                for numeric variables
md_c=         default MD code(s) for          No default md
                character variables
sysmdlabel=   default label for system        (No Data)
                missing-data value

record=       default record number for       1
                location of variables
decimals=     default number of implied       0
                decimal places
type=         default variable type:          numeric
                numeric or character
width=        default number of columns       1
                for each variable

If default values for variable specifications have been set as part of the general dataset characteristics, those defaults (or global values) can be overridden for a particular variable by simply re-specifying the keyword as part of the definition of that variable.

Those default values can be nullified for a particular variable by setting the keyword equal to a blank or by specifying ’noglobal’. For example, ‘min=   ’ or ‘min=noglobal’ will nullify the default ‘min’ for the current variable being defined (because a minimum valid value does not need to be defined for that variable).


CHARACTERISTICS OF EACH NUMERIC VARIABLE


Keyword       Possible Specification          Default (if no keyword)
_____________________________________________________________________

name=         a single string of 1-32         REQUIRED
                alphanumeric characters
iname=        name for this item in the       No instrument name
                instrument or questionnaire
record=       number of the record            Use dataset default
                containing this variable        (usually 1)
column=       column location of the          REQUIRED
                left-most character
width=        number of columns used by       Use dataset default
                this variable                   (usually 1)
decimals=     number of implied decimal       Use dataset default
                places                          (usually 0)
type=         numeric                         numeric, unless another
                                                type has been set as
                                                dataset default

label=        remainder of the line           No long variable label

catlabels=    category labels and text        No category labels
                (see discussion below)

md=           list of invalid codes and/or    No md codes
                 ranges of codes (separated
                 by blanks or commas)
                See discussion below.

min=          minimum valid code              No defined minimum
max=          maximum valid code              No defined maximum

blank=        code into which a field         System missing-data code
                containing only blanks
                will be converted

other=        code into which a field         Unless a non-numeric
                containing non-numeric          character is defined as MD,
                characters will be              non-numeric fields will
                converted                       become system missing-data

sysmdlabel=   label for system missing-data   (No Data)
                value (from a blank input
                field)

text=         descriptive text of any length  No text stored for this
                (until next keyword)            variable



CHARACTERISTICS OF EACH CHARACTER VARIABLE (if any)


Keyword       Possible Specification          Default (if no keyword)
_____________________________________________________________________

name=         a single string of 1-32         REQUIRED
                alphanumeric characters
iname=        name for this item in the       No instrument name
                instrument or questionnaire
record=       number of the record            Use dataset default
                containing this variable        (usually 1)
column=       column location of the          REQUIRED
                left-most character
width=        number of columns used by       Use dataset default
                this variable                   (usually 1)

type=         character                       REQUIRED, unless default
                                                type is character

label=        remainder of the line           No long variable label

catlabels=    category values and labels      No category labels
                (see discussion below)

md_c=         list of character codes to be   No missing data codes
                treated as invalid or MD
                (multiples separated by
                blanks or commas; blanks can
                be specified as MD by using
                empty quotes -- "")
                See discussion below.

blank_c=      character field into which an   No conversion of blanks
                all-blank field will be
                converted (including quotes
                if you give them)

case_c=       upper or lower                  Mixed case preserved
                (convert all the characters
                 into upper or lower case)

text=         descriptive text of any length  No text stored for this
                (until next keyword)            variable

MISSING DATA SPECIFICATIONS -- NUMERIC VARIABLES

In numeric variables, it is possible to flag any number of codes or ranges as invalid and to be excluded from analysis. Multiple codes or ranges are separated by a comma or a blank. An asterisk (*) can be used to indicate all codes above or below a certain value. Some examples are as follows:

MD= 8,9
Codes 8 and 9 are missing-data

MD= 7-9
Codes 7 through 9 are missing-data

MD= 8-*
All numbers 8 and larger are missing-data

MD= *-0
All numbers less than or equal to zero are missing-data

MD= D,R
The characters ’D’ and ’R’ are missing-data

MD= -1,97-99,D,R
Several missing-data codes


MISSING DATA SPECIFICATIONS -- CHARACTER VARIABLES

Missing data specifications for character variables work the same way as for numeric variables, except that the appropriate keyword for character variables is ‘md_c’. However, ranges are not allowed. Multiple missing data codes must be separated by a comma or a blank.

Note in particular that embedded blanks or quotes must be enclosed in single or double quotation marks.

Some examples are as follows:

MD_C = DK, REF
Cases with the values ‘DK’ and ‘REF’ are treated as missing. Those two values are separate by a comma and/or a blank.

MD_C = "Refused to answer"
Because of the embedded blanks, quotes are necessary to define the whole phrase as MD. Otherwise, the individual words ‘Refused’, ‘to’, and ‘answer’ would be defined as three distinct missing-data values.

MD_C = "Don’t know"
Notice that the single quote in "Don’t know" requires double quotes around the whole character value.

MD_C = ’Don’’t know’, Refused
Alternatively, a single quote can be repeated inside of a pair of single quotes. In this case the apostrophe or single quote in "Don’t know" is repeated.

MD_C = ’Mr. "Jack" Doe’
A character value with double quotes must be surrounded with single quotes.

MD_C = ""
Empty quotes refer to a blank. This specification means that a completely blank character value should be treated as missing data. Blank character fields are not treated as missing data unless you provide this specification.

CATEGORY LABELS OR TEXT

Labels or text assigned to response categories can be of any length. The basic format is to specify pairs of codes and labels, with each pair on a separate line:

Long Category Text

If the text corresponding to a category is long, some analysis programs (outside of SDA) will create a shorter category label; this shorter label would be more appropriate for printing the results of an analysis such as crosstabulation.

Depending on the analysis program, the category label might be created by truncating the text to the first 16 or 20 characters. If the label created by truncating the text would be unclear or ambiguous, it is useful to provide your own abbreviated category label. This is done by enclosing the short label in square brackets after the category text. Programs that read the DDL file can then differentiate between the (long) text of a category and the (short) label corresponding to the same category.

     catlabels=
       1 Definitely will vote in the next election     [Definitely vote]
       2 Probably will vote in the next election       [Probably vote]
       3 Probably will not vote in the next election   [Prob not vote]
       4 Definitely will not vote in the next election [Def not vote]
     

Category text can extend over more than one line, provided that a backslash (‘\’) is the last character of every line except the last line:

     catlabels=
       1 Definitely will vote\
           in the next election       [Def vote]
       2 Probably will vote\
           in the next election       [Prob vote]
       3 Probably will not\
           vote in the next election  [Prob not vote]
       4 Definitely will not vote\
           in the next election       [Def not vote]
       8 Don’t know
       9 Refused
     


OTHER OPTIONS AND CLARIFICATIONS


DECIMAL PLACES FOR NUMERIC VARIABLES

There are two ways to process numeric values with decimals, depending on whether or not the input data fields contain an explicit decimal point.
(Users should note that the treatment of numeric variables with decimal points changed, beginning in Version 2.1 of SDA.)

  1. IMPLIED decimal places in the input data field
    If an input data field does NOT have an explicit decimal point, you can specify that the number in the input file should be interpreted as having a certain number of IMPLIED decimal points by using the ‘decimals=’ keyword. The value given after the ‘decimals=’ keyword is the number of IMPLIED decimal places in the input data.

    If decimals=2, for example, the input value ‘1234’ would be stored as ‘12.34’.
    (This is the same as in previous versions of SDA.)

  2. EXPLICIT decimal points in the input data field
    If a number in the input data field has an EXPLICIT decimal point, that number is stored with the full number of decimal places given in the data, regardless of the value given after the ‘decimals=’ specification for that variable. (This is also true if the data field for a case contains scientific notation.)

    If decimals=2, for example, the input value ‘1.237’ will retain all of its decimals and will be stored as ‘1.237’ in versions 2.1 and later of SDA. (In previous versions of SDA that input value would have been rounded to 2 decimal places and would have been stored as ‘1.24’.)


NON-NUMERIC INPUT FOR NUMERIC VARIABLES

An input field defined as ‘numeric’ (the default type) can contain a variety of characters, in addition to simple numbers. A numeric input field may contain leading and trailing blanks (which are ignored), and a possible minus or plus sign. It can also contain decimal points or numbers in scientific notation (like 5e02, for 500).

In SDA, blank input fields will be set to the system missing-data value, unless the DDL specification for that variable (or for all variables, globally) includes the ‘blank=’ keyword, to specify what number those fields are to be converted into. (For example, one could specify ‘blank=-1’, to convert all blank numeric input fields to ‘-1’ in the SDA dataset.) This conversion does NOT affect the original ASCII data file.

Non-numeric characters such as ’D’ and ’R’ are valid for a numeric variable in SDA, and those characters will be stored as such in the dataset, provided that those characters have been defined as missing-data codes. (If those non-numeric characters have not been defined as missing-data codes, they will be treated as invalid codes.)

A period (’.’) by itself in a field, or an input field containing other non-numeric characters that have not been defined as missing-data codes, will ordinarily be converted to the system missing-data value in SDA. However, if the DDL specification for that variable (or for all variables, globally) includes the ‘other=’ keyword, the non- numeric fields will be converted by SDA to the value specified after ‘other=’. That value will then be examined like any other input value, to see whether it is a valid value or has been defined as missing-data or out-of-range.


SPECIAL CHARACTER SETS

Most languages can be handled through the Unicode character set named ’UTF-8’. Some documentation, however, is only available in different character encodings. The appropriate character set to use should be specified in the DDL file. For a list of recognized character sets, see: http://www.iana.org/assignments/character-sets. Some examples to use might be: ’Windows-1252’ (older Windows files) or ’ISO-8859-1’ (Western European). It is preferable today to use ’UTF-8’ if possible, if there are characters other than U.S. ASCII in the DDL file.

COPY MODE

A set of variables will often have many attributes in common, including missing-data specifications and category labels. Instead of repeating those attributes for every variable, it is possible to specify that the attributes of some previously defined variable (in the same DDL file) apply to another variable. (Of course the name and column location of a variable cannot be copied; those attributes must be unique for each variable.)

This copy feature is invoked by putting the word ‘copy’ on the asterisked line preceding the variable’s specifications. The variable whose attributes can be copied is either the previous variable (if no specific name is given) or some specific variable defined earlier in the same DDL file. The general layout is as follows:

description of v101

* copy
description of v102,
using all variable definitions of the PREVIOUS variable (v101) that are not specifically redefined in this new variable description.

* copy v75
description of v103,
using all variable definitions for v75 that are not specifically redefined in this new variable description (assuming that v75 has already been defined).


BACKWARD COMPATIBILITY

Although there are some extensions and changes in DDL syntax since previous releases of the non-Web CSA programs, the programs that read DDL also read the earlier version of DDL.

The following keywords are still recognized and are equivalent to the new keywords shown after the equal sign:

labels = catlabels
lrecl = reclen
noglob = noglobal
scale = decimals

The older missing-data keywords ‘MD1=mdvalue1’ and ‘MD2=mdvalue2’ are also recognized and are equivalent to the new form:

MD= mdvalue1, mdvalue2

EXAMPLE OF DDL

title=    Some Election Study
records/case=2
reclen=   80
path=     /mysda/election

*
name=     CASEID
label=    Case ID of Respondent
record=   1
column=   1
width=    4

*
name=     v75
label=    R’s Interest in Campaign
record=   1
column=   11
md=       8,9
catlabels=
          1 Very Interested
          2 Somewhat Interested
          5 Not Interested
          8 Don’t know, can’t answer [DK]
          9 Refused to answer        [Ref]
text=
   Some people don’t pay much attention to political
   campaigns so far this year.  How about you, are you very
   interested, somewhat interested, or not interested at all?

* copy v75
# Copy the category labels and MD definitions from the variable ’v75’
# (Other specifications are redefined for ’v76’)
name=     v76
label=    R’s Interest in Primary Election Results
column=   12
text=
   How about the results of primary elections.
   How interested in those results are you?
   Are you very interested, somewhat interested,
   or not interested at all?

*
name=     age
label=    Age of respondent
record=   2
column=   20
width=    2
md=       97-*
catlabels =
          97 Age 97 or over
          98 Don’t know
          99 Refused

*
name=     region
label=    Character code for each region
record=   2
column=   24
width =   2
type  =   character
md_c  =   X
catlabels=
          NE Northeastern states
          NC North Central states
          S Southern states
          W Western states
          X (Not available)
text =
Region of the country - coded from the state codes

*
name=     weight
label=    Weight variable
record=   2
column=   50
width=    6
decimals= 4
md=       0
text=
Weight variable with 4 implied decimal places.

_____________________________________________________________________

(For a more extended example, see the DDL file for the SDA test data which is distributed with the SDA programs.)

SEE ALSO

ddlmod Modify or merge DDL files
language Using non-English languages
makesda Make SDA variables out of DDL and an ASCII data file
q4toddl Convert CASES Q language files into DDL
xconvert Convert SAS, SPSS, or Stata data definitions into DDL


CSM, UC Berkeley
April 12, 2011