Online Help for Creating New Variables - SDA 4.0

CONTENTS


SDA Recode Program

This program recodes one or more existing numeric variables into a new SDA variable.

Steps to take

Assign a name to the new variable to be created
This is the name you will use to include the new variable in a subsequent analysis.

Specify the input variables
The input variables are the existing variables (one or more) to use, in order to create the new variable. Use the name for each variable as given in the documentation for this study.

Specify the recoding rules
These recoding rules specify how to create the new variable out of the existing variable(s).

Provide a label and other specifications
Various optional specifications can be provided for the new variable.

Start recoding
After specifying all variables and options, press the Start Recoding button.

REQUIRED variable names

Name for the new variable
The new variable to be created will be stored under the name specified here. If a variable of the same name already exists (in the location for new variables), it will NOT be replaced, unless the option to replace it is selected.

Name(s) of existing variables to use for the recode
The variable(s) specified here will be recoded into the new variable. From one to six existing variables may be specified.

For example, if you want to recode the variables 'age' and 'sex', you enter the names of those variables into the text box, separated by spaces and/or a comma:


Recoding rules and examples

The recoding rules specify how the values of the input variable(s) for each case are to be converted into the values of the new variable. The basic rules are very simple, but certain options can make the specifications a little more complex.

Basic recoding - One input variable
To combine the values of the variable 'age' into three categories numbered 1-3, and create a new variable named 'age3', for example, you first specify 'age3' as the variable to be created:

Then you specify 'age' as the input variable:

And then you can specify the recoding rule as:

Value Label age
1 Younger 18-34
2 Middle 35-64
3 Older 65-95

The specified codes on the input variable (here 'age') can consist of single code values, ranges, or a combination of many values and/or ranges (separated by commas). The input box for entering these code values and ranges only displays 8 characters at a time, but you can actually enter up to 80 characters in each box. (You can use the arrow keys to scroll back and forth in the box.)

The label for each category is the text that will be displayed in a table with the new variable. The label is optional, but it is helpful in most cases, especially when there is no obvious ordering of the categories. The input box for labels only displays 16 characters at a time, but you can actually enter up to 250 characters for each label.

If you enter a very long description for a category, you may also want to specify an abbreviated label for the category, to be used when running tables. Such optional labels are specified in brackets, after the longer text. For example:

Respondent agrees with all the major policies listed [Agrees with all]

The recoding rules ignore the missing-data status of NUMERIC codes on the input variable, if they are mentioned explicitly or in a range. For instance, if the value 90 for 'age' were flagged as a missing-data code, but included in the range 65-95 as in the example above, it would be recoded into the value 3 on the new variable. (There is additional help on the treatment of numeric missing-data codes.)

Any categories of the input variable not included in the recoding rules will generally become missing-data on the new variable, and they will ordinarily be excluded from analyses of the new variable. For example, if some cases in the recode above had codes of 96 or 97, they would be recoded into a missing-data category on the new variable. You can specify what that missing-data category should be. (See the help on that topic.)

Basic recoding - Multiple input variables
It is possible to recode combinations of multiple input variables (up to six) into a new variable.

For example, to combine 'age' and 'sex' into a new variable named 'agesex' with four categories numbered 1-4, you first specify 'agesex' as the variable to be created:

Then you would specify both 'age' and 'sex' as input variables:

And then you can specify the recoding rule as:

Value Label age sex
1 Younger Men 18-40 1
2 Younger Women 18-40 2
3 Older Men 41-95 1
4 Older Women 41-95 2

These recoding rules are easily extended to handle more than two input variables. You can add more rows for recoding rules by clicking on the button labeled 'Add empty row to table'.

Assigning labels to the new code values
As you can see in the recode examples, there is an input box for assigning a category label or descriptive text for each category of the new variable. These labels will appear in analysis tables that include the new variable. The input box for labels only displays 16 characters at a time, but you can actually enter up to 250 characters for each label.

If you enter a very long description for a category, you may also want to specify an abbreviated label for the category, to be used when running tables. Such optional labels are specified in brackets, after the longer text. For example:

Respondent agrees with all the major policies listed [Agrees with all]

Open ranges using an asterisk
If you are not sure of the ranges of the variable(s) to be recoded, you can specify an open range with an asterisk (*). That symbol matches the lowest or highest VALID code in the data for that variable. For example, the 'age' recode could be specified as:

Value Label age
1 Younger *-34
2 Middle 35-64
3 Older 65-*

Using this method, all valid age values up through 34 would go into the first recoded group. And all valid age values of 65 or older would go into the third group. (Recoding missing-data values is discussed next.)

Treatment of numeric missing-data codes
NUMERIC codes that have been defined as missing data on the original (input) variables can be recoded into one of the categories of the new variable in three ways.

1. Mention the input missing-data code explicitly as a single value.

For example, if the original 'age' value of 99 was defined as a missing-data code, it can be assigned to a new category of 9 on the new variable as follows:

Value Label age
1 Younger 18-34
2 Middle 35-64
3 Older 65-95
9 Missing data 99

Note that this value of 9 on the new variable will be a valid code unless you set 9 to be a missing-data code or out of the valid range. (See the help on these options.)

2. Include the input missing-data code as part of a range.

For example, the cases with the original value of 99 on age (whether or not 99 was defined as a missing-data value) can be recoded into category 3 of the new variable by specifying a range that includes that value:

Value Label age
1 Younger 18-34
2 Middle 35-64
3 Older 65-100

3. Use an open range with TWO asterisks (**) instead of one.

For example, the following specification will recode all numeric codes 65 and over into category 3 of the new variable (whether or not the codes had been defined as missing-data codes).

Value Label age
1 Younger 18-34
2 Middle 35-64
3 Older 65-**

Treatment of character missing-data and system missing-data codes
There are two ways to recode character missing-data codes and the system missing-data code into a specific category of the new variable. (This currently works only for recoding NUMERIC variables, not for recoding character variables.)

1. Mention the input missing-data code explicitly

A character code that has been defined as missing data on an original NUMERIC variable can be assigned to one of the categories of the recoded variable by specifying that character in the recoding rules. Similarly, the system missing-data code can be recoded by referring to it as `$.' in a recoding rule. (Note the period after the dollar sign.)

For example, the characters `D' and `R' may have been defined as missing-data values for the variable `age', to indicate "Don't know" and "Refused." Also, some cases may have had a blank input field for 'age' in the original data file and were assigned the system missing-data code. Those missing-data codes in the original variable can be recoded, respectively, into the NUMERIC codes 7, 8, and 9 as follows:

Value Label age
1 Younger 18-34
2 Middle 35-64
3 Older 65-95
7 Don't know D
8 Refused R
9 No data $.

2. Use double asterisks (**)

Double asterisks match ANY code, including all missing-data codes. They can be used to assign character missing-data codes and the system missing-data code to a numeric value on the new variable.

For example, to recode ALL the rest of the codes of the variable `age' (not previously mentioned in a recoding rule) into the category `9' on the new variable, you would specify:

Value Label age
1 Younger 18-34
2 Middle 35-64
3 Older 65-95
9 All the rest **

Note that it is only possible to recode character missing-data codes into a numeric code. It is not possible to recode anything INTO a character value. Also, it is not currently possible to recode a CHARACTER variable (which is different from a NUMERIC variable with one or more character values defined as missing-data codes).

However, it is possible to recode anything in a numeric variable into the system missing-data code. Any value that does not match a recode rule will be converted into the system missing-data code, unless a user-specified missing-data value was supplied.

Overlapping ranges
If the same original code value is mentioned in two or more groupings, it is recoded the FIRST time that the value is encountered.

For example, in the following specification age 35 will be recoded into the first category, and not the second, because the first match is the one that counts. Similarly age 65 will be recoded into the second category, and not the third.

Value Label age
1 Younger 18-35
2 Middle 35-65
3 Older 65-95

Notice that order is important with overlapping ranges. The following specification will NOT have the same effect as the preceding one:

Value Label age
3 Older 65-95
2 Middle 35-64
1 Younger 18-34

In this example, age 65 will be assigned a value of 3 on the new variable (instead of a value of 2 as in the previous example), and age 35 will be assigned a value of 2 (instead of 1).

Multiple specifications for one recoded group
It may sometimes be useful to have more than one specification for a new value on the recoded variable. This can be done by specifying the desired outcome code a second time (or as many times as you wish).

For example, to have age recoded into two categories, with category 1 including everyone EXCEPT those aged 35-64, you could use the following recoding rules:

Value Label age
1 Not middle 18-34
2 Middle age 35-64
1 Not middle 65-95

Note that if you specify the label a second time for the same category of the new variable, it will override what you specified the first time.


Other optional specifications

Several optional specifications are common to all variable generating programs. Those specific to the RECODE program are listed here.

What to do with unspecified combinations

If, according to the recoding rules, the values of the input variables for a specific case do not map to a valid output value, it is still necessary to assign some code to the new variable for that cases. There are three options for doing this:

SDA Compute Program

This program creates a new SDA variable as a result of a computation based on one or more existing numeric variables.

Steps to take

Enter the expression
The new variable will be created as the result of applying an algebraic expression to one or more pre-existing variables (or by generating random distributions).
A simple example: newvar = 2 * oldvar

Select computation options
After specifying the expression, you can modify certain rules applied to the construction of the new variable. These computation options include treatment of missing-data codes on the input variable(s), how to code missing-data on the output variable, and the number of decimal places to store.

Select optional specifications
Using these other options, you may specify labels for the new variable and also specify ranges or codes to be considered valid or invalid.

Start computing
After specifying the expression and options, press the Start Computing button.

EXPRESSION to define the new variable

The expression is of the general form:
newvar = expression
The name of one (and only one) new variable must appear on the left-hand side of the equal sign. If a variable of the same name already exists (in the location for new variables), it will NOT be replaced, unless the option to replace it is selected.


Basic expressions - One line

Basic expressions are of the form:

newvar = spend + spend2 + spend3
  (or)
newvar = sum(spend, spend2, spend3)
                    
Only one new variable can be created at a time. And the name of a new variable can only appear once on the left side of an equal sign (except in an IF-statement).

Note that the two examples above are NOT equivalent. The 'sum' function treats missing-data codes differently than just using '+'. Using '+' will usually generate a missing-data code on the new variable for a case unless ALL of the input variables have valid codes. But the 'sum' function can skip over variables with missing-data codes and just add up the valid codes on the specified variables.

Descriptions of all of the operators, functions, and options that can be used with the COMPUTE program are given next.


Treatment of missing data in expressions

If ANY input variable in an expression has a missing-data code for a particular case, the output variable being created will generally be assigned a missing-data code. By default the case will be assigned the system missing-data code. However, if the user has designated some specific value as the missing-data code for the output variable, the case will be assigned that value.

This automatic assignment of an output missing-data code does not hold if the user does one of the following:

See the documentation on each of those functions or options, to see how they treat missing-data in the input variables.

IF-statements, ELSEIF-statements, and ELSE-statements are evaluated in order. When one of those statements returns a missing-data value for a particular case, no further IF/ELSEIF/ELSE statements are evaluated, and the output variable for that case is assigned the missing-data code.


Expressions with IF / ELSE IF / ELSE
if (var1 eq 1)
   newvar = var3
else if (var1 eq 2)      [A space after `else' is optional]
   newvar = var4
else
   newvar = -1
                

The `ELSE IF' part can be repeated; `ELSE' can be used only once; both parts are optional.

The expressions `IF', `ELSE IF', and `ELSE' should begin on a new line. Note that either upper or lower case can be used for `IF' and `ELSE'.

If no `ELSE' part is used, it is possible that some cases will not meet any of the conditions; the new variable will then be set to the specified missing data code for those cases.

There is an implied `ENDIF' at the end of the expression. The use of `ENDIF' is optional unless there are nested IF-statements (as shown below).


Logical operators to use with If / Else if
OPERATOR                           EXAMPLES

   EQ     equal to            if (x eq y) newvar = 1
   NE     not equal to        if (x ne y) newvar = 1
   GT     greater than        if (x gt y) newvar = 1
   GE     greater or equal    if (x ge y) newvar = 1
   LT     less than           if (x lt y) newvar = 1
   LE     less or equal       if (x le y) newvar = 1

   AND    both are true       if (x gt y AND x gt z) newvar = 1
   OR     either is true      if (x gt y OR  x gt z) newvar = 1

   These operators can be in upper or lower case.

                

Nested IF-statements

IF-statements can be nested. In such cases, however, it is necessary to use `ENDIF' to eliminate ambiguity. The following example illustrates how this can be done:


IF ( oldvar1 eq 1 )
    IF ( oldvar2 lt 100 )
       newvar = 1
    ELSEIF ( oldvar3 eq 2 )
       newvar = 2
    ENDIF
ELSE
    IF ( oldvar4 gt 10 )
       newvar = 3
    ENDIF
ENDIF

                
There can only be one IF-statement at the top level of the nested expression. The example above has more than one IF-statement, but all except one are nested within the top-level IF/ELSE expression.

Notice how the use of `ENDIF' removes ambiguity about what part goes with what. It is required that `ENDIF' be used for the nested portion of complex IF-statements. The very last `ENDIF', however, could have been omitted.


Use of temporary Variables
$temp1 = var1 + var2
$temp2 = var3 / var4
newvar = $temp1 / $temp2
                

Variables with names that begin with `$' only exist while COMPUTE is running. They are not available for analysis after COMPUTE is finished creating the new variable.

A temporary variable cannot be used in the test portion of an IF-expression. For example, it is NOT legal to use:

if ($temp1 eq 1)         (NOT legal)
                
However, a temporary variable CAN appear on the left hand side of an equal sign within an IF-statement. For example, the following is legal:
if (age lt 40) $temp1 = 1
                

Arithmetic operators
+ - * /       Addition, subtraction, multiplication, division 

^             Power -- for example: var1^2 (var1 squared) 

-var1         Negative of var1 (unary -) 

( )           Parentheses are used to alter (or clarify) the
                usual order of evaluation. 
                
Order in which the various operators are applied:
  1. functions
  2. unary -
  3. ^
  4. * and /
  5. + and -
  6. then left to right within level

Arithmetic functions (can be in upper or lower case)

ABS(x)               Absolute value of x 
EXP(x)               Exponential function (antilog), e^x 
LOG(x) or LN(x)      Natural logarithm 
LG10(x) or LOG10(x)  Logarithm - base 10 
MOD(x,a)             Modulus (remainder) of `x' divided by `a' 
                      (e.g., mod(5,2) equals 1) 
RND(x) or ROUND(x)   Round off 
SQRT(x)              Square root 
TRUNC(x)             Truncate; the integer part of x 

                

Summaries of variables (can be in upper or lower case)

MEAN.n(x,y,...)    Mean of the given variables 
SUM.n (x,y,...)    Sum of the given variables 
MIN.n (x,y,...)    Minimum value of the given variables 
MAX.n (x,y,...)    Maximum value of the given variables 

                

Note that the `.n' part of the function name is optional. If used, it tells the function that at least `n' of the given variables must have valid data for a case; otherwise the function returns the missing data code. The default value for `n' is 1.

For example, `mean(var1,var2,var3)' will generate the mean of the three variables, even if only one of the three has a valid code. On the other hand `mean.2(var1,var2,var3)' will generate a mean for a specific case only if at least two variables have valid codes on that case.


Other summaries (can be in upper or lower case)

COUNT(x,y(a-b))   Number of variables with values between a and b
                    (can specify different ranges for each var;
                    missing data or out-of-range codes are not
                    counted unless include-MD option is selected)

CUM(x)            Cumulate the value of `x' from one case to
                    the next (`x' can be a variable or a constant;
                    if `x' is a missing-data value, the cumulation
                    from the previous case is carried over)

MISSING (x,y,...) Number of variables with missing data or
                    out-of-range codes

                

Random Distribution Functions (can be in upper or lower case)
 

UNIFORM(x,y)        Uniform distribution between x and y
                      (x and y can be constants or variables)
DUNIFORM(x,y)       Discrete uniform distribution between x and y
                      (result is always a whole number)
NORMAL(x,y)         Normal distribution with mean=x, sd=y

                

Trigonometric Functions (can be in upper or lower case)

SIN(x), COS (x)           Sine and cosine  (x is in radians) 
ARSIN(y) or ARCSIN(y)     Arcsine 
ARTAN(y) or ARCTAN(y)     Arctangent 

                


Computation Options

Include NUMERIC missing-data values in computations

If a pre-existing variable named in an expression has a value for a particular case that is designated as missing-data or as outside the range of valid values, the new variable for that case will ordinarily be assigned a missing-data code.

If you select this option to include missing-data values in computations, the program will consider numeric missing-data values as valid, for purposes of generating the new variable.

For example, the expression 'newvar = 2 * age' would ordinarily result in a missing-data value for the new variable if the variable 'age' had the value '99' which was designated as a missing-data code (to indicate a refusal). If this option is selected, the new variable in this case would receive a valid value of (2 * 99 =) 198.

Note that this option will not override character missing-data values (such as 'D' or 'R'), nor will it override the system missing-data code. Such missing-data values do not have any numeric value that could be used in a computation.


Output code to assign if no valid output value

If there is no valid value that can be assigned to the new variable for a specific case, that case will ordinarily be assigned the system missing-data value. This situation usually occurs when one or more variables in the expression have missing-data for that case.

If you prefer to assign your own missing-data code to such cases, select this option, AND ALSO list one or more values as missing-data values in the optional specifications for the new variable. Then the cases with no valid output value will be assigned the first missing-data code you specified for the new variable.


Number of decimal places for rounding

The value of the final result from calculating the expression for each case can be rounded to a specified number of decimal places. You may specify from 0 to 6 decimal places for rounding. The default is NOT to round the result, but rather to store all the decimal places resulting from the computation (within the limits of a double-precision number).

Intermediate results are never rounded. All calculations are carried out using double-precision numbers. If rounding is requested, only the final result is rounded to the specified number of decimal places.


Other optional specifications

Several optional specifications are common to all variable generating programs. Those specific to the COMPUTE program are listed here.

Category text and labels for computed values

You can assign category text of any length to the individual output values that result from the computation. In the text box, put each output code and its corresponding text on a single line. For example:

Value Label
1 Lowest
5 Middle of the range
10 Highest

If the text for a category is long, you can also assign an abbreviated version that will be used as the category label in crosstabulations and other similar output. Put the desired abbreviation in brackets before or after the long text for a category. For example:

Value Label
1 Minimum value expected from the computation[Lowest]
10 Highest value expected from the computation [Highest]

(Note that the text box will only show about 20 characters at a time and will scroll, unlike this example, which shows a larger box for purposes of clarity.)


Seed for generating random numbers

The random number functions ordinarily use the system clock and the process ID to begin random number generation. You can specify the seed with this option.

One reason to specify the seed might be to generate the same series of random numbers on repeated runs for diagnostic or instructional purposes.


Features Common to All Variable-Creating Programs

Replace the variable, if requested

If the name of the new variable to be created matches the name of a variable that already exists (in the directory for new variables), that variable can be replaced by the new one, provided that the option to replace that variable is in effect. If the option NOT to replace the variable is in effect, the program will send a message that the variable already exists and that you should select the 'Replace' option if you want to overwrite it. By default, the option selected is NOT to overwrite an existing variable.


Optional specifications for new variables

Label for the variable
A one-line descriptive label for the new variable can be specified when the variable is created. This label will be reported whenever the new variable is used by an SDA analysis program.

Missing-data codes
If you want one or more values of the new variable to be considered invalid or missing-data codes, you can specify them when the new variable is created. List individual values, or ranges of values.

For example: 8, 9, 91-99

If you specify one or more missing-data codes, the first such code specified can be used to assign a value on the new variable for those cases which do not have a valid outcome code. Cases having a missing-data code on the new variable are ordinarily excluded from analyses involving that variable.

Minimum valid value
All values less than this (optional) code value are considered invalid. For example, if the 'minimum' is given as '1', then zero and all negative values are considered invalid and will ordinarily be excluded from analysis results.

Maximum valid value
All values greater than this (optional) code value are considered invalid. For example, if the 'maximum' is given as '5', then all numbers greater than 5 are considered invalid and will ordinarily be excluded from analysis results.

Descriptive text for the variable
Text describing the new variable can be entered and stored when the new variable is created. This text is retrievable whenever the variable is used in an analysis program.

The rules used by RECODE or COMPUTE to create the new variable are also included in the descriptive text for the variable.


Color coding on the output
After creating a new variable, the program sends back to your browser some information including, usually, the frequency distribution of the new variable. The coloring of the headings can be suppressed if desired; this may be useful if you intend to print the output on a black and white printer.

List of Variables Created by Recode or Compute

The variable list includes the following features: