Online Help for Creating New Variables - SDA 4.1

CONTENTS


SDA Recode Program

This program recodes one or more existing numeric variables into a new SDA variable.

Steps to take

Assign a name to the new variable to be created
This is the name you will use to include the new variable in a subsequent analysis.

Specify the input variables
The input variables are the existing variables (one or more) to use, in order to create the new variable. Use the name for each variable as given in the documentation for this study.

Specify the recoding rules
These recoding rules specify how to create the new variable out of the existing variable(s).

Provide a label and other specifications
Various optional specifications can be provided for the new variable.

Start recoding
After specifying all variables and options, press the Start Recoding button.

REQUIRED variable names

Name for the new variable
The new variable to be created will be stored under the name specified here. If a variable of the same name already exists in the location for new variables, it will NOT be replaced, unless the option to replace it is selected.

If a variable of the same name already exists in the main dataset for the study, the variable cannot be created. Choose another name for your variable.

Variable names:

Name(s) of existing variables to use for the recode
The variable(s) specified here will be recoded into the new variable. From one to six existing variables may be specified. Note: only numeric variables can be used in recoding.

For example, if you want to recode the variables 'age' and 'sex', you enter the names of those variables into the text box, separated by spaces and/or a comma:


Recoding rules and examples

The recoding rules specify how the values of the input variable(s) for each case are to be converted into the values of the new variable. The basic rules are very simple, but certain options can make the specifications a little more complex.

Basic recoding - One input variable
To combine the values of the variable 'age' into three categories numbered 1-3, and create a new variable named 'age3', for example, you first specify 'age3' as the variable to be created:

Then you specify 'age' as the input variable:

And then you can specify the recoding rules as:

Value Label age
1 Younger 0-34
2 Middle 35-64
3 Older 65-99

The "Value" column specifies the category values of the new variable. Each category must be a single numeric value.

The "Label" column specifies the label for each category. This is the text that will be displayed in a table using the new variable. The label is optional, but it is helpful in most cases, especially when there is no obvious ordering of the categories. The input box for labels only displays 16 characters at a time, but you can actually enter up to 250 characters for each label. If you enter a very long description for a category, you may also want to specify an abbreviated label for the category, to be used when running tables. Such optional labels are specified in brackets, after the longer text. For example: Respondent agrees with all the major policies listed [Agrees with all]

The next column(s) specify which values of the input variable(s) should be included in each category of the new variable. The specified codes of the input variable (here 'age') can consist of single code values, ranges, or a combination of many values and/or ranges (separated by commas). The input box for entering these code values and ranges only displays 8 characters at a time, but you can actually enter up to 80 characters in each box. (You can use the arrow keys to scroll back and forth in the box.)

In addition to specifying numeric code values and ranges, you can use a few other special symbols that are useful in applying recode rules:

Symbol Meaning
* Matches any VALID value. If used in a range, matches the lowest or highest VALID value.
** Matches ANY value, including missing data (both user-defined and system-missing). If used in a range, matches the lowest or highest value, including missing data.
$. Matches the system-missing code. Note the period after the dollar sign.

Note that an asterisk or double asterisk, when NOT used in a range, cannot be combined with other specifications in a recode rule. (All other specifications can be combined in a recode rule, but must be separated by commas.)

Basic recoding - Multiple input variables
It is possible to recode combinations of multiple input variables (up to six) into a new variable.

For example, to combine 'age' and 'sex' into a new variable named 'agesex' with four categories numbered 1-4, you first specify 'agesex' as the variable to be created:

Then you would specify both 'age' and 'sex' as input variables:

Then (assuming 'sex' is coded 1=Male, 2=Female) you can specify the recoding rule as:

Value Label age sex
1 Younger Men 0-40 1
2 Younger Women 0-40 2
3 Older Men 41-99 1
4 Older Women 41-99 2

These recoding rules are easily extended to handle more than two input variables. You can also add more rows for recoding rules by clicking on the button labeled 'Add empty row to table'.

Open ranges using an asterisk
The recoding rules ignore the missing data status of numeric codes on the input variable, if they are specified explicitly either as a single value or in a range. In order to avoid inadvertently including a missing data value in a range, it is often useful to specify some ranges with an asterisk. For example, if the values 0 and 99 were missing data codes for the variable "age", but included in explicit ranges as in the examples above, then cases with these missing data codes would be mixed with cases with valid codes in the output categories for the new variable. To avoid this problem, you can specify an open range with an asterisk (*). That symbol matches the lowest or highest VALID code in the data for that variable. For example, the 'age' recode could be specified as:

Value Label age
1 Younger *-34
2 Middle 35-64
3 Older 65-*

Using this method, all valid age values up through 34 would go into the first recoded group. And all valid age values of 65 or older would go into the third group. All remaining cases with missing data values would be automatically assigned the system missing value in the new variable. If you want to recode the missing data codes (0 and 99) into a numeric category in the new variable, then use the following recode rules:

Value Label age
1 Younger *-34
2 Middle 35-64
3 Older 65-*
9 Missing data 0,99

Note that this value of 9 on the new variable will be a valid code unless you set 9 to be a missing data code or out of the valid range. (See the help on these options.)

Treatment of character missing data and system missing data codes
There are two ways to recode character missing data codes and the system missing data code into a specific category of the new variable.

1. Mention the input missing data code explicitly

A character string that has been defined as character missing data on a numeric input variable can be assigned to one of the categories of the recoded variable by specifying that character missing data code in the recoding rules. Similarly, the system missing data code can be recoded by referring to it as `$.' in a recoding rule. (Note the period after the dollar sign.)

In this example, the characters `D' and `R' have been defined as missing data values for the variable `age', to indicate "Don't know" and "Refused." Also, some cases had a blank input field for 'age' in the original data file and were assigned the system missing data code. Those missing data codes in the original variable can be recoded, respectively, into the NUMERIC codes 7, 8, and 9 as follows:

Value Label age
1 Younger *-34
2 Middle 35-64
3 Older 65-*
7 Don't know D
8 Refused R
9 No data $.

2. Use double asterisks (**)

Double asterisks match ANY code, including all missing data codes. They can be used to assign numeric missing data codes, character missing data codes, and the system missing data code to a numeric value on the new variable.

For example, to recode ALL the rest of the codes of the variable `age' (not previously mentioned in a recoding rule) into the category `9' on the new variable, you would specify:

Value Label age
1 Younger *-34
2 Middle 35-64
3 Older 65-*
9 All missing data **

If a case matches more than one recode rule, the first rule encountered will apply. In this example the last recode rule has '**' for the input variables -- which matches any value. Any cases not covered by a rule higher up in the recode rules will receive the value 9.

A more complex example
Finally, here is a more complex example of specifying recode rules. Note that it is possible to have more than one rule for a single output category of the new variable:

Value Label var1 var2
1 Group 1 1,3-5,7 1-10
2 Group 2 8-10,12 100
2 41,45,55 51-90
9 Unassigned ** **

Here the output code 2 has two rules -- which are listed individually because they cannot be combined into one rule. Note that a label only has to be specified once for an output category, even if that category has multiple recode rules.


Other optional specifications

Several optional specifications are common to all variable generating programs. Those specific to the RECODE program are listed here.

What to do with unspecified combinations of input variables (if any)

If a case does not match any of the recode rules, the new output variable can take on one of several values, depending on the options you choose:

SDA Compute Program

This program creates a new SDA variable as a result of a computation based on one or more existing numeric variables.

Steps to take

Enter the expression
The new variable will be created as the result of applying an algebraic expression to one or more pre-existing variables (or by generating random distributions).
A simple example: newvar = 2 * oldvar

Select computation options
After specifying the expression, you can modify certain rules applied to the construction of the new variable. These computation options include treatment of missing data codes on the input variable(s), how to code missing data on the output variable, and the number of decimal places to store.

Select optional specifications
Using these other options, you may specify labels for the new variable and also specify ranges or codes to be considered valid or invalid.

Start computing
After specifying the expression and options, press the Start Computing button.

EXPRESSION to define the new variable

The expression is of the general form:
newvar = expression
The name of one (and only one) new variable must appear on the left-hand side of the equal sign. If a variable of the same name already exists in the location for new variables, it will NOT be replaced, unless the option to replace it is selected.

If a variable of the same name already exists in the main dataset for the study, the variable cannot be created. Choose another name for your variable.

Variable names:


Basic expressions - One line

Basic expressions are of the form:

newvar = spend + spend2 + spend3
  (or)
newvar = sum(spend, spend2, spend3)
                    
Only one new variable can be created at a time. And the name of a new variable can only appear once on the left side of an equal sign (except in an IF-statement).

Note that the two examples above are NOT equivalent. The 'sum' function treats missing data codes differently than just using '+'. Using '+' will usually generate a missing data code on the new variable for a case unless ALL of the input variables have valid codes. But the 'sum' function can skip over variables with missing data codes and just add up the valid codes on the specified variables.

Descriptions of all of the operators, functions, and options that can be used with the COMPUTE program are given next.


Treatment of missing data in expressions

If ANY input variable in an expression has a missing data code for a particular case, the output variable being created will generally be assigned a missing data code. By default the case will be assigned the system missing data code. However, if the user has designated some specific value as the missing data code for the output variable, the case will be assigned that value.

This automatic assignment of an output missing data code does not hold if the user does one of the following:

See the documentation on each of those functions or options, to see how they treat missing data in the input variables.

IF-statements, ELSEIF-statements, and ELSE-statements are evaluated in order. When one of those statements returns a missing data value for a particular case, no further IF/ELSEIF/ELSE statements are evaluated, and the output variable for that case is assigned the missing data code.


Expressions with IF / ELSE IF / ELSE
if (var1 eq 1)
   newvar = var3
else if (var1 eq 2)      [A space after `else' is optional]
   newvar = var4
else
   newvar = -1
                

The `ELSE IF' part can be repeated; `ELSE' can be used only once; both parts are optional.

The expressions `IF', `ELSE IF', and `ELSE' should begin on a new line. Note that either upper or lower case can be used for `IF' and `ELSE'.

If no `ELSE' part is used, it is possible that some cases will not meet any of the conditions; the new variable will then be set to the specified missing data code for those cases.

There is an implied `ENDIF' at the end of the expression. The use of `ENDIF' is optional unless there are nested IF-statements (as shown below).


Logical operators to use with If / Else if
OPERATOR                           EXAMPLES

   EQ     equal to            if (x eq y) newvar = 1
   NE     not equal to        if (x ne y) newvar = 1
   GT     greater than        if (x gt y) newvar = 1
   GE     greater or equal    if (x ge y) newvar = 1
   LT     less than           if (x lt y) newvar = 1
   LE     less or equal       if (x le y) newvar = 1

   AND    both are true       if (x gt y AND x gt z) newvar = 1
   OR     either is true      if (x gt y OR  x gt z) newvar = 1

   These operators can be in upper or lower case.

                

Nested IF-statements

IF-statements can be nested. In such cases, however, it is necessary to use `ENDIF' to eliminate ambiguity. The following example illustrates how this can be done:


IF ( oldvar1 eq 1 )
    IF ( oldvar2 lt 100 )
       newvar = 1
    ELSEIF ( oldvar3 eq 2 )
       newvar = 2
    ENDIF
ELSE
    IF ( oldvar4 gt 10 )
       newvar = 3
    ENDIF
ENDIF

                
There can only be one IF-statement at the top level of the nested expression. The example above has more than one IF-statement, but all except one are nested within the top-level IF/ELSE expression.

Notice how the use of `ENDIF' removes ambiguity about what part goes with what. It is required that `ENDIF' be used for the nested portion of complex IF-statements. The very last `ENDIF', however, could have been omitted.


Use of temporary Variables
$temp1 = var1 + var2
$temp2 = var3 / var4
newvar = $temp1 / $temp2
                

Variables with names that begin with `$' only exist while COMPUTE is running. They are not available for analysis after COMPUTE is finished creating the new variable.

A temporary variable cannot be used in the test portion of an IF-expression. For example, it is NOT legal to use:

if ($temp1 eq 1)         (NOT legal)
                
However, a temporary variable CAN appear on the left hand side of an equal sign within an IF-statement. For example, the following is legal:
if (age lt 40) $temp1 = 1
                

Arithmetic operators
+ - * /       Addition, subtraction, multiplication, division 

^             Power -- for example: var1^2 (var1 squared) 

-var1         Negative of var1 (unary -) 

( )           Parentheses are used to alter (or clarify) the
                usual order of evaluation. 
                
Order in which the various operators are applied:
  1. functions
  2. unary -
  3. ^
  4. * and /
  5. + and -
  6. then left to right within level

Arithmetic functions (can be in upper or lower case)

ABS(x)               Absolute value of x 
EXP(x)               Exponential function (antilog), e^x 
LOG(x) or LN(x)      Natural logarithm 
LG10(x) or LOG10(x)  Logarithm - base 10 
MOD(x,a)             Modulus (remainder) of `x' divided by `a' 
                      (e.g., mod(5,2) equals 1) 
RND(x) or ROUND(x)   Round off 
SQRT(x)              Square root 
TRUNC(x)             Truncate; the integer part of x 

                

Summaries of variables (can be in upper or lower case)

MEAN.n(x,y,...)    Mean of the given variables 
SUM.n (x,y,...)    Sum of the given variables 
MIN.n (x,y,...)    Minimum value of the given variables 
MAX.n (x,y,...)    Maximum value of the given variables 

                

Note that the `.n' part of the function name is optional. If used, it tells the function that at least `n' of the given variables must have valid data for a case; otherwise the function returns the missing data code. The default value for `n' is 1.

For example, `mean(var1,var2,var3)' will generate the mean of the three variables, even if only one of the three has a valid code. On the other hand `mean.2(var1,var2,var3)' will generate a mean for a specific case only if at least two variables have valid codes on that case.


Other summaries (can be in upper or lower case)

COUNT(x,y(a-b))   Number of variables with values between a and b
                    (can specify different ranges for each var;
                    missing data or out-of-range codes are not
                    counted unless include-MD option is selected)

CUM(x)            Cumulate the value of `x' from one case to
                    the next (`x' can be a variable or a constant;
                    if `x' is a missing data value, the cumulation
                    from the previous case is carried over)

MISSING (x,y,...) Number of variables with missing data or
                    out-of-range codes

                

Random Distribution Functions (can be in upper or lower case)
 

UNIFORM(x,y)        Uniform distribution between x and y
                      (x and y can be constants or variables)
DUNIFORM(x,y)       Discrete uniform distribution between x and y
                      (result is always a whole number)
NORMAL(x,y)         Normal distribution with mean=x, sd=y

                

Trigonometric Functions (can be in upper or lower case)

SIN(x), COS (x)           Sine and cosine  (x is in radians) 
ARSIN(y) or ARCSIN(y)     Arcsine 
ARTAN(y) or ARCTAN(y)     Arctangent 

                


Computation Options

Include NUMERIC missing data values in computations

If a pre-existing variable named in an expression has a value for a particular case that is designated as missing data or as outside the range of valid values, the new variable for that case will ordinarily be assigned a missing data code.

If you select this option to include missing data values in computations, the program will consider numeric missing data values as valid, for purposes of generating the new variable.

For example, the expression 'newvar = 2 * age' would ordinarily result in a missing data value for the new variable if the variable 'age' had the value '99' which was designated as a missing data code (to indicate a refusal). If this option is selected, the new variable in this case would receive a valid value of (2 * 99 =) 198.

Note that this option will not override character missing data values (such as 'D' or 'R'), nor will it override the system missing data code. Such missing data values do not have any numeric value that could be used in a computation.


Output code to assign if no valid output value

If there is no valid value that can be assigned to the new variable for a specific case, that case will ordinarily be assigned the system missing data value. This situation usually occurs when one or more variables in the expression have missing data for that case.

If you prefer to assign your own missing data code to such cases, select this option, AND ALSO list one or more values as missing data values in the optional specifications for the new variable. Then the cases with no valid output value will be assigned the first missing data code you specified for the new variable.


Number of decimal places for rounding

The value of the final result from calculating the expression for each case can be rounded to a specified number of decimal places. You may specify from 0 to 6 decimal places for rounding. The default is NOT to round the result, but rather to store all the decimal places resulting from the computation (within the limits of a double-precision number).

Intermediate results are never rounded. All calculations are carried out using double-precision numbers. If rounding is requested, only the final result is rounded to the specified number of decimal places.


Other optional specifications

Several optional specifications are common to all variable generating programs. Those specific to the COMPUTE program are listed here.

Category text and labels for computed values

You can assign category text of any length to the individual output values that result from the computation. In the text box, put each output code and its corresponding text on a single line. For example:

Value Label
1 Lowest
5 Middle of the range
10 Highest

If the text for a category is long, you can also assign an abbreviated version that will be used as the category label in crosstabulations and other similar output. Put the desired abbreviation in brackets before or after the long text for a category. For example:

Value Label
1 Minimum value expected from the computation[Lowest]
10 Highest value expected from the computation [Highest]

(Note that the text box will only show about 20 characters at a time and will scroll, unlike this example, which shows a larger box for purposes of clarity.)


Seed for generating random numbers

The random number functions ordinarily use the system clock to begin random number generation. You can specify the starting point or seed with this option.

One reason to specify the seed might be to generate the same series of random numbers on repeated runs for diagnostic or instructional purposes. Note, however, that the same seed might not generate the same random numbers on different platforms (for example, on Linux versus Windows).


Features Common to All Variable-Creating Programs

Replace the variable, if requested

If the name of the new variable to be created matches the name of a variable that already exists (in the directory for new variables), that variable can be replaced by the new one, provided that the option to replace that variable is in effect. If the option NOT to replace the variable is in effect, the program will send a message that the variable already exists and that you should select the 'Replace' option if you want to overwrite it. If you are creating variables in a public workspace (shared with other users) please be kind: replace a variable only if you created it.


Optional specifications for new variables

Label for the variable
A one-line descriptive label for the new variable can be specified when the variable is created. This label will be reported whenever the new variable is used by an SDA analysis program.

Missing data codes
If you want one or more values of the new variable to be considered invalid or missing data codes, you can specify them when the new variable is created. List individual values, or ranges of values. Values and ranges must be separated by a comma.

For example: 8, 9, 91-99

If you specify one or more missing data codes, the first such code specified can be used to assign a value on the new variable for those cases which do not have a valid outcome code. Cases having a missing data code on the new variable are ordinarily excluded from analyses involving that variable.

Minimum valid value
All values less than this (optional) code value are considered invalid. For example, if the 'minimum' is given as '1', then zero and all negative values are considered invalid and will ordinarily be excluded from analysis results.

Maximum valid value
All values greater than this (optional) code value are considered invalid. For example, if the 'maximum' is given as '5', then all numbers greater than 5 are considered invalid and will ordinarily be excluded from analysis results.

Descriptive text for the variable
Text describing the new variable can be entered and stored when the new variable is created. This text is retrievable whenever the variable is used in an analysis program.

The rules used by RECODE or COMPUTE to create the new variable are also included in the descriptive text for the variable.


Color coding on the output
After creating a new variable, the program sends back to your browser some information including, usually, the frequency distribution of the new variable. The coloring of the headings can be suppressed if desired; this may be useful if you intend to print the output on a black and white printer.

List of Variables Created by Recode or Compute

The variable list includes the following features: