This program recodes one or more existing numeric variables into a new SDA variable.
If a variable of the same name already exists in the main dataset for the study, the variable cannot be created. Choose another name for your variable.
Variable names:
For example, if you want to recode the variables 'age' and 'sex', you enter the names of those variables into the text box, separated by spaces and/or a comma:
Then you specify 'age' as the input variable:
And then you can specify the recoding rules as:
Value | Label | age |
---|---|---|
1 | Younger | 0-34 |
2 | Middle | 35-64 |
3 | Older | 65-99 |
The "Value" column specifies the category values of the new variable. Each category must be a single numeric value.
The "Label" column specifies the label for each category. This is the text that will be displayed in a table using the new variable. The label is optional, but it is helpful in most cases, especially when there is no obvious ordering of the categories. The input box for labels only displays 16 characters at a time, but you can actually enter up to 250 characters for each label. If you enter a very long description for a category, you may also want to specify an abbreviated label for the category, to be used when running tables. Such optional labels are specified in brackets, after the longer text. For example: Respondent agrees with all the major policies listed [Agrees with all]
The next column(s) specify which values of the input variable(s) should be included in each category of the new variable. The specified codes of the input variable (here 'age') can consist of single code values, ranges, or a combination of many values and/or ranges (separated by commas). The input box for entering these code values and ranges only displays 8 characters at a time, but you can actually enter up to 80 characters in each box. (You can use the arrow keys to scroll back and forth in the box.)
In addition to specifying numeric code values and ranges, you can use a few other special symbols that are useful in applying recode rules:
Symbol | Meaning |
---|---|
* | Matches any VALID value. If used in a range, matches the lowest or highest VALID value. |
** | Matches ANY value, including missing data (both user-defined and system-missing). If used in a range, matches the lowest or highest value, including missing data. |
$. | Matches the system-missing code. Note the period after the dollar sign. |
Note that an asterisk or double asterisk, when NOT used in a range, cannot be combined with other specifications in a recode rule. (All other specifications can be combined in a recode rule, but must be separated by commas.)
For example, to combine 'age' and 'sex' into a new variable named 'agesex' with four categories numbered 1-4, you first specify 'agesex' as the variable to be created:
Then you would specify both 'age' and 'sex' as input variables:
Then (assuming 'sex' is coded 1=Male, 2=Female) you can specify the recoding rule as:
Value | Label | age | sex |
---|---|---|---|
1 | Younger Men | 0-40 | 1 |
2 | Younger Women | 0-40 | 2 |
3 | Older Men | 41-99 | 1 |
4 | Older Women | 41-99 | 2 |
These recoding rules are easily extended to handle more than two input variables. You can also add more rows for recoding rules by clicking on the button labeled 'Add empty row to table'.
Value | Label | age |
---|---|---|
1 | Younger | *-34 |
2 | Middle | 35-64 |
3 | Older | 65-* |
Using this method, all valid age values up through 34 would go into the first recoded group. And all valid age values of 65 or older would go into the third group. All remaining cases with missing data values would be automatically assigned the system missing value in the new variable. If you want to recode the missing data codes (0 and 99) into a numeric category in the new variable, then use the following recode rules:
Value | Label | age |
---|---|---|
1 | Younger | *-34 |
2 | Middle | 35-64 |
3 | Older | 65-* |
9 | Missing data | 0,99 |
Note that this value of 9 on the new variable will be a valid code unless you set 9 to be a missing data code or out of the valid range. (See the help on these options.)
1. Mention the input missing data code explicitly
A character string that has been defined as character missing data on a numeric input variable can be assigned to one of the categories of the recoded variable by specifying that character missing data code in the recoding rules. Similarly, the system missing data code can be recoded by referring to it as `$.' in a recoding rule. (Note the period after the dollar sign.)
In this example, the characters `D' and `R' have been defined as missing data values for the variable `age', to indicate "Don't know" and "Refused." Also, some cases had a blank input field for 'age' in the original data file and were assigned the system missing data code. Those missing data codes in the original variable can be recoded, respectively, into the NUMERIC codes 7, 8, and 9 as follows:
Value | Label | age |
---|---|---|
1 | Younger | *-34 |
2 | Middle | 35-64 |
3 | Older | 65-* |
7 | Don't know | D |
8 | Refused | R |
9 | No data | $. |
2. Use double asterisks (**)
Double asterisks match ANY code, including all missing data codes. They can be used to assign numeric missing data codes, character missing data codes, and the system missing data code to a numeric value on the new variable.
For example, to recode ALL the rest of the codes of the variable `age' (not previously mentioned in a recoding rule) into the category `9' on the new variable, you would specify:
Value | Label | age |
---|---|---|
1 | Younger | *-34 |
2 | Middle | 35-64 |
3 | Older | 65-* |
9 | All missing data | ** |
If a case matches more than one recode rule, the first rule encountered will apply. In this example the last recode rule has '**' for the input variables -- which matches any value. Any cases not covered by a rule higher up in the recode rules will receive the value 9.
Value | Label | var1 | var2 |
---|---|---|---|
1 | Group 1 | 1,3-5,7 | 1-10 |
2 | Group 2 | 8-10,12 | 100 |
2 | 41,45,55 | 51-90 | |
9 | Unassigned | ** | ** |
Here the output code 2 has two rules -- which are listed individually because they cannot be combined into one rule. Note that a label only has to be specified once for an output category, even if that category has multiple recode rules.
This program creates a new SDA variable as a result of a computation based on one or more existing numeric variables.
newvar = expressionThe name of one (and only one) new variable must appear on the left-hand side of the equal sign. If a variable of the same name already exists in the location for new variables, it will NOT be replaced, unless the option to replace it is selected.
If a variable of the same name already exists in the main dataset for the study, the variable cannot be created. Choose another name for your variable.
Variable names:
Basic expressions are of the form:
Only one new variable can be created at a time. And the name of a new variable can only appear once on the left side of an equal sign (except in an IF-statement).newvar = spend + spend2 + spend3 (or) newvar = sum(spend, spend2, spend3)
Note that the two examples above are NOT equivalent. The 'sum' function treats missing data codes differently than just using '+'. Using '+' will usually generate a missing data code on the new variable for a case unless ALL of the input variables have valid codes. But the 'sum' function can skip over variables with missing data codes and just add up the valid codes on the specified variables.
Descriptions of all of the operators, functions, and options that can be used with the COMPUTE program are given next.
If ANY input variable in an expression has a missing data code for a particular case, the output variable being created will generally be assigned a missing data code. By default the case will be assigned the system missing data code. However, if the user has designated some specific value as the missing data code for the output variable, the case will be assigned that value.
This automatic assignment of an output missing data code does not hold if the user does one of the following:
See the documentation on each of those functions or options, to see how they treat missing data in the input variables.
IF-statements, ELSEIF-statements, and ELSE-statements are evaluated in order. When one of those statements returns a missing data value for a particular case, no further IF/ELSEIF/ELSE statements are evaluated, and the output variable for that case is assigned the missing data code.
if (var1 eq 1) newvar = var3 else if (var1 eq 2) [A space after `else' is optional] newvar = var4 else newvar = -1
The `ELSE IF' part can be repeated; `ELSE' can be used only once; both parts are optional.
The expressions `IF', `ELSE IF', and `ELSE' should begin on a new line. Note that either upper or lower case can be used for `IF' and `ELSE'.
If no `ELSE' part is used, it is possible that some cases will not meet any of the conditions; the new variable will then be set to the specified missing data code for those cases.
There is an implied `ENDIF' at the end of the expression. The use of `ENDIF' is optional unless there are nested IF-statements (as shown below).
OPERATOR EXAMPLES EQ equal to if (x eq y) newvar = 1 NE not equal to if (x ne y) newvar = 1 GT greater than if (x gt y) newvar = 1 GE greater or equal if (x ge y) newvar = 1 LT less than if (x lt y) newvar = 1 LE less or equal if (x le y) newvar = 1 AND both are true if (x gt y AND x gt z) newvar = 1 OR either is true if (x gt y OR x gt z) newvar = 1 These operators can be in upper or lower case.
IF-statements can be nested. In such cases, however, it is necessary to use `ENDIF' to eliminate ambiguity. The following example illustrates how this can be done:
IF ( oldvar1 eq 1 ) IF ( oldvar2 lt 100 ) newvar = 1 ELSEIF ( oldvar3 eq 2 ) newvar = 2 ENDIF ELSE IF ( oldvar4 gt 10 ) newvar = 3 ENDIF ENDIFThere can only be one IF-statement at the top level of the nested expression. The example above has more than one IF-statement, but all except one are nested within the top-level IF/ELSE expression.
Notice how the use of `ENDIF' removes ambiguity about what part goes with what. It is required that `ENDIF' be used for the nested portion of complex IF-statements. The very last `ENDIF', however, could have been omitted.
$temp1 = var1 + var2 $temp2 = var3 / var4 newvar = $temp1 / $temp2
Variables with names that begin with `$' only exist while COMPUTE is running. They are not available for analysis after COMPUTE is finished creating the new variable.
A temporary variable cannot be used in the test portion of an IF-expression. For example, it is NOT legal to use:
if ($temp1 eq 1) (NOT legal)However, a temporary variable CAN appear on the left hand side of an equal sign within an IF-statement. For example, the following is legal:
if (age lt 40) $temp1 = 1
+ - * / Addition, subtraction, multiplication, division ^ Power -- for example: var1^2 (var1 squared) -var1 Negative of var1 (unary -) ( ) Parentheses are used to alter (or clarify) the usual order of evaluation.Order in which the various operators are applied:
ABS(x) Absolute value of x EXP(x) Exponential function (antilog), e^x LOG(x) or LN(x) Natural logarithm LG10(x) or LOG10(x) Logarithm - base 10 MOD(x,a) Modulus (remainder) of `x' divided by `a' (e.g., mod(5,2) equals 1) RND(x) or ROUND(x) Round off SQRT(x) Square root TRUNC(x) Truncate; the integer part of x
MEAN.n(x,y,...) Mean of the given variables SUM.n (x,y,...) Sum of the given variables MIN.n (x,y,...) Minimum value of the given variables MAX.n (x,y,...) Maximum value of the given variables
Note that the `.n' part of the function name is optional. If used, it tells the function that at least `n' of the given variables must have valid data for a case; otherwise the function returns the missing data code. The default value for `n' is 1.
For example, `mean(var1,var2,var3)' will generate the mean of the three variables, even if only one of the three has a valid code. On the other hand `mean.2(var1,var2,var3)' will generate a mean for a specific case only if at least two variables have valid codes on that case.
COUNT(x,y(a-b)) Number of variables with values between a and b (can specify different ranges for each var; missing data or out-of-range codes are not counted unless include-MD option is selected) CUM(x) Cumulate the value of `x' from one case to the next (`x' can be a variable or a constant; if `x' is a missing data value, the cumulation from the previous case is carried over) MISSING (x,y,...) Number of variables with missing data or out-of-range codes
UNIFORM(x,y) Uniform distribution between x and y (x and y can be constants or variables) DUNIFORM(x,y) Discrete uniform distribution between x and y (result is always a whole number) NORMAL(x,y) Normal distribution with mean=x, sd=y
SIN(x), COS (x) Sine and cosine (x is in radians) ARSIN(y) or ARCSIN(y) Arcsine ARTAN(y) or ARCTAN(y) Arctangent
If you select this option to include missing data values in computations, the program will consider numeric missing data values as valid, for purposes of generating the new variable.
For example, the expression 'newvar = 2 * age' would ordinarily result in a missing data value for the new variable if the variable 'age' had the value '99' which was designated as a missing data code (to indicate a refusal). If this option is selected, the new variable in this case would receive a valid value of (2 * 99 =) 198.
Note that this option will not override character missing data values (such as 'D' or 'R'), nor will it override the system missing data code. Such missing data values do not have any numeric value that could be used in a computation.
If you prefer to assign your own missing data code to such cases, select this option, AND ALSO list one or more values as missing data values in the optional specifications for the new variable. Then the cases with no valid output value will be assigned the first missing data code you specified for the new variable.
Intermediate results are never rounded. All calculations are carried out using double-precision numbers. If rounding is requested, only the final result is rounded to the specified number of decimal places.
Value | Label |
---|---|
1 | Lowest |
5 | Middle of the range |
10 | Highest |
If the text for a category is long, you can also assign an abbreviated version that will be used as the category label in crosstabulations and other similar output. Put the desired abbreviation in brackets before or after the long text for a category. For example:
Value | Label |
---|---|
1 | Minimum value expected from the computation[Lowest] |
10 | Highest value expected from the computation [Highest] |
(Note that the text box will only show about 20 characters at a time and will scroll, unlike this example, which shows a larger box for purposes of clarity.)
One reason to specify the seed might be to generate the same series of random numbers on repeated runs for diagnostic or instructional purposes. Note, however, that the same seed might not generate the same random numbers on different platforms (for example, on Linux versus Windows).
If the name of the new variable to be created matches the name of a variable that already exists (in the directory for new variables), that variable can be replaced by the new one, provided that the option to replace that variable is in effect. If the option NOT to replace the variable is in effect, the program will send a message that the variable already exists and that you should select the 'Replace' option if you want to overwrite it. If you are creating variables in a public workspace (shared with other users) please be kind: replace a variable only if you created it.
For example: 8, 9, 91-99
If you specify one or more missing data codes, the first such code specified can be used to assign a value on the new variable for those cases which do not have a valid outcome code. Cases having a missing data code on the new variable are ordinarily excluded from analyses involving that variable.
The rules used by RECODE or COMPUTE to create the new variable are also included in the descriptive text for the variable.
The variable list includes the following features: