SDA 4.0 Documentation for CORREL

NAME

correl - Correlation coefficients

USAGE

correl -b batchfile

DESCRIPTION

CORREL generates, by default, Pearson correlation coefficients among pairs of specified variables. Or the natural logarithm of the odds ratios can be calculated instead. A weight variable can be used to give different weights to each case, and filter variables may be used to exclude some of the cases.

If a case has missing data on ANY of the specified variables, by default it is excluded from all the calculations. However, there is an option to exclude cases pairwise -- that is, to calculate each correlation coefficient using all cases having valid data on that PAIR of variables.

Ordinarily this program is invoked by the Web interface for the SDA programs, and the user does not have to deal with the keywords given in this document. Output from the program is in HTML, which can be viewed with a Web browser.

It is also possible to run the program directly by preparing a batch command file, which specifies the variables to be analyzed and the options to use. This document explains how to prepare such a file. The name of this batch command file is specified to the program after the `-b' option flag.

KEYWORDS

The batch file contains specifications for the analysis. These specifications are given in the form "keyword = something" with one keyword per line. Keywords may be given in any order, either in upper or in lower case. The valid keywords are as follows (with significant characters shown in capital letters):


Keyword       Possible Specification          Default (if no keyword)
_____________________________________________________________________


STUdy=        path(s) of dataset(s)           Look for variables in
                                                current directory only

Vars=         names of vars to correlate      REQUIRED
               (separated by spaces/commas)

Weight=       name of weight variable         No weighting

Filter=       name(s) and codes of filter     No filter
                variable(s)

GVARCase=     LOWER or UPPER                  No force to lower/upper case

MD=           Pairwise                        Cases with any MD are
                                                excluded

SAvefile=     filename to receive output      Output sent to screen
                (overwrite existing file)       (standard output)

TExt=         Yes                             No text for variables

LAnguagefile= Name of file with non-English   English labels on
                labels and messages             output

RUNtitle=     Title or comments for run       No title or comments

Main Statistic to Display

The main statistic to display in each cell of the matrix can be one of two options: the Pearson correlation coefficient, or the log of the odds ratio. The default main statistics to display are the Pearson correlation coefficients.

For each statistic the user can specify the number of desired decimal places (in parentheses, after the name of the statistic). See below for the default number of decimals for each statistic. Since the default main statistic is the Pearson correlation coefficient, it is not necessary to specify that statistic unless you want to change the number of decimal places to display.

It is possible to reverse the sign of one or more of the variables. This may be desirable, for example, in order to have all of the expected correlations positive. (See the discussion below for more on this option.) Then a negative correlation will stand out as being unexpected. If you want to reverse the sign of a variable, give its index position after the 'reverse=' keyword. A variable's index position is its relative position after the 'vars=' keyword. See the last example below.


Keyword       Possible Specification          Default (if no keyword)
_____________________________________________________________________


MAINstat=     CORR (ndec)                     Display correlations,
              LOGodds (ndec)                    with default number
                                                of decimal places

REVerse=      list                            Do not reverse the signs
               (separated by spaces/commas)     of variables

Other Statistics to Display

In addition to the main statistic, several optional statistics can be displayed. You can specify the desired number of decimal places in parentheses if the default numbers of decimals (listed below) are not satisfactory.

Standard errors of the correlations.
These statistics are placed in a matrix, beneath the matrix of correlation coefficients. See below for a note on their calculation.
Univariate statistics.
The statistics available for each variable include its mean, standard deviation, standard error, valid N of cases, and (if there is a weight variable) valid weighted N of cases.
Paired statistics.
These statistics are available if the 'md=pairwise' option is specified. The paired statistics displayed are the same as the univariate statistics, minus the standard errors. Each statistic is based on the number of valid cases for that pair of variables. Note that the number of valid cases for various pairs of variables can be very different from one another.
P-Square statistics.
For an explanation of the PSQ statistic, see below.


Keyword       Possible Specification          Default (if no keyword)
_____________________________________________________________________


OTHERstats=
              SECOR (ndec)                    No standard errors of
                                                the correlations
              (Univariate statistics)

              MEANs (ndec)                    No means
              SD (ndec)                       No standard deviations
              SEVAR (ndec)                    No standard errors
              Ncases                          No unweighted N's
              WNcases (ndec)                  No weighted N's

              (Paired statistics)

              PMEANs (ndec)                   No paired means
              PSD (ndec)                      No paired std devs
              PSEVAR (ndec)                   No paired std errs
              PNcases                         No paired N's
              PWNcases (ndec)                 No paired weighted N's

PSQ=          list1 ; list2 (ndec)            No P-square statistics
               (see below)

Note that the 'otherstats=' keyword can be repeated on subsequent lines if necessary.

MORE STATISTICAL INFORMATION

DICHOTOMIZING VARIABLES FOR ODDS RATIOS

The calculation of an odds ratio assumes that each of the two variables in a pair has only two categories. If these statistics are requested, CORREL treats all of the specified variables as dichotomies, regardless of the number of categories they may actually have. The minimum valid value of each variable is treated as one category, and all valid values greater than the minimum are combined into the other category. If this default dichotomization is not appropriate for a particular variable, you can recode the variable within CORREL by using the standard SDA temporary recoding syntax.

CALCULATION OF STANDARD ERRORS

If standard errors are requested, they are computed with the standard formulas for each statistic or its transformation, assuming simple random sampling. Note that the confidence interval for the Pearson correlation coefficient is not symmetric; therefore, there is no single standard error that applies in both directions. CORREL outputs the average distance of the upward and the downward confidence band for one standard error (based on the retransformation of Fisher's Z), since that number is ordinarily a useful approximation.

The calculation of the standard error of the correlation coefficient in each cell is based by default on the UNWEIGHTED number of cases, even if a weight variable has been used for calculating the correlation coefficient. Ordinarily this procedure will generate a more appropriate statistical test than one based on the weighted N in each cell.

CALCULATION OF P-SQUARE STATISTICS

The p-square statistic is an index of proportionality for the rows in a correlation matrix. (The correlation matrix is usually a matrix of Pearson correlations, although the p-square procedure will also work with the logs of odds ratios.)

If all of the correlation coefficients in one row are exactly double the size of the coefficients in another row, for example, there is a constant proportionality, and the index will be 1.0. Usually this statistic is used to examine the consistency of the relationships of several items (defining the rows of the matrix) in respect to a number of criterion variables (defining the columns of the matrix). For a discussion of the use of this statistic for creating scales, see Thomas Piazza, "The Analysis of Attitude Items," American Journal of Sociology, vol. 86 (1980) pp. 584-603.

The `PSQ=' keyword allows you to specify which items should be used for the rows (list1), and which items should be used as the criterion variables (list2). Each list is a set of numbers, referring to the order in which the variables were specified after the `Vars=' keyword. Each list can consist of single numbers or ranges, separated by commas or blanks. The two lists are separated by a semicolon. An example is given below.

DECIMAL PLACES

Each statistic has a default number of decimal places with which it will be printed. To change the default, put the desired number of decimals in parentheses after specifying the statistic (or package of statistics). The default number of decimal places for the main statistics (correlations and logs of odds ratios) is 2 places. For their standard errors the default is 3 places. The defaults for the univariate and the paired statistics are: means (2), std deviations (2), std errors (3), and wncases(0). It is not necessary to request the 'correlation' main statistic unless you want to change the number of decimal places; unless otherwise specified, the Pearson correlation coefficient is the statistic that will be displayed.

ADDITIONAL INFORMATION

ABBREVIATIONS FOR KEYWORDS

Keywords can usually be abbreviated down to the number of characters required to differentiate them from other keywords. The keyword for the names of the variables, for instance, can be given as `variables=' or `vars=' or even `v='. Either upper or lower case may be used. In the list of keywords given above, the minimum set of characters for each keyword is capitalized.

Mention of Keyword Sufficient

The form `keyword=yes' may be shortened to `keyword'. That is, the `=yes' may be omitted for those options which require no further specification. For example, `text=yes' can be shortened to `text'.

COMMENTS

Anything on a line beginning with "#" is ignored by the batch processor and can therefore be used for comments. Blank lines are also ignored.

REPETITION OF KEYWORDS

If there is not enough room on a line to list all of the desired variables, the keyword can be repeated on a new line, and more variables can be listed. In such a case the second list is appended to the first list, for purposes of generating tables.

This appending feature applies to the keywords for specifying the variables to be correlated, the filter variables, and the `otherstats=' keyword. It also applies to the 'study=' keyword, for specifying the locations of the SDA dataset directories. If other keywords are repeated, the program will print an error message and stop.

REVERSING THE SIGNS OF VARIABLES

It is often useful to reverse the sign of the correlation coefficients of one or more variables with the other variables. In a group of attitudinal variables, for example, some variables might be coded so that a high score means a liberal response, while other variables might be coded so that a high score means a conservative response. In the correlation matrix the correlation coefficients with a variable like 'age' might then be expected to be positive for the "high = conservative" items and negative for the "high = liberal" items.

If the correlation matrix has more than a few items, it will be easier to interpret the correlations if all of the attitudinal items are scored so that a high score means "conservative" (or "liberal" -- either way). However, it is not necessary to actually recode the items to achieve this goal. The CORREL program allows you to specify one or more items for a reversal of the signs you would otherwise get with those items. Then a departure from the expected sign will be easier to detect. For example, if the "high = liberal" items in the group have their signs reversed, then ALL of the attitudinal correlations with 'age' might be expected to be positive. So if one or more of the items have a negative correlation with 'age' it will be more obvious that the items in question are measuring something different from what the other items are measuring.

EXAMPLES OF BATCH FILES

Basic example

     study = /sa/testdata

     vars = spend spend2 spend3 spend4

     savefile = mymatrix.htm

Use weight and filter variables, and request some univariate statistics and descriptive text for the variables.

     vars = spend spend2 spend3 spend4
     otherstats = means, ncases

     weight= wtvar
     filters= age(18-50) gender(1)

     text = yes

     savefile = mymatrix.htm

Generate a P-square matrix of the four 'spend' variables, using age, educ, and sex as the criterion variables.

Also request 3 decimal places.

     vars = spend spend2 spend3 spend4 age educ sex
     psq = 1-4; 5-7 (3)

     runtitle= Test run to demonstrate P-square stats
     savefile= mypsq.htm

Reverse the sign of the correlations involving two of the four 'spend' variables -- the 2nd and 4th mentioned after the 'vars=' keyword.

     vars = spend spend2 spend3 spend4
     reverse = 2 4

     text
     runtitle= Test run to demonstrate reversing signs
     savefile= mytest.htm

CSM, UC Berkeley/ISA
May 22, 2018