Online Help for Analysis Programs - SDA 4.1

This file contains the online help that is available from inside each SDA analysis program. In addition to the help specific to each program, this file includes information on features common to all analysis programs.

Help for Specific Analysis Programs

Frequencies and Crosstabulation
Comparison of Means
Correlation Matrix
Comparison of Correlations
Multiple Regression
Logit/Probit Regression
List Values of Individual Cases

Features Common to All Analysis Programs

Options for specifying variables
Optional variables
Display question text or variable description
Display a title or label for this analysis
Save an analysis run
Actions to take

SDA Frequencies and Crosstabulation Program

This program generates the univariate distribution of one variable or the crosstabulation of two variables. If a control variable is specified, a separate table will be produced for each category of the control variable.

Steps to take

Specify variables: To specify that a certain survey question or variable is to be included in a table, use the name for that variable as given in the documentation for this study. Aside from simply specifying the name of a variable, certain variable options are available to change or restrict the scope of a variable.
Select display options: After specifying the names of variables, select the display options you wish. These affect percentaging, text to display, and statistics to show.
Select an action: After specifying all variables and options, select the action to take.

REQUIRED variable name

Row variable(s): Variable down the side of the table

OPTIONAL variable names

Column variable(s): Variable along the top of the table
Control variable(s): A separate table is produced for each category of a control variable. If charts are being generated, a separate chart is also produced for each category of the control variable.; If more than one row, column and/or control variable is specified, a separate table (and chart) will be generated for each combination of variables.
Selection filter variable(s): Some cases are included in the analysis; others are excluded.
Weight variable: Cases are given different relative weights.

Table Display Options for Crosstabulation

Cell Contents

Percentaging
Sample design for the standard errors
Confidence intervals
Standard error of each percent
Design effect for each percent
Show the Z-statistics
N of cases to display

Other Options

Summary statistics
Question text or variable description
Color coding of the table cells
Suppress display of the table
Include missing-data values
Display a title or label for this analysis

Percentaging

Defines which way to make the percents add up to 100 percent:

Column: down each column
Row: across each row
Total: as a percent of the total number of cases in the table

You can request more than one type of percentaging in a table, but such tables are hard to read.

It is important to understand that if a weight variable has been specified, the percentages and the statistics are always computed using the weighted number of cases. If you want to calculate percentages and statistics using only the unweighted N's, do not specify a weight variable.

Sample design

For complex samples, standard errors and confidence intervals are calculated that take the complex design into account. If bivariate statistics are requested, the Rao-Scott adjustment to the chi-square statistics are used to create F statistics. In this case, probability values are only calculated for the Rao-Scott-based F statistics, and not for the unadjusted chi-square statistics.

Nevertheless, you can specify that the standard errors, confidence intervals, and chi-square probability values should be calculated as if the sample were a simple random sample (SRS). One reason to request SRS calculations might be to compare the size of the SRS standard errors or confidence intervals with the corresponding statistics based on the complex sample design.

Confidence intervals

If this option is selected, an additional row of numbers is generated that contains the upper and lower bound of the confidence interval of the percentage (column, row, and/or total) in each cell. The confidence interval is the range of values within which the population value of the statistic is likely to fall. By default, the level of confidence is 95 percent, but the user can also select 99 percent or 90 percent.

The confidence interval is computed by converting the standard error of each percentage to a natural logarithm and then multiplying the log of the standard error by the value of Student's t appropriate to the level of confidence requested and to the number of degrees of freedom. The result is added to the log of the percentage to obtain the upper bound of the confidence interval, and it is subtracted from the log of the percentage to obtain the lower bound. The logs of the upper bound and of the lower bound are then converted back to percentages (by taking the antilogs) and displayed in the table cell.

This conversion back and forth to logarithms results in confidence intervals that are asymmetric -- they are a little wider in the direction of 50% than in the direction of 0% or 100%. This is the same procedure used by Stata to calculate confidence intervals of percentages. Notice that the calculation of confidence intervals for a proportion (or for any mean) by the Comparison of Means program does not use this log transformation. Therefore, the confidence intervals calculated by the Comparison of Means program will be a little different than the confidence intervals calculated by the Crosstabulation program for the same proportions. This is also the case for Stata.

Standard error of each percent

Standard errors for each type of percentage (column, row, or total) can be computed and displayed for each cell of the table. Standard errors are used to create confidence intervals for the percentages in each cell.

Simple random samples
If the sample is equivalent to a simple random sample of a population, the standard error of each percentage is computed using the familiar "pq/n" formula for the normal approximation to the standard error of a proportion. For each proportion p, the formula is:
sqrt(p * (1-p) / (n-1))
where n is the number of cases in the denominator of the percentage -- the total number of cases in that particular column, row, or total table, depending on the percentage being calculated. For this calculation, n is the actual number of cases, even if weights have been used to calculate the percentages.

Complex samples
If the sample for a particular study is more complex than a simple random sample, the appropriate standard errors can still be computed provided that the stratum and/or cluster variables were specified when the dataset was set up in the SDA Web archive. Otherwise, the standard errors calculated by assuming simple random sampling are probably too small.

For complex samples the appropriate standard errors are computed using the Taylor series method. If you want additional technical information, see the document on standard error calculation methods.

Note that the calculations for standard errors in cluster samples require that the coefficient of variation of the sample size of the denominator for each percentage, CV(x), be under 0.20; otherwise, the computed standard errors (and the confidence intervals) are probably too small, and they are flagged in the table with an asterisk. CV(x) and other diagnostic information is available for standard error calculations done by the SDA Comparison of Means program. That program and the SDA Crosstabulation program use the same information and methods to calculate standard errors.

Design effect (deft) for each percent

The design effect for each percentage based on a complex sample is the ratio of the standard error of each percent in a table cell divided by the standard error of the same percent in a simple random sample of the same size. For the calculation of standard errors, see the discussion of standard errors above. (The design effect for a percent based on a simple random sample is 1.)

The design effect for each percent in a cell is used to calculate the effective number of cases (N / deft-squared) on which the percent is based, for purposes of precision-based suppression.

The design effects for all of the total percents in a table are used to calculate the Rao-Scott adjustment to the chi-square statistic, if bivariate statistics have been requested for a complex sample.

DF -- Degrees of freedom

The number of degrees of freedom (df) is used to compute the width of each confidence interval. For a simple random sample the df equal the number of cases in the denominator for each each percentage for that cell, minus one.

For complex samples, the df equal the number of primary sampling units (clusters, for cluster samples; individual cases in the denominator, for unclustered samples) minus the number of strata (unstratified samples have a single stratum). Note that the number of strata and clusters used for this calculation is usually the number in the overall sample, and not in the subclass represented by a cell in a table. For a fuller discussion of this issue, see the treatment of domains and subclasses in the document on standard error methods.

The value of Student's t used for computing confidence intervals depends on the desired level of confidence (95 percent, by default) and the df. The fewer the df, the larger the required value of Student's t and, consequently, the larger the width of the confidence intervals. As the df increase, the size of the required Student's t value decreases until it approaches the familiar value for the normal distribution (which is 1.96, for the 95 percent confidence level).

Show the Z-statistics

The Z-statistic controls the color coding of cells in the table. If you select this option, the statistic will be displayed in each cell.

The Z-statistic shows whether the frequencies in a cell are greater or fewer than expected (in the same sense as used for the chi-square statistic). It also takes into account the total number of cases in the table. If there are only a few cases in the table, the deviations from the expected values are not as significant as if there are many cases in the table.

The Z-statistics are standardized residuals. The residual for each cell is calculated as the ratio of two quantities:

The numerator is the difference between the observed and the expected number of cases in each cell. (The number "expected" is the same number used to calculate the chi-square statistic.)
The denominator is the following quantity:
sqrt(expected_n * (1-row_proportion) * (1-column_proportion))

For a discussion of the standardized residuals, see Alan Agresti, An Introduction to Categorical Data Analysis, New York: John Wiley, 1996, p. 31.

Note that if the frequencies in the table are weighted, the Z-statistic can be artificially inflated (or deflated). Consequently, if weights are used, each Z-statistic is divided by the average size of the weights. The average size of the weights is just the ratio of the total number of weighted cases in the table, divided by the actual number of unweighted cases in the table. For example, if the table is based on 1,000 actual cases, but the weighted number of cases is 100,000, the average size of the weights is 100,000/1,000 = 100. (The chi-square statistics are adjusted in the same way, to compensate for weights whose average is different from 1.) Note also that the Z-statistic does not take into account the complex sample design, if the table is based on such a sample.

N of cases to display

By default, the number of cases used to calculate percentages is displayed in each cell. The box to display the weighted N is initially checked on the option form. If no weight variable was specified for the analysis, the unweighted N of cases is displayed in each cell, even if the box for weighted N was checked.

However, you can uncheck both boxes, and no N will be displayed. Or you can check both boxes, and both the unweighted and the weighted N of cases will be displayed (if a weight variable has been specified).

It is important to understand that if a weight variable has been specified, the percentages and the statistics are always computed using the weighted number of cases, regardless of which N is displayed in the table. If you want to calculate percentages and statistics using only the unweighted N's, do not specify a weight variable.

Summary statistics (Bivariate or Univariate)

Various numbers or statistics can be used to summarize the distributions of the variables. If you specify both a row and a column variable, a package of bivariate statistics is generated. If you specify a row variable only, a package of univariate statistics is generated. Consult any statistics textbook for more information on the meaning of these statistics.

Bivariate statistics

The bivariate statistics summarize the strength or the statistical significance of the observed relationship between the row and the column variables. Several of the most common statistics are displayed if you select this option.

Nominal-level statistics

A nominal-level statistic does not take into account any ordering of the categories of the row and column variables. That is, you would get the same result even if the categories were put into another order.
SDA displays two versions of the chi-square statistic, which is the most commonly used nominal-level statistic. For simple random samples (SRS) a probability level (p-value) is also calculated for each chi-square statistic.
For complex samples a Rao-Scott adjustment to each chi-square is calculated. An F statistic is derived from the adjusted Rao-Scott statistics and is added to the statistics package. The p-values corresponding to those F statistics are displayed (instead of the p-values for the regular chi-square statistics, which do not take the sample design into account).
If the p-value is low (about .05 or less), the chances that the observed relationship is only due to sampling error are correspondingly low, and in that case the relationship is said to be statistically significant. On the other hand, if the p-value is high, the chances are correspondingly high that the row and the column variables are not related to one another in the whole population from which the sample was drawn but are only related in the sample that happens to have been selected and that we are observing (analyzing).
- The chi-square statistics
  Two versions of the chi-square statistic are displayed -- Pearson's Chi-square, displayed after 'Chisq-P(df)=', where df is the number of degrees of freedom; and the Likelihood-ratio Chi-square, displayed after 'Chisq-LR(df)='. For SRS, the p-value (probability statistic) corresponding to each chi-square statistic for the given df (degrees of freedom) is also displayed.
  Note that if the frequencies in the table are weighted, the chi-square statistic can be artificially inflated (or deflated). Consequently, if weights are used, the chi-square is adjusted by the factor: (Total unweighted N) / (Total weighted N).
- Rao-Scott adjustment to chi-square (for complex samples)
  The probability associated with a regular Pearson or Likelihood ratio chi-square statistic assumes that the sample was a simple random sample. For complex samples, the probability associated with a given chi-square statistic is usually too small. This means that a particular relationship between two variables may appear to be statistically significant when it could really have arisen by chance.
  The Rao-Scott adjustment to the chi-square statistic takes the complex sample design into account. The probability associated with the Rao-Scott statistic is a more accurate indicator of the statistical significance of the relationship between the row and the column variables than the probability corresponding to a regular chi-square statistic.
  SDA displays the F statistic derived from each Rao-Scott statistic and the associated p-value of the F. This is done both for the Pearson chi-square, displayed after 'Rao-Scott-P:F(dfn, dfd)'; and for the Likelihood-ratio chi-square, displayed after 'Rao-Scott-LR:F(dfn, dfd)'. These are F-tests, where dfn is the number of numerator degrees of freedom and dfd is the number of denominator degrees of freedom.
  In generating these test statistics, SDA uses the first-order Rao-Scott approximation. The first step is to generate design effects for the estimated proportion of cases in each cell of the table and then to calculate a generalized design effect based on the cell design effects. The two chi-square statistics are divided by the generalized design effect, to obtain design-adjusted chi-square statistics. Then each design-adjusted chi-square statistic is divided by its numerator degrees of freedom to obtain F-statistics, which are then tested. The Rao-Scott adjustments to chi-square are explained in the following journal article: J.N.K. Rao and A.J. Scott, "On Chi-squared Tests for Multiway Contingency Tables with Cell Proportions Estimated from Survey Data," The Annals of Statistics, Vol. 12 (1984), No. 1, pp.46-60.
  Note that this use of the first-order Rao-Scott approximation is the same as in SAS. Stata uses a second-order approximation, which is a little different but should give the same substantive results.
Ordinal-level statistics
Ordinal statistics take into account the order of the row and the column categories. However, there is no assumption made that the distance between successive rows or columns is of the same magnitude. Only the order is considered.
Four ordinal statistics are given: Gamma, Tau (2 versions) and Somers' d (assuming the row variable to be the dependent variable).
The ordinal statistics can be calculated either for numeric variables or for character variables (with the categories sorted into alphabetic order).
These ordinal statistics are purely descriptive. No attempt is made to test them for sampling error.
Interval-level statistics
Interval-level statistics take into account the ordering of the row and column categories (like ordinal statistics). And they also make the assumption that the distance between each successive category code is of equal importance.
If interval-level statistics are reported for numeric variables that are ordered, note that they must be ordered in a way that approximates interval-level variables. This refers to variables coded like 1=Agree strongly; 2=Agree somewhat; 3=Disagree somewhat; 4=Disagree strongly. To report interval-level statistics for such variables, you must assume that the "distance" between 1 and 2 is of equal importance as the distance between 2 and 3, and between 3 and 4.
Two interval-level statistics are given: R (the Pearson correlation coefficient), and Eta (the correlation ratio assuming the row variable to be the dependent variable).
If the row variable is a character variable, Eta cannot be calculated. If either the row variable or the column variable is a character variable, the correlation coefficient cannot be calculated.
These interval statistics are purely descriptive. No attempt is made to test them for sampling error. Use the regression program for tests of significance and confidence intervals for correlation statistics. The regression program can also handle complex sample designs.

Univariate statistics

The univariate statistics package includes the mean, median, mode, standard deviation, variance, and the coefficient of variation (standard deviation divided by the mean) of the specified variable, plus a few other descriptive statistics. All of these statistics are calculated using the weight variable, if one is specified.

Note that the univariate statistics cannot be calculated for character variables. If a character variable is used as a row variable, the request for univariate statistics is ignored. Even for numeric variables, be aware that the univariate statistics will not be meaningful unless the code values of the row variable are ordered in a way that approximates interval-level data.

These univariate statistics are purely descriptive. No attempt is made to test them for sampling error. To get standard errors and confidence intervals for the mean of a variable, you can use the Comparison of Means program.

Other Display Options

Question text or variable description

The text of the question or other descriptive text.

Color coding of the table cells

The table cells are color coded, in order to aid in detecting patterns. Cells with more cases than expected (based on the marginal percentages) become redder, the more they exceed the expected value. Cells with fewer cases than expected become bluer, the smaller they are, compared to the expected value.

The transition from a lighter shade of red or blue to a darker shade depends on the magnitude of the Z-statistic. The lightest shade corresponds to Z-statistics between 0 and 1. The medium shade corresponds to Z-statistics between 1 and 2. The darkest shade corresponds to Z-statistics greater than 2.

The color coding can be turned off, if you prefer. Color coding may not be helpful if you are using a black-and-white monitor or if you intend to print out the table on a black-and-white printer.

Suppress display of the table

Occasionally you may want to see the summary statistics for a table and/or the chart, without wishing to view the table itself, especially if the table is a very large one. If you select this option, the table is generated internally but is not displayed.

Include missing-data values

With this option, the row, column, and control variables in the table will include ALL categories, including those defined as missing-data or out-of-range categories. The system-missing code will also appear in the table. Its category label will be the default "(No Data)" unless another label has been assigned to the system-missing code. Any range restrictions or temporary recode commands will be ignored, and every category will be shown.

If bivariate statistics are requested, nominal and ordinal statistics will be produced as usual, with the missing data codes sorted into order with the valid codes.

Interval-level statistics will also be computed if the included missing-data codes allow it. The Eta statistic will be calculated if the included missing data codes on the ROW variable are all numeric. The Pearson correlation coefficient can be calculated only if the included missing data codes are all numeric on BOTH the row and column variables.

If univariate statistics are requested, the row variable can only have numeric missing-data codes. Otherwise, no statistics can be generated, and the request is ignored.

Number of decimals to display

Each statistic displayed in the cells of the table has a default number of decimal places. If you want more or fewer decimal places, you can generally specify from 0 to 6 decimal places for most of the statistics displayed in each cell (with the exception of the unweighted number of cases). Note that the decimal place specifications for standard errors are RELATIVE to the number of decimal places in the percentages.

Chart Options for Crosstabulation

Type of chart to display

Select the type of chart you would like. A stacked bar chart is relatively compact and is suitable for most tables. Regular side-by-side bar charts, pie charts, and line charts are also available.

If you select column percentaging, the chart will include a separate set of bars (or a separate pie) describing the row variable, for each category of the column variable. For a line chart, there will be a separate line for each category of the row variable, plotted against the values of the column variable. The column variable is treated as the "break variable" in this layout.

If you select row percentaging, the chart will include a separate set of bars (or a separate pie) describing the column variable, for each category of the row variable. For a line chart, there will be a separate line for each category of the column variable, plotted against the values of the row variable. The row variable is treated as the "break variable" in this layout.

If you select total percentaging, a combination of row and column percentaging, or no percentaging at all, the effect is the same as selecting column percentaging only.

If there is only a row variable specified for the table, the chart will include one set of bars (or one pie, or one line) to show the distribution of that row variable.

Bar chart options

The appearance of bar charts (both stacked and side-by-side bar charts) can be modified in two ways:

Orientation (vertical or horizontal): The default orientation is vertical, but a horizontal orientation will sometimes offer a clearer picture, especially if there are many categories in the break variable. A horizontal orientation will also accommodate longer category labels for the break variable.
Visual effects (2-D or 3-D): The bars in the bar charts can be shaded, to give a 3-dimensional effect. The desirability of shading depends mostly on personal preference.

Show Percents

Each bar, pie slice, or point on a line will have its percent included on the chart, if you select this option.

Note that these percents may not always appear or may not be legible in all situations.

On stacked bar charts the percents may not have sufficient room to appear inside the area allocated to small categories.

On pie charts and line charts the percents for some slices or for some points on the lines may be almost overlaid and become illegible, if there are many categories or if the lines are very close together.

If you still want to show the percents in those situations, it will usually help if you increase the size of the charts. For stacked bar charts it can also help to change from a vertical to a horizontal orientation.

Palette

The charts are usually output in color. If you wish to print or copy the charts on a black-and-white printer or copier, you can select the grayscale palette for your charts. The charts will then be output in various shades of gray (instead of in various colors).

Size of chart

The width and height of the chart (expressed in the number of pixels) can be modified. If there is a large (or a very small) number of categories in either the row or the column variable, it may be helpful to increase (or decrease) one or both of the dimensions of the chart.

Pie charts in particular may require an increase in the dimensions of the chart if the number of category slices is large. Otherwise, the labels for each slice of the pies might overlay one another.

Stacked bar charts with only two or three break categories may look better if the chart is made narrower. But if there is a large number of break categories (like years of age), the best solution is often to combine a horizontal chart orientation with an increase in the height of the chart.

Side-by-side bar charts are best limited to tables with a relatively small number of categories in both the row and the column variables. If there are many categories in either or both of the variables, the proliferation of bars can be confusing, even if the chart dimensions are increased. In such cases it is probably better to use stacked bar charts instead of side-by-side bar charts.

Line charts may need to be enlarged if the lines are close to being overlaid. If percents are being shown, they also can become overlaid. In such cases it may help to increase the height of the chart.

CSV output file

Create a CSV output file for downloading

You can create a CSV format file (based on the currently selected options) by clicking on the "Create CSV file" button. Once a CSV file is created, another button labeled "Download CSV file" will appear. Clicking on this button will allow you to download the CSV file to your computer. CSV files are useful for importing into other applications (such as Excel) for creating custom charts. CSV files can also be useful for preparing tables for inclusion in manuscripts.

When creating a CSV file it is usually easiest to first create preliminary HTML output, by clicking on the "Run the Table" button, while you choose the correct variables, filters, weight, cell statistics, etc. The HTML output is quick and easy to read while you're fine-tuning your options. When you have the desired output, create a CSV file.

Once the CSV file has been created, click on the "Download CSV file" button and download it. (Note that this same CSV file will remain available for downloading -- even multiple times -- until you create a new CSV file by clicking the "Create CSV file" button again.) Once you have downloaded the CSV file, you can import it into an appropriate application on your computer.

For example, to create a chart in Excel, use SDA's default CSV output option to separate statistics into multiple tables (see below). Once the CSV file has been imported into Excel, select the desired table of statistics, including the row and column labels. (You may also want to include the row or column totals, depending on the statistic and your preferences.) Then select the "Insert" tab and click on "Recommended Charts". You can now preview various chart types by clicking on them. Once you've chosen your desired chart type, click on "OK". You can then customize your chart in various ways using Excel's tools.

CSV table format

If the cells in your HTML table contain more than one statistic then, by default, in the equivalent CSV output file a separate table is created for each statistic. This is often the most useful format for importing into a charting tool. However, for some applications, it is more useful to combine the statistics into one table. You can specify whether the statistics should be output in separate tables or combined in one table. (Note that in either case, only one CSV file is created.)

Name of CSV download file

By default, the name of the CSV download file is "tables.csv". However, you can specify another name of your choice. This is useful for giving a more meaningful name to the CSV file, especially when you are creating several different CSV files. However, the file ending or extension must always be ".csv". If you do not specify a ".csv" extension, it will be automatically appended to your specified name.

SDA Comparison of Means Program

This program calculates the mean of the dependent variable separately within categories of the row variable and, optionally, the column variable. If a control variable is specified, a separate table will be produced for each category of the control variable.

Steps to take

Specify variables: To specify that a certain survey question or variable is to be included in a table, use the name for that variable as given in the documentation for this study. Aside from simply specifying the name of a variable, certain variable options are available to change or restrict the scope of a variable.
Select display options: After specifying the names of variables, select the display options you wish. These affect the number of decimals to show, text to display, and statistics to compute.
Select an action: After specifying all variables and options, select the action to take.

REQUIRED variable names

Dependent variable(s): A numeric variable whose mean or average value is to be computed for each combination of the row and (optionally) column and control variables and displayed in a table.
Row variable(s): Variable down the side of the table

OPTIONAL variable names

Column variable(s): Variable along the top of the table
Control variable(s): A separate table is produced for each category of a control variable.; If more than one dependent variable, row variable, column variable, and/or control variable is specified, a separate table will be generated for each combination of variables.
Selection filter variable(s): Some cases are included in the analysis; others are excluded.
Weight variable: Cases are given different relative weights.

Display Options for Comparison of Means

Main statistic to display

Each cell of the table will usually contain the MEAN of the dependent variable for that particular combination of the row and (optionally) column and control variables.

Sometimes, however, it is more helpful to express each cell mean in another way:

DIFFERENCES from the overall mean.
Select this option to have those differences calculated and put into each cell of the table.
DIFFERENCES from a ROW category.
A specific ROW category is used as the base category. The other cells in the table are expressed as the difference between that cell and the base category cell in the same column.
The rightmost column of the table usually shows the Row Totals. In some setups, however, the average of the differences is shown. This is the weighted average of the differences in that row. The weight is the number of cases (weighted number of cases if a weight was used) in that cell plus the number of cases in the corresponding base category.
DIFFERENCES from a COLUMN category.
A specific COLUMN category is used as the base category The other cells in the table are expressed as the difference between that cell and the base category cell in the same row.
The bottom row of the table usually shows the Column Totals. In some setups, however, the average of the differences is shown. This is the weighted average of the differences in that column. The weight is the number of cases (weighted number of cases if a weight was used) in that cell plus the number of cases in the corresponding base category.
TOTALS for each cell.
The total is the numerator of the ratio used to calculate the mean. (The denominator of the ratio is the number of cases in that cell.)
The totals are usually of interest only when a weight is being used to expand the cell counts up to their estimated values in the population. For example, one may be interested in the total estimated NUMBER of persons in each cell who have some characteristic (e.g., who smoke, or drive cars), instead of the PROPORTION of persons who have that characteristic. This assumes that the dependent variable is coded `1' for a case which has the characteristic (smokes, for example) and `0' for a case which does not have the characteristic.

Base row or column category

When the main statistic to display is a DIFFERENCE from a row or column category, it is necessary to specify which row or column category is the base category.

Enter the code value for the row or column category that you want to consider the base category.

Transformation of the dependent variable (for 0/1 dependent variables)

The mean of a dependent variable coded 0 or 1 is a proportion. The problem with analyzing a proportion is that the standard deviation and variance depend on the magnitude of the proportion.

The proportion in each cell of the table can be transformed into another statistic that has a more stable distribution. These options are provided for didactic purposes, so that students and researchers can readily compare the logit and and probit transformations with the original proportions in a table. The following options are available:

Logit
The proportion p is reexpressed as log(p/(1-p)).
Proportions greater than .5 have a positive logit. Proportions less than .5 have a negative logit. A proportion of .5 has a logit value of 0.
The logit has a constant standard deviation of 1.81 (pi / sqrt(3), to be exact).
The standard error of each logit is 1.81 / sqrt(n) where n is the number of cases in that cell. (This assumes simple random sampling. For complex samples, it is necessary to use the Logit/Probit Regression program to calculate standard errors.)
Probit
The proportion is reexpressed as the value corresponding to that specific probability on the cumulative distribution function of the normal distribution.
Proportions greater than .5 have a positive probit. Proportions less than .5 have a negative probit. A proportion of .5 has a probit value of 0.
The probit has a constant standard deviation of 1.0.
The standard error of each probit is 1.0 / sqrt(n) where n is the number of cases in that cell. (This assumes simple random sampling. For complex samples, it is necessary to use the Logit/Probit Regression program with the probit regression option to calculate standard errors.)
Logit scaled as a probit
The logit and probit distributions are very similar. They differ primarily in the tails of the distributions. However, because the two statistics are scaled differently, the similarity is not evident by simply examining the statistics.
This option converts a proportion into a logit and then rescales it by making the standard deviation equal to 1.0 (like a probit) instead of 1.81 (the usual standard deviation of a logit).
The standard error of each logit scaled as a probit is 1.0 / sqrt(n) where n is the number of cases in that cell. (This assumes simple random sampling. For complex samples, it is necessary to use the Logit/Probit Regression program to calculate standard errors.)

These transformations require that the dependent variable be coded as a value of 0 or 1. If the variable is not coded that way, SDA will create a temporary 0/1 variable by recoding the lowest value to 0 and all other values to 1.

Calculate a median or other percentile for each cell

The drop-down menu allows you to select either the median or a percentile. (The median is the same as the 50th percentile.) If you select 'Percentile', another drop-down list will appear, from which you can pick any percentile between 1 and 99 (the default is the 90th percentile).

The median or the specified percentile of the dependent variable will be displayed as the first statistic in each cell. If a weight variable is used, the medians or percentiles will be calculated using the weights. The medians or percentiles calculated by the MEANS program are purely descriptive. No attempt is made to test them for sampling error.

Base chart on medians/percentiles (instead of means): A chart generated by the MEANS program is based, by default, on the mean of the dependent variable (or whatever else has been selected as the "Main statistic to display"). If you have requested that medians or percentiles be displayed in each cell (in addition to the means), you can choose to base the chart on these medians or percentiles by checking this box.

Estimate the values of medians or percentiles

If you have a very large number of cases, dependent variable categories, and table cells, there may only be enough memory to calculate the exact median or percentile for some of the cells of the table. By default, no median or percentile is output for the remaining cells. By checking this box, however, you can request that an estimated value be calculated for the median or percentile in those cells that otherwise would be left without any statistic at all.

An asterisk(*) next to a median or percentile indicates that it was estimated using an algorithm for what is called the "remedian". For further information on this method of estimating medians and percentiles, see Peter J. Rousseeuw and Gilbert W. Bassett, Jr., "The Remedian: A Robust Averaging Method for Large Data Sets." Journal of the American Statistical Association, March 1990, vol. 85, pp. 97-104. Note that SDA uses a base of 101 to calculate the remedian.

Confidence intervals

If this option is selected, an additional row of numbers is generated that contains the upper and lower bound of the confidence interval of the statistic (mean or difference or total) in each cell. The confidence interval is the range of values within which the population value of the statistic is likely to fall. By default, the level of confidence is 95 percent, but the user can also select 99 percent or 90 percent.

The confidence interval or range is computed by multiplying the standard error of the mean (or difference or total) by the value of Student's t appropriate to the level of confidence requested and to the number of degrees of freedom. The result is added to the mean (or difference or total) to obtain the upper bound of the confidence interval, and the result is subtracted from the mean (or difference or total) to obtain the lower bound. Note that if both complex and SRS standard errors are requested, only the complex standard errors are used to compute the confidence intervals.

For a very large random sample (in a particular cell of a table), for instance, the appropriate value for Student's t for a 95 percent confidence interval is close to the familiar 1.96 value for the normal distribution.

Additional statistics to display (for simple random samples): There are several additional statistics that can be displayed in each cell:

Standard errors for the means (or for differences or for totals) can be computed and displayed for each cell of the table.
Note the following for the various main statistics:
- Means: If the sample is equivalent to a simple random sample of a population, the standard error of the mean is computed by dividing the standard deviation by the square root of the number of (unweighted) cases in each cell.
- Totals: The standard error (SE) is equal to the SE of the mean multiplied by the (weighted) number of cases in that cell.
- Differences from a row or column category: The SE for each difference between two means is the square root of the sum of the two variances, and it is calculated as: sqrt(VARIANCE1 + VARIANCE2) The variance of each comparison is the square of the corresponding SE. If the Row Total or the Column Total is displayed, the row or column total is treated like the other columns or rows, and the standard error of the difference is calculated in the same way.
  If the "Average of the Differences," is shown in the last column or row, its standard error is calculated by computing the weighted average of the variances of the differences in that row or column, where the weight is the square of the (unweighted) N for each comparison. This weighted average is then divided by the square of the total N for the comparisons in that row or column, and the square root of the result is the SE for the "Average of the Differences." (This optional display in the last column or row can be set up by the data archive for certain didactic purposes.)
- Differences from the overall mean: The SE for this difference is not calculated.
If the sample for a particular study is more complex than a simple random sample, the appropriate standard errors can still be computed (except for transformed dependent variables) provided that the stratum and/or cluster variables were specified when the dataset was set up in the SDA data archive. Otherwise, the standard errors calculated by assuming simple random sampling are probably too small.
Standard errors are used to create confidence intervals for the mean in each cell. For cells with at least 30 cases, you can be 95% confident that the mean in the population (for each cell) is within the interval bounded by approximately two standard errors above and below the mean in the sample (ignoring the problem of potential bias in the sample).
Standard deviations can be computed and displayed for each cell of the table. These statistics measure how much variation there is in the dependent variable within each cell of the table. The calculation of standard deviations uses weights, if a weight variable has been specified.
Standard deviations are displayed when the main statistic to display is specified to be either means or totals. Standard deviations are not displayed when the main statistic is a difference.
Minimum and/or Maximum value
The minimum and/or the maximum value of the dependent variable within each cell of the table can be displayed by checking the corresponding box.
N of cases to display
By default, the number of cases used to calculate means is displayed in each cell. The box to display the weighted N is initially checked on the option form. If no weight variable was specified for the analysis, the unweighted N of cases is displayed in each cell, even if the box for weighted N was checked.
However, you can uncheck both boxes, and no N will be displayed. Or you can check both boxes, and both the unweighted and the weighted N of cases will be displayed (if a weight variable has been specified).
It is important to understand that if a weight variable has been specified, the means and the statistics are always computed using the weighted number of cases, regardless of which N is displayed in the table. If you want to calculate means and statistics using only the unweighted N's, do not specify a weight variable.
The Z-statistic or the t-statistic (depending on the main statistic selected for display). This statistic controls the color coding of cells in the table of means.
- The Z-statistic, which is purely descriptive, is shown if the main statistic displayed is the mean or the difference from the overall mean. The Z-statistic shows whether the mean in a cell is larger or smaller than the overall mean, and by how many standard units. It is calculated by dividing the difference from the overall mean by the standard deviation of the overall mean. In other words, the difference is converted to a standardized difference.
  Note that this standardized difference is purely descriptive. It does not assess the statistical significance of the observed difference between the cell mean and the overall mean.
- The t-statistic is shown if the main statistic displayed is the difference between the cell mean and the mean in a specified row or column. It is calculated by dividing the difference by the standard error of that difference. If both a complex standard error and an SRS standard error are requested, the complex standard error is used for calculating the t-statistic.
  The t-statistic can be used to calculate and display a p-value if requested.
The p-value. If you select this option, and if a t-statistic is (or can be) calculated for the difference between the cell mean and the mean in a specified row or column, the p-value corresponding to that t-statistic and its degrees of freedom will be displayed.
If the p-value is low (about .05 or less), the chances that the observed difference is only due to sampling error are correspondingly low, and in that case the difference is said to be statistically significant. On the other hand, if the p-value is high, the chances are correspondingly high that the difference between the cell mean and the mean in the specified row or column does not reflect a difference in the whole population from which the sample was drawn but is only found in the sample that happens to have been selected and that we are observing (analyzing)

Additional statistics to display (for complex probability samples): There are several additional statistics that can be displayed in each cell:

Standard errors for complex samples can be computed and displayed for the means (or for differences or for totals) in each cell of the table (except for transformed dependent variables).

Note the following for the various main statistics:
- Means: The standard errors for complex samples are computed using the Taylor series method. The method used is reported when you run the program. If you want additional technical information, see the discussion of standard error calculation methods.
  You can also request SRS standard errors, which are calculated as if the sample were a simple random sample, for purposes of comparison.
- Totals: The standard error (SE) is equal to the SE of the mean multiplied by the (weighted) number of cases in that cell.
- Differences from a row or column category: The SE for each difference between two means is the square root of the sum of the two variances of the means minus the covariance, and it is calculated as: sqrt(VARIANCE1 + VARIANCE2 - COVARIANCE12). The variance of each mean is the square of the corresponding SE. The covariance term arises because of the complex design. If the Row Total or the Column Total is displayed, the row or column total is treated like the other columns or rows, and the standard error of the difference is calculated in the same way.
  If the "Average of the Differences," is shown in the last column or row, its standard error is calculated by computing the weighted average of the variances of the differences in that row or column, where the weight is the square of the (unweighted) N for each comparison. This weighted average is then divided by the square of the total N for the comparisons in that row or column, and the square root of the result is the SE for the "Average of the Differences." (This optional display in the last column or row can be set up by the data archive for certain didactic purposes.)
  You can also request SRS standard errors for the differences, which are calculated as if the sample were a simple random sample. However, if you request BOTH complex and SRS standard errors for the differences, only the complex standard errors are computed and reported.
- Differences from the overall mean: The SE for this difference is not calculated.
Standard errors are used to create confidence intervals for the mean in each cell, or for the difference between a cell mean and the mean in a specified row or column, or for the total in each cell (if a weight is being used to expand the cell counts to the estimated size of the population). The optional diagnostic table reports the degrees of freedom used to generate the appropriate t-statistic for creating the confidence intervals.
Note that the calculations for standard errors in cluster samples require that the coefficient of variation of the sample size in each cell, CV(x), be under 0.20; otherwise, the computed standard errors are probably too small, and they are flagged in the table with an asterisk. CV(x) for each cell is available in the optional diagnostic table.
Standard errors for simple random sampling (SRS) can also be displayed in each cell. These standard errors are computed by the formula used for simple random samples, ignoring stratification and clustering. These standard errors can be compared with the standard errors that take the complex sample design into account. In general the SRS standard errors will be somewhat smaller.
The Design Effect (DEFT) is the ratio of the complex standard error divided by the SRS standard error. It indicates how much the standard error has been inflated by the complex sample design. For example, a DEFT of 1.25 means that the calculated standard error is 25 percent larger than the standard error of a simple random sample of the same size.
DEFT is calculated for each subgroup of the data defined by the values of the row, column, control, and filter variables (if any). DEFT is only calculated for the means and totals. It is not calculated for the differences between means.
The RHO statistic or clustering coefficient is a measure of the effect of clustering (if the sample is a cluster sample). If the statistic is zero, it indicates that the standard error has not been inflated because of the cluster design. A rho statistic between .05 and .10 is of moderate size.
The effect of rho on the standard error depends on the size of the clusters. The larger the clusters, the larger the effect of rho. The formula for calculating rho is shown with the explanation of the stratified average cluster size. The average cluster size is represented as 'b' and is displayed for each cell in the optional diagnostic table.
If the design effect is less than 1.0, the rho statistic will be negative. This means that the differences between clusters within the same stratum are relatively small, compared to the variability between elements in the sample as a whole.
RHO is calculated for each subgroup of the data defined by the values of the row, column, control, and filter variables (if any). RHO is only calculated for the means and totals. It is not calculated for the differences between means.
Standard deviations can be computed and displayed for each cell of the table. These statistics measure how much variation there is in the dependent variable within each cell of the table. The calculation of standard deviations uses weights, if a weight variable has been specified, but it does not take the stratification or clustering of the sample into account.
Standard deviations are displayed when the main statistic to display is specified to be either means or totals. Standard deviations are not displayed when the main statistic is a difference.
The Z-statistic or the t-statistic (depending on the main statistic selected for display). This statistic controls the color coding of cells in the table of means.
- The Z-statistic, which is purely descriptive, is shown if the main statistic displayed either is the mean or is the difference from the overall mean. The Z-statistic shows whether the mean in a cell is larger or smaller than the overall mean, and by how many standard units. It is calculated by dividing the difference between the cell mean and the overall mean by the standard deviation of the overall mean. In other words, the difference is converted to a standardized difference.
  Note that this standardized difference is purely descriptive. It does not assess the statistical significance of the observed difference between the cell mean and the overall mean.
- The t-statistic is shown if the main statistic displayed is the difference between the cell mean and the mean in a specified row or column. It is calculated by dividing the difference by the standard error of that difference. If both a complex standard error and an SRS standard error are requested, the complex standard error is used for calculating the t-statistic.
  The t-statistic can be used to calculate and display a p-value if requested.
The p-value. If you select this option, and if a t-statistic is (or can be) calculated for the difference between the cell mean and the mean in a specified row or column, the p-value corresponding to that t-statistic and its degrees of freedom will be displayed.
If the p-value is low (about .05 or less), the chances that the observed difference is only due to sampling error are correspondingly low, and in that case the difference is said to be statistically significant. On the other hand, if the p-value is high, the chances are correspondingly high that the difference between the cell mean and the mean in the specified row or column does not reflect a difference in the whole population from which the sample was drawn but is only found in the sample that happens to have been selected and that we are observing (analyzing)

Multiple Classification Analysis (MCA)

If this option is selected, an MCA table is generated showing the effect on the dependent variable of each of the categories of each row, column, and control variable. (Those variables must all be numeric variables. If one or more are character variables, the MCA request is ignored.)

These MCA statistics are purely descriptive. No attempt is made to test them for sampling error. You can run the SDA regression program to calculate standard errors and confidence intervals, even for complex samples.

The MCA procedure shows the average effect of each category, and it ignores any interactions between the variables. If interaction effects are statistically significant, MCA is generally not appropriate.

The first set of columns shows the effects of each category of each variable.

The Unadjusted Effect is the the difference between the dependent variable score of respondents in each category and the overall mean of the dependent variable.
The Adjusted Effect of each category takes into account the effects of the other variables. The adjustment process is similar to running a regression with dummy variables for the various categories. Regression coefficients for dummy variables, however, represent deviations from the effect of the omitted category. MCA coefficients, on the other hand, are deviations from the overall mean of the dependent variable.
The eta coefficient for each variable is like a bivariate correlation coefficient. It is the square root of the proportion of variance of the dependent variable "explained" by the categories of each variable, unadjusted for the effects of the other variables.
The beta coefficient for each variable is like a standardized regression coefficient. It adjusts the eta coefficient for each variable by taking into account the effects of the other variables.
The Multiple R is the multiple regression coefficient for all of the categories of all of the independent variables. The Multiple R takes into account all of the overlapping linear effects of all of the independent variables. The Multiple R-Squared shows the total variance of the dependent variables explained by all of the categories of the independent variables.

The second set of columns shows the mean of the dependent variable for each category of the row (and column and control) variable(s).'

The Unadjusted Means are the (weighted) mean values of the dependent variable within the various categories of each variable. Each mean is the sum of the overall mean and the unadjusted effect for that category.
The Adjusted Means are the adjusted mean values of the dependent variable, taking into account the other categories of all the variables. Each adjusted mean is the sum of the overall mean and the adjusted effect for that category.

The "Difference" column shows the difference between the adjusted and unadjusted effects (or, equivalently, the means) for each category.

Diagnostic output for standard errors (for complex probability samples): If this option is selected, an additional table is generated that contains the following statistics in each cell:

Number of strata
The number of strata used for computing the standard error of the mean is reported in the diagnostic table for each cell. The number of strata in each cell is the same as the number of strata in the overall sample, except for strata without any cases at all in the subgroup represented by that specific cell in the table. For a fuller discussion of this issue, see the treatment of domains and subclasses in the document on standard error methods.
Number of clusters (for cluster samples)
The number of clusters or primary sampling units (in a cluster sample) used for computing the standard error of the mean is reported in the diagnostic table for each cell. The number of clusters in each cell is the same as the number of clusters in the overall sample, except for clusters in strata without any cases at all in the subgroup represented by that specific cell in the table. For a fuller discussion of this issue, see the treatment of domains and subclasses in the document on standard error methods.
DF -- Degrees of freedom
The number of degrees of freedom (df) is used to compute the width of each confidence interval. For each cell of the table, the df equal the number of primary sampling units (clusters, for cluster samples; individual cases, for unclustered samples) minus the number of strata (unstratified samples have a single stratum).
The value of Student's t used for computing confidence intervals depends on the desired level of confidence (95 percent, by default) and the df. The fewer the df, the larger the required value of Student's t and, consequently, the larger the width of the confidence interval. As the df increase, the size of the required Student's t value decreases until it approaches the familiar value for the normal distribution (which is 1.96, for the 95 percent confidence level).
Design effect (deft) due to weighting
If weights were used to estimate means or totals, part of the total design effect MAY be due to weighting. The estimated design effect (deft) attributable to weighting (assuming that it would have been optimal to use the same sampling fraction for the whole sample) is given in the table of diagnostic information. The overall design effect for each cell could then be divided by the deft due to weighting, to estimate how much of the overall deft is due to the other characteristics of the sample design (like clustering).
The variation in the values of the weight variable is used to estimate the design effect due to weighting, assuming that it would have been optimal to use the same sampling fraction within all the strata. The deft due to weighting is based on formula 11.7.6 given in Kish, Survey Sampling, p. 430. That formula gives the design effect in terms of variances. The square root of that result gives the design effect in terms of standard errors (deft).
Note that this estimation of the design effect due to weighting is based entirely on the variation in the weight variable, and it does not consider the specific dependent variable being analyzed. Not every use of weights will increase the standard error of the mean of a specific dependent variable. If the weights result from the use of different sampling fractions in different strata of the sample and if that stratification was effective in reducing sampling error for this particular dependent variable, the estimated deft due to weighting may be greater than the overall deft. If this occurs, it is an indication that the weighting did not increase sampling error for this dependent variable as much as was estimated from the variation in the weight variable (if at all).
Frequently, however, differential rates of sampling are used in different strata simply to achieve the oversampling of some group(s) relative to others. Weights are then used to compensate for the different probabilities of selection. In those cases the different strata are sampled at different rates in a way that departs from optimum allocation, and the sampling variance of the mean of the dependent variable is increased (see Kish, Survey Sampling, pp. 429-433).
b -- Average cluster size (for cluster samples)
The effect of clustering on the size of standard errors depends on two factors: b (average cluster size, combined across strata), which is reported in the diagnostic table, and rho (the intra-cluster correlation) which is reported (optionally) in the main table. The relationship between these factors and DEFF, the design effect in terms of sampling variances (the square of the DEFT reported in the main table), is given by Kish (Survey Sampling, pp. 161-164) as:
DEFF = 1 + (rho)(b-1)
The b for each cell is a stratified estimate. The program calculates the average cluster size within each stratum. The individual stratum b's are then combined into an overall b for each cell. This overall b is reported in the optional table of diagnostic statistics.
CV(x) -- Coefficient of variation of cluster sizes (for cluster samples)
For a ratio mean computed as `y/x', the denominator `x' is the number of cases in a cluster. A requirement of the Taylor series method is that `x', the size of clusters within each stratum, not vary excessively. Concretely, this means that the coefficient of variation of the cluster sizes should be less than 0.20, and preferably under 0.10. (See Kish, Survey Sampling, pp. 206-209.)
If the value of CV(x) is greater than 0.20, the calculated standard error is probably too small. Such standard errors are flagged in the main table with an asterisk. The corresponding confidence intervals are also flagged with an asterisk.
The CV for each cell is a stratified estimate. The program calculates the coefficient of variation of the number of valid cases for the clusters within each stratum. The individual stratum CVs are then combined into an overall CV for each cell. This overall CV is reported in the optional table of diagnostic statistics available in the Comparison of Means program (but not in the Frequencies and Crosstabulation program). If the CV(x) is greater than 0.20, it is flagged with an asterisk.

ANOVA

An analysis of variance can be carried out and presented after the table of means. The Eta squared statistic shows the proportion of the variance of the dependent variable accounted for by the row variable (and by the column variable, if there is one, and the interaction between the row and column variables.)

If the sample is a simple random sample, the ANOVA can also be used to assess the statistical significance of the effects of the row variable (and the column variable, if there is one) on the dependent variable. An F statistic is calculated as the ratio of each mean square divided by the residual mean square, and the probability of the F statistic is evaluated. If the p-value (probability statistic) associated with a particular row or column effect is low (about .05 or less), the chances are correspondingly low that the observed effect on the dependent variable is only due to sampling error. In that case the effect is said to be statistically significant.

If the sample is a complex sample, like a cluster sample, the ANOVA is only of descriptive value. The F tests and their associated probability statistics are omitted because they would likely underestimate the size of the true p-value and therefore overstate the statistical significance of the observed row and/or column effects. Only the Eta squared statistic for each effect is displayed. You can use the SDA regression program to calculate the statistical significance of the independent variables in complex samples.

Other Display Options

Suppress display of the table

Occasionally you may want to see ANOVA statistics or a Multiple Classification Analysis (MCA) or a chart without viewing the table of means, especially if the table is a very large one. If you select this option, the table of means is generated internally but is not displayed. Tables containing confidence intervals and diagnostic information (for complex standard errors) are also suppressed.

Color coding of the cells

The cells of the table of means are color coded, in order to aid in detecting patterns. Cells with higher means than the overall mean become redder, the more they exceed the overall mean. Cells with lower means than the overall mean become bluer, the smaller they are.

The transition from a lighter shade of red or blue to a darker shade depends on the magnitude of the absolute value of the Z-statistic or t-statistic. The transition points vary, depending on which of those two statistics is calculated:

For Z-statistics the lightest shade corresponds to Z-statistics between 0 and .5. The medium shade corresponds to Z-statistics between .5 and 1. The darkest shade corresponds to Z-statistics greater than 1. If the Z-statistic is zero, no color is shown. Also, if the cell has fewer than 10 cases, no color is shown, since the difference from the overall mean is probably unstable.
For t-statistics the lightest shade corresponds to t-statistics between 0 and 1. The medium shade corresponds to t-statistics between 1 and 2. The darkest shade corresponds to t-statistics greater than 2. If the t-statistic is zero, no color is shown.

The color coding can be turned off, if you prefer. Color coding may not be helpful if you are using a black-and-white monitor or if you intend to print out the table on a black-and-white printer.

Number of decimals to display

Each statistic displayed in the cells of the table has a default number of decimal places. If you want more or fewer decimal places, you can generally specify from 0 to 6 decimal places for most of the statistics displayed in each cell. Note that some decimal place specifications are RELATIVE to the number of decimal places in the main statistic (means or totals).

Question text or variable description

The text of the question or other descriptive text.

Chart Options for Comparison of Means

Type of chart to display

Select the type of chart you would like. A bar chart is suitable for most tables and is the default chart format. Line charts are also available and are suitable especially when the categories of the row variable are ordered.

If only a row variable is specified (and no column variable), the bars or the line will show the value of the dependent variable (on the vertical axis) for each value of the row variable.

If both a row variable and a column variable are specified, there will be a separate set of bars, or a separate line, for each category of the column variable. For a bar chart, there will be sub-bars for each column category within the bar for each row category. For a line chart, there will be a separate line for each column category.

Bar chart options

The appearance of bar charts can be modified in two ways:

Orientation (vertical or horizontal): The default orientation is vertical, but a horizontal orientation will sometimes offer a clearer picture, especially if there are many categories in the row variable. A horizontal orientation will also accommodate longer category labels for the row variable.
Visual effects (2-D or 3-D): The bars in the bar charts can be shaded, to give a 3-dimensional effect. The desirability of shading depends mostly on personal preference.

Show means/medians/percentiles

Each bar or each point on a line will have its statistic included on the chart, if you select this option. This could be the mean (or another "main statistic"), or the median, or the specified percentile, depending on the statistic that was requested on which to base the chart.

Note that the chosen statistic may not always appear or may not be legible in all situations. Especially on line charts, the statistics for some points on the lines may be almost overlaid and become illegible, if there are many categories or if the lines are very close together. If you still want to show the statistics in those situations, it will usually help if you increase the size of the charts.

Palette

Size of chart

Bar charts are best limited to tables with a relatively small number of categories in both the row and the column variables. If there are many categories in either or both of the variables, the proliferation of bars can be confusing, even if the chart dimensions are increased.

Line charts may need to be enlarged if the lines are close to being overlaid. If means are being shown, they also can become overlaid. In such cases it may help to increase the height of the chart.

CSV output file

Create a CSV output file for downloading

CSV table format

Name of CSV download file

By default, the name of the CSV download file is "means.csv". However, you can specify another name of your choice. This is useful for giving a more meaningful name to the CSV file, especially when you are creating several different CSV files. However, the file ending or extension must always be ".csv". If you do not specify a ".csv" extension, it will be automatically appended to your specified name.

SDA Correlation Matrix Program

This program calculates the correlation between all pairs of two or more variables.

Steps to take

Specify variables: To specify that a certain survey question or variable is to be correlated or used as a filter or weight variable, give the name for that variable as given in the documentation for this study. Aside from simply specifying the name of a variable, certain variable options are available to change or restrict the scope of a variable.
Select display options: After specifying the names of variables, select the display options you wish. These affect the statistics to compute, the number of decimals to show, and text to display.
Run correlations: After specifying all variables and options, select Run correlations to run the program.
Or you can select Clear fields to delete all previously specified variables and options, so that you can start over.

REQUIRED variable names

Variables to be correlated

Enter the names of two or more numeric variables whose correlation coefficients are to be computed for each pair of variables. (There are various optional ways of specifying variable names for analysis.)

Enter the name of each variable in a text box. To go from one text box to another, use the tab key or your mouse. It is all right to skip a text box and leave it blank -- to use only text boxes 1, 5, and 9, for example.

It is possible to enter more than one variable name in a text box (the underlying text-entry area will scroll). This has consequences for other options which refer to variable numbers. For example, if you enter two variables in text box number 3, and then you request that the signs of the correlations be reversed for variable number 3, the signs of BOTH variables in text box number 3 will be reversed.

Each text box, consequently, defines a variable GROUP. Ordinarily it is clearer to put only one variable in each text box, but the possibility of defining groups of variables exists.

OPTIONAL variable names

Selection filter variable(s): Some cases are included in the analysis; others are excluded.
Weight variable: Cases are given different relative weights in calculating the correlation coefficients.

How to exclude cases with missing data

Listwise exclusion: If a case has a missing-data value on ANY of the variables to be correlated, it is excluded from ALL of the correlation calculations. This is the default procedure.
Pairwise exclusion: If a case has a missing-data value on SOME of the variables to be correlated, but not on others, it is excluded from the calculations for those PAIRS of variables in which one of the values is missing.
This procedure retains all of the information about each pairwise relationship. However, the multivariate relationships can be inconsistent, if many of the cases have different missing-data patterns on different variables.

Correlation Measure to Calculate

The Pearson correlation coefficient

This is the usual correlation coefficient and is the default correlation measure to calculate. It is appropriate for ordered numeric variables.

Log of the odds-ratio

The log of the odds-ratio is an optional measure for dichotomous variables. The calculation of the odds ratio assumes that the two variables have only two categories each. If these statistics are requested, the correlation program treats each variable as a dichotomy, regardless of the number of categories it may actually have. The minimum valid value of each variable is treated as one category, and all valid values greater than the minimum are combined into the other category.

If this default dichotomization is not appropriate for a particular analysis, you can recode the variable temporarily within the correlation program using the standard methods of recoding variables.

Consult any beginners' statistics book for more information on the meaning of these statistics.

Additional Statistics to Calculate

Alpha coefficient

Cronbach's alpha coefficient is a measure of how well the variables in the correlation matrix could be said to measure the same thing. If you added together all of the variables included in the correlation matrix to form a scale, alpha is the square of the correlation between the scale and the underlying factor.

The alpha coefficient is a function of the average correlation between the variables and of the number of variables. If some of the variables are scored in opposite directions, you should use the option to reverse the signs of some of the variables, so that a high score on all variables means the same thing.

Standard errors

A standard error for each correlation coefficient can be computed. If this option is requested, the standard errors are placed in a separate matrix, right under the matrix of correlation coefficients. If the sample is more complex than a simple random sample, the standard errors calculated here are probably too small.

The standard errors can be used to create confidence intervals for each correlation coefficient. For example, you can be 95% confident that the correlation coefficient in the population (for each pair of variables) is within the interval bounded by approximately two standard errors above and below the correlation coefficient calculated from the sample (as shown in the matrix). The actual multiple to use for creating confidence intervals is the t-statistic with (n-1) degrees of freedom.

The calculation of the standard error of the correlation coefficient in each cell is based by default on the UNWEIGHTED number of cases, even if a weight variable has been used for calculating the correlation coefficient. Ordinarily this procedure will generate a more appropriate statistical test than one based on the weighted N in each cell.

The standard error is computed differently, depending on which correlation coefficient you have selected.

Standard errors for Pearson correlation coefficients:: The confidence interval for the Pearson correlation coefficient is not symmetric; therefore, there is no single standard error that applies in both directions. The standard error output by this program is the average distance of the upward and the downward confidence band for one standard error (based on the retransformation of Fisher's Z), since that number is ordinarily a useful approximation.
Standard errors for the log of the odds ratio:: The standard error for the log of the odds ratio is calculated with standard formulas for that statistic. Consult a statistics book for details.

Univariate statistics

Univariate statistics for each of the variables in the correlation matrix will be computed and displayed, if this option is selected.

The statistics available for each variable include its mean, standard deviation, standard error, valid N of cases, and (if there is a weight variable) valid weighted N of cases.

If missing-data cases have been excluded LISTWISE (the default), the univariate statistics for all variables will be based on the SAME cases -- those which have valid data on ALL of the variables.

If missing-data cases have been excluded PAIRWISE, the univariate statistics for each variable will be based on all the cases with valid data for that one variable.

Paired univariate statistics

If missing-data cases have been excluded pairwise, each correlation coefficient is based (potentially) on a different subset of the cases. Univariate statistics based on that same subset of cases for each pair of variables will be calculated and displayed, if this option is selected.

The paired statistics for each variable include its mean, standard deviation, valid N of cases for the pair, and (if there is a weight variable) valid weighted N of cases for the pair.

These statistics are displayed as a series of matrices. Each statistic for a given variable is (potentially) somewhat different, depending on which other variable it is being paired with.

Index of proportionality (P-squared)

It is sometimes useful to know the degree to which the correlations in each row of the correlation matrix are proportional to the correlations in the other rows. This is particularly the case in creating scales or indexes of items. If variables are measuring the same thing, they should have similar correlations to other relevant (criterion) variables.

The P-squared statistic is a way to measure the proportionality of rows in a correlation matrix. For example, if all of the coefficients in one row are exactly double the size of the coefficients in another row, there is a constant proportionality, and the index will be 1.0.

Usually we want to limit this comparison to a subset of the the matrix -- namely, to the part corresponding to the correlations of the criterion variables with the variables of interest. To do this, we specify on the option screen the variable numbers (next to each text box on the option screen) corresponding to the variables for which we want the P-squared measure, and the variable numbers corresponding to the criterion variables.

For example, we could examine the degree to which the variables v1, v2, and v3 have proportional correlations to the criterion variables x1, x2, and x3. We would enter v1, v2, and v3 into the first 3 text boxes on the option screen; and x1, x2, and x3 into text boxes 4 through 6. To get the P-squared statistic for all the combinations of v1, v2, and v3, in respect to the criterion variables, we would then specify:

Vars to measure: 1-3
Criterion vars: 4-6

These variable numbers can be specified either as a range (1-3) or as a list (1,2,3); and the variables need not be adjacent in the original correlation matrix -- a list like '1,3,5' is valid.

The P-squared statistics are presented in a symmetrical matrix. Each row and column corresponds to one of the variables that we specified as a "variable to measure."

For a discussion of how to use this statistic, see Thomas Piazza, "The Analysis of Attitude Items," American Journal of Sociology, vol. 86 (1980) pp. 584-603.

Other Display Options

Reverse signs of some correlations

In order to detect patterns in the correlation matrix, it is sometimes useful to reverse the signs of the correlations corresponding to one or more variables. Enter the variable number of each variable for which you want the signs reversed. The variable number corresponds to the text box number on the option screen.

For example, we may know that var1 is scaled in such a way that a HIGH score or value corresponds to a LOW score on var2 and var3, so we expect the correlations of var1 to be negative with var2 and var3. But if we are interested in the relationships of those variables to other variables, it will be easier to detect different patterns if we reverse all the signs corresponding to var1. That way, we can expect var1, var2, and var3 to have correlations of the same sign with other variables. Then if we do observe a difference in the signs, it will catch our attention.

Color coding of the correlations

The correlation coefficients are color coded, in order to aid in detecting patterns. Correlation coefficients greater than zero become redder, the larger they are. Correlation coefficients less than zero become bluer, the more negative they are.

The transition from a lighter shade of red or blue to a darker shade depends on the magnitude of the correlation coefficient in each cell. The lightest shade corresponds to coefficients between 0 and .15. The colors become darker as the absolute value of the correlations exceed .15, then .30, then .45.

Color coding is also used for the P-squared matrix, if one has been requested. However, the dividing points for colors are double in magnitude. The lightest shade corresponds to P-squared coefficients between 0 and .30. The colors become darker as the absolute value of the P-squared coefficients exceed .30, then .60, then .90.

The color coding can be turned off, if you prefer. Color coding may not be helpful if you are using a black-and-white monitor or if you intend to print out the matrix on a black-and-white printer.

Question text or variable description

The text of the question or other descriptive text.

SDA Comparison of Correlations Program

This program calculates the correlation between two variables separately within categories of the row variable and, optionally, the column variable. If a control variable is specified, a separate table will be produced for each category of the control variable.

Steps to take

Specify variables: To specify that a certain survey question or variable is to be included in a table, use the name for that variable as given in the documentation for this study. Aside from simply specifying the name of a variable, certain variable options are available to change or restrict the scope of a variable.
Select display options: After specifying the names of variables, select the display options you wish. These affect the number of decimals to show, statistics to compute, and text to display.
Select an action: After specifying all variables and options, select the action to take.

REQUIRED variable names

Variables to be correlated: Two numeric variables whose correlation coefficient is to be computed for each combination of the row and (optionally) column and control variables and displayed in a table
Row variable(s): Variable down the side of the table

OPTIONAL variable names

Column variable(s): Variable along the top of the table
Control variable(s): A separate table is produced for each category of a control variable.; If more than one correlation variable, row variable, column variable, and/or control variable is specified, a separate table will be generated for each combination of variables.
Selection filter variable(s): Some cases are included in the analysis; others are excluded.
Weight variable: Cases are given different relative weights.

Display Options for Comparison of Correlations

Correlation measure to calculate

The Pearson correlation coefficient is the default correlation measure to calculate. It is appropriate for ordered numeric variables.

The log of the odds-ratio is an optional measure for dichotomous variables. The calculation of the odds ratio assumes that the two variables to be correlated have only two categories each. If these statistics are requested, CORRTAB treats Var 1 and Var 2 as dichotomies, regardless of the number of categories they may actually have. The minimum valid value of each variable is treated as the base category (coded 0), and all valid values greater than the minimum are combined into the other category (coded 1). If this default dichotomization is not appropriate for a particular variable, you can specify another temporary recode after the variable name is given.

Show differences from overall correlation (instead of cell correlations)

Usually each cell of the table will contain the correlation coefficient of the two variables being correlated, for that particular combination of the row and (optionally) column and control variables. Sometimes, however, it is more helpful to express each cell correlation as the DIFFERENCE from the overall correlation. Select this option to have those differences calculated and put into each cell of the table.

Standard errors

Standard errors for the correlations can be computed and displayed for each cell of the table. The standard errors can be used to create confidence intervals for the correlation in each cell. If the sample is equivalent to a simple random sample of a population, you can be about 95% confident that the correlation in the population (for each cell) is within the interval bounded by two standard errors above and below the correlation in the sample (shown in the table).

The standard error is computed differently, depending on which correlation coefficient you have selected. The standard error for the Pearson correlation is based on Fisher's Z, and it is calculated as the average distance of the upward and the downward confidence band for one standard error (based on the retransformation of Fisher's Z into Pearson's R). The standard error for the log of the odds ratio is calculated with standard formulas for that statistic.

If the sample is more complex than a simple random sample, the standard errors calculated here are probably too small.

Consult any beginners' statistics book for more information on the meaning of these statistics.

Other Display Options

Color coding of the cells

The cells of the table of correlations are color coded, in order to aid in detecting patterns. Cells with higher correlations than the overall correlation become redder, the more they exceed the overall correlation. Cells with lower correlations than the overall correlation become bluer, the smaller they are.

The transition from a lighter shade of red or blue to a darker shade depends on the magnitude of the t-statistic. The lightest shade corresponds to t-statistics between 0 and 1. The medium shade corresponds to t-statistics between 1 and 2. The darkest shade corresponds to t-statistics greater than 2.

The color coding can be turned off, if you prefer. Color coding may not be helpful if you are using a black-and-white monitor or if you intend to print out the table on a black-and-white printer.

Show the t-statistic

If you select this option, the t-statistic will also be displayed in each cell.

The t-statistic shows whether the correlation in a cell is larger or smaller than the overall correlation. It also takes into account the total number of cases in each cell. If there are only a few cases in a cell, the deviations from the overall correlation are not as significant as if there are many cases in that cell.

The t-statistic is calculated as the ratio of two quantities: The numerator is the difference between the correlation in the cell and the overall correlation. The denominator is the standard error of the correlation in that cell.

Note that the t-statistic controls the color coding of cells in the table of correlations.

Number of decimals for the correlation

You can select from 1 to 6 decimal places. The default is 2 decimal places.

Question text or variable description

The text of the question or other descriptive text.

SDA Multiple Regression Program

This program calculates the regression coefficients for one or more independent or predictor variables, using ordinary least squares.

Two versions of the regression coefficient are given for each variable:

The unstandardized regression coefficient -- labeled B
The standardized regression coefficient -- labeled Beta

For each version of the coefficient there is also a standard error -- labeled either as SE(B) or as SE(Beta). The calculation of these standard errors depends on the sample design, as specified when the dataset was set up for SDA. For simple random samples, the standard formulas are used.

For complex sample designs, the user has a choice to specify SRS or complex standard errors. If your analysis is exploratory, and if you are only interested in the magnitude of the coefficients, you might want to specify that the sample is SRS, since the calculation of complex standard errors can be time consuming and does not affect the coefficients themselves. However, the complex standard errors should be used for significance tests and for the presentation of results.

In addition to the coefficients for each independent variable, a few summary measures for the regression as a whole are given. These include the Multiple R (multiple correlation coefficient), the R-Squared (the square of the Multiple R, also called the Coefficient of Determination), the Adjusted R-Squared, and the Standard Error of the Estimate (also called the root mean square error).

The Adjusted R-Squared is a measure that compensates for the inflation of the regular R-Squared statistic due simply to the inclusion of additional independent variables. The Adjusted R-Squared will increase only if the additional independent variables increase the predictive power of the model more than would be expected by chance. It will always be less than or equal to the regular R-Squared.

Steps to take

Specify variables

To specify that a certain survey question or variable is to be used as the dependent variable, give the name for that variable as given in the documentation for this study. Then specify the names of one or more independent variables. Selection filter variables and a weight variable may also be specified.

Aside from simply specifying the name of a variable, it is possible to restrict the range of a variable or to recode the variable temporarily. Note in particular that you can create dummy variables and product terms.

Select display options

After specifying the names of variables, select the display options you wish. These affect the statistics to compute, the number of decimals to show, and text to display.

Run regression

After specifying all variables and options, select Run Regression to run the program.

Or you can select Clear Fields to delete all previously specified variables and options, so that you can start over.

REQUIRED variable names

Dependent variable

Enter the name of one numeric variable to be used as the dependent variable or the variable to be predicted.

Independent variables

Enter the names of one or more numeric variables whose regression coefficients are to be computed. Note that you can specify dummy variables and product terms as independent variables. It is also possible to restrict the range of a variable or to recode the variable temporarily.

It is possible to enter more than one variable name in a text box (the underlying text-entry area will scroll). Ordinarily it is clearer to put only one variable in each text box, but it is possible to enter more variables than there are text boxes.

Create a Temporary Dummy Variable

A dummy variable is a dichotomous variable coded 0 or 1. Cases that have a certain characteristic are coded as 1; whereas cases that do NOT have the characteristic are coded as 0.

To create such a variable temporarily, for a single regression run, for example, use the following syntax:

varname(d:1-3)

This would create a variable in which cases coded 1 through 3 on the variable 'varname' receive a code of 1, and all other VALID cases receive a code of 0. If 'varname' has a code defined as missing-data or out of range, the dummy variable will have the system-missing data value.

The characters 'd:' (or 'D:') indicate that you want to create a temporary dummy variable. The codes that follow show which codes on the original variable should become the code of 1 on the new dummy variable. One or more single code values or ranges can be specified. Multiple codes or ranges are separated by a comma.

You can give the '1' category of the dummy variable a label by putting the label in double quotes or in square brackets:

occupation(d:1, 3-5, 9, 10 "Managerial occupations = 1")

If you do not give a label, SDA will take the label from the code of the input variable assigned to the '1' category on the new dummy variable, provided that only a single code is assigned to the '1' category.

Create Multiple Temporary Dummy Variables (Regression only)

For multiple regression, it is possible to create multiple dummy variables at the same time from a single variable -- a separate dummy variable for every category except one, which is considered the base category. Each of the dummy variables would then be used as an independent variable in the regression.

For example, a variable such as 'party' (political party) could have categories like '1=Democrat', '2=Republican', '3=Independent', '4=Other'. To make 3 dummy variables, with 4 as the base category, use the syntax:

party(m:4)

The characters 'm:' (or 'M:') indicate that you want to create multiple temporary dummy variables. The code(s) that follow show which code(s) on the original variable should become the base category -- that is, which code or codes should NOT have a dummy variable created. The use of this syntax to create multiple dummy variables also has the effect of defining the set of dummy variables as a group, whose effects as a group are tested for significance.

One or more single code values or ranges can be specified as the base category. Multiple codes or ranges are separated by a comma, as in this example:

education(m:1-8,14,15)

If you want to create dummy variables for every category except the category with the highest valid numeric code, you can designate '*' as the base category. For example:

party(m:*)

For the example above, this has the same effect as designating '4' as the base category. However, it is convenient to be able to create multiple dummy variables without knowing ahead of time which category has the highest valid code.

Note that using this multiple dummy syntax is similar to creating individual dummy variables. However, dummy variables created individually are not automatically treated as a group, for purposes of testing the significance of the group as a whole.

Create a Temporary Product Variable or Term (Regression and correlation only)

An independent variable in a regression can be the product of two or more variables. A product variable can also combine one or more temporary dummy variables.

To create such a variable temporarily, for a single regression run for instance, use an asterisk (*) between the component variable names. For example:

age*education

This would create a variable in which, for each case, the value of 'age' is multiplied by the value of 'education'. If either 'age' or 'education' has an invalid code for that case, the temporary product term will have the system missing-data value.

One or more dummy variables can also be part of a product term. For example, the following form is acceptable:

party(d:3)*sex

In this example, first a single dummy variable is created from the variable 'party', and then that dummy variable is multiplied by 'sex'. Note that this syntax does not work with multiple dummy variables created like 'party(m:*)'. It only works with single dummy variables.

OPTIONAL variable names

Selection filter variable(s): Some cases are included in the analysis; others are excluded.
Weight variable: Cases are given different relative weights in calculating the regression coefficients.

How cases with missing data are excluded

Listwise exclusion: If a case has a missing-data value on ANY of the variables to be correlated and then regressed, it is excluded from ALL of the regression calculations. This is the only allowed procedure. The pairwise option available for the correlation program is not available for the regression programs.

Additional Statistics to Calculate

T-test for each coefficient

The t-test for each regression coefficient is generally displayed. The t-statistic is the ratio of the unstandardized regression coefficient (B) divided by its standard error -- shown as SE(B). Dividing the standardized regression coefficient (Beta) by its standard error, shown as SE(Beta), gives the same t-statistic.

The probability estimate associated with each t-statistic is given in the last column. This is the probability of obtaining a regression coefficient (either B or Beta) that is this large or larger, if the true coefficient is equal to zero in the population from which the current sample was drawn.

If the probability value for a regression coefficient is low (about .05 or less), the chances are correspondingly low that the observed effect of that independent variable on the dependent variable is only due to sampling error. However, a low probability value does not indicate that the true value of the coefficient in the population is of any specific magnitude -- only that it is not equal to zero.

The t-statistic and associated probability value are also given for the constant term of the regression equation. This is a test that the regression equation in the population has no constant term (or intercept). This test is usually of less interest than the tests for the regression coefficients of the independent variables.

Sum of squares analysis

The decomposition of the sum of squares for the regression is shown after the regression coefficients and the t-statistics. The proportion of the total sum of squares in the dependent variable accounted for by the regression is the 'R-squared' statistic, often referred to as the proportion of the variance that is "explained" by the regression on the independent variables. The square root of the R-squared statistic is the 'Multiple R' or the multiple correlation coefficient.

Since the R-squared always increases with the addition of more independent variables, regardless of their independent contribution, an 'Adjusted R-squared' is also shown. The Adjusted R-squared compensates for the addition of extra variables and will be less than the R-squared if some of the additional independent variables do not contribute independent predictive power.

Global tests

In addition to the t-tests for the individual independent variables, tests are generally carried out on groups of variables. (Uncheck this box, to suppress this output.)

The first group tested is the whole set of independent variables. A Wald F-statistic for ALL the independent variables is computed. The p-value (probability value) for the F-statistic is given in the last column of the table. This is the probability that ALL of the regression coefficients (B's and Beta's) are equal to zero, in the population from which the current sample was drawn.
If the p-value for the test is low (about .05 or less), the chances are correspondingly low that ALL of the observed effects of the independent variables on the dependent variable are only due to sampling error. Nevertheless, a low p-value does not indicate that any specific independent variable has an effect on the dependent variable The separate t-test for each independent variable should be examined for that purpose. (However, if there is only one independent variable, the t-test for that variable will give the same p-value as the Global F-test.)
If any sets of multiple dummy variables have been created, using the 'var(M:*)' syntax, each resulting group of dummy variables created from a single variable is also tested. Similarly as for the set of ALL independent variables, the Wald F-test is used to calculate the probability that ALL of the regression coefficients corresponding to the dummy variables in the group are equal to zero in the population from which the sample was drawn.

Confidence intervals

Confidence intervals for the regression coefficients can be requested for various levels of confidence. The width of the confidence interval is affected by the level of confidence requested. For the usual 95% confidence interval, you can be 95% confident that the regression coefficient in the population from which the sample was drawn is within the interval bounded by approximately two standard errors above and below the regression coefficient in the sample (ignoring the problem of potential bias in the sample).

Note that the accuracy of the confidence intervals depends on specifying the correct sample design. If the sample is not a simple random sample (SRS), the size of the SRS standard errors and confidence intervals will probably be too small.

Univariate statistics

Univariate statistics for each of the variables will be computed and displayed, if this option is selected. The statistics displayed for each variable include its mean and standard deviation.

Product of B and the univariate statistics

For each independent variable, the product of its regression coefficient (B) with its mean and its standard deviation can be displayed.

The product of each 'B' and the mean can be thought of as the average effect of this independent variable on the dependent variable.
The product of each 'B' and the standard deviation can be thought of as the effect of an increase of one standard unit of this independent variable on the magnitude of the dependent variable.

If this option is selected, the univariate statistics will automatically be selected as well, and the products will be displayed as additional columns in that table.

Correlation matrix

The correlation matrix of all the variables with one another will be displayed. The diagonal elements are always equal to 1.0 -- that is, each variable is perfectly correlated with itself.

Covariance matrix

The covariance matrix of all the variables with one another will be displayed. Each diagonal element displays the variance of a variable. The off-diagonal elements are the covariances.

Covariance matrix of coefficients

The variance/covariance matrix of the regression coefficients with one another will be displayed. Each diagonal element displays the variance of the regression coefficient (B) of the corresponding variable -- that is, the square of its standard error. The off-diagonal elements are the covariances.

Other Display Options

Sample design

For complex samples, the standard errors, confidence intervals and global tests should be calculated in a way that takes the complex design into account. Nevertheless, you can specify that the standard errors, confidence intervals, and other test values should be calculated as if the sample were a simple random sample (SRS). One reason to request SRS calculations might be to compare the size of the SRS standard errors or confidence intervals with the corresponding statistics based on the complex sample design.

For some large datasets, SRS calculations might be set as the default method, because the calculation of complex standard errors is MUCH more computer intensive and time-consuming than the equivalent SRS calculations. In such cases, it would be appropriate to do some SRS runs for exploratory purposes and then to request complex standard errors for your final runs.

The standard errors for complex samples are computed using the jackknife repeated replication method. The method used, together with the names of the stratum and/or cluster variables, are reported when you run the program. If you want additional technical information, see the discussion of standard error calculation methods.

Question text or variable description

The text of the question or other descriptive text.

Color coding of the coefficients

The regression coefficients are color coded, in order to aid in detecting patterns, if t-tests have been requested. Regression coefficients greater than zero become redder, the larger they are. Regression coefficients less than zero become bluer, the more negative they are.

The transition from a lighter shade of red or blue to a darker shade depends on the magnitude of the t-statistic, which is the ratio of each regression coefficient (B) divided by its standard error. The lightest shade corresponds to t-statistics between 0 and 1. The medium shade corresponds to t-statistics between 1 and 2. The darkest shade corresponds to t-statistics greater than 2.

Correlation coefficients are also color coded, if a correlation matrix is requested. Correlation coefficients greater than zero become redder, the larger they are. Correlation coefficients less than zero become bluer, the more negative they are. The transition from a lighter shade of red or blue to a darker shade depends on the magnitude of the correlation coefficient in each cell of the matrix. The lightest shade corresponds to coefficients between 0 and .15. The colors become darker as the absolute value of the correlations exceed .15, then .30, then .45.

The color coding can be turned off, if you prefer. Color coding may not be helpful if you intend to print out the regression results on a black-and-white printer.

Suppress independent variables list

If the regression has many independent variables that are used over and over in a series of regressions, the list of independent variables usually displayed at the beginning of the regression output may seem redundant. If this option is selected, the independent variables are not listed in the top section of the regression output, nor are their labels, valid ranges, and missing data codes. The variable names themselves, however, are still displayed in the rest of the output, so there should not be any confusion.

This option does NOT suppress the output for the dependent variable and any filter or weight variables used in the analysis. Only the independent variables are dropped from the list.

Chart Options for Multiple Regression

A chart showing the values of the regression coefficients and their confidence intervals can be displayed. You can choose either the 'B' (unstandardized coefficient) or the 'Beta' (standardized coefficient) to display.

The confidence intervals in the chart are based on the confidence level selected in "Output Options" (90, 95, or 99 percent level of confidence). If you request a chart, but the "Confidence intervals" checkbox in "Output Options" is not checked, then the default 95 percent confidence level will be used for the chart.

Coefficients to chart: Select the coefficients that will be charted: B (the default) or Beta. If you do not want a chart, select "(No chart)".
Range to display: By default, the range of the x-axis for the chart will be automatically adjusted, depending on the confidence intervals that will be displayed. Usually this works well. However, you can manually set the low and high bounds of the chart's x-axis if you prefer. Select "Custom range" in the menu, then enter the low and high bounds of the range in the input boxes that appear. If you leave either the "Low" box or the "High" box blank, the automatically adjusted value will be used for that bound.
Maximum number of independent variables to include in chart: By default, all of the independent variables are included in the chart. However, you can limit the number of independent variables that are displayed by setting a maximum in the drop-down menu. For example, if you have specified ten independent variables, but set the maximum to five, then only the first five variables are included in the chart. Note that this option only affects the number of independent variables displayed in the chart. It does not affect the values of the coefficients or confidence intervals.
Size of chart: The width and height of the chart (expressed in the number of pixels) can be modified. If there is a large (or a very small) number of independent variables, it may be helpful to increase (or decrease) the dimensions of the chart.

SDA Logit/Probit Regression Program

This program calculates the logit or probit regression coefficients for one or more independent or predictor variables.

Steps to take

Select type of regression to run

The choice is between logistic (logit) regression and probit regression. The difference is summarized below.

Specify variables

Select display options

After specifying the names of variables, select the display options you wish. These affect the statistics to compute, the number of decimals to show, and text to display.

Run Logit/Probit

After specifying all variables and options, select Run Logit/Probit to run the program.

Or you can select Clear Fields to delete all previously specified variables and options, so that you can start over.

Type of regression to run

The program can run either logistic (logit) or probit regression. The difference between them is in how the dependent variable is transformed from a proportion (a mean between 0 and 1).

Logistic regression reexpresses the dependent variable as the natural logarithm of the odds that a person will have a score of 1 versus a score of 0 on the dependent variable (the logit). Therefore, the regression coefficient (B) for each independent variable measures the effect of a one unit change in that variable on the logit of the dependent variable.
The exponential (or antilog) of each logistic regression coefficient is also output. This transformed coefficient expresses the effect of a one unit change in that independent variable on the odds that a person will have a score of 1 versus a score of 0 on the dependent variable. Note that this exponential transformation converts the additive regression coefficients into multiplicative terms.
Probit regression reexpresses the dependent variable as the inverse of the cumulative distribution function of the normal distribution corresponding to the proportion of persons having a score of 1 on the dependent variable (the probit). Therefore, the regression coefficient (B) for each independent variable measures the effect of a one unit change in that variable on the probit of the dependent variable.

When the dependent variable has only two categories, logistic and probit regression are more appropriate to use than ordinary least squares regression. Both logistic and probit regression will usually generate the same substantive results. The choice between them is generally a matter of custom within a specific field or discipline.

REQUIRED variable names

Dependent variable

Enter the name of one numeric variable to be used as the dependent variable or the variable to be predicted. In order for this variable to be used as a dependent variable in logit or probit regression, it must be coded to have exactly two categories: 0 and 1.

If the variable you want to use as a dependent variable is not already coded as a simple 0/1 variable, you can create a dummy variable, or you can recode the variable temporarily.

If the dependent variable is left as anything other than a simple 0/1 variable, the program will recode the dependent variable automatically. The lowest valid score will be recoded to the value '0', and all other scores will be recoded to the value '1'.

Independent variables

Create a Temporary Dummy Variable

A dummy variable is a dichotomous variable coded 0 or 1. Cases that have a certain characteristic are coded as 1; whereas cases that do NOT have the characteristic are coded as 0.

To create such a variable temporarily, for a single regression run, for example, use the following syntax:

varname(d:1-3)

You can give the '1' category of the dummy variable a label by putting the label in double quotes or in square brackets:

occupation(d:1, 3-5, 9, 10 "Managerial occupations = 1")

Create Multiple Temporary Dummy Variables

party(m:4)

One or more single code values or ranges can be specified as the base category. Multiple codes or ranges are separated by a comma, as in this example:

education(m:1-8,14,15)

If you want to create dummy variables for every category except the category with the highest valid numeric code, you can designate '*' as the base category. For example:

party(m:*)

Product terms

An independent variable can be the product of two or more variables.

To create such a variable temporarily, for a single regression run for instance, use an asterisk (*) between the component variable names. For example:

age*education

One or more dummy variables can also be part of a product term. For example, the following form is acceptable:

party(d:3)*sex

In this example, first a dummy variable is created from the variable 'party', and then that dummy variable is multiplied by 'sex'. Note that this syntax does not work with multiple dummy variables created like 'party(m:*)'. It only works with single dummy variables.

OPTIONAL variable names

Selection filter variable(s): Some cases are included in the analysis; others are excluded.
Weight variable: Cases are given different relative weights in calculating the regression coefficients.

How to exclude cases with missing data

Listwise exclusion: If a case has a missing-data value on ANY of the variables included in the logit or probit regression, it is excluded from ALL of the regression calculations. This is the only allowed procedure. The pairwise option available for the correlation program is not available for the regression programs.

Additional Statistics to Calculate

T-test for each coefficient

The t-test for each logit or probit regression coefficient is generally displayed. (Uncheck the box, to suppress this output.) The t-statistic is the ratio of the regression coefficient (B) divided by its standard error -- shown as SE(B). (For a discussion of the calculation of standard errors for complex samples, see the document on methods used by SDA for computing standard errors for complex samples.)

The probability of each t-statistic is given in the last column. This is the probability that the regression coefficient (B) is equal to zero, in the population from which the current sample was drawn.

Exponential of the logistic regression coefficient (B)

The exponential (or antilog) of each logistic regression coefficient is usually displayed. (Uncheck the corresponding box if you want to suppress this output.) This transformed coefficient expresses the effect of a one unit change in that independent variable on the odds that a person will have a score of 1 versus a score of 0 on the dependent variable. Note that this exponential transformation converts the additive regression coefficients into multiplicative terms. Each exponential coefficient has the same significance level as the logistic coefficient on which it is based.

Probability Differences

The logit or probit coefficients can be a little difficult to interpret. This option converts the logit or probit coefficients to the scale of probabilities, to show how much each independent variable contributes to an increase in the probability that the dependent variable is predicted to be '1' rather than '0' under the specified regression model when all of the independent variables are at their mean values.

The statistics produced for each independent variable depend on the variable’s type:

Two statistics are output for interval data variables:
- For a ONE UNIT increase in THIS independent variable, how much does the probability increase that the dependent variable is '1' rather than '0', if the OTHER independent variables remain at their mean value?
- For a ONE STANDARD DEVIATION increase in THIS independent variable, how much does the probability increase that the dependent variable is '1' rather than '0', if the OTHER independent variables remain at their mean value?
- The one unit increase and the one standard deviation increase can either be calculated from the mean as the starting point (uncentered) or from a half unit or half standard deviation below the mean to a half unit or half standard deviation above the mean (centered).
One statistic is output for dummy or multidummy variables:
- For a ONE UNIT increase in THIS independent variable, how much does the probability increase that the dependent variable is '1' rather than '0', if the OTHER independent variables remain at their mean value?
- The one unit increase is always calculated from 0 to 1 for a dummy or each of a set multidummy variables. Note that when calculating the probability difference for one of the multidummy variables, the other categories of the multidummy set are all assigned 0, not their means, since the dummies constitute an exclusive set.
No statistics are output for product variables. Product variables do, however, contribute to the probability differences calculated for their constituent variables.

If this option is selected, the univariate statistics will automatically be selected as well, to show the mean and standard deviation of each variable.

Summary Statistics

The log of the likelihood statistic is displayed after the regression coefficients and t-tests. This statistic is an indicator of the goodness of fit of the model and is used to calculate the pseudo-R-squared statistic.

A pseudo-R-squared statistic is also displayed. It is calculated as 1 - (LL1 / LL0), where:

LL0 is the log likelihood of the model including only the constant term (and no independent variables), and
LL1 is the log likelihood of the model that includes all of the independent variables as well as the constant term.

This version of the pseudo-R-squared statistic is often referred to as "McFadden's-R-squared" or the "Likelihood ratio index." It varies between 0 and (somewhat close to) 1.

The pseudo-R-squared statistic is (roughly) analogous to the R-squared statistic in ordinary least squares regression, which expresses the proportion of variance in the dependent variable explained by the entire set of independent variables. This pseudo-R-squared statistic, however, will be smaller than the R-squared in an ordinary regression, and it is not comparable across datasets. It is best used to compare regressions with different sets of independent variables within the same dataset.

Global tests

In addition to the t-tests for the individual independent variables, tests are generally carried out on groups of variables. (Uncheck this box, if you want to suppress this output.)

The first group tested is the whole set of independent variables. A Wald F-statistic for ALL the independent variables is computed. The p-value (probability value) for the F-statistic is given in the last column of the table. This is the probability that ALL of the regression coefficients (B) are equal to zero, in the population from which the current sample was drawn.
If the p-value for the test is low (about .05 or less), the chances are correspondingly low that ALL of the observed effects of the independent variables on the dependent variable are only due to sampling error. Nevertheless, a low p-value does not indicate that any specific independent variable has an effect on the dependent variable The separate t-test for each independent variable should be examined for that purpose. (However, if there is only one independent variable, the t-test for that variable will give the same p-value as the Global F-test.)
If any sets of multiple dummy variables have been created, using the 'var(M:*)' syntax, each resulting group of dummy variables created from a single variable is also tested. Similarly as for the set of ALL independent variables, the Wald F-test is used to calculate the probability that ALL of the regression coefficients corresponding to the dummy variables in the group are equal to zero in the population from which the sample was drawn.

Confidence intervals

For logit coefficients, two confidence intervals are shown. The first is for the logit coefficient itself. The second confidence interval is for the exponential (antilog) of the logit coefficient. This second confidence interval is created by taking the exponential of each upper and lower bound of the confidence interval for the logit coefficient.

Univariate statistics

Univariate statistics for each of the variables will be computed and displayed, if this option is selected. The statistics displayed for each variable include its mean and standard deviation.

Product of B and the univariate statistics

For each independent variable, the product of its regression coefficient (B) with its mean and its standard deviation can be displayed.

The product of each 'B' and the mean can be thought of as the average effect of this independent variable on the dependent variable.
The product of each 'B' and the standard deviation can be thought of as the effect of an increase of one standard unit of this independent variable on the magnitude of the logit or probit of the dependent variable.

If this option is selected, the univariate statistics will automatically be selected as well, and the products will be displayed as additional columns in that table.

Other Display Options

Color coding of the coefficients

The color coding can be turned off, if you prefer. Color coding may not be helpful if you intend to print out the regression results on a black-and-white printer.

Chart Options for Logit/Probit Regression

A chart showing the values of the regression coefficients and their confidence intervals can be displayed.

Coefficients to chart: Select which coefficients will be charted. You can choose either 'B', or 'Exp(B)' (if Logit regression is specified), or (if 'Probability differencess' are specified) 'P-Diff 1 unit' or 'P-Diff 1 SD'. If you do not want a chart, select '(No chart)'.
Range to display: By default, the range of the x-axis for the chart will be automatically adjusted, depending on the confidence intervals that will be displayed. Usually this works well. However, you can manually set the low and high bounds of the chart's x-axis if you prefer. Select "Custom range" in the menu, then enter the low and high bounds of the range in the input boxes that appear. If you leave either the "Low" box or the "High" box blank, the automatically adjusted value will be used for that bound.
Maximum number of independent variables to include in chart: By default, all of the independent variables are included in the chart. However, you can limit the number of independent variables that are displayed by setting a maximum in the drop-down menu. For example, if you have specified ten independent variables, but set the maximum to five, then only the first five variables are included in the chart. Note that this option only affects the number of independent variables displayed in the chart. It does not affect the values of the coefficients or confidence intervals.
Size of chart: The width and height of the chart (expressed in the number of pixels) can be modified. If there is a large (or a very small) number of independent variables, it may be helpful to increase (or decrease) the dimensions of the chart.

SDA Program to List Values of Individual Cases

This program lists the values of individual cases on variables specified by the user. Values of a numeric variable can also be transformed into percents of a second numeric variable. This is particularly useful when the cases in the data file are aggregate units such as cities.

One or more filter variables are used to limit the listing to a subset of the cases. In general a limit of 500 cases is enforced for each listing, in case the user has forgotten to limit the listing with sufficient filter variables.

Steps to take

Specify variables to list: To specify that a certain survey question or variable is to be included in the listing, enter into one of the text boxes the name for that variable, as given in the documentation for the study. You can also request a percent to be displayed.
Specify one or more filter variables: Selection filter variables are used to limit the listing to a subset of cases. Except for very small datasets, a filter variable will almost always be required.
Select display options: After specifying the names of variables, select the display options you wish. These affect how to display numeric variables and whether or not to display the text of each variable.
Start the listing: After specifying all variables and options, select Start Listing to begin the program.
Or you can select Clear fields to delete all previously specified variables and options, so that you can start over.

Variables to list

To specify that a certain survey question or variable is to be included in the listing, enter into one of the text boxes the name for that variable, as given in the documentation for the study.

Percentages

Aside from simply specifying the name of a variable, it is possible to convert a number into the percent of another variable. (Both variables must be numeric variables.) This is particularly useful when the cases in the data file are aggregate units such as cities.

To calculate and display a percent, use the following formats, beginning with $p, instead of a simple variable name:

$p(var1, var2): This will display the value: 100 * var1 / var2
(using 1 decimal place) where 'var1' and 'var2' are variables in the dataset. It is not necessary that either 'var1' or 'var2' be specified separately for listing.
$p(var1, var2, 2): To display a percent using other than one decimal place, specify the desired number of decimal places after var2. The example above would use 2 decimal places.
$p(demo, totvote, "Percent Voted Democrat"): To give your own name to the percentage created, put the name you want within double quotes. This name will be displayed at the top of the column for that percentage.

Selection filter variables: After specifying the names of the variables to list, select the filter variable(s) in order to specify which cases to list. Since data files generally have a large number of cases, it is very important to limit the listing to a subset of the cases. The usual options for specifying filter variable(s) are available.
To avoid accidental attempts to list large numbers of cases, the program suppresses any listing that would exceed a certain number of cases. The default limit is 500 cases, but that limit can be modified when the datasets are set up in the Web archive.

Summaries of each variable listed

For each numeric variable listed, you can obtain summaries of the values for the selected cases in the listing. These summaries exclude missing-data or out-of-range values.

The available summaries are:

Sum of the values
Mean of the values
Minimum value listed
Maximum value listed

For a percentage (created with the '$p' command), the summaries, if requested, will be calculated as follows:

Sum: calculated from the sums of the two variables
Mean: the mean of the percentages in a column
Minimum: the smallest valid value in a column
Maximum: the greatest valid value in a column

How to display variables (You may select one of the following options:)

Code value
The code value of each variable is what is stored in the dataset for each case. For numeric variables, the code value is a number (except for character missing-data values). For character variables, the code value is the string of characters that has been stored for each case.
Category label
If a category label has been defined for a specific code value, it is often more helpful to display the label rather than the code value.
However, if this option has been selected and no label has been defined for a specific code value, the code value itself will be displayed.
BOTH code value and category label (This is the default option)
Under this option, the code value is always displayed, followed by the category label if one has been defined.

Color coding on the output: In the listing of the values of each variable, the coloring of the headings can be suppressed if desired. This may be useful if you intend to print the output on a black and white printer.

Features Common to All Analysis Programs

Multiple variable names

More than one name may be entered for variables to be analyzed, such as for the row and the column variables. The names should be separated by a comma or blanks. Separate analyses for each combination of variables will be generated.

For example, the following specifications would generate six separate tables:

Row variables: spend spend2
Column variables: gender, education income

Restricting the valid range

The name of each analysis variable can be followed, in parentheses, by a list of values to be included in the analysis.

Basic range restriction: A single value such as 'gender(2)' or a range of codes such as 'age(30-50)', will limit the analysis to cases having those codes.

Multiple ranges and codes may be specified.: For example: age(1-17, 25, 95-100)

Open-ended Ranges using '*' and '**'

In a range, one asterisk '*' can be used to signify the lowest or highest VALID value.
For example: age(*-25,75-*)
This would include all VALID values less than or equal to 25 and all VALID values greater than or equal to 75. However, any missing-data values within those ranges would still be excluded.

In a range, two asterisks '**' can be used to signify the lowest or highest NUMERIC value, regardless of whether or not the codes are defined as missing data.
For example: age(50-**)
This would include ALL numeric values greater than or equal to 50, including data values like 98 or 99, even if they had been defined as missing-data codes. Note that '**' cannot be used alone (without '-') as a range specification. If you want to include all NUMERIC codes, you can use the range '(**-**)'.

Temporarily Transforming a Variable

A numeric variable can be transformed temporarily, for purposes of running the current analysis. There are four types of temporary transformations:

Recode a variable
Collapse a variable into fewer categories
Create a dummy variable (a dichotomy coded 0 or 1)
Create multiple dummy variables from a single variable
Create a product variable (for correlation and regression)

Temporarily Recode a Variable

Temporary recodes are created by specifying groups of codes that are to be combined into a single category. This type of transformation can be very simple, but certain options can make it a little more complex. These are the possibilities:

Basic recoding
Assigning particular new code values
Assigning labels to the new code values
Open ranges (with '*' or '**')
Overlapping ranges
Multiple specifications for one recoded group
Treatment of missing data

Basic recoding

For example, to combine the categories of 'age' into three groups, you can specify the variable as:
age(r: 18-30; 31-50; 51-95)
Notice that the name of the variable ('age') is followed by parentheses, then the instruction 'r' (or 'R') followed by a colon (':'), and then the groupings of codes. Those groupings can consist of single code values, ranges, or a combination of many values and/or ranges. Each group is separated from the other by a semicolon (';'). Spaces are optional, but are added here for readability.

Using this basic method of recoding, the new groupings of codes are given the default code values 1, 2, 3, and so forth. The default label for each group is the range of original codes that constitute that group ("18-30", for example).

Any categories of 'age' not included in the specified groupings will become missing-data on the recoded version, and they will be excluded from the analysis in the table.

On the other hand, any original missing-data categories of 'age' that are explicitly mentioned in the recode, will be included. For instance, if the value '90' for 'age' were flagged as a missing-data code, but included as in the example above, it would become part of the third recoded category. This is discussed in more detail in the section on "Treatment of missing data."

Assigning particular new code values

It is possible to assign new code values that are different from the default 1, 2, 3, and so forth. To do this, give the new code value, then an equal sign, then the grouping. (The new code value must be a whole number, and decimal places will be ignored. If you want the new code value to include decimal places, use the regular SDA RECODE program.)

For example, the variable 'age' can be recoded into the same three groups as above, but with the new code values 1, 5, and 10, by specifying the recode as follows:
age(r: 1 = 18-30; 5 = 31-50; 10 = 51-90)

For column, row, or control variables it will not usually matter what the new code values are. For variables on which statistics are computed, however, the new code values will affect the value of those statistics.

Assigning labels to the new code values

To assign your own label to a new grouping of code values, place the label in double quotes after the group codes, but before the semicolon. There is no set limit on the length of these labels; however, very long labels may distort the formatting of the tables.

For example, you can assign labels to the recoded categories of race by using the following specification:
race(r: 800-869 "White"; 870-934 "Black"; 600-652, 979-982 "Asian")

These labels will appear in the table, in place of the range of original codes that constitute that group. Nevertheless, the recode specifications will still be documented. A summary is always given at the bottom of the table.

Open ranges (with '*' or '**')

If you are not sure of the ranges of the variable to be recoded, you can specify an open range with an asterisk ('*'). A single asterisk matches the lowest or highest VALID code in the data for that variable.

For example, the 'age' recode could be specified as: age(r: *-30; 31-50; 51-*)
Using this method, all valid age values up to 30 would go into the first recoded group. And all valid age values of 51 or older would go into the third group.

If you want to use a range that includes NUMERIC codes that were defined as missing-data values, you can specify the range with two asterisks ('**') instead of one.

For example, the 'age' recode could be specified as: age(r: *-30; 31-50; 51-**)
Using this method, all valid age values up to 30 would go into the first recoded group. But every numeric value of 51 or greater would go into the third group, including codes like 99 that may have been defined as missing-data codes.

For more discussion about including codes that have been defined as missing-data codes, see the section on "Treatment of missing data."

Overlapping ranges

If the same original code value is mentioned in two or more groupings, it is recoded the FIRST time that the value is encountered.
For example, the following two specifications have the same effect:
age(r: 18-30; 30-50; 50-90), and
age(r: 18-30; 31-50; 51-90)
In both cases, the original 'age' value of 30 ends up in the first group, and the original 'age' value of 50 ends up in the second group.

Notice that order is important with overlapping ranges. The following specification will NOT have the same effect as the preceding two:
age(r: 3= 50-90; 2= 30-50; 1= 18-30)
In this example, the 'age' value of 50 will end up in the recode group with the value '3' (instead of in the second group), and the 'age' value of 30 will end up in the recode group with the value '2' (instead of in the first group).

Multiple specifications for one recoded group

It may sometimes be useful to have more than one specification for a new recoded group. This can be done by specifying the desired outcome code more than once.
For example, to have race recoded into two categories, with the first category including everyone EXCEPT those originally coded as '2', you could use the following specification:
race(r: 1=1 "Non-black"; 2=2 "Black"; 1=3-20)

Treatment of missing data

NUMERIC codes that have been defined as missing data on the original variable can be included in one of the categories of the recoded variable in two ways.

The first method is to mention the code explicitly, either as a single value or as part of a range. For example, if the 'age' value of 99 has been defined as a missing-data code, it can still be included by either of the following specifications:
age(r: 18-30; 31-50; 51-90; 99), or
age(r: 18-30; 31-50; 51-100)
In the first case the code 99 will become its own fourth recode category. In the second case, it will be included as part of the third category.

A second method to include NUMERIC missing data codes is to use an open range with two asterisks ('**') instead of one. For example, the following specification will include all numeric codes above 50 as part of the third recoded group:
age(r: 18-30; 31-50; 51-**)

Note that at present there is no way to include in a temporary recode the system-missing value or a character missing-data value (like 'D' or 'R'). You must use the regular recode program to handle those special missing-data codes. (Your data archive may or may not have enabled that program to run on your current dataset.)

Temporarily Collapse a Variable into Fewer Categories

A simple way to recode a variable into fewer categories is to "collapse" the variable, using a fixed interval.

Collapse syntax

For example, to collapse the variable 'age' into 10-year categories, you can specify the variable as:
age(c: 10, 1)
Notice that the name of the variable ('age') is followed by parentheses, then the instruction 'c' (or 'C') followed by a colon (':'), and then the interval, a comma, and the starting point. Spaces are optional, but are added here for readability.

Using this simple method of collapsing, the new groupings of codes are given the code values 1, 2, 3, and so forth. The label for each group is the range of original codes that constitute that group ("21-30", for example).

Effect of the starting point

The specified starting point affects the range. If the starting point is '1', the age ranges will be: 1-10, 11-20, 21-30, etc. On the other hand, if the starting point is '0', the age ranges will be: 0-9, 10-19, 20-29, etc.

If the starting point is HIGHER than the lowest actual value in the data, the values lower than the starting point become missing-data. For example, with a starting point of '21', any lower values of 'age' (like 18, 19, and 20) would not be included in a range and would become missing-data.

If the starting point is LOWER than the actual minimum value in the data, the ending point of each range is not affected. However, the first range includes only the valid values in that range, if any. For example, if the starting point for collapsing 'age' is '1', with an interval of '10', but the lowest valid value in the data is '18', then the age ranges will be: 18-20, 21-30, 31-40, etc.

The highest range is affected by the highest valid value in the data. For example, if the highest valid value for 'age' is '97', and the starting point is '1' and the interval is '10', the highest intervals will be: 71-80, 81-90, 91-97.

Treatment of missing-data in a collapse

The intervals created by the collapse procedure will exclude missing-data codes that are either above or below the valid codes. Character missing-data codes (like 'D' or 'R') will also be excluded.

A numeric missing-data code that happened to fall in between valid codes, however, would be included in the range that covers that code. For example, if '0' were defined as missing-data, but both '-1' and '+1' were actual valid codes, '0' would be included in one of the ranges.

Optional variables

Control variables
Selection filter variables
Weight variable

Control variables (for table-generating programs)

A separate table is produced for each category of a control variable. If charts are being generated, a separate chart is also produced for each category of the control variable.

For example, if the control variable is gender, there will be one table for men alone and then one table for women alone. A table will also be produced for the total of all valid categories of the control variable (e.g., men and women combined).

Only one variable at a time can be used as a control variable. If more than one control variable is specified, a separate set of tables (and charts) will be generated for each control variable.

Selection filter variables

Selection filters are used in order to limit an analysis to a subset of the cases in the data file. This is done by specifying one or more variables as selection filters, and by indicating which codes of those variables to include.

Note that it is also possible to limit the table to a subset of the cases by restricting the valid range of any of the other variables. But when the desired subset of cases is defined by a variable that is not one of the variables in the table or analysis, you must use filter variables.

Numeric variables as selection filters

Basic filter use

The name of each filter variable is followed, in parentheses, by a single value such as 'gender(2)' or a range of codes such as 'age(30-50)', to limit the analysis to cases having those codes.

Multiple ranges and codes may be specified

For example: age(1-17, 25, 95-100)

Multiple filter variables

If you specify more than one filter variable, a case must satisfy ALL of the conditions in order to be included in the table.
For example: gender(1), age(30-50)

Open-ended Ranges using '*' and '**'

A single asterisk, '*', can be used to specify that all cases with VALID codes for a variable will pass the filter.
For example: age(*) includes all cases with valid data on the variable 'age'.

In a range, the '*' can be used to signify the lowest or highest VALID value. For example: age(*-25,75-*). This filter would include all VALID values less than or equal to 25 and all VALID values greater than or equal to 75. However, any missing-data values within those ranges would still be excluded.

In a range, two asterisks '**' can be used to signify the lowest or highest numeric value, regardless of whether or not the codes are defined as missing data. For example: age(50-**) would include ALL numeric values greater than or equal to 50, including data values like 98 or 99, even if they had been defined as missing-data codes. However, any character missing-data values would still be excluded. Note that '**' cannot be used alone in a filter variable. It can only be used as part of a range.

Character variables as selection filters

The syntax for specifying character variable filters is similar to the syntax for numeric variables but with a few differences. Like numeric variable filters, character variable filters specify the variable name followed by the filter value(s) in parentheses.
For example: city( Atlanta )

Multiple filter values

Multiple filter values can be specified, separated by spaces or commas:
city( Chicago,Atlanta Seattle)

Character variable filters are case-insensitive

For example, the following filters are functionally identical:
city( Atlanta )
city( ATLANTA )
city( AtLAnta )

Spaces. commas, and quotation marks

If a filter value contains internal spaces or commas, it must be enclosed in matching quotation marks (either single or double):
city( "New York" )
state("Cal, Calif")

A filter value containing a single quote (apostrophe) can be specified by enclosing it in double quotes:
city( "Knot's Landing" )

Or, conversely, a filter value containing double quotes can be specified by enclosing it in single quotes:
name( 'William "Bill" Smith' )

Leading and trailing spaces, and multiple internal spaces, are NOT significant. The following filters are all functionally equivalent:
city( "New York    " )
city( "New    York" )
city( "   New York    " )

Ranges are NOT allowed

Note that ranges, which are legal for numeric variables, are not allowed for character variables:
The following syntax is NOT legal: city( Atlanta-Seattle)

Weight variable

Depending on the design and implementation of the study, it may be appropriate to give some of the cases more weight than other cases in computing frequency distributions and statistics. The way you do this is to specify that a certain variable contains the relative weight for each case and is to be considered a weight variable. The documentation for the study should explain the reasons for using a weight variable, if there is one, and what its name is.

SDA studies can be set up with a weight variable specified ahead of time so that the weight variable is used automatically. Other studies may be set up with a drop-down list of choices to be presented to the user, who then selects one of the available weight variables (or no weight variable, if that option is included in the list). If no weight variables have been pre-specified, the user is free to enter the name of an appropriate variable to be used as a weight.

Question text or variable description

All of the descriptive text available for each variable included in the analysis will be appended to the bottom of the results, if you select this option.

The usual text available for a variable is the text of the question that produced the variable, provided that the text was included in the study documentation. Sometimes other explanatory text has been included.

If the variable was created by the 'recode' or the 'compute' program, the commands used to create the new variable are included in the descriptive text.

Title or label for this analysis

On the option screen for an analysis program, you can enter a title or a label for this analysis. If a title is specified, it will appear as the first line of the HTML output generated by the SDA program.

Save an analysis run

Occasionally you may want to save the options you've selected for a particularly complicated and/or significant analysis run.

Before saving an analysis run, you should first fine-tune your output by clicking on the "Run the Table" button and checking the HTML output. Once you've got the correct variables, filters, weight, cell statistics, etc., save the run options.

Saved runs will be listed on the "Saved Runs" tab. From that list you can choose a saved run so its option form will be displayed with the saved specifications already filled-in. You can then simply re-run the analysis or, alternatively, modify the options and save the changes. If you want to save a modified run while keeping the current one, just change the saved run name (and label) before saving.

If a saved run of the same name already exists it will NOT be replaced, unless the option to replace it is selected.

Specify a name for the saved run

A name for the saved run is required. Saved run names:

must contain only US-ASCII letters (lower or upper case), numbers, or underscores
cannot begin with a number or underscore
must be no longer than 32 characters
cannot be one of the Microsoft Windows reserved filenames: "CLOCK$", "CON", "PRN", "AUX", "NUL", "COM1", "COM2", "COM3", "COM4", "COM5", "COM6", "COM7", "COM8", "COM9", "COM0", "LPT1", "LPT2", "LPT3", "LPT4", "LPT5", "LPT6", "LPT7", "LPT8", "LPT9", "LPT0"

Specify a label for the saved run

You can specify an optional label for a saved run. Rather than creating an overly long and complicated name for a saved run, it is highly recommended that you instead provide a label that provides a meaningful description.

Replace saved run

If the name of the analysis run to be saved matches the name of an already-saved run, then that run cannot be saved unless the "Replace" checkbox is selected. If you are saving runs in a public workspace (shared with other users) please be kind: replace a saved run only if you created it.

Actions to take

After you specify variables and select the options you want, go to the bottom section of the form, and select one of two actions:

Run the Table (or Run a specific type of analysis): Select this when you have finished specifying the variables and options you want. The requested table (or other analysis) will then be generated by the server computer and displayed on your screen.
Clear Fields: Select this to delete all previously specified variables and options, so that you can start over.

List of Saved Analysis Runs

The saved analysis runs list includes the following features:

Paging: if the saved runs list is long it is divided into pages for display. Page navigation tools are provided at the top and bottom of the list.
Actions: saved runs can be run/modifed or deleted. Clicking the 'Run/Modify' button will bring up the appropriate options form with the options already filled-in. You can simply re-run the analysis or, alternatively, modify the options. Clicking the 'Delete' button (and confirming your intention) will permanently delete the variable -- so use some caution.
Filtering: enter search terms in the input boxes at the top of a column to dynamically limit the variables that are shown. A match occurs if the characters in the search term occur in any position within the target value. For example, the search term 'age' will match 'respondentage', 'agecategories' or 'management'.
Sorting: use the up and down arrows at the top of each column to sort the saved runs in ascending or descending order by name, label, program or date of creation.

Online Help for Analysis Programs - SDA 4.1

CONTENTS

Help for Specific Analysis Programs

Features Common to All Analysis Programs

SDA Frequencies and Crosstabulation Program

Steps to take

REQUIRED variable name

OPTIONAL variable names

Table Display Options for Crosstabulation

Cell Contents

Other Options

Bivariate statistics

Nominal-level statistics

Ordinal-level statistics

Interval-level statistics

Univariate statistics

Other Display Options

Chart Options for Crosstabulation

CSV output file

SDA Comparison of Means Program

Steps to take

REQUIRED variable names

OPTIONAL variable names

Display Options for Comparison of Means

The first set of columns shows the effects of each category of each variable.

The second set of columns shows the mean of the dependent variable for each category of the row (and column and control) variable(s).'

The "Difference" column shows the difference between the adjusted and unadjusted effects (or, equivalently, the means) for each category.

Other Display Options

Chart Options for Comparison of Means

CSV output file

SDA Correlation Matrix Program

Steps to take

REQUIRED variable names

OPTIONAL variable names

How to exclude cases with missing data

Correlation Measure to Calculate

Additional Statistics to Calculate

Other Display Options

SDA Comparison of Correlations Program

Steps to take

REQUIRED variable names

OPTIONAL variable names

Display Options for Comparison of Correlations

Other Display Options

SDA Multiple Regression Program

Steps to take

REQUIRED variable names

OPTIONAL variable names

How cases with missing data are excluded

Additional Statistics to Calculate

Other Display Options

Chart Options for Multiple Regression

SDA Logit/Probit Regression Program

Steps to take

Type of regression to run

REQUIRED variable names

Create Multiple Temporary Dummy Variables

Product terms

OPTIONAL variable names

How to exclude cases with missing data

Additional Statistics to Calculate

Other Display Options

Chart Options for Logit/Probit Regression

SDA Program to List Values of Individual Cases

Steps to take

Features Common to All Analysis Programs

Options for specifying variables

Multiple variable names

Restricting the valid range

Temporarily Transforming a Variable

Temporarily Recode a Variable

Temporarily Collapse a Variable into Fewer Categories

Optional variables

Control variables (for table-generating programs)

Selection filter variables

Numeric variables as selection filters

Character variables as selection filters

Weight variable

Question text or variable description

Title or label for this analysis