Online Help for Analysis Programs - SDA 4.1

This file contains the online help that is available from inside each SDA analysis program. In addition to the help specific to each program, this file includes information on features common to all analysis programs.

CONTENTS

Help for Specific Analysis Programs

Features Common to All Analysis Programs


SDA Frequencies and Crosstabulation Program

This program generates the univariate distribution of one variable or the crosstabulation of two variables. If a control variable is specified, a separate table will be produced for each category of the control variable.

Steps to take

Specify variables
To specify that a certain survey question or variable is to be included in a table, use the name for that variable as given in the documentation for this study. Aside from simply specifying the name of a variable, certain variable options are available to change or restrict the scope of a variable.

Select display options
After specifying the names of variables, select the display options you wish. These affect percentaging, text to display, and statistics to show.

Select an action
After specifying all variables and options, select the action to take.

REQUIRED variable name

Row variable(s)
Variable down the side of the table

OPTIONAL variable names

Column variable(s)
Variable along the top of the table

Control variable(s)
A separate table is produced for each category of a control variable. If charts are being generated, a separate chart is also produced for each category of the control variable.

If more than one row, column and/or control variable is specified, a separate table (and chart) will be generated for each combination of variables.

Selection filter variable(s)
Some cases are included in the analysis; others are excluded.

Weight variable
Cases are given different relative weights.

Table Display Options for Crosstabulation

Cell Contents

Other Options


Percentaging
Defines which way to make the percents add up to 100 percent:

You can request more than one type of percentaging in a table, but such tables are hard to read.

It is important to understand that if a weight variable has been specified, the percentages and the statistics are always computed using the weighted number of cases. If you want to calculate percentages and statistics using only the unweighted N's, do not specify a weight variable.


Sample design
For complex samples, standard errors and confidence intervals are calculated that take the complex design into account. If bivariate statistics are requested, the Rao-Scott adjustment to the chi-square statistics are used to create F statistics. In this case, probability values are only calculated for the Rao-Scott-based F statistics, and not for the unadjusted chi-square statistics.

Nevertheless, you can specify that the standard errors, confidence intervals, and chi-square probability values should be calculated as if the sample were a simple random sample (SRS). One reason to request SRS calculations might be to compare the size of the SRS standard errors or confidence intervals with the corresponding statistics based on the complex sample design.


Confidence intervals
If this option is selected, an additional row of numbers is generated that contains the upper and lower bound of the confidence interval of the percentage (column, row, and/or total) in each cell. The confidence interval is the range of values within which the population value of the statistic is likely to fall. By default, the level of confidence is 95 percent, but the user can also select 99 percent or 90 percent.

The confidence interval is computed by converting the standard error of each percentage to a natural logarithm and then multiplying the log of the standard error by the value of Student's t appropriate to the level of confidence requested and to the number of degrees of freedom. The result is added to the log of the percentage to obtain the upper bound of the confidence interval, and it is subtracted from the log of the percentage to obtain the lower bound. The logs of the upper bound and of the lower bound are then converted back to percentages (by taking the antilogs) and displayed in the table cell.

This conversion back and forth to logarithms results in confidence intervals that are asymmetric -- they are a little wider in the direction of 50% than in the direction of 0% or 100%. This is the same procedure used by Stata to calculate confidence intervals of percentages. Notice that the calculation of confidence intervals for a proportion (or for any mean) by the Comparison of Means program does not use this log transformation. Therefore, the confidence intervals calculated by the Comparison of Means program will be a little different than the confidence intervals calculated by the Crosstabulation program for the same proportions. This is also the case for Stata.


Standard error of each percent
Standard errors for each type of percentage (column, row, or total) can be computed and displayed for each cell of the table. Standard errors are used to create confidence intervals for the percentages in each cell.

Simple random samples
If the sample is equivalent to a simple random sample of a population, the standard error of each percentage is computed using the familiar "pq/n" formula for the normal approximation to the standard error of a proportion. For each proportion p, the formula is:
   sqrt(p * (1-p) / (n-1))
where n is the number of cases in the denominator of the percentage -- the total number of cases in that particular column, row, or total table, depending on the percentage being calculated. For this calculation, n is the actual number of cases, even if weights have been used to calculate the percentages.

Complex samples
If the sample for a particular study is more complex than a simple random sample, the appropriate standard errors can still be computed provided that the stratum and/or cluster variables were specified when the dataset was set up in the SDA Web archive. Otherwise, the standard errors calculated by assuming simple random sampling are probably too small.

For complex samples the appropriate standard errors are computed using the Taylor series method. If you want additional technical information, see the document on standard error calculation methods.

Note that the calculations for standard errors in cluster samples require that the coefficient of variation of the sample size of the denominator for each percentage, CV(x), be under 0.20; otherwise, the computed standard errors (and the confidence intervals) are probably too small, and they are flagged in the table with an asterisk. CV(x) and other diagnostic information is available for standard error calculations done by the SDA Comparison of Means program. That program and the SDA Crosstabulation program use the same information and methods to calculate standard errors.


Design effect (deft) for each percent
The design effect for each percentage based on a complex sample is the ratio of the standard error of each percent in a table cell divided by the standard error of the same percent in a simple random sample of the same size. For the calculation of standard errors, see the discussion of standard errors above. (The design effect for a percent based on a simple random sample is 1.)

The design effect for each percent in a cell is used to calculate the effective number of cases (N / deft-squared) on which the percent is based, for purposes of precision-based suppression.

The design effects for all of the total percents in a table are used to calculate the Rao-Scott adjustment to the chi-square statistic, if bivariate statistics have been requested for a complex sample.


DF -- Degrees of freedom
The number of degrees of freedom (df) is used to compute the width of each confidence interval. For a simple random sample the df equal the number of cases in the denominator for each each percentage for that cell, minus one.

For complex samples, the df equal the number of primary sampling units (clusters, for cluster samples; individual cases in the denominator, for unclustered samples) minus the number of strata (unstratified samples have a single stratum). Note that the number of strata and clusters used for this calculation is usually the number in the overall sample, and not in the subclass represented by a cell in a table. For a fuller discussion of this issue, see the treatment of domains and subclasses in the document on standard error methods.

The value of Student's t used for computing confidence intervals depends on the desired level of confidence (95 percent, by default) and the df. The fewer the df, the larger the required value of Student's t and, consequently, the larger the width of the confidence intervals. As the df increase, the size of the required Student's t value decreases until it approaches the familiar value for the normal distribution (which is 1.96, for the 95 percent confidence level).


Show the Z-statistics
The Z-statistic controls the color coding of cells in the table. If you select this option, the statistic will be displayed in each cell.

The Z-statistic shows whether the frequencies in a cell are greater or fewer than expected (in the same sense as used for the chi-square statistic). It also takes into account the total number of cases in the table. If there are only a few cases in the table, the deviations from the expected values are not as significant as if there are many cases in the table.

The Z-statistics are standardized residuals. The residual for each cell is calculated as the ratio of two quantities:

For a discussion of the standardized residuals, see Alan Agresti, An Introduction to Categorical Data Analysis, New York: John Wiley, 1996, p. 31.

Note that if the frequencies in the table are weighted, the Z-statistic can be artificially inflated (or deflated). Consequently, if weights are used, each Z-statistic is divided by the average size of the weights. The average size of the weights is just the ratio of the total number of weighted cases in the table, divided by the actual number of unweighted cases in the table. For example, if the table is based on 1,000 actual cases, but the weighted number of cases is 100,000, the average size of the weights is 100,000/1,000 = 100. (The chi-square statistics are adjusted in the same way, to compensate for weights whose average is different from 1.) Note also that the Z-statistic does not take into account the complex sample design, if the table is based on such a sample.


N of cases to display
By default, the number of cases used to calculate percentages is displayed in each cell. The box to display the weighted N is initially checked on the option form. If no weight variable was specified for the analysis, the unweighted N of cases is displayed in each cell, even if the box for weighted N was checked.

However, you can uncheck both boxes, and no N will be displayed. Or you can check both boxes, and both the unweighted and the weighted N of cases will be displayed (if a weight variable has been specified).

It is important to understand that if a weight variable has been specified, the percentages and the statistics are always computed using the weighted number of cases, regardless of which N is displayed in the table. If you want to calculate percentages and statistics using only the unweighted N's, do not specify a weight variable.


Summary statistics (Bivariate or Univariate)
Various numbers or statistics can be used to summarize the distributions of the variables. If you specify both a row and a column variable, a package of bivariate statistics is generated. If you specify a row variable only, a package of univariate statistics is generated. Consult any statistics textbook for more information on the meaning of these statistics.

Bivariate statistics

The bivariate statistics summarize the strength or the statistical significance of the observed relationship between the row and the column variables. Several of the most common statistics are displayed if you select this option.

Univariate statistics

The univariate statistics package includes the mean, median, mode, standard deviation, variance, and the coefficient of variation (standard deviation divided by the mean) of the specified variable, plus a few other descriptive statistics. All of these statistics are calculated using the weight variable, if one is specified.

Note that the univariate statistics cannot be calculated for character variables. If a character variable is used as a row variable, the request for univariate statistics is ignored. Even for numeric variables, be aware that the univariate statistics will not be meaningful unless the code values of the row variable are ordered in a way that approximates interval-level data.

These univariate statistics are purely descriptive. No attempt is made to test them for sampling error. To get standard errors and confidence intervals for the mean of a variable, you can use the Comparison of Means program.


Other Display Options

Question text
The text of the question that produced each variable is generally available.
Color coding of the table cells
The table cells are color coded, in order to aid in detecting patterns. Cells with more cases than expected (based on the marginal percentages) become redder, the more they exceed the expected value. Cells with fewer cases than expected become bluer, the smaller they are, compared to the expected value.

The transition from a lighter shade of red or blue to a darker shade depends on the magnitude of the Z-statistic. The lightest shade corresponds to Z-statistics between 0 and 1. The medium shade corresponds to Z-statistics between 1 and 2. The darkest shade corresponds to Z-statistics greater than 2.

The color coding can be turned off, if you prefer. Color coding may not be helpful if you are using a black-and-white monitor or if you intend to print out the table on a black-and-white printer.


Suppress display of the table
Occasionally you may want to see the summary statistics for a table and/or the chart, without wishing to view the table itself, especially if the table is a very large one. If you select this option, the table is generated internally but is not displayed.
Include missing-data values
With this option, the row, column, and control variables in the table will include ALL categories, including those defined as missing-data or out-of-range categories. The system-missing code will also appear in the table. Its category label will be the default "(No Data)" unless another label has been assigned to the system-missing code. Any range restrictions or temporary recode commands will be ignored, and every category will be shown.

If bivariate statistics are requested, nominal and ordinal statistics will be produced as usual, with the missing data codes sorted into order with the valid codes.

Interval-level statistics will also be computed if the included missing-data codes allow it. The Eta statistic will be calculated if the included missing data codes on the ROW variable are all numeric. The Pearson correlation coefficient can be calculated only if the included missing data codes are all numeric on BOTH the row and column variables.

If univariate statistics are requested, the row variable can only have numeric missing-data codes. Otherwise, no statistics can be generated, and the request is ignored.


Number of decimals to display
Each statistic displayed in the cells of the table has a default number of decimal places. If you want more or fewer decimal places, you can generally specify from 0 to 6 decimal places for most of the statistics displayed in each cell (with the exception of the unweighted number of cases). Note that the decimal place specifications for standard errors are RELATIVE to the number of decimal places in the percentages.


Chart Options for Crosstabulation

Type of chart to display
Select the type of chart you would like. A stacked bar chart is relatively compact and is suitable for most tables. Regular side-by-side bar charts, pie charts, and line charts are also available.

If you select column percentaging, the chart will include a separate set of bars (or a separate pie) describing the row variable, for each category of the column variable. For a line chart, there will be a separate line for each category of the row variable, plotted against the values of the column variable. The column variable is treated as the "break variable" in this layout.

If you select row percentaging, the chart will include a separate set of bars (or a separate pie) describing the column variable, for each category of the row variable. For a line chart, there will be a separate line for each category of the column variable, plotted against the values of the row variable. The row variable is treated as the "break variable" in this layout.

If you select total percentaging, a combination of row and column percentaging, or no percentaging at all, the effect is the same as selecting column percentaging only.

If there is only a row variable specified for the table, the chart will include one set of bars (or one pie, or one line) to show the distribution of that row variable.

Bar chart options
The appearance of bar charts (both stacked and side-by-side bar charts) can be modified in two ways:

Show Percents
Each bar, pie slice, or point on a line will have its percent included on the chart, if you select this option.

Note that these percents may not always appear or may not be legible in all situations.

On stacked bar charts the percents may not have sufficient room to appear inside the area allocated to small categories.

On pie charts and line charts the percents for some slices or for some points on the lines may be almost overlaid and become illegible, if there are many categories or if the lines are very close together.

If you still want to show the percents in those situations, it will usually help if you increase the size of the charts. For stacked bar charts it can also help to change from a vertical to a horizontal orientation.

Palette
The charts are usually output in color. If you wish to print or copy the charts on a black-and-white printer or copier, you can select the grayscale palette for your charts. The charts will then be output in various shades of gray (instead of in various colors).

Size of chart
The width and height of the chart (expressed in the number of pixels) can be modified. If there is a large (or a very small) number of categories in either the row or the column variable, it may be helpful to increase (or decrease) one or both of the dimensions of the chart.

Pie charts in particular may require an increase in the dimensions of the chart if the number of category slices is large. Otherwise, the labels for each slice of the pies might overlay one another.

Stacked bar charts with only two or three break categories may look better if the chart is made narrower. But if there is a large number of break categories (like years of age), the best solution is often to combine a horizontal chart orientation with an increase in the height of the chart.

Side-by-side bar charts are best limited to tables with a relatively small number of categories in both the row and the column variables. If there are many categories in either or both of the variables, the proliferation of bars can be confusing, even if the chart dimensions are increased. In such cases it is probably better to use stacked bar charts instead of side-by-side bar charts.

Line charts may need to be enlarged if the lines are close to being overlaid. If percents are being shown, they also can become overlaid. In such cases it may help to increase the height of the chart.


CSV output file

Create a CSV output file for downloading
You can create a CSV format file (based on the currently selected options) by clicking on the "Create CSV file" button. Once a CSV file is created, another button labeled "Download CSV file" will appear. Clicking on this button will allow you to download the CSV file to your computer. CSV files are useful for importing into other applications (such as Excel) for creating custom charts. CSV files can also be useful for preparing tables for inclusion in manuscripts.

When creating a CSV file it is usually easiest to first create preliminary HTML output, by clicking on the "Run the Table" button, while you choose the correct variables, filters, weight, cell statistics, etc. The HTML output is quick and easy to read while you're fine-tuning your options. When you have the desired output, create a CSV file.

Once the CSV file has been created, click on the "Download CSV file" button and download it. (Note that this same CSV file will remain available for downloading -- even multiple times -- until you create a new CSV file by clicking the "Create CSV file" button again.) Once you have downloaded the CSV file, you can import it into an appropriate application on your computer.

For example, to create a chart in Excel, use SDA's default CSV output option to separate statistics into multiple tables (see below). Once the CSV file has been imported into Excel, select the desired table of statistics, including the row and column labels. (You may also want to include the row or column totals, depending on the statistic and your preferences.) Then select the "Insert" tab and click on "Recommended Charts". You can now preview various chart types by clicking on them. Once you've chosen your desired chart type, click on "OK". You can then customize your chart in various ways using Excel's tools.

CSV table format
If the cells in your HTML table contain more than one statistic then, by default, in the equivalent CSV output file a separate table is created for each statistic. This is often the most useful format for importing into a charting tool. However, for some applications, it is more useful to combine the statistics into one table. You can specify whether the statistics should be output in separate tables or combined in one table. (Note that in either case, only one CSV file is created.)

Name of CSV download file
By default, the name of the CSV download file is "tables.csv". However, you can specify another name of your choice. This is useful for giving a more meaningful name to the CSV file, especially when you are creating several different CSV files. However, the file ending or extension must always be ".csv". If you do not specify a ".csv" extension, it will be automatically appended to your specified name.

SDA Comparison of Means Program

This program calculates the mean of the dependent variable separately within categories of the row variable and, optionally, the column variable. If a control variable is specified, a separate table will be produced for each category of the control variable.

Steps to take

Specify variables
To specify that a certain survey question or variable is to be included in a table, use the name for that variable as given in the documentation for this study. Aside from simply specifying the name of a variable, certain variable options are available to change or restrict the scope of a variable.
Select display options
After specifying the names of variables, select the display options you wish. These affect the number of decimals to show, text to display, and statistics to compute.
Select an action
After specifying all variables and options, select the action to take.

REQUIRED variable names

Dependent variable(s)
A numeric variable whose mean or average value is to be computed for each combination of the row and (optionally) column and control variables and displayed in a table.

Row variable(s)
Variable down the side of the table

OPTIONAL variable names

Column variable(s)
Variable along the top of the table

Control variable(s)
A separate table is produced for each category of a control variable.

If more than one dependent variable, row variable, column variable, and/or control variable is specified, a separate table will be generated for each combination of variables.

Selection filter variable(s)
Some cases are included in the analysis; others are excluded.

Weight variable
Cases are given different relative weights.

Display Options for Comparison of Means

Main statistic to display
Each cell of the table will usually contain the MEAN of the dependent variable for that particular combination of the row and (optionally) column and control variables.

Sometimes, however, it is more helpful to express each cell mean in another way:


Base row or column category
When the main statistic to display is a DIFFERENCE from a row or column category, it is necessary to specify which row or column category is the base category.

Enter the code value for the row or column category that you want to consider the base category.


Transformation of the dependent variable (for 0/1 dependent variables)
The mean of a dependent variable coded 0 or 1 is a proportion. The problem with analyzing a proportion is that the standard deviation and variance depend on the magnitude of the proportion.

The proportion in each cell of the table can be transformed into another statistic that has a more stable distribution. These options are provided for didactic purposes, so that students and researchers can readily compare the logit and and probit transformations with the original proportions in a table. The following options are available:

These transformations require that the dependent variable be coded as a value of 0 or 1. If the variable is not coded that way, SDA will create a temporary 0/1 variable by recoding the lowest value to 0 and all other values to 1.

Calculate a median or other percentile for each cell

The drop-down menu allows you to select either the median or a percentile. (The median is the same as the 50th percentile.) If you select 'Percentile', another drop-down list will appear, from which you can pick any percentile between 1 and 99 (the default is the 90th percentile).

The median or the specified percentile of the dependent variable will be displayed as the first statistic in each cell. If a weight variable is used, the medians or percentiles will be calculated using the weights. The medians or percentiles calculated by the MEANS program are purely descriptive. No attempt is made to test them for sampling error.

Base chart on medians/percentiles (instead of means)

A chart generated by the MEANS program is based, by default, on the mean of the dependent variable (or whatever else has been selected as the "Main statistic to display"). If you have requested that medians or percentiles be displayed in each cell (in addition to the means), you can choose to base the chart on these medians or percentiles by checking this box.

Estimate the values of medians or percentiles

If you have a very large number of cases, dependent variable categories, and table cells, there may only be enough memory to calculate the exact median or percentile for some of the cells of the table. By default, no median or percentile is output for the remaining cells. By checking this box, however, you can request that an estimated value be calculated for the median or percentile in those cells that otherwise would be left without any statistic at all.

An asterisk(*) next to a median or percentile indicates that it was estimated using an algorithm for what is called the "remedian". For further information on this method of estimating medians and percentiles, see Peter J. Rousseeuw and Gilbert W. Bassett, Jr., "The Remedian: A Robust Averaging Method for Large Data Sets." Journal of the American Statistical Association, March 1990, vol. 85, pp. 97-104. Note that SDA uses a base of 101 to calculate the remedian.


Confidence intervals
If this option is selected, an additional row of numbers is generated that contains the upper and lower bound of the confidence interval of the statistic (mean or difference or total) in each cell. The confidence interval is the range of values within which the population value of the statistic is likely to fall. By default, the level of confidence is 95 percent, but the user can also select 99 percent or 90 percent.

The confidence interval or range is computed by multiplying the standard error of the mean (or difference or total) by the value of Student's t appropriate to the level of confidence requested and to the number of degrees of freedom. The result is added to the mean (or difference or total) to obtain the upper bound of the confidence interval, and the result is subtracted from the mean (or difference or total) to obtain the lower bound. Note that if both complex and SRS standard errors are requested, only the complex standard errors are used to compute the confidence intervals.

For a very large random sample (in a particular cell of a table), for instance, the appropriate value for Student's t for a 95 percent confidence interval is close to the familiar 1.96 value for the normal distribution.


Additional statistics to display (for simple random samples)
There are several additional statistics that can be displayed in each cell:

Additional statistics to display (for complex probability samples)

There are several additional statistics that can be displayed in each cell:


Multiple Classification Analysis (MCA)
If this option is selected, an MCA table is generated showing the effect on the dependent variable of each of the categories of each row, column, and control variable. (Those variables must all be numeric variables. If one or more are character variables, the MCA request is ignored.)

These MCA statistics are purely descriptive. No attempt is made to test them for sampling error. You can run the SDA regression program to calculate standard errors and confidence intervals, even for complex samples.

The MCA procedure shows the average effect of each category, and it ignores any interactions between the variables. If interaction effects are statistically significant, MCA is generally not appropriate.

The first set of columns shows the effects of each category of each variable.

The second set of columns shows the mean of the dependent variable for each category of the row (and column and control) variable(s).'

The "Difference" column shows the difference between the adjusted and unadjusted effects (or, equivalently, the means) for each category.


Diagnostic output for standard errors (for complex probability samples)
If this option is selected, an additional table is generated that contains the following statistics in each cell:

ANOVA
An analysis of variance can be carried out and presented after the table of means. The Eta squared statistic shows the proportion of the variance of the dependent variable accounted for by the row variable (and by the column variable, if there is one, and the interaction between the row and column variables.)

If the sample is a simple random sample, the ANOVA can also be used to assess the statistical significance of the effects of the row variable (and the column variable, if there is one) on the dependent variable. An F statistic is calculated as the ratio of each mean square divided by the residual mean square, and the probability of the F statistic is evaluated. If the p-value (probability statistic) associated with a particular row or column effect is low (about .05 or less), the chances are correspondingly low that the observed effect on the dependent variable is only due to sampling error. In that case the effect is said to be statistically significant.

If the sample is a complex sample, like a cluster sample, the ANOVA is only of descriptive value. The F tests and their associated probability statistics are omitted because they would likely underestimate the size of the true p-value and therefore overstate the statistical significance of the observed row and/or column effects. Only the Eta squared statistic for each effect is displayed. You can use the SDA regression program to calculate the statistical significance of the independent variables in complex samples.


Other Display Options

Suppress display of the table
Occasionally you may want to see ANOVA statistics or a Multiple Classification Analysis (MCA) or a chart without viewing the table of means, especially if the table is a very large one. If you select this option, the table of means is generated internally but is not displayed. Tables containing confidence intervals and diagnostic information (for complex standard errors) are also suppressed.
Color coding of the cells
The cells of the table of means are color coded, in order to aid in detecting patterns. Cells with higher means than the overall mean become redder, the more they exceed the overall mean. Cells with lower means than the overall mean become bluer, the smaller they are.

The transition from a lighter shade of red or blue to a darker shade depends on the magnitude of the absolute value of the Z-statistic or t-statistic. The transition points vary, depending on which of those two statistics is calculated:

The color coding can be turned off, if you prefer. Color coding may not be helpful if you are using a black-and-white monitor or if you intend to print out the table on a black-and-white printer.


Number of decimals to display
Each statistic displayed in the cells of the table has a default number of decimal places. If you want more or fewer decimal places, you can generally specify from 0 to 6 decimal places for most of the statistics displayed in each cell. Note that some decimal place specifications are RELATIVE to the number of decimal places in the main statistic (means or totals).
Question text
The text of the question that produced each variable is generally available.

Chart Options for Comparison of Means

Type of chart to display
Select the type of chart you would like. A bar chart is suitable for most tables and is the default chart format. Line charts are also available and are suitable especially when the categories of the row variable are ordered.

If only a row variable is specified (and no column variable), the bars or the line will show the value of the dependent variable (on the vertical axis) for each value of the row variable.

If both a row variable and a column variable are specified, there will be a separate set of bars, or a separate line, for each category of the column variable. For a bar chart, there will be sub-bars for each column category within the bar for each row category. For a line chart, there will be a separate line for each column category.

Bar chart options
The appearance of bar charts can be modified in two ways:

Show means/medians/percentiles
Each bar or each point on a line will have its statistic included on the chart, if you select this option. This could be the mean (or another "main statistic"), or the median, or the specified percentile, depending on the statistic that was requested on which to base the chart.

Note that the chosen statistic may not always appear or may not be legible in all situations. Especially on line charts, the statistics for some points on the lines may be almost overlaid and become illegible, if there are many categories or if the lines are very close together. If you still want to show the statistics in those situations, it will usually help if you increase the size of the charts.

Palette
The charts are usually output in color. If you wish to print or copy the charts on a black-and-white printer or copier, you can select the grayscale palette for your charts. The charts will then be output in various shades of gray (instead of in various colors).

Size of chart
The width and height of the chart (expressed in the number of pixels) can be modified. If there is a large (or a very small) number of categories in either the row or the column variable, it may be helpful to increase (or decrease) one or both of the dimensions of the chart.

Bar charts are best limited to tables with a relatively small number of categories in both the row and the column variables. If there are many categories in either or both of the variables, the proliferation of bars can be confusing, even if the chart dimensions are increased.

Line charts may need to be enlarged if the lines are close to being overlaid. If means are being shown, they also can become overlaid. In such cases it may help to increase the height of the chart.


CSV output file

Create a CSV output file for downloading
You can create a CSV format file (based on the currently selected options) by clicking on the "Create CSV file" button. Once a CSV file is created, another button labeled "Download CSV file" will appear. Clicking on this button will allow you to download the CSV file to your computer. CSV files are useful for importing into other applications (such as Excel) for creating custom charts. CSV files can also be useful for preparing tables for inclusion in manuscripts.

When creating a CSV file it is usually easiest to first create preliminary HTML output, by clicking on the "Run the Table" button, while you choose the correct variables, filters, weight, cell statistics, etc. The HTML output is quick and easy to read while you're fine-tuning your options. When you have the desired output, create a CSV file.

Once the CSV file has been created, click on the "Download CSV file" button and download it. (Note that this same CSV file will remain available for downloading -- even multiple times -- until you create a new CSV file by clicking the "Create CSV file" button again.) Once you have downloaded the CSV file, you can import it into an appropriate application on your computer.

For example, to create a chart in Excel, use SDA's default CSV output option to separate statistics into multiple tables (see below). Once the CSV file has been imported into Excel, select the desired table of statistics, including the row and column labels. (You may also want to include the row or column totals, depending on the statistic and your preferences.) Then select the "Insert" tab and click on "Recommended Charts". You can now preview various chart types by clicking on them. Once you've chosen your desired chart type, click on "OK". You can then customize your chart in various ways using Excel's tools.

CSV table format
If the cells in your HTML table contain more than one statistic then, by default, in the equivalent CSV output file a separate table is created for each statistic. This is often the most useful format for importing into a charting tool. However, for some applications, it is more useful to combine the statistics into one table. You can specify whether the statistics should be output in separate tables or combined in one table. (Note that in either case, only one CSV file is created.)

Name of CSV download file
By default, the name of the CSV download file is "means.csv". However, you can specify another name of your choice. This is useful for giving a more meaningful name to the CSV file, especially when you are creating several different CSV files. However, the file ending or extension must always be ".csv". If you do not specify a ".csv" extension, it will be automatically appended to your specified name.

SDA Correlation Matrix Program

This program calculates the correlation between all pairs of two or more variables.

Steps to take

Specify variables
To specify that a certain survey question or variable is to be correlated or used as a filter or weight variable, give the name for that variable as given in the documentation for this study. Aside from simply specifying the name of a variable, certain variable options are available to change or restrict the scope of a variable.

Select display options
After specifying the names of variables, select the display options you wish. These affect the statistics to compute, the number of decimals to show, and text to display.

Run correlations
After specifying all variables and options, select Run correlations to run the program.

Or you can select Clear fields to delete all previously specified variables and options, so that you can start over.


REQUIRED variable names

Variables to be correlated
Enter the names of two or more numeric variables whose correlation coefficients are to be computed for each pair of variables. (There are various optional ways of specifying variable names for analysis.)

Enter the name of each variable in a text box. To go from one text box to another, use the tab key or your mouse. It is all right to skip a text box and leave it blank -- to use only text boxes 1, 5, and 9, for example.

It is possible to enter more than one variable name in a text box (the underlying text-entry area will scroll). This has consequences for other options which refer to variable numbers. For example, if you enter two variables in text box number 3, and then you request that the signs of the correlations be reversed for variable number 3, the signs of BOTH variables in text box number 3 will be reversed.

Each text box, consequently, defines a variable GROUP. Ordinarily it is clearer to put only one variable in each text box, but the possibility of defining groups of variables exists.


OPTIONAL variable names

Selection filter variable(s)
Some cases are included in the analysis; others are excluded.

Weight variable
Cases are given different relative weights in calculating the correlation coefficients.

How to exclude cases with missing data

Listwise exclusion
If a case has a missing-data value on ANY of the variables to be correlated, it is excluded from ALL of the correlation calculations. This is the default procedure.

Pairwise exclusion
If a case has a missing-data value on SOME of the variables to be correlated, but not on others, it is excluded from the calculations for those PAIRS of variables in which one of the values is missing.

This procedure retains all of the information about each pairwise relationship. However, the multivariate relationships can be inconsistent, if many of the cases have different missing-data patterns on different variables.


Correlation Measure to Calculate

The Pearson correlation coefficient
This is the usual correlation coefficient and is the default correlation measure to calculate. It is appropriate for ordered numeric variables.

Log of the odds-ratio
The log of the odds-ratio is an optional measure for dichotomous variables. The calculation of the odds ratio assumes that the two variables have only two categories each. If these statistics are requested, the correlation program treats each variable as a dichotomy, regardless of the number of categories it may actually have. The minimum valid value of each variable is treated as one category, and all valid values greater than the minimum are combined into the other category.

If this default dichotomization is not appropriate for a particular analysis, you can recode the variable temporarily within the correlation program using the standard methods of recoding variables.

Consult any beginners' statistics book for more information on the meaning of these statistics.


Additional Statistics to Calculate

Alpha coefficient
Cronbach's alpha coefficient is a measure of how well the variables in the correlation matrix could be said to measure the same thing. If you added together all of the variables included in the correlation matrix to form a scale, alpha is the square of the correlation between the scale and the underlying factor.

The alpha coefficient is a function of the average correlation between the variables and of the number of variables. If some of the variables are scored in opposite directions, you should use the option to reverse the signs of some of the variables, so that a high score on all variables means the same thing.

Standard errors
A standard error for each correlation coefficient can be computed. If this option is requested, the standard errors are placed in a separate matrix, right under the matrix of correlation coefficients. If the sample is more complex than a simple random sample, the standard errors calculated here are probably too small.

The standard errors can be used to create confidence intervals for each correlation coefficient. For example, you can be 95% confident that the correlation coefficient in the population (for each pair of variables) is within the interval bounded by approximately two standard errors above and below the correlation coefficient calculated from the sample (as shown in the matrix). The actual multiple to use for creating confidence intervals is the t-statistic with (n-1) degrees of freedom.

The calculation of the standard error of the correlation coefficient in each cell is based by default on the UNWEIGHTED number of cases, even if a weight variable has been used for calculating the correlation coefficient. Ordinarily this procedure will generate a more appropriate statistical test than one based on the weighted N in each cell.

The standard error is computed differently, depending on which correlation coefficient you have selected.

Standard errors for Pearson correlation coefficients:
The confidence interval for the Pearson correlation coefficient is not symmetric; therefore, there is no single standard error that applies in both directions. The standard error output by this program is the average distance of the upward and the downward confidence band for one standard error (based on the retransformation of Fisher's Z), since that number is ordinarily a useful approximation.

Standard errors for the log of the odds ratio:
The standard error for the log of the odds ratio is calculated with standard formulas for that statistic. Consult a statistics book for details.

Univariate statistics
Univariate statistics for each of the variables in the correlation matrix will be computed and displayed, if this option is selected.

The statistics available for each variable include its mean, standard deviation, standard error, valid N of cases, and (if there is a weight variable) valid weighted N of cases.

If missing-data cases have been excluded LISTWISE (the default), the univariate statistics for all variables will be based on the SAME cases -- those which have valid data on ALL of the variables.

If missing-data cases have been excluded PAIRWISE, the univariate statistics for each variable will be based on all the cases with valid data for that one variable.


Paired univariate statistics
If missing-data cases have been excluded pairwise, each correlation coefficient is based (potentially) on a different subset of the cases. Univariate statistics based on that same subset of cases for each pair of variables will be calculated and displayed, if this option is selected.

The paired statistics for each variable include its mean, standard deviation, valid N of cases for the pair, and (if there is a weight variable) valid weighted N of cases for the pair.

These statistics are displayed as a series of matrices. Each statistic for a given variable is (potentially) somewhat different, depending on which other variable it is being paired with.


Index of proportionality (P-squared)
It is sometimes useful to know the degree to which the correlations in each row of the correlation matrix are proportional to the correlations in the other rows. This is particularly the case in creating scales or indexes of items. If variables are measuring the same thing, they should have similar correlations to other relevant (criterion) variables.

The P-squared statistic is a way to measure the proportionality of rows in a correlation matrix. For example, if all of the coefficients in one row are exactly double the size of the coefficients in another row, there is a constant proportionality, and the index will be 1.0.

Usually we want to limit this comparison to a subset of the the matrix -- namely, to the part corresponding to the correlations of the criterion variables with the variables of interest. To do this, we specify on the option screen the variable numbers (next to each text box on the option screen) corresponding to the variables for which we want the P-squared measure, and the variable numbers corresponding to the criterion variables.

For example, we could examine the degree to which the variables v1, v2, and v3 have proportional correlations to the criterion variables x1, x2, and x3. We would enter v1, v2, and v3 into the first 3 text boxes on the option screen; and x1, x2, and x3 into text boxes 4 through 6. To get the P-squared statistic for all the combinations of v1, v2, and v3, in respect to the criterion variables, we would then specify:

These variable numbers can be specified either as a range (1-3) or as a list (1,2,3); and the variables need not be adjacent in the original correlation matrix -- a list like '1,3,5' is valid.

The P-squared statistics are presented in a symmetrical matrix. Each row and column corresponds to one of the variables that we specified as a "variable to measure."

For a discussion of how to use this statistic, see Thomas Piazza, "The Analysis of Attitude Items," American Journal of Sociology, vol. 86 (1980) pp. 584-603.


Other Display Options

Reverse signs of some correlations
In order to detect patterns in the correlation matrix, it is sometimes useful to reverse the signs of the correlations corresponding to one or more variables. Enter the variable number of each variable for which you want the signs reversed. The variable number corresponds to the text box number on the option screen.

For example, we may know that var1 is scaled in such a way that a HIGH score or value corresponds to a LOW score on var2 and var3, so we expect the correlations of var1 to be negative with var2 and var3. But if we are interested in the relationships of those variables to other variables, it will be easier to detect different patterns if we reverse all the signs corresponding to var1. That way, we can expect var1, var2, and var3 to have correlations of the same sign with other variables. Then if we do observe a difference in the signs, it will catch our attention.


Color coding of the correlations
The correlation coefficients are color coded, in order to aid in detecting patterns. Correlation coefficients greater than zero become redder, the larger they are. Correlation coefficients less than zero become bluer, the more negative they are.

The transition from a lighter shade of red or blue to a darker shade depends on the magnitude of the correlation coefficient in each cell. The lightest shade corresponds to coefficients between 0 and .15. The colors become darker as the absolute value of the correlations exceed .15, then .30, then .45.

Color coding is also used for the P-squared matrix, if one has been requested. However, the dividing points for colors are double in magnitude. The lightest shade corresponds to P-squared coefficients between 0 and .30. The colors become darker as the absolute value of the P-squared coefficients exceed .30, then .60, then .90.

The color coding can be turned off, if you prefer. Color coding may not be helpful if you are using a black-and-white monitor or if you intend to print out the matrix on a black-and-white printer.


Question text
The text of the question that produced each variable is generally available.

SDA Comparison of Correlations Program

This program calculates the correlation between two variables separately within categories of the row variable and, optionally, the column variable. If a control variable is specified, a separate table will be produced for each category of the control variable.

Steps to take

Specify variables
To specify that a certain survey question or variable is to be included in a table, use the name for that variable as given in the documentation for this study. Aside from simply specifying the name of a variable, certain variable options are available to change or restrict the scope of a variable.

Select display options
After specifying the names of variables, select the display options you wish. These affect the number of decimals to show, statistics to compute, and text to display.

Select an action
After specifying all variables and options, select the action to take.

REQUIRED variable names

Variables to be correlated
Two numeric variables whose correlation coefficient is to be computed for each combination of the row and (optionally) column and control variables and displayed in a table

Row variable(s)
Variable down the side of the table

OPTIONAL variable names

Column variable(s)
Variable along the top of the table

Control variable(s)
A separate table is produced for each category of a control variable.

If more than one correlation variable, row variable, column variable, and/or control variable is specified, a separate table will be generated for each combination of variables.

Selection filter variable(s)
Some cases are included in the analysis; others are excluded.

Weight variable
Cases are given different relative weights.

Display Options for Comparison of Correlations

Correlation measure to calculate
The Pearson correlation coefficient is the default correlation measure to calculate. It is appropriate for ordered numeric variables.

The log of the odds-ratio is an optional measure for dichotomous variables. The calculation of the odds ratio assumes that the two variables to be correlated have only two categories each. If these statistics are requested, CORRTAB treats Var 1 and Var 2 as dichotomies, regardless of the number of categories they may actually have. The minimum valid value of each variable is treated as the base category (coded 0), and all valid values greater than the minimum are combined into the other category (coded 1). If this default dichotomization is not appropriate for a particular variable, you can specify another temporary recode after the variable name is given.


Show differences from overall correlation (instead of cell correlations)
Usually each cell of the table will contain the correlation coefficient of the two variables being correlated, for that particular combination of the row and (optionally) column and control variables. Sometimes, however, it is more helpful to express each cell correlation as the DIFFERENCE from the overall correlation. Select this option to have those differences calculated and put into each cell of the table.
Standard errors
Standard errors for the correlations can be computed and displayed for each cell of the table. The standard errors can be used to create confidence intervals for the correlation in each cell. If the sample is equivalent to a simple random sample of a population, you can be about 95% confident that the correlation in the population (for each cell) is within the interval bounded by two standard errors above and below the correlation in the sample (shown in the table).

The standard error is computed differently, depending on which correlation coefficient you have selected. The standard error for the Pearson correlation is based on Fisher's Z, and it is calculated as the average distance of the upward and the downward confidence band for one standard error (based on the retransformation of Fisher's Z into Pearson's R). The standard error for the log of the odds ratio is calculated with standard formulas for that statistic.

If the sample is more complex than a simple random sample, the standard errors calculated here are probably too small.

Consult any beginners' statistics book for more information on the meaning of these statistics.

Other Display Options

Color coding of the cells
The cells of the table of correlations are color coded, in order to aid in detecting patterns. Cells with higher correlations than the overall correlation become redder, the more they exceed the overall correlation. Cells with lower correlations than the overall correlation become bluer, the smaller they are.

The transition from a lighter shade of red or blue to a darker shade depends on the magnitude of the t-statistic. The lightest shade corresponds to t-statistics between 0 and 1. The medium shade corresponds to t-statistics between 1 and 2. The darkest shade corresponds to t-statistics greater than 2.

The color coding can be turned off, if you prefer. Color coding may not be helpful if you are using a black-and-white monitor or if you intend to print out the table on a black-and-white printer.


Show the t-statistic
If you select this option, the t-statistic will also be displayed in each cell.

The t-statistic shows whether the correlation in a cell is larger or smaller than the overall correlation. It also takes into account the total number of cases in each cell. If there are only a few cases in a cell, the deviations from the overall correlation are not as significant as if there are many cases in that cell.

The t-statistic is calculated as the ratio of two quantities: The numerator is the difference between the correlation in the cell and the overall correlation. The denominator is the standard error of the correlation in that cell.

Note that the t-statistic controls the color coding of cells in the table of correlations.


Number of decimals for the correlation
You can select from 1 to 6 decimal places. The default is 2 decimal places.
Question text
The text of the question that produced each variable is generally available.

SDA Multiple Regression Program

This program calculates the regression coefficients for one or more independent or predictor variables, using ordinary least squares.

Two versions of the regression coefficient are given for each variable:

  1. The unstandardized regression coefficient -- labeled B
  2. The standardized regression coefficient -- labeled Beta
For each version of the coefficient there is also a standard error -- labeled either as SE(B) or as SE(Beta). The calculation of these standard errors depends on the sample design, as specified when the dataset was set up for SDA. For simple random samples, the standard formulas are used.

For complex sample designs, the user has a choice to specify SRS or complex standard errors. If your analysis is exploratory, and if you are only interested in the magnitude of the coefficients, you might want to specify that the sample is SRS, since the calculation of complex standard errors can be time consuming and does not affect the coefficients themselves. However, the complex standard errors should be used for significance tests and for the presentation of results.

In addition to the coefficients for each independent variable, a few summary measures for the regression as a whole are given. These include the Multiple R (multiple correlation coefficient), the R-Squared (the square of the Multiple R, also called the Coefficient of Determination), the Adjusted R-Squared, and the Standard Error of the Estimate (also called the root mean square error).

The Adjusted R-Squared is a measure that compensates for the inflation of the regular R-Squared statistic due simply to the inclusion of additional independent variables. The Adjusted R-Squared will increase only if the additional independent variables increase the predictive power of the model more than would be expected by chance. It will always be less than or equal to the regular R-Squared.

Steps to take

Specify variables
To specify that a certain survey question or variable is to be used as the dependent variable, give the name for that variable as given in the documentation for this study. Then specify the names of one or more independent variables. Selection filter variables and a weight variable may also be specified.

Aside from simply specifying the name of a variable, it is possible to restrict the range of a variable or to recode the variable temporarily. Note in particular that you can create dummy variables and product terms.

Select display options
After specifying the names of variables, select the display options you wish. These affect the statistics to compute, the number of decimals to show, and text to display.

Run regression
After specifying all variables and options, select Run Regression to run the program.

Or you can select Clear Fields to delete all previously specified variables and options, so that you can start over.


REQUIRED variable names

Dependent variable
Enter the name of one numeric variable to be used as the dependent variable or the variable to be predicted.

Independent variables
Enter the names of one or more numeric variables whose regression coefficients are to be computed. Note that you can specify dummy variables and product terms as independent variables. It is also possible to restrict the range of a variable or to recode the variable temporarily.

Enter the name of each variable in a text box. To go from one text box to another, use the tab key or your mouse. It is all right to skip a text box and leave it blank -- to use only text boxes 1, 5, and 9, for example.

It is possible to enter more than one variable name in a text box (the underlying text-entry area will scroll). Ordinarily it is clearer to put only one variable in each text box, but it is possible to enter more variables than there are text boxes.


Create a Temporary Dummy Variable
A dummy variable is a dichotomous variable coded 0 or 1. Cases that have a certain characteristic are coded as 1; whereas cases that do NOT have the characteristic are coded as 0.

To create such a variable temporarily, for a single regression run, for example, use the following syntax:

varname(d:1-3)

This would create a variable in which cases coded 1 through 3 on the variable 'varname' receive a code of 1, and all other VALID cases receive a code of 0. If 'varname' has a code defined as missing-data or out of range, the dummy variable will have the system-missing data value.

The characters 'd:' (or 'D:') indicate that you want to create a temporary dummy variable. The codes that follow show which codes on the original variable should become the code of 1 on the new dummy variable. One or more single code values or ranges can be specified. Multiple codes or ranges are separated by a comma.

You can give the '1' category of the dummy variable a label by putting the label in double quotes or in square brackets:

occupation(d:1, 3-5, 9, 10 "Managerial occupations = 1")

If you do not give a label, SDA will take the label from the code of the input variable assigned to the '1' category on the new dummy variable, provided that only a single code is assigned to the '1' category.


Create Multiple Temporary Dummy Variables (Regression only)
For multiple regression, it is possible to create multiple dummy variables at the same time from a single variable -- a separate dummy variable for every category except one, which is considered the base category. Each of the dummy variables would then be used as an independent variable in the regression.

For example, a variable such as 'party' (political party) could have categories like '1=Democrat', '2=Republican', '3=Independent', '4=Other'. To make 3 dummy variables, with 4 as the base category, use the syntax:

party(m:4)

The characters 'm:' (or 'M:') indicate that you want to create multiple temporary dummy variables. The code(s) that follow show which code(s) on the original variable should become the base category -- that is, which code or codes should NOT have a dummy variable created. The use of this syntax to create multiple dummy variables also has the effect of defining the set of dummy variables as a group, whose effects as a group are tested for significance.

One or more single code values or ranges can be specified as the base category. Multiple codes or ranges are separated by a comma, as in this example:

education(m:1-8,14,15)

If you want to create dummy variables for every category except the category with the highest valid numeric code, you can designate '*' as the base category. For example:

party(m:*)

For the example above, this has the same effect as designating '4' as the base category. However, it is convenient to be able to create multiple dummy variables without knowing ahead of time which category has the highest valid code.

Note that using this multiple dummy syntax is similar to creating individual dummy variables. However, dummy variables created individually are not automatically treated as a group, for purposes of testing the significance of the group as a whole.


Create a Temporary Product Variable or Term (Regression and correlation only)
An independent variable in a regression can be the product of two or more variables. A product variable can also combine one or more temporary dummy variables.

To create such a variable temporarily, for a single regression run for instance, use an asterisk (*) between the component variable names. For example:

age*education

This would create a variable in which, for each case, the value of 'age' is multiplied by the value of 'education'. If either 'age' or 'education' has an invalid code for that case, the temporary product term will have the system missing-data value.

One or more dummy variables can also be part of a product term. For example, the following form is acceptable:

party(d:3)*sex

In this example, first a single dummy variable is created from the variable 'party', and then that dummy variable is multiplied by 'sex'. Note that this syntax does not work with multiple dummy variables created like 'party(m:*)'. It only works with single dummy variables.


OPTIONAL variable names

Selection filter variable(s)
Some cases are included in the analysis; others are excluded.

Weight variable
Cases are given different relative weights in calculating the regression coefficients.

How cases with missing data are excluded

Listwise exclusion
If a case has a missing-data value on ANY of the variables to be correlated and then regressed, it is excluded from ALL of the regression calculations. This is the only allowed procedure. The pairwise option available for the correlation program is not available for the regression programs.

Additional Statistics to Calculate

T-test for each coefficient
The t-test for each regression coefficient is generally displayed. The t-statistic is the ratio of the unstandardized regression coefficient (B) divided by its standard error -- shown as SE(B). Dividing the standardized regression coefficient (Beta) by its standard error, shown as SE(Beta), gives the same t-statistic.

The probability estimate associated with each t-statistic is given in the last column. This is the probability of obtaining a regression coefficient (either B or Beta) that is this large or larger, if the true coefficient is equal to zero in the population from which the current sample was drawn.

If the probability value for a regression coefficient is low (about .05 or less), the chances are correspondingly low that the observed effect of that independent variable on the dependent variable is only due to sampling error. However, a low probability value does not indicate that the true value of the coefficient in the population is of any specific magnitude -- only that it is not equal to zero.

The t-statistic and associated probability value are also given for the constant term of the regression equation. This is a test that the regression equation in the population has no constant term (or intercept). This test is usually of less interest than the tests for the regression coefficients of the independent variables.

Sum of squares analysis
The decomposition of the sum of squares for the regression is shown after the regression coefficients and the t-statistics. The proportion of the total sum of squares in the dependent variable accounted for by the regression is the 'R-squared' statistic, often referred to as the proportion of the variance that is "explained" by the regression on the independent variables. The square root of the R-squared statistic is the 'Multiple R' or the multiple correlation coefficient.

Since the R-squared always increases with the addition of more independent variables, regardless of their independent contribution, an 'Adjusted R-squared' is also shown. The Adjusted R-squared compensates for the addition of extra variables and will be less than the R-squared if some of the additional independent variables do not contribute independent predictive power.

Global tests
In addition to the t-tests for the individual independent variables, tests are generally carried out on groups of variables. (Uncheck this box, to suppress this output.)

Confidence intervals
Confidence intervals for the regression coefficients can be requested for various levels of confidence. The width of the confidence interval is affected by the level of confidence requested. For the usual 95% confidence interval, you can be 95% confident that the regression coefficient in the population from which the sample was drawn is within the interval bounded by approximately two standard errors above and below the regression coefficient in the sample (ignoring the problem of potential bias in the sample).

Note that the accuracy of the confidence intervals depends on specifying the correct sample design. If the sample is not a simple random sample (SRS), the size of the SRS standard errors and confidence intervals will probably be too small.

Univariate statistics
Univariate statistics for each of the variables will be computed and displayed, if this option is selected. The statistics displayed for each variable include its mean and standard deviation.

Product of B and the univariate statistics
For each independent variable, the product of its regression coefficient (B) with its mean and its standard deviation can be displayed.

If this option is selected, the univariate statistics will automatically be selected as well, and the products will be displayed as additional columns in that table.

Correlation matrix
The correlation matrix of all the variables with one another will be displayed. The diagonal elements are always equal to 1.0 -- that is, each variable is perfectly correlated with itself.

Covariance matrix
The covariance matrix of all the variables with one another will be displayed. Each diagonal element displays the variance of a variable. The off-diagonal elements are the covariances.

Covariance matrix of coefficients
The variance/covariance matrix of the regression coefficients with one another will be displayed. Each diagonal element displays the variance of the regression coefficient (B) of the corresponding variable -- that is, the square of its standard error. The off-diagonal elements are the covariances.

Other Display Options

Sample design
For complex samples, the standard errors, confidence intervals and global tests should be calculated in a way that takes the complex design into account. Nevertheless, you can specify that the standard errors, confidence intervals, and other test values should be calculated as if the sample were a simple random sample (SRS). One reason to request SRS calculations might be to compare the size of the SRS standard errors or confidence intervals with the corresponding statistics based on the complex sample design.

For some large datasets, SRS calculations might be set as the default method, because the calculation of complex standard errors is MUCH more computer intensive and time-consuming than the equivalent SRS calculations. In such cases, it would be appropriate to do some SRS runs for exploratory purposes and then to request complex standard errors for your final runs.

The standard errors for complex samples are computed using the jackknife repeated replication method. The method used, together with the names of the stratum and/or cluster variables, are reported when you run the program. If you want additional technical information, see the discussion of standard error calculation methods.


Question text
The text of the question that produced each variable is generally available.
Color coding of the coefficients
The regression coefficients are color coded, in order to aid in detecting patterns, if t-tests have been requested. Regression coefficients greater than zero become redder, the larger they are. Regression coefficients less than zero become bluer, the more negative they are.

The transition from a lighter shade of red or blue to a darker shade depends on the magnitude of the t-statistic, which is the ratio of each regression coefficient (B) divided by its standard error. The lightest shade corresponds to t-statistics between 0 and 1. The medium shade corresponds to t-statistics between 1 and 2. The darkest shade corresponds to t-statistics greater than 2.

Correlation coefficients are also color coded, if a correlation matrix is requested. Correlation coefficients greater than zero become redder, the larger they are. Correlation coefficients less than zero become bluer, the more negative they are. The transition from a lighter shade of red or blue to a darker shade depends on the magnitude of the correlation coefficient in each cell of the matrix. The lightest shade corresponds to coefficients between 0 and .15. The colors become darker as the absolute value of the correlations exceed .15, then .30, then .45.

The color coding can be turned off, if you prefer. Color coding may not be helpful if you intend to print out the regression results on a black-and-white printer.


Suppress independent variables list
If the regression has many independent variables that are used over and over in a series of regressions, the list of independent variables usually displayed at the beginning of the regression output may seem redundant. If this option is selected, the independent variables are not listed in the top section of the regression output, nor are their labels, valid ranges, and missing data codes. The variable names themselves, however, are still displayed in the rest of the output, so there should not be any confusion.

This option does NOT suppress the output for the dependent variable and any filter or weight variables used in the analysis. Only the independent variables are dropped from the list.


Chart Options for Multiple Regression

A chart showing the values of the regression coefficients and their confidence intervals can be displayed. You can choose either the 'B' (unstandardized coefficient) or the 'Beta' (standardized coefficient) to display.

The confidence intervals in the chart are based on the confidence level selected in "Output Options" (90, 95, or 99 percent level of confidence). If you request a chart, but the "Confidence intervals" checkbox in "Output Options" is not checked, then the default 95 percent confidence level will be used for the chart.

Note that the accuracy of the confidence intervals depends on specifying the correct sample design. If the sample is not a simple random sample (SRS), the size of the SRS standard errors and confidence intervals will probably be too small.

Coefficients to chart
Select the coefficients that will be charted: B (the default) or Beta. If you do not want a chart, select "(No chart)".

Range to display
By default, the range of the x-axis for the chart will be automatically adjusted, depending on the confidence intervals that will be displayed. Usually this works well. However, you can manually set the low and high bounds of the chart's x-axis if you prefer. Select "Custom range" in the menu, then enter the low and high bounds of the range in the input boxes that appear. If you leave either the "Low" box or the "High" box blank, the automatically adjusted value will be used for that bound.

Maximum number of independent variables to include in chart
By default, all of the independent variables are included in the chart. However, you can limit the number of independent variables that are displayed by setting a maximum in the drop-down menu. For example, if you have specified ten independent variables, but set the maximum to five, then only the first five variables are included in the chart. Note that this option only affects the number of independent variables displayed in the chart. It does not affect the values of the coefficients or confidence intervals.

Size of chart
The width and height of the chart (expressed in the number of pixels) can be modified. If there is a large (or a very small) number of independent variables, it may be helpful to increase (or decrease) the dimensions of the chart.

SDA Logit/Probit Regression Program

This program calculates the logit or probit regression coefficients for one or more independent or predictor variables.

Steps to take

Select type of regression to run
The choice is between logistic (logit) regression and probit regression. The difference is summarized below.

Specify variables
To specify that a certain survey question or variable is to be used as the dependent variable, give the name for that variable as given in the documentation for this study. Then specify the names of one or more independent variables. Selection filter variables and a weight variable may also be specified.

Aside from simply specifying the name of a variable, it is possible to restrict the range of a variable or to recode the variable temporarily. Note in particular that you can create dummy variables and product terms.

Select display options
After specifying the names of variables, select the display options you wish. These affect the statistics to compute, the number of decimals to show, and text to display.

Run Logit/Probit
After specifying all variables and options, select Run Logit/Probit to run the program.

Or you can select Clear Fields to delete all previously specified variables and options, so that you can start over.


Type of regression to run

The program can run either logistic (logit) or probit regression. The difference between them is in how the dependent variable is transformed from a proportion (a mean between 0 and 1).

When the dependent variable has only two categories, logistic and probit regression are more appropriate to use than ordinary least squares regression. Both logistic and probit regression will usually generate the same substantive results. The choice between them is generally a matter of custom within a specific field or discipline.


REQUIRED variable names

Dependent variable
Enter the name of one numeric variable to be used as the dependent variable or the variable to be predicted. In order for this variable to be used as a dependent variable in logit or probit regression, it must be coded to have exactly two categories: 0 and 1.

If the variable you want to use as a dependent variable is not already coded as a simple 0/1 variable, you can create a dummy variable, or you can recode the variable temporarily.

If the dependent variable is left as anything other than a simple 0/1 variable, the program will recode the dependent variable automatically. The lowest valid score will be recoded to the value '0', and all other scores will be recoded to the value '1'.

Independent variables
Enter the names of one or more numeric variables whose regression coefficients are to be computed. Note that you can specify dummy variables and product terms as independent variables. It is also possible to restrict the range of a variable or to recode the variable temporarily.

Enter the name of each variable in a text box. To go from one text box to another, use the tab key or your mouse. It is all right to skip a text box and leave it blank -- to use only text boxes 1, 5, and 9, for example.

It is possible to enter more than one variable name in a text box (the underlying text-entry area will scroll). Ordinarily it is clearer to put only one variable in each text box, but it is possible to enter more variables than there are text boxes.


Create a Temporary Dummy Variable
A dummy variable is a dichotomous variable coded 0 or 1. Cases that have a certain characteristic are coded as 1; whereas cases that do NOT have the characteristic are coded as 0.

To create such a variable temporarily, for a single regression run, for example, use the following syntax:

varname(d:1-3)

This would create a variable in which cases coded 1 through 3 on the variable 'varname' receive a code of 1, and all other VALID cases receive a code of 0. If 'varname' has a code defined as missing-data or out of range, the dummy variable will have the system-missing data value.

The characters 'd:' (or 'D:') indicate that you want to create a temporary dummy variable. The codes that follow show which codes on the original variable should become the code of 1 on the new dummy variable. One or more single code values or ranges can be specified. Multiple codes or ranges are separated by a comma.

You can give the '1' category of the dummy variable a label by putting the label in double quotes or in square brackets:

occupation(d:1, 3-5, 9, 10 "Managerial occupations = 1")

If you do not give a label, SDA will take the label from the code of the input variable assigned to the '1' category on the new dummy variable, provided that only a single code is assigned to the '1' category.


Create Multiple Temporary Dummy Variables

For multiple regression, it is possible to create multiple dummy variables at the same time from a single variable -- a separate dummy variable for every category except one, which is considered the base category. Each of the dummy variables would then be used as an independent variable in the regression.

For example, a variable such as 'party' (political party) could have categories like '1=Democrat', '2=Republican', '3=Independent', '4=Other'. To make 3 dummy variables, with 4 as the base category, use the syntax:

party(m:4)

The characters 'm:' (or 'M:') indicate that you want to create multiple temporary dummy variables. The code(s) that follow show which code(s) on the original variable should become the base category -- that is, which code or codes should NOT have a dummy variable created. The use of this syntax to create multiple dummy variables also has the effect of defining the set of dummy variables as a group, whose effects as a group are tested for significance.

One or more single code values or ranges can be specified as the base category. Multiple codes or ranges are separated by a comma, as in this example:

education(m:1-8,14,15)

If you want to create dummy variables for every category except the category with the highest valid numeric code, you can designate '*' as the base category. For example:

party(m:*)

For the example above, this has the same effect as designating '4' as the base category. However, it is convenient to be able to create multiple dummy variables without knowing ahead of time which category has the highest valid code.

Note that using this multiple dummy syntax is similar to creating individual dummy variables. However, dummy variables created individually are not automatically treated as a group, for purposes of testing the significance of the group as a whole.

Product terms

An independent variable can be the product of two or more variables.

To create such a variable temporarily, for a single regression run for instance, use an asterisk (*) between the component variable names. For example:

age*education

This would create a variable in which, for each case, the value of 'age' is multiplied by the value of 'education'. If either 'age' or 'education' has an invalid code for that case, the temporary product term will have the system missing-data value.

One or more dummy variables can also be part of a product term. For example, the following form is acceptable:

party(d:3)*sex

In this example, first a dummy variable is created from the variable 'party', and then that dummy variable is multiplied by 'sex'. Note that this syntax does not work with multiple dummy variables created like 'party(m:*)'. It only works with single dummy variables.


OPTIONAL variable names

Selection filter variable(s)
Some cases are included in the analysis; others are excluded.

Weight variable
Cases are given different relative weights in calculating the regression coefficients.

How to exclude cases with missing data

Listwise exclusion
If a case has a missing-data value on ANY of the variables included in the logit or probit regression, it is excluded from ALL of the regression calculations. This is the only allowed procedure. The pairwise option available for the correlation program is not available for the regression programs.

Additional Statistics to Calculate

T-test for each coefficient
The t-test for each logit or probit regression coefficient is generally displayed. (Uncheck the box, to suppress this output.) The t-statistic is the ratio of the regression coefficient (B) divided by its standard error -- shown as SE(B). (For a discussion of the calculation of standard errors for complex samples, see the document on methods used by SDA for computing standard errors for complex samples.)

The probability of each t-statistic is given in the last column. This is the probability that the regression coefficient (B) is equal to zero, in the population from which the current sample was drawn.

If the probability value for a regression coefficient is low (about .05 or less), the chances are correspondingly low that the observed effect of that independent variable on the dependent variable is only due to sampling error. However, a low probability value does not indicate that the true value of the coefficient in the population is of any specific magnitude -- only that it is not equal to zero.

The t-statistic and associated probability value are also given for the constant term of the regression equation. This is a test that the regression equation in the population has no constant term (or intercept). This test is usually of less interest than the tests for the regression coefficients of the independent variables.

Exponential of the logistic regression coefficient (B)
The exponential (or antilog) of each logistic regression coefficient is usually displayed. (Uncheck the corresponding box if you want to suppress this output.) This transformed coefficient expresses the effect of a one unit change in that independent variable on the odds that a person will have a score of 1 versus a score of 0 on the dependent variable. Note that this exponential transformation converts the additive regression coefficients into multiplicative terms. Each exponential coefficient has the same significance level as the logistic coefficient on which it is based.

Probability Differences
The logit or probit coefficients can be a little difficult to interpret. This option converts the logit or probit coefficients to the scale of probabilities, to show how much each independent variable contributes to an increase in the probability that the dependent variable is predicted to be '1' rather than '0' under the specified regression model when all of the independent variables are at their mean values.

Two statistics are output for EACH independent variable:

If this option is selected, the univariate statistics will automatically be selected as well, to show the mean and standard deviation of each variable.

Summary Statistics
The log of the likelihood statistic is displayed after the regression coefficients and t-tests. This statistic is an indicator of the goodness of fit of the model and is used to calculate the pseudo-R-squared statistic.

A pseudo-R-squared statistic is also displayed. It is calculated as 1 - (LL1 / LL0), where:

This version of the pseudo-R-squared statistic is often referred to as "McFadden's-R-squared" or the "Likelihood ratio index." It varies between 0 and (somewhat close to) 1.

The pseudo-R-squared statistic is (roughly) analogous to the R-squared statistic in ordinary least squares regression, which expresses the proportion of variance in the dependent variable explained by the entire set of independent variables. This pseudo-R-squared statistic, however, will be smaller than the R-squared in an ordinary regression, and it is not comparable across datasets. It is best used to compare regressions with different sets of independent variables within the same dataset.

Global tests
In addition to the t-tests for the individual independent variables, tests are generally carried out on groups of variables. (Uncheck this box, if you want to suppress this output.)

Confidence intervals
Confidence intervals for the regression coefficients can be requested for various levels of confidence. The width of the confidence interval is affected by the level of confidence requested. For the usual 95% confidence interval, you can be 95% confident that the regression coefficient in the population from which the sample was drawn is within the interval bounded by approximately two standard errors above and below the regression coefficient in the sample (ignoring the problem of potential bias in the sample).

For logit coefficients, two confidence intervals are shown. The first is for the logit coefficient itself. The second confidence interval is for the exponential (antilog) of the logit coefficient. This second confidence interval is created by taking the exponential of each upper and lower bound of the confidence interval for the logit coefficient.

Univariate statistics
Univariate statistics for each of the variables will be computed and displayed, if this option is selected. The statistics displayed for each variable include its mean and standard deviation.

Product of B and the univariate statistics
For each independent variable, the product of its regression coefficient (B) with its mean and its standard deviation can be displayed.

If this option is selected, the univariate statistics will automatically be selected as well, and the products will be displayed as additional columns in that table.


Other Display Options


Color coding of the coefficients
The regression coefficients are color coded, in order to aid in detecting patterns, if t-tests have been requested. Regression coefficients greater than zero become redder, the larger they are. Regression coefficients less than zero become bluer, the more negative they are.

The transition from a lighter shade of red or blue to a darker shade depends on the magnitude of the t-statistic, which is the ratio of each regression coefficient (B) divided by its standard error. The lightest shade corresponds to t-statistics between 0 and 1. The medium shade corresponds to t-statistics between 1 and 2. The darkest shade corresponds to t-statistics greater than 2.

The color coding can be turned off, if you prefer. Color coding may not be helpful if you intend to print out the regression results on a black-and-white printer.


Question text
The text of the question that produced each variable is generally available.

Chart Options for Logit/Probit Regression

A chart showing the values of the regression coefficients and their confidence intervals can be displayed.

The confidence intervals in the chart are based on the confidence level selected in "Output Options" (90, 95, or 99 percent level of confidence). If you request a chart, but the "Confidence intervals" checkbox in "Output Options" is not checked, then the default 95 percent confidence level will be used for the chart.

Note that the accuracy of the confidence intervals depends on specifying the correct sample design. If the sample is not a simple random sample (SRS), the size of the SRS standard errors and confidence intervals will probably be too small.

Coefficients to chart
Select which coefficients will be charted. You can choose either 'B', or 'Exp(B)' (if Logit regression is specified), or (if 'Probability differencess' are specified) 'P-Diff 1 unit' or 'P-Diff 1 SD'. If you do not want a chart, select '(No chart)'.

Range to display
By default, the range of the x-axis for the chart will be automatically adjusted, depending on the confidence intervals that will be displayed. Usually this works well. However, you can manually set the low and high bounds of the chart's x-axis if you prefer. Select "Custom range" in the menu, then enter the low and high bounds of the range in the input boxes that appear. If you leave either the "Low" box or the "High" box blank, the automatically adjusted value will be used for that bound.

Maximum number of independent variables to include in chart
By default, all of the independent variables are included in the chart. However, you can limit the number of independent variables that are displayed by setting a maximum in the drop-down menu. For example, if you have specified ten independent variables, but set the maximum to five, then only the first five variables are included in the chart. Note that this option only affects the number of independent variables displayed in the chart. It does not affect the values of the coefficients or confidence intervals.

Size of chart
The width and height of the chart (expressed in the number of pixels) can be modified. If there is a large (or a very small) number of independent variables, it may be helpful to increase (or decrease) the dimensions of the chart.

SDA Program to List Values of Individual Cases

This program lists the values of individual cases on variables specified by the user. Values of a numeric variable can also be transformed into percents of a second numeric variable. This is particularly useful when the cases in the data file are aggregate units such as cities.

One or more filter variables are used to limit the listing to a subset of the cases. In general a limit of 500 cases is enforced for each listing, in case the user has forgotten to limit the listing with sufficient filter variables.

Steps to take

Specify variables to list
To specify that a certain survey question or variable is to be included in the listing, enter into one of the text boxes the name for that variable, as given in the documentation for the study. You can also request a percent to be displayed.

Specify one or more filter variables
Selection filter variables are used to limit the listing to a subset of cases. Except for very small datasets, a filter variable will almost always be required.

Select display options
After specifying the names of variables, select the display options you wish. These affect how to display numeric variables and whether or not to display the text of each variable.

Start the listing
After specifying all variables and options, select Start Listing to begin the program.

Or you can select Clear fields to delete all previously specified variables and options, so that you can start over.


Variables to list
To specify that a certain survey question or variable is to be included in the listing, enter into one of the text boxes the name for that variable, as given in the documentation for the study.


Percentages

Aside from simply specifying the name of a variable, it is possible to convert a number into the percent of another variable. (Both variables must be numeric variables.) This is particularly useful when the cases in the data file are aggregate units such as cities.

To calculate and display a percent, use the following formats, beginning with $p, instead of a simple variable name:

$p(var1, var2)
This will display the value: 100 * var1 / var2
(using 1 decimal place) where 'var1' and 'var2' are variables in the dataset. It is not necessary that either 'var1' or 'var2' be specified separately for listing.

$p(var1, var2, 2)
To display a percent using other than one decimal place, specify the desired number of decimal places after var2. The example above would use 2 decimal places.

$p(demo, totvote, "Percent Voted Democrat")
To give your own name to the percentage created, put the name you want within double quotes. This name will be displayed at the top of the column for that percentage.

Selection filter variables
After specifying the names of the variables to list, select the filter variable(s) in order to specify which cases to list. Since data files generally have a large number of cases, it is very important to limit the listing to a subset of the cases. The usual options for specifying filter variable(s) are available.

To avoid accidental attempts to list large numbers of cases, the program suppresses any listing that would exceed a certain number of cases. The default limit is 500 cases, but that limit can be modified when the datasets are set up in the Web archive.


Summaries of each variable listed
For each numeric variable listed, you can obtain summaries of the values for the selected cases in the listing. These summaries exclude missing-data or out-of-range values.

The available summaries are:

For a percentage (created with the '$p' command), the summaries, if requested, will be calculated as follows:


How to display variables (You may select one of the following options:)


Color coding on the output
In the listing of the values of each variable, the coloring of the headings can be suppressed if desired. This may be useful if you intend to print the output on a black and white printer.

Display question text for the listed variables

If this option is selected, the text corresponding to each variable listed is displayed at the bottom of the listing. The text for each variable referenced in a percent specification is also displayed.

Features Common to All Analysis Programs

Options for specifying variables


Multiple variable names

More than one name may be entered for variables to be analyzed, such as for the row and the column variables. The names should be separated by a comma or blanks. Separate analyses for each combination of variables will be generated.

For example, the following specifications would generate six separate tables:


Restricting the valid range

The name of each analysis variable can be followed, in parentheses, by a list of values to be included in the analysis.

Basic range restriction

A single value such as 'gender(2)' or a range of codes such as 'age(30-50)', will limit the analysis to cases having those codes.
Multiple ranges and codes may be specified.
For example: age(1-17, 25, 95-100)
Open-ended Ranges using '*' and '**'
In a range, one asterisk '*' can be used to signify the lowest or highest VALID value.
For example: age(*-25,75-*)
This would include all VALID values less than or equal to 25 and all VALID values greater than or equal to 75. However, any missing-data values within those ranges would still be excluded.

In a range, two asterisks '**' can be used to signify the lowest or highest NUMERIC value, regardless of whether or not the codes are defined as missing data.
For example: age(50-**)
This would include ALL numeric values greater than or equal to 50, including data values like 98 or 99, even if they had been defined as missing-data codes. Note that '**' cannot be used alone (without '-') as a range specification. If you want to include all NUMERIC codes, you can use the range '(**-**)'.


Temporarily Transforming a Variable

A numeric variable can be transformed temporarily, for purposes of running the current analysis. There are four types of temporary transformations:

Temporarily Recode a Variable

Temporary recodes are created by specifying groups of codes that are to be combined into a single category. This type of transformation can be very simple, but certain options can make it a little more complex. These are the possibilities:

Basic recoding
For example, to combine the categories of 'age' into three groups, you can specify the variable as:
age(r: 18-30; 31-50; 51-95)
Notice that the name of the variable ('age') is followed by parentheses, then the instruction 'r' (or 'R') followed by a colon (':'), and then the groupings of codes. Those groupings can consist of single code values, ranges, or a combination of many values and/or ranges. Each group is separated from the other by a semicolon (';'). Spaces are optional, but are added here for readability.

Using this basic method of recoding, the new groupings of codes are given the default code values 1, 2, 3, and so forth. The default label for each group is the range of original codes that constitute that group ("18-30", for example).

Any categories of 'age' not included in the specified groupings will become missing-data on the recoded version, and they will be excluded from the analysis in the table.

On the other hand, any original missing-data categories of 'age' that are explicitly mentioned in the recode, will be included. For instance, if the value '90' for 'age' were flagged as a missing-data code, but included as in the example above, it would become part of the third recoded category. This is discussed in more detail in the section on "Treatment of missing data."

Assigning particular new code values
It is possible to assign new code values that are different from the default 1, 2, 3, and so forth. To do this, give the new code value, then an equal sign, then the grouping. (The new code value must be a whole number, and decimal places will be ignored. If you want the new code value to include decimal places, use the regular SDA RECODE program.)

For example, the variable 'age' can be recoded into the same three groups as above, but with the new code values 1, 5, and 10, by specifying the recode as follows:
age(r: 1 = 18-30; 5 = 31-50; 10 = 51-90)

For column, row, or control variables it will not usually matter what the new code values are. For variables on which statistics are computed, however, the new code values will affect the value of those statistics.

Assigning labels to the new code values
To assign your own label to a new grouping of code values, place the label in double quotes after the group codes, but before the semicolon. There is no set limit on the length of these labels; however, very long labels may distort the formatting of the tables.

For example, you can assign labels to the recoded categories of race by using the following specification:
race(r: 800-869 "White"; 870-934 "Black"; 600-652, 979-982 "Asian")

These labels will appear in the table, in place of the range of original codes that constitute that group. Nevertheless, the recode specifications will still be documented. A summary is always given at the bottom of the table.

Open ranges (with '*' or '**')
If you are not sure of the ranges of the variable to be recoded, you can specify an open range with an asterisk ('*'). A single asterisk matches the lowest or highest VALID code in the data for that variable.

For example, the 'age' recode could be specified as: age(r: *-30; 31-50; 51-*)
Using this method, all valid age values up to 30 would go into the first recoded group. And all valid age values of 51 or older would go into the third group.

If you want to use a range that includes NUMERIC codes that were defined as missing-data values, you can specify the range with two asterisks ('**') instead of one.

For example, the 'age' recode could be specified as: age(r: *-30; 31-50; 51-**)
Using this method, all valid age values up to 30 would go into the first recoded group. But every numeric value of 51 or greater would go into the third group, including codes like 99 that may have been defined as missing-data codes.

For more discussion about including codes that have been defined as missing-data codes, see the section on "Treatment of missing data."

Overlapping ranges
If the same original code value is mentioned in two or more groupings, it is recoded the FIRST time that the value is encountered.
For example, the following two specifications have the same effect:
age(r: 18-30; 30-50; 50-90), and
age(r: 18-30; 31-50; 51-90)

In both cases, the original 'age' value of 30 ends up in the first group, and the original 'age' value of 50 ends up in the second group.

Notice that order is important with overlapping ranges. The following specification will NOT have the same effect as the preceding two:
age(r: 3= 50-90; 2= 30-50; 1= 18-30)
In this example, the 'age' value of 50 will end up in the recode group with the value '3' (instead of in the second group), and the 'age' value of 30 will end up in the recode group with the value '2' (instead of in the first group).

Multiple specifications for one recoded group
It may sometimes be useful to have more than one specification for a new recoded group. This can be done by specifying the desired outcome code more than once.
For example, to have race recoded into two categories, with the first category including everyone EXCEPT those originally coded as '2', you could use the following specification:
race(r: 1=1 "Non-black"; 2=2 "Black"; 1=3-20)

Treatment of missing data
NUMERIC codes that have been defined as missing data on the original variable can be included in one of the categories of the recoded variable in two ways.

The first method is to mention the code explicitly, either as a single value or as part of a range. For example, if the 'age' value of 99 has been defined as a missing-data code, it can still be included by either of the following specifications:
age(r: 18-30; 31-50; 51-90; 99), or
age(r: 18-30; 31-50; 51-100)

In the first case the code 99 will become its own fourth recode category. In the second case, it will be included as part of the third category.

A second method to include NUMERIC missing data codes is to use an open range with two asterisks ('**') instead of one. For example, the following specification will include all numeric codes above 50 as part of the third recoded group:
age(r: 18-30; 31-50; 51-**)

Note that at present there is no way to include in a temporary recode the system-missing value or a character missing-data value (like 'D' or 'R'). You must use the regular recode program to handle those special missing-data codes. (Your data archive may or may not have enabled that program to run on your current dataset.)


Temporarily Collapse a Variable into Fewer Categories

A simple way to recode a variable into fewer categories is to "collapse" the variable, using a fixed interval.
Collapse syntax
For example, to collapse the variable 'age' into 10-year categories, you can specify the variable as:
age(c: 10, 1)
Notice that the name of the variable ('age') is followed by parentheses, then the instruction 'c' (or 'C') followed by a colon (':'), and then the interval, a comma, and the starting point. Spaces are optional, but are added here for readability.

Using this simple method of collapsing, the new groupings of codes are given the code values 1, 2, 3, and so forth. The label for each group is the range of original codes that constitute that group ("21-30", for example).

Effect of the starting point
The specified starting point affects the range. If the starting point is '1', the age ranges will be: 1-10, 11-20, 21-30, etc. On the other hand, if the starting point is '0', the age ranges will be: 0-9, 10-19, 20-29, etc.

If the starting point is HIGHER than the lowest actual value in the data, the values lower than the starting point become missing-data. For example, with a starting point of '21', any lower values of 'age' (like 18, 19, and 20) would not be included in a range and would become missing-data.

If the starting point is LOWER than the actual minimum value in the data, the ending point of each range is not affected. However, the first range includes only the valid values in that range, if any. For example, if the starting point for collapsing 'age' is '1', with an interval of '10', but the lowest valid value in the data is '18', then the age ranges will be: 18-20, 21-30, 31-40, etc.

The highest range is affected by the highest valid value in the data. For example, if the highest valid value for 'age' is '97', and the starting point is '1' and the interval is '10', the highest intervals will be: 71-80, 81-90, 91-97.

Treatment of missing-data in a collapse
The intervals created by the collapse procedure will exclude missing-data codes that are either above or below the valid codes. Character missing-data codes (like 'D' or 'R') will also be excluded.

A numeric missing-data code that happened to fall in between valid codes, however, would be included in the range that covers that code. For example, if '0' were defined as missing-data, but both '-1' and '+1' were actual valid codes, '0' would be included in one of the ranges.


Optional variables


Control variables (for table-generating programs)

A separate table is produced for each category of a control variable. If charts are being generated, a separate chart is also produced for each category of the control variable.

For example, if the control variable is gender, there will be one table for men alone and then one table for women alone. A table will also be produced for the total of all valid categories of the control variable (e.g., men and women combined).

Only one variable at a time can be used as a control variable. If more than one control variable is specified, a separate set of tables (and charts) will be generated for each control variable.


Selection filter variables

Selection filters are used in order to limit an analysis to a subset of the cases in the data file. This is done by specifying one or more variables as selection filters, and by indicating which codes of those variables to include.

Some filter variables may be set up ahead of time by the data archive. That type of filter variable is discussed below.

Note that it is also possible to limit the table to a subset of the cases by restricting the valid range of any of the other variables. But when the desired subset of cases is defined by a variable that is not one of the variables in the table or analysis, you must use filter variables.

Numeric variables as selection filters

Basic filter use
The name of each filter variable is followed, in parentheses, by a single value such as 'gender(2)' or a range of codes such as 'age(30-50)', to limit the analysis to cases having those codes.

Multiple ranges and codes may be specified.
For example: age(1-17, 25, 95-100)

Multiple filter variables
If you specify more than one filter variable, a case must satisfy ALL of the conditions in order to be included in the table.
For example: gender(1), age(30-50)

Open-ended Ranges using '*' and '**'
A single asterisk, '*', can be used to specify that all cases with VALID codes for a variable will pass the filter.
For example: age(*) includes all cases with valid data on the variable 'age'.

In a range, the '*' can be used to signify the lowest or highest VALID value. For example: age(*-25,75-*). This filter would include all VALID values less than or equal to 25 and all VALID values greater than or equal to 75. However, any missing-data values within those ranges would still be excluded.

In a range, two asterisks '**' can be used to signify the lowest or highest numeric value, regardless of whether or not the codes are defined as missing data. For example: age(50-**) would include ALL numeric values greater than or equal to 50, including data values like 98 or 99, even if they had been defined as missing-data codes. However, any character missing-data values would still be excluded. Note that '**' cannot be used alone in a filter variable. It can only be used as part of a range.

Character variables as selection filters

The syntax for specifying character variable filters is similar to the syntax for numeric variables but with a few differences. Like numeric variable filters, character variable filters specify the variable name followed by the filter value(s) in parentheses.
For example: city( Atlanta )

Multiple filter values can be specified, separated by spaces or commas:
city( Chicago,Atlanta Seattle)

Character variable filters are case-insensitive. For example, the following filters are functionally identical:
city( Atlanta )
city( ATLANTA )
city( AtLAnta )

If a filter value contains internal spaces or commas, it must be enclosed in matching quotation marks (either single or double):
city( "New York" )
state("Cal, Calif")

A filter value containing a single quote (apostrophe) can be specified by enclosing it in double quotes:
city( "Knot's Landing" )

Or, conversely, a filter value containing double quotes can be specified by enclosing it in single quotes:
name( 'William "Bill" Smith' )

Leading and trailing spaces, and multiple internal spaces, are NOT significant. The following filters are all functionally equivalent:
city( "New York    " )
city( "New    York" )
city( "   New York    " )

Note that ranges, which are legal for numeric variables, are not allowed for character variables:
The following syntax is NOT legal: city( Atlanta-Seattle)

Pre-set selection filters

One or more filter variables may be pre-set by the archive so that they appear automatically on the option screen for the various analysis programs. The user can then select the desired filter-variable categories from a drop-down menu.

For example, the variable 'gender' might be set up as a pre-set filter variable. The user could then choose 'Males' or 'Females' (or 'Both genders') from the drop-down list.

Pre-set filter variables are only a convenience for the user. The same result can be obtained by using the regular selection filter option to specify the filter variable(s) and the desired code categories to include in the analysis.

One possible difference between the pre-set filters and the regular user-defined selection filter specifications concerns cases with missing-data on the filter variable. A user-defined filter specification of 'gender(*)' would include all cases with a valid code on the variable 'gender', excluding any cases with missing-data on that variable, if there are any. On the other hand, selecting the '(Both genders)' option (or whatever the '##none' specification is labeled) for a pre-set filter would generally include cases with missing-data on the filter variable. (The '##none' specification has the same effect as not using that variable as a filter at all.)

To avoid any doubt about which cases are included or excluded, remember that the analysis output always reports which filter variables have been used and which code values have been included in the analysis. This is true both for pre-set selection filters and for user-defined filters.


Weight variable

Depending on the design and implementation of the study, it may be appropriate to give some of the cases more weight than other cases in computing frequency distributions and statistics. The way you do this is to specify that a certain variable contains the relative weight for each case and is to be considered a weight variable. The documentation for the study should explain the reasons for using a weight variable, if there is one, and what its name is.

SDA studies can be set up with a weight variable specified ahead of time so that the weight variable is used automatically. Other studies may be set up with a drop-down list of choices to be presented to the user, who then selects one of the available weight variables (or no weight variable, if that option is included in the list). If no weight variables have been pre-specified, the user is free to enter the name of an appropriate variable to be used as a weight.


Question text

All of the descriptive text available for each variable included in the analysis will be appended to the bottom of the results, if you select this option.

The usual text available for a variable is the text of the question that produced the variable, provided that the text was included in the study documentation. Sometimes other explanatory text has been included.

If the variable was created by the 'recode' or the 'compute' program, the commands used to create the new variable are included in the descriptive text.


Title or label for this analysis

On the option screen for an analysis program, you can enter a title or a label for this analysis. If a title is specified, it will appear as the first line of the HTML output generated by the SDA program.


Actions to take

After you specify variables and select the options you want, go to the bottom section of the form, and select one of two actions:
Run the Table (or Run a specific type of analysis)
Select this when you have finished specifying the variables and options you want. The requested table (or other analysis) will then be generated by the server computer and displayed on your screen.

Clear Fields
Select this to delete all previously specified variables and options, so that you can start over.