This file contains the online help that is available from inside each SDA analysis program. In addition to the help specific to each program, this file includes information on features common to all analysis programs.
This program generates the univariate distribution of one variable or the crosstabulation of two variables. If a control variable is specified, a separate table will be produced for each category of the control variable.
It is important to understand that if a weight variable has been specified, the percentages and the statistics are always computed using the weighted number of cases. If you want to calculate percentages and statistics using only the unweighted N's, do not specify a weight variable.
Nevertheless, you can specify that the standard errors, confidence intervals, and chi-square probability values should be calculated as if the sample were a simple random sample (SRS). One reason to request SRS calculations might be to compare the size of the SRS standard errors or confidence intervals with the corresponding statistics based on the complex sample design.
The confidence interval is computed by converting the standard error of each percentage to a natural logarithm and then multiplying the log of the standard error by the value of Student's t appropriate to the level of confidence requested and to the number of degrees of freedom. The result is added to the log of the percentage to obtain the upper bound of the confidence interval, and it is subtracted from the log of the percentage to obtain the lower bound. The logs of the upper bound and of the lower bound are then converted back to percentages (by taking the antilogs) and displayed in the table cell.
This conversion back and forth to logarithms results in confidence intervals that are asymmetric -- they are a little wider in the direction of 50% than in the direction of 0% or 100%. This is the same procedure used by Stata to calculate confidence intervals of percentages. Notice that the calculation of confidence intervals for a proportion (or for any mean) by the Comparison of Means program does not use this log transformation. Therefore, the confidence intervals calculated by the Comparison of Means program will be a little different than the confidence intervals calculated by the Crosstabulation program for the same proportions. This is also the case for Stata.
Simple random samples
If the sample is equivalent to a simple random sample of a population,
the standard error of each percentage
is computed using the familiar "pq/n" formula
for the normal approximation to the standard error of a proportion.
For each proportion p, the formula is:
sqrt(p * (1-p) / (n-1))
where n is the number of cases in the denominator
of the percentage -- the total number of cases in
that particular column, row, or total table, depending
on the percentage being calculated.
For this calculation, n is the actual number
of cases, even if weights have been used to calculate
the percentages.
Complex samples
If the sample for a particular study
is more complex than a simple random sample,
the appropriate standard errors can still be computed
provided that
the stratum and/or cluster variables
were specified when the dataset was
set up in the SDA Web archive.
Otherwise, the standard errors calculated
by assuming simple random sampling
are probably too small.
For complex samples the appropriate standard errors are computed using the Taylor series method. If you want additional technical information, see the document on standard error calculation methods.
Note that the calculations for standard errors in cluster samples require that the coefficient of variation of the sample size of the denominator for each percentage, CV(x), be under 0.20; otherwise, the computed standard errors (and the confidence intervals) are probably too small, and they are flagged in the table with an asterisk. CV(x) and other diagnostic information is available for standard error calculations done by the SDA Comparison of Means program. That program and the SDA Crosstabulation program use the same information and methods to calculate standard errors.
The design effect for each percent in a cell is used to calculate the effective number of cases (N / deft-squared) on which the percent is based, for purposes of precision-based suppression.
The design effects for all of the total percents in a table are used to calculate the Rao-Scott adjustment to the chi-square statistic, if bivariate statistics have been requested for a complex sample.
For complex samples, the df equal the number of primary sampling units (clusters, for cluster samples; individual cases in the denominator, for unclustered samples) minus the number of strata (unstratified samples have a single stratum). Note that the number of strata and clusters used for this calculation is usually the number in the overall sample, and not in the subclass represented by a cell in a table. For a fuller discussion of this issue, see the treatment of domains and subclasses in the document on standard error methods.
The value of Student's t used for computing confidence intervals depends on the desired level of confidence (95 percent, by default) and the df. The fewer the df, the larger the required value of Student's t and, consequently, the larger the width of the confidence intervals. As the df increase, the size of the required Student's t value decreases until it approaches the familiar value for the normal distribution (which is 1.96, for the 95 percent confidence level).
The Z-statistic shows whether the frequencies in a cell are greater or fewer than expected (in the same sense as used for the chi-square statistic). It also takes into account the total number of cases in the table. If there are only a few cases in the table, the deviations from the expected values are not as significant as if there are many cases in the table.
The Z-statistics are standardized residuals. The residual for each cell is calculated as the ratio of two quantities:
Note that if the frequencies in the table are weighted, the Z-statistic can be artificially inflated (or deflated). Consequently, if weights are used, each Z-statistic is divided by the average size of the weights. The average size of the weights is just the ratio of the total number of weighted cases in the table, divided by the actual number of unweighted cases in the table. For example, if the table is based on 1,000 actual cases, but the weighted number of cases is 100,000, the average size of the weights is 100,000/1,000 = 100. (The chi-square statistics are adjusted in the same way, to compensate for weights whose average is different from 1.) Note also that the Z-statistic does not take into account the complex sample design, if the table is based on such a sample.
However, you can uncheck both boxes, and no N will be displayed. Or you can check both boxes, and both the unweighted and the weighted N of cases will be displayed (if a weight variable has been specified).
It is important to understand that if a weight variable has been specified, the percentages and the statistics are always computed using the weighted number of cases, regardless of which N is displayed in the table. If you want to calculate percentages and statistics using only the unweighted N's, do not specify a weight variable.
A nominal-level statistic does not take into account any ordering of the categories of the row and column variables. That is, you would get the same result even if the categories were put into another order.
SDA displays two versions of the chi-square statistic, which is the most commonly used nominal-level statistic. For simple random samples (SRS) a probability level (p-value) is also calculated for each chi-square statistic.
For complex samples a Rao-Scott adjustment to each chi-square is calculated. An F statistic is derived from the adjusted Rao-Scott statistics and is added to the statistics package. The p-values corresponding to those F statistics are displayed (instead of the p-values for the regular chi-square statistics, which do not take the sample design into account).
If the p-value is low (about .05 or less), the chances that the observed relationship is only due to sampling error are correspondingly low, and in that case the relationship is said to be statistically significant. On the other hand, if the p-value is high, the chances are correspondingly high that the row and the column variables are not related to one another in the whole population from which the sample was drawn but are only related in the sample that happens to have been selected and that we are observing (analyzing).
Note that if the frequencies in the table are weighted,
the chi-square statistic can be artificially inflated (or deflated).
Consequently, if weights are used, the chi-square is adjusted by
the factor: (Total unweighted N) / (Total weighted N).
The Rao-Scott adjustment to the chi-square statistic takes the complex sample design into account. The probability associated with the Rao-Scott statistic is a more accurate indicator of the statistical significance of the relationship between the row and the column variables than the probability corresponding to a regular chi-square statistic.
SDA displays the F statistic derived from each Rao-Scott statistic and the associated p-value of the F. This is done both for the Pearson chi-square, displayed after 'Rao-Scott-P:F(dfn, dfd)'; and for the Likelihood-ratio chi-square, displayed after 'Rao-Scott-LR:F(dfn, dfd)'. These are F-tests, where dfn is the number of numerator degrees of freedom and dfd is the number of denominator degrees of freedom.
In generating these test statistics, SDA uses the first-order Rao-Scott approximation. The first step is to generate design effects for the estimated proportion of cases in each cell of the table and then to calculate a generalized design effect based on the cell design effects. The two chi-square statistics are divided by the generalized design effect, to obtain design-adjusted chi-square statistics. Then each design-adjusted chi-square statistic is divided by its numerator degrees of freedom to obtain F-statistics, which are then tested. The Rao-Scott adjustments to chi-square are explained in the following journal article: J.N.K. Rao and A.J. Scott, "On Chi-squared Tests for Multiway Contingency Tables with Cell Proportions Estimated from Survey Data," The Annals of Statistics, Vol. 12 (1984), No. 1, pp.46-60.
Note that this use of the first-order Rao-Scott approximation is the same as in SAS. Stata uses a second-order approximation, which is a little different but should give the same substantive results.
Four ordinal statistics are given: Gamma, Tau (2 versions) and Somers' d (assuming the row variable to be the dependent variable).
The ordinal statistics can be calculated either for numeric variables or for character variables (with the categories sorted into alphabetic order).
These ordinal statistics are purely descriptive. No attempt is made to test them for sampling error.
If interval-level statistics are reported for numeric variables that are ordered, note that they must be ordered in a way that approximates interval-level variables. This refers to variables coded like 1=Agree strongly; 2=Agree somewhat; 3=Disagree somewhat; 4=Disagree strongly. To report interval-level statistics for such variables, you must assume that the "distance" between 1 and 2 is of equal importance as the distance between 2 and 3, and between 3 and 4.
Two interval-level statistics are given: R (the Pearson correlation coefficient), and Eta (the correlation ratio assuming the row variable to be the dependent variable).
If the row variable is a character variable, Eta cannot be calculated. If either the row variable or the column variable is a character variable, the correlation coefficient cannot be calculated.
These interval statistics are purely descriptive. No attempt is made to test them for sampling error. Use the regression program for tests of significance and confidence intervals for correlation statistics. The regression program can also handle complex sample designs.
Note that the univariate statistics cannot be calculated for character variables. If a character variable is used as a row variable, the request for univariate statistics is ignored. Even for numeric variables, be aware that the univariate statistics will not be meaningful unless the code values of the row variable are ordered in a way that approximates interval-level data.
These univariate statistics are purely descriptive. No attempt is made to test them for sampling error. To get standard errors and confidence intervals for the mean of a variable, you can use the Comparison of Means program.
The transition from a lighter shade of red or blue to a darker shade depends on the magnitude of the Z-statistic. The lightest shade corresponds to Z-statistics between 0 and 1. The medium shade corresponds to Z-statistics between 1 and 2. The darkest shade corresponds to Z-statistics greater than 2.
The color coding can be turned off, if you prefer. Color coding may not be helpful if you are using a black-and-white monitor or if you intend to print out the table on a black-and-white printer.
If bivariate statistics are requested, nominal and ordinal statistics will be produced as usual, with the missing data codes sorted into order with the valid codes.
Interval-level statistics will also be computed if the included missing-data codes allow it. The Eta statistic will be calculated if the included missing data codes on the ROW variable are all numeric. The Pearson correlation coefficient can be calculated only if the included missing data codes are all numeric on BOTH the row and column variables.
If univariate statistics are requested, the row variable can only have numeric missing-data codes. Otherwise, no statistics can be generated, and the request is ignored.
If you select column percentaging, the chart will include a separate set of bars (or a separate pie) describing the row variable, for each category of the column variable. For a line chart, there will be a separate line for each category of the row variable, plotted against the values of the column variable. The column variable is treated as the "break variable" in this layout.
If you select row percentaging, the chart will include a separate set of bars (or a separate pie) describing the column variable, for each category of the row variable. For a line chart, there will be a separate line for each category of the column variable, plotted against the values of the row variable. The row variable is treated as the "break variable" in this layout.
If you select total percentaging, a combination of row and column percentaging, or no percentaging at all, the effect is the same as selecting column percentaging only.
If there is only a row variable specified for the table, the chart will include one set of bars (or one pie, or one line) to show the distribution of that row variable.
Note that these percents may not always appear or may not be legible in all situations.
On stacked bar charts the percents may not have sufficient room to appear inside the area allocated to small categories.
On pie charts and line charts the percents for some slices or for some points on the lines may be almost overlaid and become illegible, if there are many categories or if the lines are very close together.
If you still want to show the percents in those situations, it will usually help if you increase the size of the charts. For stacked bar charts it can also help to change from a vertical to a horizontal orientation.
Pie charts in particular may require an increase in the dimensions of the chart if the number of category slices is large. Otherwise, the labels for each slice of the pies might overlay one another.
Stacked bar charts with only two or three break categories may look better if the chart is made narrower. But if there is a large number of break categories (like years of age), the best solution is often to combine a horizontal chart orientation with an increase in the height of the chart.
Side-by-side bar charts are best limited to tables with a relatively small number of categories in both the row and the column variables. If there are many categories in either or both of the variables, the proliferation of bars can be confusing, even if the chart dimensions are increased. In such cases it is probably better to use stacked bar charts instead of side-by-side bar charts.
Line charts may need to be enlarged if the lines are close to being overlaid. If percents are being shown, they also can become overlaid. In such cases it may help to increase the height of the chart.
This program calculates the mean of the dependent variable separately within categories of the row variable and, optionally, the column variable. If a control variable is specified, a separate table will be produced for each category of the control variable.
Sometimes, however, it is more helpful to express each cell mean in another way:
The rightmost column of the table usually shows the Row Totals. In some setups, however, the average of the differences is shown. This is the weighted average of the differences in that row. The weight is the number of cases (weighted number of cases if a weight was used) in that cell plus the number of cases in the corresponding base category.
The bottom row of the table usually shows the Column Totals. In some setups, however, the average of the differences is shown. This is the weighted average of the differences in that column. The weight is the number of cases (weighted number of cases if a weight was used) in that cell plus the number of cases in the corresponding base category.
The totals are usually of interest only when a weight is being used to expand the cell counts up to their estimated values in the population. For example, one may be interested in the total estimated NUMBER of persons in each cell who have some characteristic (e.g., who smoke, or drive cars), instead of the PROPORTION of persons who have that characteristic. This assumes that the dependent variable is coded `1' for a case which has the characteristic (smokes, for example) and `0' for a case which does not have the characteristic.
Enter the code value for the row or column category that you want to consider the base category.
The proportion in each cell of the table can be transformed into another statistic that has a more stable distribution. These options are provided for didactic purposes, so that students and researchers can readily compare the logit and and probit transformations with the original proportions in a table. The following options are available:
Proportions greater than .5 have a positive logit. Proportions less than .5 have a negative logit. A proportion of .5 has a logit value of 0.
The logit has a constant standard deviation of 1.81 (pi / sqrt(3), to be exact).
The standard error of each logit is 1.81 / sqrt(n) where n is the number of cases in that cell. (This assumes simple random sampling. For complex samples, it is necessary to use the Logit/Probit Regression program to calculate standard errors.)
Proportions greater than .5 have a positive probit. Proportions less than .5 have a negative probit. A proportion of .5 has a probit value of 0.
The probit has a constant standard deviation of 1.0.
The standard error of each probit is 1.0 / sqrt(n) where n is the number of cases in that cell. (This assumes simple random sampling. For complex samples, it is necessary to use the Logit/Probit Regression program with the probit regression option to calculate standard errors.)
This option converts a proportion into a logit and then rescales it by making the standard deviation equal to 1.0 (like a probit) instead of 1.81 (the usual standard deviation of a logit).
The standard error of each logit scaled as a probit is 1.0 / sqrt(n) where n is the number of cases in that cell. (This assumes simple random sampling. For complex samples, it is necessary to use the Logit/Probit Regression program to calculate standard errors.)
The drop-down menu allows you to select either the median or a percentile. (The median is the same as the 50th percentile.) If you select 'Percentile', another drop-down list will appear, from which you can pick any percentile between 1 and 99 (the default is the 90th percentile).
The median or the specified percentile of the dependent variable will be displayed as the first statistic in each cell. If a weight variable is used, the medians or percentiles will be calculated using the weights. The medians or percentiles calculated by the MEANS program are purely descriptive. No attempt is made to test them for sampling error.
A chart generated by the MEANS program is based, by default, on the mean of the dependent variable (or whatever else has been selected as the "Main statistic to display"). If you have requested that medians or percentiles be displayed in each cell (in addition to the means), you can choose to base the chart on these medians or percentiles by checking this box.
If you have a very large number of cases, dependent variable categories, and table cells, there may only be enough memory to calculate the exact median or percentile for some of the cells of the table. By default, no median or percentile is output for the remaining cells. By checking this box, however, you can request that an estimated value be calculated for the median or percentile in those cells that otherwise would be left without any statistic at all.
An asterisk(*) next to a median or percentile indicates that it was estimated using an algorithm for what is called the "remedian". For further information on this method of estimating medians and percentiles, see Peter J. Rousseeuw and Gilbert W. Bassett, Jr., "The Remedian: A Robust Averaging Method for Large Data Sets." Journal of the American Statistical Association, March 1990, vol. 85, pp. 97-104. Note that SDA uses a base of 101 to calculate the remedian.
Note the following for the various main statistics:
If the "Average of the Differences," is shown in the last column or row, its standard error is calculated by computing the weighted average of the variances of the differences in that row or column, where the weight is the square of the (unweighted) N for each comparison. This weighted average is then divided by the square of the total N for the comparisons in that row or column, and the square root of the result is the SE for the "Average of the Differences." (This optional display in the last column or row can be set up by the data archive for certain didactic purposes.)
If the sample for a particular study is more complex than a simple random sample, the appropriate standard errors can still be computed (except for transformed dependent variables) provided that the stratum and/or cluster variables were specified when the dataset was set up in the SDA data archive. Otherwise, the standard errors calculated by assuming simple random sampling are probably too small.
Standard errors are used to create confidence intervals for the mean in each cell. For cells with at least 30 cases, you can be 95% confident that the mean in the population (for each cell) is within the interval bounded by approximately two standard errors above and below the mean in the sample (ignoring the problem of potential bias in the sample).
Standard deviations are displayed when the main statistic to display is specified to be either means or totals. Standard deviations are not displayed when the main statistic is a difference.
However, you can uncheck both boxes, and no N will be displayed. Or you can check both boxes, and both the unweighted and the weighted N of cases will be displayed (if a weight variable has been specified).
It is important to understand that if a weight variable has been specified, the means and the statistics are always computed using the weighted number of cases, regardless of which N is displayed in the table. If you want to calculate means and statistics using only the unweighted N's, do not specify a weight variable.
Note that this standardized difference is purely descriptive. It does not assess the statistical significance of the observed difference between the cell mean and the overall mean.
The t-statistic can be used to calculate and display a p-value if requested.
If the p-value is low (about .05 or less), the chances that the observed difference is only due to sampling error are correspondingly low, and in that case the difference is said to be statistically significant. On the other hand, if the p-value is high, the chances are correspondingly high that the difference between the cell mean and the mean in the specified row or column does not reflect a difference in the whole population from which the sample was drawn but is only found in the sample that happens to have been selected and that we are observing (analyzing)
Note the following for the various main statistics:
You can also request SRS standard errors, which are calculated as if the sample were a simple random sample, for purposes of comparison.
If the "Average of the Differences," is shown in the last column or row, its standard error is calculated by computing the weighted average of the variances of the differences in that row or column, where the weight is the square of the (unweighted) N for each comparison. This weighted average is then divided by the square of the total N for the comparisons in that row or column, and the square root of the result is the SE for the "Average of the Differences." (This optional display in the last column or row can be set up by the data archive for certain didactic purposes.)
You can also request SRS standard errors for the differences, which are calculated as if the sample were a simple random sample. However, if you request BOTH complex and SRS standard errors for the differences, only the complex standard errors are computed and reported.
Standard errors are used to create confidence intervals for the mean in each cell, or for the difference between a cell mean and the mean in a specified row or column, or for the total in each cell (if a weight is being used to expand the cell counts to the estimated size of the population). The optional diagnostic table reports the degrees of freedom used to generate the appropriate t-statistic for creating the confidence intervals.
Note that the calculations for standard errors in cluster samples require that the coefficient of variation of the sample size in each cell, CV(x), be under 0.20; otherwise, the computed standard errors are probably too small, and they are flagged in the table with an asterisk. CV(x) for each cell is available in the optional diagnostic table.
DEFT is calculated for each subgroup of the data defined by the values of the row, column, control, and filter variables (if any). DEFT is only calculated for the means and totals. It is not calculated for the differences between means.
The effect of rho on the standard error depends on the size of the clusters. The larger the clusters, the larger the effect of rho. The formula for calculating rho is shown with the explanation of the stratified average cluster size. The average cluster size is represented as 'b' and is displayed for each cell in the optional diagnostic table.
If the design effect is less than 1.0, the rho statistic will be negative. This means that the differences between clusters within the same stratum are relatively small, compared to the variability between elements in the sample as a whole.
RHO is calculated for each subgroup of the data defined by the values of the row, column, control, and filter variables (if any). RHO is only calculated for the means and totals. It is not calculated for the differences between means.
Standard deviations are displayed when the main statistic to display is specified to be either means or totals. Standard deviations are not displayed when the main statistic is a difference.
However, you can uncheck both boxes, and no N will be displayed. Or you can check both boxes, and both the unweighted and the weighted N of cases will be displayed (if a weight variable has been specified).
It is important to understand that if a weight variable has been specified, the means and the statistics are always computed using the weighted number of cases, regardless of which N is displayed in the table. If you want to calculate means and statistics using only the unweighted N's, do not specify a weight variable.
Note that this standardized difference is purely descriptive. It does not assess the statistical significance of the observed difference between the cell mean and the overall mean.
The t-statistic can be used to calculate and display a p-value if requested.
If the p-value is low (about .05 or less), the chances that the observed difference is only due to sampling error are correspondingly low, and in that case the difference is said to be statistically significant. On the other hand, if the p-value is high, the chances are correspondingly high that the difference between the cell mean and the mean in the specified row or column does not reflect a difference in the whole population from which the sample was drawn but is only found in the sample that happens to have been selected and that we are observing (analyzing)
The confidence interval or range is computed by multiplying the standard error of the mean (or difference or total) by the value of Student's t appropriate to the level of confidence requested and to the number of degrees of freedom. The result is added to the mean (or difference or total) to obtain the upper bound of the confidence interval, and the result is subtracted from the mean (or difference or total) to obtain the lower bound. Note that if both complex and SRS standard errors are requested, only the complex standard errors are used to compute the confidence intervals.
For a very large random sample (in a particular cell of a table), for instance, the appropriate value for Student's t for a 95 percent confidence interval is close to the familiar 1.96 value for the normal distribution.
These MCA statistics are purely descriptive. No attempt is made to test them for sampling error. You can run the SDA regression program to calculate standard errors and confidence intervals, even for complex samples.
The MCA procedure shows the average effect of each category, and it ignores any interactions between the variables. If interaction effects are statistically significant, MCA is generally not appropriate.
The Adjusted Means are the adjusted mean values of the dependent variable, taking into account the other categories of all the variables. Each adjusted mean is the sum of the overall mean and the adjusted effect for that category.
The value of Student's t used for computing confidence intervals depends on the desired level of confidence (95 percent, by default) and the df. The fewer the df, the larger the required value of Student's t and, consequently, the larger the width of the confidence interval. As the df increase, the size of the required Student's t value decreases until it approaches the familiar value for the normal distribution (which is 1.96, for the 95 percent confidence level).
The variation in the values of the weight variable is used to estimate the design effect due to weighting, assuming that it would have been optimal to use the same sampling fraction within all the strata. The deft due to weighting is based on formula 11.7.6 given in Kish, Survey Sampling, p. 430. That formula gives the design effect in terms of variances. The square root of that result gives the design effect in terms of standard errors (deft).
Note that this estimation of the design effect due to weighting is based entirely on the variation in the weight variable, and it does not consider the specific dependent variable being analyzed. Not every use of weights will increase the standard error of the mean of a specific dependent variable. If the weights result from the use of different sampling fractions in different strata of the sample and if that stratification was effective in reducing sampling error for this particular dependent variable, the estimated deft due to weighting may be greater than the overall deft. If this occurs, it is an indication that the weighting did not increase sampling error for this dependent variable as much as was estimated from the variation in the weight variable (if at all).
Frequently, however, differential rates of sampling are used in different strata simply to achieve the oversampling of some group(s) relative to others. Weights are then used to compensate for the different probabilities of selection. In those cases the different strata are sampled at different rates in a way that departs from optimum allocation, and the sampling variance of the mean of the dependent variable is increased (see Kish, Survey Sampling, pp. 429-433).
The b for each cell is a stratified estimate. The program calculates the average cluster size within each stratum. The individual stratum b's are then combined into an overall b for each cell. This overall b is reported in the optional table of diagnostic statistics.
If the value of CV(x) is greater than 0.20, the calculated standard error is probably too small. Such standard errors are flagged in the main table with an asterisk. The corresponding confidence intervals are also flagged with an asterisk.
The CV for each cell is a stratified estimate. The program calculates the coefficient of variation of the number of valid cases for the clusters within each stratum. The individual stratum CVs are then combined into an overall CV for each cell. This overall CV is reported in the optional table of diagnostic statistics available in the Comparison of Means program (but not in the Frequencies and Crosstabulation program). If the CV(x) is greater than 0.20, it is flagged with an asterisk.
If the sample is a simple random sample, the ANOVA can also be used to assess the statistical significance of the effects of the row variable (and the column variable, if there is one) on the dependent variable. An F statistic is calculated as the ratio of each mean square divided by the residual mean square, and the probability of the F statistic is evaluated. If the p-value (probability statistic) associated with a particular row or column effect is low (about .05 or less), the chances are correspondingly low that the observed effect on the dependent variable is only due to sampling error. In that case the effect is said to be statistically significant.
If the sample is a complex sample, like a cluster sample, the ANOVA is only of descriptive value. The F tests and their associated probability statistics are omitted because they would likely underestimate the size of the true p-value and therefore overstate the statistical significance of the observed row and/or column effects. Only the Eta squared statistic for each effect is displayed. You can use the SDA regression program to calculate the statistical significance of the independent variables in complex samples.
The transition from a lighter shade of red or blue to a darker shade depends on the magnitude of the absolute value of the Z-statistic or t-statistic. The transition points vary, depending on which of those two statistics is calculated:
The color coding can be turned off, if you prefer. Color coding may not be helpful if you are using a black-and-white monitor or if you intend to print out the table on a black-and-white printer.
If only a row variable is specified (and no column variable), the bars or the line will show the value of the dependent variable (on the vertical axis) for each value of the row variable.
If both a row variable and a column variable are specified, there will be a separate set of bars, or a separate line, for each category of the column variable. For a bar chart, there will be sub-bars for each column category within the bar for each row category. For a line chart, there will be a separate line for each column category.
Note that the chosen statistic may not always appear or may not be legible in all situations. Especially on line charts, the statistics for some points on the lines may be almost overlaid and become illegible, if there are many categories or if the lines are very close together. If you still want to show the statistics in those situations, it will usually help if you increase the size of the charts.
Bar charts are best limited to tables with a relatively small number of categories in both the row and the column variables. If there are many categories in either or both of the variables, the proliferation of bars can be confusing, even if the chart dimensions are increased.
Line charts may need to be enlarged if the lines are close to being overlaid. If means are being shown, they also can become overlaid. In such cases it may help to increase the height of the chart.
This program calculates the correlation between all pairs of two or more variables.
Or you can select Clear fields to delete all previously specified variables and options, so that you can start over.
Enter the name of each variable in a text box. To go from one text box to another, use the tab key or your mouse. It is all right to skip a text box and leave it blank -- to use only text boxes 1, 5, and 9, for example.
It is possible to enter more than one variable name in a text box (the underlying text-entry area will scroll). This has consequences for other options which refer to variable numbers. For example, if you enter two variables in text box number 3, and then you request that the signs of the correlations be reversed for variable number 3, the signs of BOTH variables in text box number 3 will be reversed.
Each text box, consequently, defines a variable GROUP. Ordinarily it is clearer to put only one variable in each text box, but the possibility of defining groups of variables exists.
This procedure retains all of the information about each pairwise relationship. However, the multivariate relationships can be inconsistent, if many of the cases have different missing-data patterns on different variables.
If this default dichotomization is not appropriate for a particular analysis, you can recode the variable temporarily within the correlation program using the standard methods of recoding variables.
Consult any beginners' statistics book for more information on the meaning of these statistics.
The alpha coefficient is a function of the average correlation between the variables and of the number of variables. If some of the variables are scored in opposite directions, you should use the option to reverse the signs of some of the variables, so that a high score on all variables means the same thing.
The standard errors can be used to create confidence intervals for each correlation coefficient. For example, you can be 95% confident that the correlation coefficient in the population (for each pair of variables) is within the interval bounded by approximately two standard errors above and below the correlation coefficient calculated from the sample (as shown in the matrix). The actual multiple to use for creating confidence intervals is the t-statistic with (n-1) degrees of freedom.
The calculation of the standard error of the correlation coefficient in each cell is based by default on the UNWEIGHTED number of cases, even if a weight variable has been used for calculating the correlation coefficient. Ordinarily this procedure will generate a more appropriate statistical test than one based on the weighted N in each cell.
The standard error is computed differently, depending on which correlation coefficient you have selected.
The statistics available for each variable include its mean, standard deviation, standard error, valid N of cases, and (if there is a weight variable) valid weighted N of cases.
If missing-data cases have been excluded LISTWISE (the default), the univariate statistics for all variables will be based on the SAME cases -- those which have valid data on ALL of the variables.
If missing-data cases have been excluded PAIRWISE, the univariate statistics for each variable will be based on all the cases with valid data for that one variable.
The paired statistics for each variable include its mean, standard deviation, valid N of cases for the pair, and (if there is a weight variable) valid weighted N of cases for the pair.
These statistics are displayed as a series of matrices. Each statistic for a given variable is (potentially) somewhat different, depending on which other variable it is being paired with.
The P-squared statistic is a way to measure the proportionality of rows in a correlation matrix. For example, if all of the coefficients in one row are exactly double the size of the coefficients in another row, there is a constant proportionality, and the index will be 1.0.
Usually we want to limit this comparison to a subset of the the matrix -- namely, to the part corresponding to the correlations of the criterion variables with the variables of interest. To do this, we specify on the option screen the variable numbers (next to each text box on the option screen) corresponding to the variables for which we want the P-squared measure, and the variable numbers corresponding to the criterion variables.
For example, we could examine the degree to which the variables v1, v2, and v3 have proportional correlations to the criterion variables x1, x2, and x3. We would enter v1, v2, and v3 into the first 3 text boxes on the option screen; and x1, x2, and x3 into text boxes 4 through 6. To get the P-squared statistic for all the combinations of v1, v2, and v3, in respect to the criterion variables, we would then specify:
The P-squared statistics are presented in a symmetrical matrix. Each row and column corresponds to one of the variables that we specified as a "variable to measure."
For a discussion of how to use this statistic, see Thomas Piazza, "The Analysis of Attitude Items," American Journal of Sociology, vol. 86 (1980) pp. 584-603.
For example, we may know that var1 is scaled in such a way that a HIGH score or value corresponds to a LOW score on var2 and var3, so we expect the correlations of var1 to be negative with var2 and var3. But if we are interested in the relationships of those variables to other variables, it will be easier to detect different patterns if we reverse all the signs corresponding to var1. That way, we can expect var1, var2, and var3 to have correlations of the same sign with other variables. Then if we do observe a difference in the signs, it will catch our attention.
The transition from a lighter shade of red or blue to a darker shade depends on the magnitude of the correlation coefficient in each cell. The lightest shade corresponds to coefficients between 0 and .15. The colors become darker as the absolute value of the correlations exceed .15, then .30, then .45.
Color coding is also used for the P-squared matrix, if one has been requested. However, the dividing points for colors are double in magnitude. The lightest shade corresponds to P-squared coefficients between 0 and .30. The colors become darker as the absolute value of the P-squared coefficients exceed .30, then .60, then .90.
The color coding can be turned off, if you prefer. Color coding may not be helpful if you are using a black-and-white monitor or if you intend to print out the matrix on a black-and-white printer.
This program calculates the correlation between two variables separately within categories of the row variable and, optionally, the column variable. If a control variable is specified, a separate table will be produced for each category of the control variable.
The log of the odds-ratio is an optional measure for dichotomous variables. The calculation of the odds ratio assumes that the two variables to be correlated have only two categories each. If these statistics are requested, CORRTAB treats Var 1 and Var 2 as dichotomies, regardless of the number of categories they may actually have. The minimum valid value of each variable is treated as the base category (coded 0), and all valid values greater than the minimum are combined into the other category (coded 1). If this default dichotomization is not appropriate for a particular variable, you can specify another temporary recode after the variable name is given.
The standard error is computed differently, depending on which correlation coefficient you have selected. The standard error for the Pearson correlation is based on Fisher's Z, and it is calculated as the average distance of the upward and the downward confidence band for one standard error (based on the retransformation of Fisher's Z into Pearson's R). The standard error for the log of the odds ratio is calculated with standard formulas for that statistic.
If the sample is more complex than a simple random sample, the standard errors calculated here are probably too small.
The transition from a lighter shade of red or blue to a darker shade depends on the magnitude of the t-statistic. The lightest shade corresponds to t-statistics between 0 and 1. The medium shade corresponds to t-statistics between 1 and 2. The darkest shade corresponds to t-statistics greater than 2.
The color coding can be turned off, if you prefer. Color coding may not be helpful if you are using a black-and-white monitor or if you intend to print out the table on a black-and-white printer.
The t-statistic shows whether the correlation in a cell is larger or smaller than the overall correlation. It also takes into account the total number of cases in each cell. If there are only a few cases in a cell, the deviations from the overall correlation are not as significant as if there are many cases in that cell.
The t-statistic is calculated as the ratio of two quantities: The numerator is the difference between the correlation in the cell and the overall correlation. The denominator is the standard error of the correlation in that cell.
Note that the t-statistic controls the color coding of cells in the table of correlations.
This program calculates the regression coefficients for one or more independent or predictor variables, using ordinary least squares.
Two versions of the regression coefficient are given for each variable:
For complex sample designs, the user has a choice to specify SRS or complex standard errors. If your analysis is exploratory, and if you are only interested in the magnitude of the coefficients, you might want to specify that the sample is SRS, since the calculation of complex standard errors can be time consuming and does not affect the coefficients themselves. However, the complex standard errors should be used for significance tests and for the presentation of results.
In addition to the coefficients for each independent variable, a few summary measures for the regression as a whole are given. These include the Multiple R (multiple correlation coefficient), the R-Squared (the square of the Multiple R, also called the Coefficient of Determination), the Adjusted R-Squared, and the Standard Error of the Estimate (also called the root mean square error).
The Adjusted R-Squared is a measure that compensates for the inflation of the regular R-Squared statistic due simply to the inclusion of additional independent variables. The Adjusted R-Squared will increase only if the additional independent variables increase the predictive power of the model more than would be expected by chance. It will always be less than or equal to the regular R-Squared.
Aside from simply specifying the name of a variable, it is possible to restrict the range of a variable or to recode the variable temporarily. Note in particular that you can create dummy variables and product terms.
Or you can select Clear Fields to delete all previously specified variables and options, so that you can start over.
Enter the name of each variable in a text box. To go from one text box to another, use the tab key or your mouse. It is all right to skip a text box and leave it blank -- to use only text boxes 1, 5, and 9, for example.
It is possible to enter more than one variable name in a text box (the underlying text-entry area will scroll). Ordinarily it is clearer to put only one variable in each text box, but it is possible to enter more variables than there are text boxes.
To create such a variable temporarily, for a single regression run, for example, use the following syntax:
varname(d:1-3)
This would create a variable in which cases coded 1 through 3 on the variable 'varname' receive a code of 1, and all other VALID cases receive a code of 0. If 'varname' has a code defined as missing-data or out of range, the dummy variable will have the system-missing data value.
The characters 'd:' (or 'D:') indicate that you want to create a temporary dummy variable. The codes that follow show which codes on the original variable should become the code of 1 on the new dummy variable. One or more single code values or ranges can be specified. Multiple codes or ranges are separated by a comma.
You can give the '1' category of the dummy variable a label by putting the label in double quotes or in square brackets:
occupation(d:1, 3-5, 9, 10 "Managerial occupations = 1")
If you do not give a label, SDA will take the label from the code of the input variable assigned to the '1' category on the new dummy variable, provided that only a single code is assigned to the '1' category.
For example, a variable such as 'party' (political party) could have categories like '1=Democrat', '2=Republican', '3=Independent', '4=Other'. To make 3 dummy variables, with 4 as the base category, use the syntax:
party(m:4)
The characters 'm:' (or 'M:') indicate that you want to create multiple temporary dummy variables. The code(s) that follow show which code(s) on the original variable should become the base category -- that is, which code or codes should NOT have a dummy variable created. The use of this syntax to create multiple dummy variables also has the effect of defining the set of dummy variables as a group, whose effects as a group are tested for significance.
One or more single code values or ranges can be specified as the base category. Multiple codes or ranges are separated by a comma, as in this example:
education(m:1-8,14,15)
If you want to create dummy variables for every category except the category with the highest valid numeric code, you can designate '*' as the base category. For example:
party(m:*)
For the example above, this has the same effect as designating '4' as the base category. However, it is convenient to be able to create multiple dummy variables without knowing ahead of time which category has the highest valid code.
Note that using this multiple dummy syntax is similar to creating individual dummy variables. However, dummy variables created individually are not automatically treated as a group, for purposes of testing the significance of the group as a whole.
To create such a variable temporarily, for a single regression run for instance, use an asterisk (*) between the component variable names. For example:
age*education
This would create a variable in which, for each case, the value of 'age' is multiplied by the value of 'education'. If either 'age' or 'education' has an invalid code for that case, the temporary product term will have the system missing-data value.
One or more dummy variables can also be part of a product term. For example, the following form is acceptable:
party(d:3)*sex
In this example, first a single dummy variable is created from the variable 'party', and then that dummy variable is multiplied by 'sex'. Note that this syntax does not work with multiple dummy variables created like 'party(m:*)'. It only works with single dummy variables.
The probability estimate associated with each t-statistic is given in the last column. This is the probability of obtaining a regression coefficient (either B or Beta) that is this large or larger, if the true coefficient is equal to zero in the population from which the current sample was drawn.
If the probability value for a regression coefficient is low (about .05 or less), the chances are correspondingly low that the observed effect of that independent variable on the dependent variable is only due to sampling error. However, a low probability value does not indicate that the true value of the coefficient in the population is of any specific magnitude -- only that it is not equal to zero.
The t-statistic and associated probability value are also given for the constant term of the regression equation. This is a test that the regression equation in the population has no constant term (or intercept). This test is usually of less interest than the tests for the regression coefficients of the independent variables.
Since the R-squared always increases with the addition of more independent variables, regardless of their independent contribution, an 'Adjusted R-squared' is also shown. The Adjusted R-squared compensates for the addition of extra variables and will be less than the R-squared if some of the additional independent variables do not contribute independent predictive power.
If the p-value for the test is low (about .05 or less), the chances are correspondingly low that ALL of the observed effects of the independent variables on the dependent variable are only due to sampling error. Nevertheless, a low p-value does not indicate that any specific independent variable has an effect on the dependent variable The separate t-test for each independent variable should be examined for that purpose. (However, if there is only one independent variable, the t-test for that variable will give the same p-value as the Global F-test.)
Note that the accuracy of the confidence intervals depends on specifying the correct sample design. If the sample is not a simple random sample (SRS), the size of the SRS standard errors and confidence intervals will probably be too small.
If this option is selected, the univariate statistics will automatically be selected as well, and the products will be displayed as additional columns in that table.
For some large datasets, SRS calculations might be set as the default method, because the calculation of complex standard errors is MUCH more computer intensive and time-consuming than the equivalent SRS calculations. In such cases, it would be appropriate to do some SRS runs for exploratory purposes and then to request complex standard errors for your final runs.
The standard errors for complex samples are computed using the jackknife repeated replication method. The method used, together with the names of the stratum and/or cluster variables, are reported when you run the program. If you want additional technical information, see the discussion of standard error calculation methods.
The transition from a lighter shade of red or blue to a darker shade depends on the magnitude of the t-statistic, which is the ratio of each regression coefficient (B) divided by its standard error. The lightest shade corresponds to t-statistics between 0 and 1. The medium shade corresponds to t-statistics between 1 and 2. The darkest shade corresponds to t-statistics greater than 2.
Correlation coefficients are also color coded, if a correlation matrix is requested. Correlation coefficients greater than zero become redder, the larger they are. Correlation coefficients less than zero become bluer, the more negative they are. The transition from a lighter shade of red or blue to a darker shade depends on the magnitude of the correlation coefficient in each cell of the matrix. The lightest shade corresponds to coefficients between 0 and .15. The colors become darker as the absolute value of the correlations exceed .15, then .30, then .45.
The color coding can be turned off, if you prefer. Color coding may not be helpful if you intend to print out the regression results on a black-and-white printer.
This option does NOT suppress the output for the dependent variable and any filter or weight variables used in the analysis. Only the independent variables are dropped from the list.
The confidence intervals in the chart are based on the confidence level selected in "Output Options" (90, 95, or 99 percent level of confidence). If you request a chart, but the "Confidence intervals" checkbox in "Output Options" is not checked, then the default 95 percent confidence level will be used for the chart.
Note that the accuracy of the confidence intervals depends on specifying the correct sample design. If the sample is not a simple random sample (SRS), the size of the SRS standard errors and confidence intervals will probably be too small.
This program calculates the logit or probit regression coefficients for one or more independent or predictor variables.
Aside from simply specifying the name of a variable, it is possible to restrict the range of a variable or to recode the variable temporarily. Note in particular that you can create dummy variables and product terms.
Or you can select Clear Fields to delete all previously specified variables and options, so that you can start over.
The exponential (or antilog) of each logistic regression coefficient is also output. This transformed coefficient expresses the effect of a one unit change in that independent variable on the odds that a person will have a score of 1 versus a score of 0 on the dependent variable. Note that this exponential transformation converts the additive regression coefficients into multiplicative terms.
When the dependent variable has only two categories, logistic and probit regression are more appropriate to use than ordinary least squares regression. Both logistic and probit regression will usually generate the same substantive results. The choice between them is generally a matter of custom within a specific field or discipline.
If the variable you want to use as a dependent variable is not already coded as a simple 0/1 variable, you can create a dummy variable, or you can recode the variable temporarily.
If the dependent variable is left as anything other than a simple 0/1 variable, the program will recode the dependent variable automatically. The lowest valid score will be recoded to the value '0', and all other scores will be recoded to the value '1'.
Enter the name of each variable in a text box. To go from one text box to another, use the tab key or your mouse. It is all right to skip a text box and leave it blank -- to use only text boxes 1, 5, and 9, for example.
It is possible to enter more than one variable name in a text box (the underlying text-entry area will scroll). Ordinarily it is clearer to put only one variable in each text box, but it is possible to enter more variables than there are text boxes.
To create such a variable temporarily, for a single regression run, for example, use the following syntax:
varname(d:1-3)
This would create a variable in which cases coded 1 through 3 on the variable 'varname' receive a code of 1, and all other VALID cases receive a code of 0. If 'varname' has a code defined as missing-data or out of range, the dummy variable will have the system-missing data value.
The characters 'd:' (or 'D:') indicate that you want to create a temporary dummy variable. The codes that follow show which codes on the original variable should become the code of 1 on the new dummy variable. One or more single code values or ranges can be specified. Multiple codes or ranges are separated by a comma.
You can give the '1' category of the dummy variable a label by putting the label in double quotes or in square brackets:
occupation(d:1, 3-5, 9, 10 "Managerial occupations = 1")
If you do not give a label, SDA will take the label from the code of the input variable assigned to the '1' category on the new dummy variable, provided that only a single code is assigned to the '1' category.
For example, a variable such as 'party' (political party) could have categories like '1=Democrat', '2=Republican', '3=Independent', '4=Other'. To make 3 dummy variables, with 4 as the base category, use the syntax:
party(m:4)
The characters 'm:' (or 'M:') indicate that you want to create multiple temporary dummy variables. The code(s) that follow show which code(s) on the original variable should become the base category -- that is, which code or codes should NOT have a dummy variable created. The use of this syntax to create multiple dummy variables also has the effect of defining the set of dummy variables as a group, whose effects as a group are tested for significance.
One or more single code values or ranges can be specified as the base category. Multiple codes or ranges are separated by a comma, as in this example:
education(m:1-8,14,15)
If you want to create dummy variables for every category except the category with the highest valid numeric code, you can designate '*' as the base category. For example:
party(m:*)
For the example above, this has the same effect as designating '4' as the base category. However, it is convenient to be able to create multiple dummy variables without knowing ahead of time which category has the highest valid code.
Note that using this multiple dummy syntax is similar to creating individual dummy variables. However, dummy variables created individually are not automatically treated as a group, for purposes of testing the significance of the group as a whole.
To create such a variable temporarily, for a single regression run for instance, use an asterisk (*) between the component variable names. For example:
age*education
This would create a variable in which, for each case, the value of 'age' is multiplied by the value of 'education'. If either 'age' or 'education' has an invalid code for that case, the temporary product term will have the system missing-data value.
One or more dummy variables can also be part of a product term. For example, the following form is acceptable:
party(d:3)*sex
In this example, first a dummy variable is created from the variable 'party', and then that dummy variable is multiplied by 'sex'. Note that this syntax does not work with multiple dummy variables created like 'party(m:*)'. It only works with single dummy variables.
The probability of each t-statistic is given in the last column. This is the probability that the regression coefficient (B) is equal to zero, in the population from which the current sample was drawn.
If the probability value for a regression coefficient is low (about .05 or less), the chances are correspondingly low that the observed effect of that independent variable on the dependent variable is only due to sampling error. However, a low probability value does not indicate that the true value of the coefficient in the population is of any specific magnitude -- only that it is not equal to zero.
The t-statistic and associated probability value are also given for the constant term of the regression equation. This is a test that the regression equation in the population has no constant term (or intercept). This test is usually of less interest than the tests for the regression coefficients of the independent variables.
Two statistics are output for EACH independent variable:
If this option is selected, the univariate statistics will automatically be selected as well, to show the mean and standard deviation of each variable.
A pseudo-R-squared statistic is also displayed. It is calculated as 1 - (LL1 / LL0), where:
This version of the pseudo-R-squared statistic is often referred to as "McFadden's-R-squared" or the "Likelihood ratio index." It varies between 0 and (somewhat close to) 1.
The pseudo-R-squared statistic is (roughly) analogous to the R-squared statistic in ordinary least squares regression, which expresses the proportion of variance in the dependent variable explained by the entire set of independent variables. This pseudo-R-squared statistic, however, will be smaller than the R-squared in an ordinary regression, and it is not comparable across datasets. It is best used to compare regressions with different sets of independent variables within the same dataset.
If the p-value for the test is low (about .05 or less), the chances are correspondingly low that ALL of the observed effects of the independent variables on the dependent variable are only due to sampling error. Nevertheless, a low p-value does not indicate that any specific independent variable has an effect on the dependent variable The separate t-test for each independent variable should be examined for that purpose. (However, if there is only one independent variable, the t-test for that variable will give the same p-value as the Global F-test.)
For logit coefficients, two confidence intervals are shown. The first is for the logit coefficient itself. The second confidence interval is for the exponential (antilog) of the logit coefficient. This second confidence interval is created by taking the exponential of each upper and lower bound of the confidence interval for the logit coefficient.
If this option is selected, the univariate statistics will automatically be selected as well, and the products will be displayed as additional columns in that table.
The transition from a lighter shade of red or blue to a darker shade depends on the magnitude of the t-statistic, which is the ratio of each regression coefficient (B) divided by its standard error. The lightest shade corresponds to t-statistics between 0 and 1. The medium shade corresponds to t-statistics between 1 and 2. The darkest shade corresponds to t-statistics greater than 2.
The color coding can be turned off, if you prefer. Color coding may not be helpful if you intend to print out the regression results on a black-and-white printer.
The confidence intervals in the chart are based on the confidence level selected in "Output Options" (90, 95, or 99 percent level of confidence). If you request a chart, but the "Confidence intervals" checkbox in "Output Options" is not checked, then the default 95 percent confidence level will be used for the chart.
Note that the accuracy of the confidence intervals depends on specifying the correct sample design. If the sample is not a simple random sample (SRS), the size of the SRS standard errors and confidence intervals will probably be too small.
This program lists the values of individual cases on variables specified by the user. Values of a numeric variable can also be transformed into percents of a second numeric variable. This is particularly useful when the cases in the data file are aggregate units such as cities.
One or more filter variables are used to limit the listing to a subset of the cases. In general a limit of 500 cases is enforced for each listing, in case the user has forgotten to limit the listing with sufficient filter variables.
Or you can select Clear fields to delete all previously specified variables and options, so that you can start over.
Percentages
Aside from simply specifying the name of a variable, it is possible to convert a number into the percent of another variable. (Both variables must be numeric variables.) This is particularly useful when the cases in the data file are aggregate units such as cities.
To calculate and display a percent, use the following formats, beginning with $p, instead of a simple variable name:
To avoid accidental attempts to list large numbers of cases, the program suppresses any listing that would exceed a certain number of cases. The default limit is 500 cases, but that limit can be modified when the datasets are set up in the Web archive.
The available summaries are:
For a percentage (created with the '$p' command), the summaries, if requested, will be calculated as follows:
For example, the following specifications would generate six separate tables:
Basic range restriction
In a range,
two asterisks '**'
can be used to signify
the lowest or highest
NUMERIC value,
regardless of whether or not
the codes are defined as missing data.
For example: age(50-**)
This would include ALL numeric values greater than or equal to 50,
including data values like 98 or 99, even if they had been
defined as missing-data codes.
Note that '**' cannot be used alone
(without '-')
as a range specification.
If you want to include all NUMERIC codes,
you can use the range '(**-**)'.
Using this basic method of recoding, the new groupings of codes are given the default code values 1, 2, 3, and so forth. The default label for each group is the range of original codes that constitute that group ("18-30", for example).
Any categories of 'age' not included in the specified groupings will become missing-data on the recoded version, and they will be excluded from the analysis in the table.
On the other hand, any original missing-data categories of 'age' that are explicitly mentioned in the recode, will be included. For instance, if the value '90' for 'age' were flagged as a missing-data code, but included as in the example above, it would become part of the third recoded category. This is discussed in more detail in the section on "Treatment of missing data."
For example, the variable 'age' can be recoded into the same three
groups as above, but with the new code values 1, 5, and 10, by
specifying the recode as follows:
age(r: 1 = 18-30; 5 = 31-50; 10 = 51-90)
For column, row, or control variables it will not usually matter what the new code values are. For variables on which statistics are computed, however, the new code values will affect the value of those statistics.
For example, you can assign labels to the recoded categories of race
by using the following specification:
race(r: 800-869 "White"; 870-934 "Black"; 600-652, 979-982 "Asian")
These labels will appear in the table, in place of the range of original codes that constitute that group. Nevertheless, the recode specifications will still be documented. A summary is always given at the bottom of the table.
For example, the 'age' recode could be specified as:
age(r: *-30; 31-50; 51-*)
Using this method, all valid age values up to 30 would go into the
first recoded group.
And all valid age values of 51 or older would go into the third group.
If you want to use a range that includes NUMERIC codes that were defined as missing-data values, you can specify the range with two asterisks ('**') instead of one.
For example, the 'age' recode could be specified as:
age(r: *-30; 31-50; 51-**)
Using this method, all
valid age values up to 30 would go into the
first recoded group.
But every numeric value
of 51 or greater would go into the third group,
including codes like 99 that may have been defined
as missing-data codes.
For more discussion about including codes that have been defined as missing-data codes, see the section on "Treatment of missing data."
Notice that
order is important with overlapping ranges.
The following specification will
NOT have the same effect
as the
preceding two:
age(r: 3= 50-90; 2= 30-50; 1= 18-30)
In this example, the 'age' value of 50 will end up in the recode
group with the value '3' (instead of in the second group),
and the 'age' value of 30 will end up in the recode group with
the value '2' (instead of in the first group).
The first method is to
mention the code explicitly,
either as a single
value or as part of a range.
For example, if the 'age' value of 99 has been defined as a missing-data
code, it can still be included by either of the following specifications:
age(r: 18-30; 31-50; 51-90; 99), or
age(r: 18-30; 31-50; 51-100)
In the first case the code 99 will become its own fourth recode category.
In the second case, it will be included as part of the third category.
A second method to include NUMERIC missing data codes is to use an
open range with two asterisks ('**') instead of one.
For example, the following specification will include all numeric
codes above 50 as part of the third recoded group:
age(r: 18-30; 31-50; 51-**)
Note that at present there is no way to include in a temporary recode the system-missing value or a character missing-data value (like 'D' or 'R'). You must use the regular recode program to handle those special missing-data codes. (Your data archive may or may not have enabled that program to run on your current dataset.)
Using this simple method of collapsing, the new groupings of codes are given the code values 1, 2, 3, and so forth. The label for each group is the range of original codes that constitute that group ("21-30", for example).
If the starting point is HIGHER than the lowest actual value in the data, the values lower than the starting point become missing-data. For example, with a starting point of '21', any lower values of 'age' (like 18, 19, and 20) would not be included in a range and would become missing-data.
If the starting point is LOWER than the actual minimum value in the data, the ending point of each range is not affected. However, the first range includes only the valid values in that range, if any. For example, if the starting point for collapsing 'age' is '1', with an interval of '10', but the lowest valid value in the data is '18', then the age ranges will be: 18-20, 21-30, 31-40, etc.
The highest range is affected by the highest valid value in the data. For example, if the highest valid value for 'age' is '97', and the starting point is '1' and the interval is '10', the highest intervals will be: 71-80, 81-90, 91-97.
A numeric missing-data code that happened to fall in between valid codes, however, would be included in the range that covers that code. For example, if '0' were defined as missing-data, but both '-1' and '+1' were actual valid codes, '0' would be included in one of the ranges.
For example, if the control variable is gender, there will be one table for men alone and then one table for women alone. A table will also be produced for the total of all valid categories of the control variable (e.g., men and women combined).
Only one variable at a time can be used as a control variable. If more than one control variable is specified, a separate set of tables (and charts) will be generated for each control variable.
Some filter variables may be set up ahead of time by the data archive. That type of filter variable is discussed below.
Note that it is also possible to limit the table to a subset of the cases by restricting the valid range of any of the other variables. But when the desired subset of cases is defined by a variable that is not one of the variables in the table or analysis, you must use filter variables.
Multiple ranges and codes
may be specified.
For example: age(1-17, 25, 95-100)
Multiple filter variables
If you specify more than one filter variable, a case must satisfy
ALL of the conditions in order to be included in the table.
For example: gender(1), age(30-50)
Open-ended Ranges using '*' and '**'
A single asterisk, '*', can be used to specify that all cases with VALID
codes for a variable will pass the filter.
For example:
age(*)
includes all cases with valid data on the
variable 'age'.
In a range, the '*' can be used to signify the lowest or highest VALID value. For example: age(*-25,75-*). This filter would include all VALID values less than or equal to 25 and all VALID values greater than or equal to 75. However, any missing-data values within those ranges would still be excluded.
In a range, two asterisks '**' can be used to signify the lowest or highest numeric value, regardless of whether or not the codes are defined as missing data. For example: age(50-**) would include ALL numeric values greater than or equal to 50, including data values like 98 or 99, even if they had been defined as missing-data codes. However, any character missing-data values would still be excluded. Note that '**' cannot be used alone in a filter variable. It can only be used as part of a range.
Multiple filter values
can be specified, separated by
spaces or commas:
city( Chicago,Atlanta Seattle)
Character variable filters are
case-insensitive.
For example, the following filters are functionally identical:
city( Atlanta )
city( ATLANTA )
city( AtLAnta )
If a filter value contains
internal spaces or commas,
it must be
enclosed in matching quotation marks (either single or double):
city( "New York" )
state("Cal, Calif")
A filter value containing a single quote (apostrophe)
can be
specified by enclosing it in double quotes:
city( "Knot's Landing" )
Or, conversely,
a filter value containing double quotes
can be specified by enclosing it in single quotes:
name( 'William "Bill" Smith' )
Leading and trailing spaces, and multiple internal spaces,
are NOT significant. The following filters are all functionally
equivalent:
city( "New York " )
city( "New York" )
city( " New York " )
Note that
ranges,
which are legal for numeric variables,
are not allowed
for character variables:
The following syntax is
NOT legal:
city( Atlanta-Seattle)
For example, the variable 'gender' might be set up as a pre-set filter variable. The user could then choose 'Males' or 'Females' (or 'Both genders') from the drop-down list.
Pre-set filter variables are only a convenience for the user. The same result can be obtained by using the regular selection filter option to specify the filter variable(s) and the desired code categories to include in the analysis.
One possible difference between the pre-set filters and the regular user-defined selection filter specifications concerns cases with missing-data on the filter variable. A user-defined filter specification of 'gender(*)' would include all cases with a valid code on the variable 'gender', excluding any cases with missing-data on that variable, if there are any. On the other hand, selecting the '(Both genders)' option (or whatever the '##none' specification is labeled) for a pre-set filter would generally include cases with missing-data on the filter variable. (The '##none' specification has the same effect as not using that variable as a filter at all.)
To avoid any doubt about which cases are included or excluded, remember that the analysis output always reports which filter variables have been used and which code values have been included in the analysis. This is true both for pre-set selection filters and for user-defined filters.
SDA studies can be set up with a weight variable specified ahead of time so that the weight variable is used automatically. Other studies may be set up with a drop-down list of choices to be presented to the user, who then selects one of the available weight variables (or no weight variable, if that option is included in the list). If no weight variables have been pre-specified, the user is free to enter the name of an appropriate variable to be used as a weight.
The usual text available for a variable is the text of the question that produced the variable, provided that the text was included in the study documentation. Sometimes other explanatory text has been included.
If the variable was created by the 'recode' or the 'compute' program, the commands used to create the new variable are included in the descriptive text.