This file contains the online help available that is also available from inside each SDA analysis program (by selecting the corresponding word highlighted on the form or screen for selecting options). In addition to the help specific to each program, this file includes information on features common to all analysis programs.
This program generates the univariate distribution of one variable or the crosstabulation of two variables. If a control variable is specified, a separate table will be produced for each category of the control variable. An explanation of each option can be obtained by selecting the corresponding word highlighted on the form.
The confidence interval or range is computed by multiplying the standard error of each percentage by the value of Student's t appropriate to the level of confidence requested and to the number of degrees of freedom. The result is added to the percentage to obtain the upper bound of the confidence interval, and it is subtracted from the percentage to obtain the lower bound.
For a large random sample, the appropriate value for Student's t for a 95 percent confidence interval is close to the familiar 1.96 value of the normal distribution.
Simple random samples
If the sample is equivalent to a simple random sample of a population,
the standard error of each percentage
is computed using the familiar "pq/n" formula
for the normal approximation to the standard error of a proportion.
For each proportion p, the formula is:
sqrt(p * (1-p) / (n-1))
where n is the number of cases in the denominator
of the percentage -- the total number of cases in
that particular column, row, or total table, depending
on the percentage being calculated.
For this calculation, n is the actual number
of cases, even if weights have been used to calculate
the percentages.
Complex samples
If the sample for a particular study
is more complex than a simple random sample,
the appropriate standard errors can still be computed
provided that
the stratum and/or cluster variables
were specified when the dataset was
set up in the SDA Web archive.
Otherwise, the standard errors calculated
by assuming simple random sampling
are probably too small.
For complex samples the appropriate standard errors are computed using either the Taylor series method (for cluster samples) or the formula for stratified subclass means (for stratified element samples). The method used is reported when you run the program. If you want additional technical information, see the discussion of standard error calculation methods.
Note that the calculations for standard errors in cluster samples require that the coefficient of variation of the sample size of the denominator for each percentage, CV(x), be under 0.20; otherwise, the computed standard errors are probably too small, and they are flagged in the table with an asterisk. CV(x) and other diagnostic information is available for standard error calculations done by the SDA Comparison of Means program. That program and the SDA Crosstabulation program use the same information and methods to calculate standard errors.
Standard errors are used to create confidence intervals for the percentages in each cell. For most of the percentages in a table you can be 95% confident that the percentage in the population is within the interval bounded by approximately two standard errors above and below the percentage in the sample (ignoring the problem of potential bias in the sample).
For complex samples, the df equals the number of primary sampling units (clusters, for cluster samples; individual cases in the denominator, for unclustered samples) minus the number of strata (unstratified samples have a single stratum). Note that the number of strata used for this calculation is the number in each cell after collapsing, if strata had to be combined in order to have at least two clusters in a stratum.
The value of Student's t used for computing confidence intervals depends on the desired level of confidence (usually 95 percent) and the df. The smaller the df, the larger the required value of Student's t and, consequently, the width of the confidence intervals. As the df increases, the size of the required Student's t value decreases until it approaches the familiar constant for the normal distribution (which is 1.96, for the 95 percent confidence level).
One reason to request SRS calculations might be to compare the size of the SRS standard errors or confidence intervals with the corresponding statistics based on the complex sample design.
The Chi-square statistics are the most often used. Two versions are displayed -- Pearson's Chi-square, and the Likelihood-ratio Chi-square, each with its P-value (probability statistic) for the given df (degrees of freedom). The chi-square statistics do not take into account any ordering of the categories of the row and column variables. That is, you would get the same result even if the categories were put into another order.
The Chi-square statistic is used to assess the statistical significance of the observed relationship between the row and the column variables in the table. If the p-value is low (about .05 or less), the chances that the observed relationship is only due to sampling error are correspondingly low, and in that case the relationship is said to be statistically significant. On the other hand, if the p-value is high, the chances are correspondingly high that the row and the column variables are not related to one another in the whole population from which the sample was drawn but that they are only related in the sample that happens to have been selected and that we are observing (analyzing).
Note that if the frequencies in the table are weighted, the Chi-square statistic can be artificially inflated (or deflated). Consequently, if weights are used, the Chi-square is adjusted by the factor: (Total unweighted N) / (Total weighted N).
If the row variable is a character variable, Eta cannot be calculated. If either the row variable or the column variable is a character variable, the correlation coefficient cannot be calculated.
The ordinal statistics can be calculated either for numeric variables or for character variables (with the categories sorted into alphabetic order).
If the sample is a simple random sample (or is being treated as one), The univariate statistics will also include the standard error of the mean and the coefficient of variation of the mean (standard error divided by the mean). If the sample is a complex sample (with stratum and/or cluster variables defined), you must use the MEANS program to obtain those statistics.
Note that the univariate statistics cannot be calculated for character variables. If a character variable is used as a row variable, the request for univariate statistics is ignored. Even for numeric variables, be aware that the univariate statistics will not be meaningful unless the code values of the row variable are ordered in a way that approximates interval-level data.
The transition from a lighter shade of red or blue to a darker shade depends on the magnitude of the Z-statistic. The lightest shade corresponds to Z-statistics between 0 and 1. The medium shade corresponds to Z-statistics between 1 and 2. The darkest shade corresponds to Z-statistics greater than 2.
The color coding can be turned off, if you prefer. Color coding may not be helpful if you are using a black-and-white monitor or if you intend to print out the table on a black-and-white printer.
The Z-statistic shows whether the frequencies in a cell are greater or fewer than expected (in the same sense as used for the Chi-square statistic). It also takes into account the total number of cases in the table. If there are only a few cases in the table, the deviations from the expected values are not as significant as if there are many cases in the table.
The Z-statistics are standardized residuals. The residual for each cell is calculated as the ratio of two quantities:
Note that if the frequencies in the table are weighted, the Z-statistic can be artificially inflated (or deflated). Consequently, if weights are used, each Z-statistic is divided by the average size of the weights. The average size of the weights is just the ratio of the total number of weighted cases in the table, divided by the actual number of unweighted cases in the table. For example, if the table is based on 1,000 actual cases, but the weighted number of cases is 100,000, the average size of the weights is 100,000/1,000 = 100. (The Chi-square statistics are adjusted in the same way, to compensate for weights whose average is different from 1.)
If bivariate statistics are requested, nominal and ordinal statistics will be produced as usual, with the missing data codes sorted into order with the valid codes.
Interval-level statistics will also be computed if the included missing-data codes allow it. The Eta statistic will be calculated if the included missing data codes on the ROW variable are all numeric. The Pearson correlation coefficient can be calculated only if the included missing data codes are all numeric on BOTH the row and column variables.
If univariate statistics are requested, the row variable can only have numeric missing-data codes. Otherwise, no statistics can be generated, and the request is ignored.
If you select column percentaging, the chart will include a separate set of bars (or a separate pie) describing the row variable, for each category of the column variable. For a line chart, there will be a separate line for each category of the row variable, plotted against the values of the column variable. The column variable is treated as the "break variable" in this layout.
If you select row percentaging, the chart will include a separate set of bars (or a separate pie) describing the column variable, for each category of the row variable. For a line chart, there will be a separate line for each category of the column variable, plotted against the values of the row variable. The row variable is treated as the "break variable" in this layout.
If you select total percentaging, a combination of row and column percentaging, or no percentaging at all, the effect is the same as selecting column percentaging only.
If there is only a row variable specified for the table, the chart will include one set of bars (or one pie, or one line) to show the distribution of that row variable.
Note that these percents may not always appear or may not be legible in all situations.
On stacked bar charts the percents may not have sufficient room to appear inside the area allocated to small categories.
On pie charts and line charts the percents for some slices or for some points on the lines may be almost overlaid and become illegible, if there are many categories or if the lines are very close together.
If you still want to show the percents in those situations, it will usually help if you increase the size of the charts. For stacked bar charts it can also help to change from a vertical to a horizontal orientation.
Pie charts in particular may require an increase in the dimensions of the chart if the number of category slices is large. Otherwise, the labels for each slice of the pies might overlay one another.
Stacked bar charts with only two or three break categories may look better if the chart is made narrower. But if there is a large number of break categories (like years of age), the best solution is often to combine a horizontal chart orientation with an increase in the height of the chart.
Side-by-side bar charts are best limited to tables with a relatively small number of categories in both the row and the column variables. If there are many categories in either or both of the variables, the proliferation of bars can be confusing, even if the chart dimensions are increased. In such cases it is probably better to use stacked bar charts instead of side-by-side bar charts.
Line charts may need to be enlarged if the lines are close to being overlaid. If percents are being shown, they also can become overlaid. In such cases it may help to increase the height of the chart.
This program calculates the mean of the dependent variable separately within categories of the row variable and, optionally, the column variable. If a control variable is specified, a separate table will be produced for each category of the control variable. An explanation of each option can be obtained by selecting the corresponding word highlighted on the form.
Steps to take
REQUIRED variable names
Sometimes, however, it is more helpful to express each cell mean in another way:
The average of the differences (shown in the rightmost column) is the weighted average of the differences in that row. The weight is the number of cases (weighted number of cases if a weight was used) in that cell plus the number of cases in the corresponding base category.
The average of the differences (shown in the bottom row) is the weighted average of the differences in that column. The weight is the number of cases (weighted number of cases if a weight was used) in that cell plus the number of cases in the corresponding base category.
The totals are usually of interest only when a weight is being used to expand the cell counts up to their estimated values in the population. For example, one may be interested in the total estimated NUMBER of persons in each cell who have some characteristic (e.g., who smoke, or drive cars), instead of the PROPORTION of persons who have that characteristic. This assumes that the dependent variable is coded `1' for a case which has the characteristic (smokes, for example) and `0' for a case which does not have the characteristic.
Enter the code value for the row or column category that you want to consider the base category.
The proportion in each cell of the table can be transformed into another statistic that has a more stable distribution. The following options are available:
Proportions greater than .5 have a positive logit. Proportions less than .5 have a negative logit. A proportion of .5 has a logit value of 0.
The logit has a constant standard deviation of 1.81 (pi / sqrt(3), to be exact).
The standard error of each logit is 1.81 / sqrt(n) where n is the number of cases in that cell. (This assumes simple random sampling. Currently, no complex standard errors are available in SDA for logit statistics.)
Proportions greater than .5 have a positive probit. Proportions less than .5 have a negative probit. A proportion of .5 has a probit value of 0.
The probit has a constant standard deviation of 1.0.
The standard error of each probit is 1.0 / sqrt(n) where n is the number of cases in that cell. (This assumes simple random sampling. Currently, no complex standard errors are available in SDA for probit statistics.)
This option converts a proportion into a logit and then rescales it by making the standard deviation equal to 1.0 (like a probit) instead of 1.81 (the usual standard deviation of a logit).
This option is provided for didactic purposes, so that students and researchers can readily compare the logit and and probit transformations of a table of proportions.
The standard error of each logit scaled as a probit is 1.0 / sqrt(n) where n is the number of cases in that cell. (This assumes simple random sampling. Currently, no complex standard errors are available in SDA for logit or probit statistics.)
Note the following for the standard errors of totals and differences:
If the sample for a particular study is more complex than a simple random sample, the appropriate standard errors can still be computed (except for transformed dependent variables) provided that the stratum and/or cluster variables were specified when the dataset was set up in the Web archive. Otherwise, the standard errors calculated by assuming simple random sampling are probably too small.
Standard errors are used to create confidence intervals for the mean in each cell. For cells with at least 30 cases, you can be 95% confident that the mean in the population (for each cell) is within the interval bounded by approximately two standard errors above and below the mean in the sample (ignoring the problem of potential bias in the sample).
The appropriate standard errors are computed using either the Taylor series method (for cluster samples) or the formula for stratified subclass means (for stratified element samples). The method used is reported when you run the program. If you want additional technical information, see the discussion of standard error calculation methods. You can also request SRS standard errors, which are calculated as if the sample were a simple random sample, for purposes of comparison.
Note the following for the standard errors of totals and differences. (These standard errors are given for the complex standard errors and/or the SRS standard errors, depending on what you have requested.)
Standard errors are used to create confidence intervals for the mean in each cell (or for the total, if a weight is being used to expand the cell counts to the estimated size of the population). The optional diagnostic table reports the degrees of freedom used to generate the appropriate t-statistic for creating the confidence intervals.
Note that the calculations for standard errors in cluster samples require that the coefficient of variation of the sample size in each cell, CV(x), be under 0.20; otherwise, the computed standard errors are probably too small, and they are flagged in the table with an asterisk. CV(x) for each cell is available in the optional diagnostic table.
A rho statistic between .05 and .10 is of moderate size. Its effect on the standard error depends on the size of the clusters. The larger the clusters, the larger the effect of rho. The stratified average cluster size is displayed for each cell in the optional diagnostic table.
If the design effect is less than 1.0, the rho statistic will be negative. This means that the differences between clusters within the same stratum are relatively small, compared to the variability between elements in the sample as a whole.
The confidence interval or range is computed by multiplying the standard error of the mean (or total) by the value of Student's t appropriate to the level of confidence requested and to the number of degrees of freedom. The result is added to the mean (or total) to obtain the upper bound of the confidence interval, and the result is subtracted from the mean (or total) to obtain the lower bound. Note that if both complex and SRS standard errors are requested, only the complex standard errors are used to compute the confidence intervals.
For a very large random sample (in a particular cell of a table), for instance, the appropriate value for Student's t for a 95 percent confidence interval is close to the familiar 1.96 value for the normal distribution.
The MCA procedure shows the average effect of each category, and it ignores any interactions between the variables. If interaction effects are statistically significant, MCA is generally not appropriate.
The first column of the table gives the difference between the dependent variable score of respondents in each category and the overall mean of the dependent variable. This is the UNADJUSTED effect of each category.
The second column of the table gives the ADJUSTED effect of each category, taking into account the effects of the other variables. The adjustment process is similar to running a regression with dummy variables for the various categories. Regression coefficients for dummy variables, however, represent deviations from the effect of the omitted category. MCA coefficients, on the other hand, are deviations from the overall mean of the dependent variable.
The eta coefficient for each variable is like a bivariate correlation coefficient. It is the square root of the proportion of variance of the dependent variable "explained" by the categories of each variable.
The beta coefficient for each variable is like a standardized regression coefficient. It adjusts the eta coefficient for each variable by taking into account the effects of the other variables.
In an unclustered sample the df equals the number of cases in that cell, minus one. In a clustered sample with two clusters per stratum, the df equals the number of strata (or half the number of clusters). Note that the number of strata used for this calculation is the number in each cell after collapsing, if strata had to be combined in order to have at least two clusters in a stratum.
The value of Student's t used for computing confidence intervals depends on the desired level of confidence (usually 95 percent) and the df. The smaller the df, the larger the required value of Student's t and, consequently, the width of the confidence intervals. As the df increases, the size of the required Student's t value decreases until it approaches the familiar constant for the normal distribution (which is 1.96, for the 95 percent confidence level).
Note that this estimation of the design effect due to weighting is based entirely on the variation in the weight variable, and it does not consider the specific dependent variable being analyzed. Not all uses of weights will increase the sample variance of a specific variable. If the weights reflect a stratification of the sample that was effective in reducing sampling error for this particular dependent variable, the estimated deft due to weighting may be greater than the overall deft. If this occurs, it is an indication that the weighting did not increase sampling error in this case.
Frequently, however, differential rates of sampling are used in different strata simply to achieve the oversampling of some group(s) relative to others. Weights are then used to compensate for the different probabilities of selection. In such a case, the different strata are sampled at different rates in a way that departs from optimum allocation, and the sampling variance is increased (see Kish, Survey Sampling, pp. 429-433).
The variation in the size of the weights across strata can be used to estimate the design effect due to weighting, assuming that it would have been optimal to use the same sampling fraction within all the strata. The deft due to weighting is based on formula 11.7.6 given in Kish, Survey Sampling, p. 430. That formula gives a design effect in terms of sampling variances. The square root of that result gives the deft in terms of standard errors.
If the value of CV(x) is greater than 0.20, the calculated standard error is probably too small. Such standard errors are flagged in the main table with an asterisk. The corresponding confidence intervals are also flagged with an asterisk, as is the CV(x) in the table of diagnostic information that can be generated by the MEANS program (only).
The CV for each cell is a stratified estimate. The program calculates the coefficient of variation of the number of valid cases for the clusters within each stratum. The individual stratum CVs are then combined into an overall CV for each cell. This overall CV is reported in the optional table of diagnostic statistics available in the Comparison of Means program (but not in the Frequencies and Crosstabulation program).
If the p-value (probability statistic) associated with a variable is low (about .05 or less), the chances are correspondingly low that the observed effect on the dependent variable is only due to sampling error, and in that case the effect is said to be statistically significant.
Consult any beginners' statistics book for more information on the meaning of these statistics.
The transition from a lighter shade of red or blue to a darker shade depends on the magnitude of the t-statistic. The lightest shade corresponds to t-statistics between 0 and 1. The medium shade corresponds to t-statistics between 1 and 2. The darkest shade corresponds to t-statistics greater than 2. If the t-statistic is undefined because the standard error is zero or cannot be calculated, no color is shown -- this usually indicates that the mean in that cell is an unstable estimate based on a few cases that all happen to have the same value.
The color coding can be turned off, if you prefer. Color coding may not be helpful if you are using a black-and-white monitor or if you intend to print out the table on a black-and-white printer.
The t-statistic shows whether the mean in a cell is larger or smaller than the overall mean. It also takes into account the total number of cases in the cell. If there are only a few cases in a cell, the deviation from the overall mean is not as significant as if there are many cases in that cell.
The t-statistic is calculated as the ratio of two quantities: The numerator is the difference between the mean in the cell and the overall mean. The denominator is the standard error of the mean in that cell. If complex standard errors have been requested, the complex standard error for each cell is used to calculate the t-statistic.
This program calculates the correlation between all pairs of two or more variables. An explanation of each option can be obtained by selecting the corresponding word highlighted on the form.
Or you can select Clear fields to delete all previously specified variables and options, so that you can start over.
Enter the name of each variable in a window (box). To go from one window to another, use the tab key or your mouse. It is all right to skip a window and leave it blank -- to use only windows 1, 5, and 9, for example.
It is possible to enter more than one variable name in a window (the underlying text-entry area will scroll). This has consequences for other options which refer to variable numbers. For example, if you enter two variables in window number 3, and then you request that the signs of the correlations be reversed for variable number 3, the signs of BOTH variables in window number 3 will be reversed.
Each window, consequently, defines a variable GROUP. Ordinarily it is clearer to put only one variable in each window, but the possibility of defining groups of variables exists.
This procedure retains all of the information about each pairwise relationship. However, the multivariate relationships can be inconsistent, if many of the cases have different missing-data patterns on different variables.
If this default dichotomization is not appropriate for a particular analysis, you can recode the variable temporarily within the correlation program using the standard methods of recoding variables.
Consult any beginners' statistics book for more information on the meaning of these statistics.
The alpha coefficient is a function of the average correlation
between the variables and of the number of variables.
If some of the variables are scored in opposite directions,
you should use the option to reverse the signs of some of
the variables, so that a high score on all variables means
the same thing.
The standard errors are used to create confidence intervals for each correlation coefficient. For example, you can be 95% confident that the correlation coefficient in the population (for each pair of variables) is within the interval bounded by approximately two standard errors above and below the correlation coefficient calculated from the sample (as shown in the matrix). The actual multiple to use for creating confidence intervals is the t-statistic with (n-1) degrees of freedom.
The calculation of the standard error of the correlation coefficient in each cell is based by default on the UNWEIGHTED number of cases, even if a weight variable has been used for calculating the correlation coefficient. Ordinarily this procedure will generate a more appropriate statistical test than one based on the weighted N in each cell.
The standard error is computed differently, depending on which correlation coefficient you have selected.
The statistics available for each variable include its mean, standard deviation, standard error, valid N of cases, and (if there is a weight variable) valid weighted N of cases.
If missing-data cases have been excluded LISTWISE (the default), the univariate statistics for all variables will be based on the SAME cases -- those which have valid data on ALL of the variables.
If missing-data cases have been excluded PAIRWISE, the univariate statistics for each variable will be based on all the cases with valid data for that one variable.
The paired statistics for each variable include its mean, standard deviation, valid N of cases for the pair, and (if there is a weight variable) valid weighted N of cases for the pair.
These statistics are displayed as a series of matrices. Each statistic for a given variable is (potentially) somewhat different, depending on which other variable it is being paired with.
The P-squared statistic is a way to measure the proportionality of rows in a correlation matrix. For example, if all of the coefficients in one row are exactly double the size of the coefficients in another row, there is a constant proportionality, and the index will be 1.0.
Usually we want to limit this comparison to a subset of the the matrix -- namely, to the part corresponding to the correlations of the criterion variables with the variables of interest. To do this, we specify on the option screen the variable numbers (next to each window on the option screen) corresponding to the variables for which we want the P-squared measure, and the variable numbers corresponding to the criterion variables.
For example, we could examine the degree to which the variables v1, v2, and v3 have proportional correlations to the criterion variables x1, x2, and x3. We would enter v1, v2, and v3 into the first 3 windows on the option screen; and x1, x2, and x3 into windows 4 through 6. To get the P-squared statistic for all the combinations of v1, v2, and v3, in respect to the criterion variables, we would then specify:
The P-squared statistics are presented in a symmetrical matrix. Each row and column corresponds to one of the variables that we specified as a "variable to measure."
For a discussion of how to use this statistic, see Thomas Piazza, "The Analysis of Attitude Items," American Journal of Sociology, vol. 86 (1980) pp. 584-603.
For example, we may know that var1 is scaled in such a way that a HIGH score or value corresponds to a LOW score on var2 and var3, so we expect the correlations of var1 to be negative with var2 and var3. But if we are interested in the relationships of those variables to other variables, it will be easier to detect different patterns if we reverse all the signs corresponding to var1. That way, we can expect var1, var2, and var3 to have correlations of the same sign with other variables. Then if we do observe a difference in the signs, it will catch our attention.
The transition from a lighter shade of red or blue to a darker shade depends on the magnitude of the correlation coefficient in each cell. The lightest shade corresponds to coefficients between 0 and .15. The colors become darker as the absolute value of the correlations exceed .15, then .30, then .45.
Color coding is also used for the P-squared matrix, if one has been requested. However, the dividing points for colors are double in magnitude. The lightest shade corresponds to P-squared coefficients between 0 and .30. The colors become darker as the absolute value of the P-squared coefficients exceed .30, then .60, then .90.
The color coding can be turned off, if you prefer. Color coding may not be helpful if you are using a black-and-white monitor or if you intend to print out the matrix on a black-and-white printer.
This program calculates the correlation between two variables separately within categories of the row variable and, optionally, the column variable. If a control variable is specified, a separate table will be produced for each category of the control variable. An explanation of each option can be obtained by selecting the corresponding word highlighted on the form.
The log of the odds-ratio is an optional measure for dichotomous variables.
The standard error is computed differently, depending on which correlation coefficient you have selected. The standard error for the Pearson correlation is based on Fisher's Z, and it is calculated as the average distance of the upward and the downward confidence band for one standard error (based on the retransformation of Fisher's Z into Pearson's R). The standard error for the log of the odds ratio is calculated with standard formulas for that statistic.
If the sample is more complex than a simple random sample, the standard errors calculated here are probably too small.
The transition from a lighter shade of red or blue to a darker shade depends on the magnitude of the t-statistic. The lightest shade corresponds to t-statistics between 0 and 1. The medium shade corresponds to t-statistics between 1 and 2. The darkest shade corresponds to t-statistics greater than 2.
The color coding can be turned off, if you prefer. Color coding may not be helpful if you are using a black-and-white monitor or if you intend to print out the table on a black-and-white printer.
The t-statistic shows whether the correlation in a cell is larger or smaller than the overall correlation. It also takes into account the total number of cases in each cell. If there are only a few cases in a cell, the deviations from the overall correlation are not as significant as if there are many cases in that cell.
The t-statistic is calculated as the ratio of two quantities: The numerator is the difference between the correlation in the cell and the overall correlation. The denominator is the standard error of the correlation in that cell.
This program calculates the regression coefficients for one or more independent or predictor variables, using ordinary least squares. An explanation of each option can be obtained by selecting the corresponding word highlighted on the form.
Two versions of the regression coefficient are given for each variable:
In addition to the coefficients for each independent variable, a few summary measures for the regression as a whole are given. These include the Multiple R (multiple correlation coefficient), the R-Squared (the square of the Multiple R, also called the Coefficient of Determination), the Adjusted R-Squared, and the Standard Error of the Estimate.
The Adjusted R-Squared is a measure that compensates for the inflation of the regular R-Squared statistic due simply to the inclusion of additional independent variables. The Adjusted R-Squared will increase only if the additional independent variables increase the predictive power of the model more than would be expected by chance. It will always be less than or equal to the regular R-Squared.
Aside from simply specifying the name of a variable, it is possible to restrict the range of a variable or to recode the variable temporarily. Note in particular that you can create dummy variables and product terms.
Or you can select Clear Fields to delete all previously specified variables and options, so that you can start over.
Enter the name of each variable in a window (box). To go from one window to another, use the tab key or your mouse. It is all right to skip a window and leave it blank -- to use only windows 1, 5, and 9, for example.
It is possible to enter more than one variable name in a window (the underlying text-entry area will scroll). Each window, consequently, defines a variable GROUP. Ordinarily it is clearer to put only one variable in each window, but the possibility of defining groups of variables exists.
To create such a variable temporarily, for a single regression run, for example, use the following syntax:
varname(d:1-3)
This would create a variable in which cases coded 1 through 3 on the variable 'varname' receive a code of 1, and all other VALID cases receive a code of 0. If 'varname' has a code defined as missing-data or out of range, the dummy variable will have the system-missing data value.
The characters 'd:' (or 'D:') indicate that you want to create a temporary dummy variable. The codes that follow show which codes on the original variable should become the code of 1 on the new dummy variable. One or more single code values or ranges can be specified. Multiple codes or ranges are separated by a comma.
You can give the '1' category of the dummy variable a label by putting the label in double quotes or in square brackets:
occupation(d:1, 3-5, 9, 10 "Managerial occupations = 1")
If you do not give a label, SDA will take the label from the code of the input variable assigned to the '1' category on the new dummy variable, provided that only a single code is assigned to the '1' category.
To create such a variable temporarily, for a single regression run for instance, use an asterisk (*) between the component variable names. For example:
age*education
This would create a variable in which, for each case, the value of 'age' is multiplied by the value of 'education'. If either 'age' or 'education' has an invalid code for that case, the temporary product term will have the system missing-data value.
One or more dummy variables can also be part of a product term. For example, the following form is acceptable:
party(d:3)*sex
In this example, first a dummy variable is created from the variable 'party', and then that dummy variable is multiplied by 'sex'.
The probability estimate associated with each t-statistic is given in the last column. This is the probability of obtaining a regression coefficient (either B or Beta) that is this large or larger, if the true coefficient is equal to zero in the population from which the current sample was drawn. (Note that this version of the regression program assumes that the dataset is a simple random sample of the target population.)
If the probability value for a regression coefficient is low (about .05 or less), the chances are correspondingly low that the observed effect of that independent variable on the dependent variable is only due to sampling error. However, a low probability value does not indicate that the true value of the coefficient in the population is of any specific magnitude -- only that it is not equal to zero.
To construct a confidence interval for a specific regression coefficient, use the standard error of the coefficient. The approximate 95 percent confidence interval of each coefficient is formed by creating a range that is equal to the regression coefficient plus or minus two times the standard error.
The t-statistic and associated probability value are also given for the constant term of the regression equation. This is a test that the regression equation in the population has no constant term (or intercept). This test is usually of less interest than the tests for the regression coefficients of the independent variables.
If the p-value for the regression is low (about .05 or less), the chances are correspondingly low that ALL of the observed effects of the independent variables on the dependent variable are only due to sampling error. Nevertheless, a low p-value does not indicate that any specific independent variable has an effect on the dependent variable The separate t-test for each independent variable should be examined for that purpose. (However, if there is only one independent variable, the t-test for that variable will give the same p-value as the Global F-test.)
If this option is selected, the univariate statistics will automatically be selected as well, and the product will be displayed as an additional column in that table of statistics.
The transition from a lighter shade of red or blue to a darker shade depends on the magnitude of the t-statistic, which is the ratio of each regression coefficient (B) divided by its standard error. The lightest shade corresponds to t-statistics between 0 and 1. The medium shade corresponds to t-statistics between 1 and 2. The darkest shade corresponds to t-statistics greater than 2.
Correlation coefficients are also color coded, if a correlation matrix is requested. Correlation coefficients greater than zero become redder, the larger they are. Correlation coefficients less than zero become bluer, the more negative they are. The transition from a lighter shade of red or blue to a darker shade depends on the magnitude of the correlation coefficient in each cell of the matrix. The lightest shade corresponds to coefficients between 0 and .15. The colors become darker as the absolute value of the correlations exceed .15, then .30, then .45.
The color coding can be turned off, if you prefer. Color coding may not be helpful if you are using a black-and-white monitor or if you intend to print out the table on a black-and-white printer.
This program calculates the logit or probit regression coefficients for one or more independent or predictor variables. An explanation of each option can be obtained by selecting the corresponding word highlighted on the form.
Aside from simply specifying the name of a variable, it is possible to restrict the range of a variable or to recode the variable temporarily. Note in particular that you can create dummy variables and product terms.
Or you can select Clear Fields to delete all previously specified variables and options, so that you can start over.
The exponential (or antilog) of each logistic regression coefficient is also output, if that option is selected. This transformed coefficient expresses the effect of a one unit change in that independent variable on the odds that a person will have a score of 1 versus a score of 0 on the dependent variable. Note that this exponential transformation converts the additive regression coefficients into multiplicative terms.
When the dependent variable has only two categories, logistic and probit regression are more appropriate to use than ordinary least squares regression. Both logistic and probit regression will usually generate the same substantive results. The choice between them is generally a matter of custom within a specific field or discipline.
If the variable you want to use as a dependent variable is not already coded as a simple 0/1 variable, you can create a dummy variable, or you can recode the variable temporarily.
If the dependent variable is left as anything other than a simple 0/1 variable, the program will recode the dependent variable automatically. The lowest valid score will be recoded to the value '0', and all other scores will be recoded to the value '1'.
Enter the name of each variable in a window (box). To go from one window to another, use the tab key or your mouse. It is all right to skip a window and leave it blank -- to use only windows 1, 5, and 9, for example.
It is possible to enter more than one variable name in a window (the underlying text-entry area will scroll). Each window, consequently, defines a variable GROUP. Ordinarily it is clearer to put only one variable in each window, but the possibility of defining groups of variables exists.
To create such a variable temporarily, for a single analysis run, for example, use the following syntax:
varname(d:1-3)
This would create a variable in which cases coded 1 through 3 on the variable 'varname' receive a code of 1, and all other VALID cases receive a code of 0. If 'varname' has a code defined as missing-data or out of range, the dummy variable will have the system-missing data value.
The characters 'd:' (or 'D:') indicate that you want to create a temporary dummy variable. The codes that follow show which codes on the original variable should become the code of 1 on the new dummy variable. One or more single code values or ranges can be specified. Multiple codes or ranges are separated by a comma.
You can give the '1' category of the dummy variable a label by putting the label in double quotes or in square brackets:
occupation(d:1, 3-5, 9, 10 "Managerial occupations = 1")
If you do not give a label, SDA will take the label from the code of the input variable assigned to the '1' category on the new dummy variable, provided that only a single code is assigned to the '1' category.
To create such a variable temporarily, for a single regression run for instance, use an asterisk (*) between the component variable names. For example:
age*education
This would create a variable in which, for each case, the value of 'age' is multiplied by the value of 'education'. If either 'age' or 'education' has an invalid code for that case, the temporary product term will have the system missing-data value.
One or more dummy variables can also be part of a product term. For example, the following form is acceptable:
party(d:3)*sex
In this example, first a dummy variable is created from the variable 'party', and then that dummy variable is multiplied by 'sex'.
The probability of each t-statistic is given in the last column. This is the probability that the regression coefficient (B) is equal to zero, in the population from which the current sample was drawn. (Note that this version of the logit/probit regression program assumes that the dataset is a simple random sample of the target population.)
If the probability value for a regression coefficient is low (about .05 or less), the chances are correspondingly low that the observed effect of that independent variable on the dependent variable is only due to sampling error. However, a low probability value does not indicate that the true value of the coefficient in the population is of any specific magnitude -- only that it is not equal to zero.
To estimate the confidence interval of a specific regression coefficient, use the standard error of the coefficient -- displayed as SE(B). The approximate 95 percent confidence interval of each coefficient is formed by creating a range that is equal to the regression coefficient plus or minus two times the standard error.
The t-statistic and associated probability value are also given for the constant term of the regression equation. This is a test that the regression equation in the population has no constant term (or intercept). This test is usually of less interest than the tests for the regression coefficients of the independent variables.
A chi-square test for the regression is also computed. The p-value (probability value) for the chi-square test is the probability that ALL of the regression coefficients (B's) are equal to zero, in the population from which the current sample was drawn. (Note that this version of the logit/probit regression program assumes that the dataset is a simple random sample of the target population.)
If the p-value for the chi-square test is low (about .05 or less), the chances are correspondingly low that ALL of the observed effects of the independent variables on the dependent variable are only due to sampling error. Nevertheless, a low p-value does not indicate that any specific independent variable has an effect on the dependent variable The t-test for each independent variable should be examined for that purpose. (However, if there is only one independent variable, the t-test for that variable will give the same p-value as the global chi-square test.)
If this option is selected, the univariate statistics will automatically be selected as well, and the product will be displayed as an additional column in that table of statistics.
The transition from a lighter shade of red or blue to a darker shade depends on the magnitude of the t-statistic, which is the ratio of each regression coefficient (B) divided by its standard error. The lightest shade corresponds to t-statistics between 0 and 1. The medium shade corresponds to t-statistics between 1 and 2. The darkest shade corresponds to t-statistics greater than 2.
The color coding can be turned off, if you prefer. Color coding may not be helpful if you are using a black-and-white monitor or if you intend to print out the table on a black-and-white printer.
This program lists the values of individual cases on variables specified by the user. Values of a numeric variable can also be transformed into percents of a second numeric variable. This is particularly useful when the cases in the data file are aggregate units such as cities.
One or more filter variables are used to limit the listing to a subset of the cases. In general a limit of 500 cases is enforced for each listing, in case the user has forgotten to limit the listing with sufficient filter variables.
An explanation of each option can be obtained by selecting the corresponding word highlighted on the form.
Or you can select Clear fields to delete all previously specified variables and options, so that you can start over.
Aside from simply specifying the name of a variable, it is possible to convert a number into the percent of another variable. (Both variables must be numeric variables.) This is particularly useful when the cases in the data file are aggregate units such as cities.
To calculate and display a percent, use the following formats, beginning with $p, instead of a simple variable name:
To avoid accidental attempts to list large numbers of cases, the program suppresses any listing that would exceed a certain number of cases. The default limit is 500 cases, but that limit can be modified when the datasets are set up in the Web archive.
The available summaries are:
For a percentage (created with the '$p' command), the summaries, if requested, will be calculated as follows:
For example, the following specifications would generate six separate tables:
Basic range restriction
In a range,
two asterisks '**'
can be used to signify
the lowest or highest
NUMERIC value,
regardless of whether or not
the codes are defined as missing data.
For example: age(50-**)
This would include ALL numeric values greater than or equal to 50,
including data values like 98 or 99, even if they had been
defined as missing-data codes.
Note that '**' cannot be used alone
(without '-')
as a range specification.
If you want to include all NUMERIC codes,
you can use the range '(**-**)'.
Using this basic method of recoding, the new groupings of codes are given the default code values 1, 2, 3, and so forth. The default label for each group is the range of original codes that constitute that group ("18-30", for example).
Any categories of 'age' not included in the specified groupings will become missing-data on the recoded version, and they will be excluded from the analysis in the table.
On the other hand, any original missing-data categories of 'age' that are explicitly mentioned in the recode, will be included. For instance, if the value '90' for 'age' were flagged as a missing-data code, but included as in the example above, it would become part of the third recoded category. This is discussed in more detail in the section on "Treatment of missing data."
For example, the variable 'age' can be recoded into the same three
groups as above, but with the new code values 1, 5, and 10, by
specifying the recode as follows:
age(r: 1 = 18-30; 5 = 31-50; 10 = 51-90)
For column, row, or control variables it will not usually matter what the new code values are. For variables on which statistics are computed, however, the new code values will affect the value of those statistics.
For example, you can assign labels to the recoded categories of race
by using the following specification:
race(r: 800-869 "White"; 870-934 "Black"; 600-652, 979-982 "Asian")
These labels will appear in the table, in place of the range of original codes that constitute that group. Nevertheless, the recode specifications will still be documented. A summary is always given at the bottom of the table.
For example, the 'age' recode could be specified as:
age(r: *-30; 31-50; 51-*)
Using this method, all valid age values up to 30 would go into the
first recoded group.
And all valid age values of 51 or older would go into the third group.
If you want to use a range that includes NUMERIC codes that were defined as missing-data values, you can specify the range with two asterisks ('**') instead of one.
For example, the 'age' recode could be specified as:
age(r: *-30; 31-50; 51-**)
Using this method, all
valid age values up to 30 would go into the
first recoded group.
But every numeric value
of 51 or greater would go into the third group,
including codes like 99 that may have been defined
as missing-data codes.
For more discussion about including codes that have been defined as missing-data codes, see the section on "Treatment of missing data."
Notice that
order is important with overlapping ranges.
The following specification will
NOT have the same effect
as the
preceding two:
age(r: 3= 50-90; 2= 30-50; 1= 18-30)
In this example, the 'age' value of 50 will end up in the recode
group with the value '3' (instead of in the second group),
and the 'age' value of 30 will end up in the recode group with
the value '2' (instead of in the first group).
The first method is to
mention the code explicitly,
either as a single
value or as part of a range.
For example, if the 'age' value of 99 has been defined as a missing-data
code, it can still be included by either of the following specifications:
age(r: 18-30; 31-50; 51-90; 99), or
age(r: 18-30; 31-50; 51-100)
In the first case the code 99 will become its own fourth recode category.
In the second case, it will be included as part of the third category.
A second method to include NUMERIC missing data codes is to use an
open range with two asterisks ('**') instead of one.
For example, the following specification will include all numeric
codes above 50 as part of the third recoded group:
age(r: 18-30; 31-50; 51-**)
Note that at present there is no way to include in a recode the system-missing value or a character missing-data value (like 'D' or 'R').
Using this simple method of collapsing, the new groupings of codes are given the code values 1, 2, 3, and so forth. The label for each group is the range of original codes that constitute that group ("21-30", for example).
If the starting point is HIGHER than the lowest actual value in the data, the values lower than the starting point become missing-data. For example, with a starting point of '21', any lower values of 'age' (like 18, 19, and 20) would not be included in a range and would become missing-data.
If the starting point is LOWER than the actual minimum value in the data, the ending point of each range is not affected. However, the first range includes only the valid values in that range, if any. For example, if the starting point for collapsing 'age' is '1', with an interval of '10', but the lowest valid value in the data is '18', then the age ranges will be: 18-20, 21-30, 31-40, etc.
The highest range is affected by the highest valid value in the data. For example, if the highest valid value for 'age' is '97', and the starting point is '1' and the interval is '10', the highest intervals will be: 71-80, 81-90, 91-97.
A numeric missing-data code that happened to fall in between valid codes, however, would be included in the range that covers that code. For example, if '0' were defined as missing-data, but both '-1' and '+1' were actual valid codes, '0' would be included in one of the ranges.
For example, if the control variable is gender, there will be one table for men alone and then one table for women alone. A table will also be produced for the total of all valid categories of the control variable (e.g., men and women combined).
Only one variable at a time can be used as a control variable. If more than one control variable is specified, a separate set of tables (and charts) will be generated for each control variable.
Some filter variables may be set up ahead of time by the data archive. That type of filter variable is discussed below.
Note that it is also possible to limit the table to a subset of the cases by restricting the valid range of any of the other variables. But when the desired subset of cases is defined by a variable that is not one of the variables in the table or analysis, you must use filter variables.
Multiple ranges and codes
may be specified.
For example: age(1-17, 25, 95-100)
Multiple filter variables
If you specify more than one filter variable, a case must satisfy
ALL of the conditions in order to be included in the table.
For example: gender(1), age(30-50)
Open-ended Ranges using '*' and '**'
A single asterisk, '*', can be used to specify that all cases with VALID
codes for a variable will pass the filter.
For example:
age(*)
includes all cases with valid data on the
variable 'age'.
In a range, the '*' can be used to signify the lowest or highest VALID value. For example: age(*-25,75-*). This filter would include all VALID values less than or equal to 25 and all VALID values greater than or equal to 75. However, any missing-data values within those ranges would still be excluded.
In a range, two asterisks '**' can be used to signify the lowest or highest numeric value, regardless of whether or not the codes are defined as missing data. For example: age(50-**) would include ALL numeric values greater than or equal to 50, including data values like 98 or 99, even if they had been defined as missing-data codes. However, any character missing-data values would still be excluded. Note that '**' cannot be used alone in a filter variable. It can only be used as part of a range.
Multiple filter values
can be specified, separated by
spaces or commas:
city( Chicago,Atlanta Seattle)
Character variable filters are
case-insensitive.
For example, the following filters are functionally identical:
city( Atlanta )
city( ATLANTA )
city( AtLAnta )
If a filter value contains
internal spaces or commas,
it must be
enclosed in matching quotation marks (either single or double):
city( "New York" )
state("Cal, Calif")
A filter value containing a single quote (apostrophe)
can be
specified by enclosing it in double quotes:
city( "Knot's Landing" )
Or, conversely,
a filter value containing double quotes
can be specified by enclosing it in single quotes:
name( 'William "Bill" Smith' )
Leading and trailing spaces, and multiple internal spaces,
are NOT significant. The following filters are all functionally
equivalent:
city( "New York " )
city( "New York" )
city( " New York " )
Note that
ranges,
which are legal for numeric variables,
are not allowed
for character variables:
The following syntax is
NOT legal:
city( Atlanta-Seattle)
For example, the variable 'gender' might be set up as a pre-set filter variable. The user could then choose 'Males' or 'Females' (or 'Both genders') from the drop-down list.
Pre-set filter variables are only a convenience for the user. The same result can be obtained by using the regular selection filter option to specify the filter variable(s) and the desired code categories to include in the analysis.
One possible difference between the pre-set filters and the regular user-defined selection filter specifications concerns cases with missing-data on the filter variable. A user-defined filter specification of 'gender(*)' would include all cases with a valid code on the variable 'gender', excluding any cases with missing-data on that variable, if there are any. On the other hand, selecting the '(Both genders)' option (or whatever the '##none' specification is labeled) for a pre-set filter would generally include cases with missing-data on the filter variable. (The '##none' specification has the same effect as not using that variable as a filter at all.)
To avoid any doubt about which cases are included or excluded, remember that the analysis output always reports which filter variables have been used and which code values have been included in the analysis. This is true both for pre-set selection filters and for user-defined filters.
SDA studies can be set up with a weight variable specified ahead of time so that the weight variable is used automatically. Other studies may be set up with a drop-down list of choices to be presented to the user, who then selects one of the available weight variables (or no weight variable, if that option is included in the list). If no weight variables have been pre-specified, the user is free to enter the name of an appropriate variable to be used as a weight.
The usual text available for a variable is the text of the question that produced the variable, provided that the text was included in the study documentation. Sometimes other explanatory text has been included.
If the variable was created by the 'recode' or the 'compute' program, the commands used to create the new variable are included in the descriptive text.