Methods Used by SDA 4 for Computing
Standard Errors for Complex Samples

Specifying the sample design
Methods Specific to Certain Programs
Domains and subclasses of the sample
Differences between SDA and other programs
References

Specifying the Sample Design

The sample design is specified for each study when the SDA dataset is defined in the "Configure Datasets" section of the SDA Manager. On that page the archive manager may specify a stratum variable and/or a cluster variable for each dataset. (Alternatively, this information can be imported from a HARC file when the SDA database is initialized.)

There are three possible specifications:

Stratum and cluster BOTH specified
Cluster only specified
Stratum only specified

Each of these three specifications results in standard errors being computed differently.

Stratum and cluster variables BOTH specified
For a stratified cluster sample, both a stratum variable and a cluster (PSU) variable are specified. Each stratum must have valid cases in at least two clusters, for the sample as a whole. If this is not true, and if stratum and cluster variables have been defined for this dataset, SDA will complain and will not run. In such cases the stratum and cluster design variables will have to be fixed before SDA will run.
In assessing the completeness of the stratum and cluster information, cases with missing data on the variables used in the analysis are excluded. Users must therefore be careful not to create variables to be used as selection filters which assign a missing-data code to many of the cases.
Cluster variable (only) specified.
If a cluster sample has no explicit stratification, SDA can either create pseudo-strata by combining adjacent clusters or it can treat the clusters as belonging to a single stratum.
- Creating pseudo-strata: SDA by default will combine the clusters into pseudo-strata so that there are cases in each pseudo-stratum from more than one primary sampling unit (cluster).
  In order to create these pseudo-strata, each pair of adjacent clusters (in numeric order) is combined into a stratum. For example, clusters 1 and 2 would be paired as belonging to stratum 1, and clusters 3 and 4 would belong to stratum 2. If there is an odd number of clusters, the last cluster will be combined with the preceding two clusters, to form a final stratum with 3 clusters.
  If there are expected to be substantial differences between the clusters, it may be preferable to create explicit pseudo-strata yourself (based on criteria that do not involve peeking at the data), rather than to let the program create strata automatically from the numeric order of the clusters.
  It is important to understand that the method of automatically creating strata is done only once -- for the sample as a whole. If a specific cell in a table (which is effectively a subclass of the sample) has no valid cases in a particular cluster, the pseudo-strata are not re-created. The calculation of variances can proceed with only one cluster in a stratum, so long as this happens in a subclass of the sample and not in the sample as a whole. See the discussion below on subclasses.
  On the other hand, if a subclass has no valid cases in any cluster in a stratum, that whole stratum is dropped from the calculation, and the missing strata and clusters do not contribute to the calculation of the degrees of freedom for the statistic in that cell.
  Once the clusters have been combined into pseudo-strata by the program, the Taylor series method is used to calculate standard errors, just as for the stratified cluster design.
- A single stratum: If the pairing of clusters into pseudo-strata might result in large differences between clusters that happen to fall into the same pseudo-stratum, it may be preferable to group all the clusters into one large stratum. This will avoid having the clusters paired automatically into pseudo-strata. This option may be specified by assigning '$1' as the name of the stratum variable in the SDA Manager (or in the HARC file to be imported).
  This procedure has the same effect as creating a stratum variable that has the same value for all the cases in the sample (the number '1', for instance) and then defining that variable as the stratum variable. The computation then proceeds as for the stratified cluster design.
  This procedure will sacrifice any potential gains that might result from the implicit stratification of the clusters (if they have been ordered by some relevant criterion). But it will also avoid the inflation of variance that could result from the pairing up of very different clusters.
Stratum variable (only) specified.
If a stratum variable is specified, but no cluster variable, the sample is treated as a stratified random (element) sample. This is equivalent to a stratified cluster sample with only one case in each cluster. In other words, each case is treated as a cluster of size one.

Methods Specific to Certain Programs

Percentages and Means
The TABLES and the MEANS programs calculate standard errors using the Taylor series approximation method.
For stratified cluster designs, the sampling variances are calculated based on the differences in the percentages or in the mean values of the dependent variable between clusters within each stratum. This method of calculation is discussed in Kish, Survey Sampling, pp. 190-193. The actual formula is 6.4.4 on p. 192. The finite population correction (1-f) is ignored.
Designs with only a cluster variable are basically converted to a stratified cluster design, as described above. Standard errors are then computed as for a stratified cluster sample.
Designs with only a stratum variable are equivalent to a stratified cluster sample with only one case in each cluster. In other words, each case is treated as a cluster of size one. The computation of standard errors for stratified element samples is a little simpler than for cluster samples, since there is no covariance between sampled elements within the strata. The actual formula used is 6.4.2 in Kish, Survey Sampling, p. 192. Once again, the finite population correction (1-f) is ignored.
Differences between means
The MEANS program allows the user to specify a row or column to use as a base of comparison. Then the mean in each cell is compared to the base cell in the same row (or column). The difference between the two means is shown in the table, along with the standard error of the difference. Optionally, a table of confidence intervals can also be produced.
The standard error for each difference between two means is the square root of the sum of the two variances of the means minus the covariance, and it is calculated as: sqrt(VARIANCE1 + VARIANCE2 - COVARIANCE12). The variance of each mean is the square of the corresponding standard error. Each standard error is calculated according to the sample design, as described in the sections above. The covariance term arises because of the complex design (in cluster samples).
The confidence interval for each difference is calculated as a multiple of the standard error that is added to, or subtracted from, the difference. This multiple is based on Student's t-statistic, The value of Student's t used for computing confidence intervals depends on the desired level of confidence (usually 95 percent) and the degrees of freedom (df) for the comparison. The smaller the df, the larger the required value of Student's t and, consequently, the width of the confidence intervals. As the df increase, the size of the required Student's t value decreases until it approaches the familiar constant for the normal distribution (which is 1.96, for the 95 percent confidence level).
In complex samples, the degrees of freedom are based on the number of clusters and strata used for the comparison. The optional diagnostic table reports those numbers for each difference shown.
Regression analysis
The SDA regression programs, REGRESS and LOGIT, calculate standard errors for complex samples using Jackknife repeated replications. Information on this method is available from a variety of sources. (See the references below.) Basically the method proceeds as follows:
- The stratum and/or cluster information for the study is used to create a series of sample replicates, each of which deletes the cases in one of the sample clusters (PSUs). (Cases are effectively deleted by assigning them a weight of zero.) For each replicate a set of weights is then created, which has the effect of compensating for the deleted cases by increasing the weights of the other cases in the same stratum as the deleted cases.
  (The current version of SDA must generate these replicate weights internally. It is not currently possible to use this method on a dataset that contains replicate weights but not the stratum and cluster variables themselves.)
- The specified regressions are then run multiple times, successively using each set of replicate weights and then the overall weight. The regression coefficients calculated from each replicate are compared to the results using the overall weight, and the differences are used to compute standard errors for each regression coefficient.
This method is relatively simple and can be used for many types of analysis. However, it does require more computation time than the Taylor series method used for TABLES and MEANS.
Logistic and probit regression, in particular, can require extra time. The LOGIT program, even without a complex sample design, uses multiple iterations to converge to a solution. For complex samples, each iteration through the data requires separate calculations for each set of replicate weights. For large datasets with many PSUs, therefore, the user should not expect the almost instantaneous results that SDA usually provides.

Domains and subclasses of the sample

When analyzing data, we often need to calculate statistics for subgroups of the sample, in addition to calculating statistics for the sample as a whole.

Defining subgroups within a sample
There are several ways to define subgroups of the sample:
- Row and/or column categories in a table
- Categories of a control variable
- Cases defined by a selection filter variable
- Combinations of the above
After defining the subgroups, we will then calculate statistics like percentages or means, together with standard errors and confidence intervals, for each subgroup.
Domains versus subclasses
In calculating confidence intervals, the issue arises as to whether the various subgroups are sampling domains or were merely created after the fact for purposes of analysis. The consequences for calculating confidence intervals are different in these two cases.
- Domains: If parts of a sample are drawn separately from different and non-overlapping strata, those portions of the sample are usually referred to as "domains." Practically speaking, this is the same as drawing separate samples from each of those portions of the population. An example of domains might be the regions of a country, if the sampling strata do not cross regional boundaries.
- Subclasses: In contrast, the usual division of a sample into subgroups is not based on separate sampling strata but is simply done after the fact for purposes of analysis. For example, respondents are often divided into subgroups by age and gender so that the analyst can generate separate estimates for those subgroups or compare results for the different subgroups. A subgroup of this type is usually called a "subclass." In this case all or most of the sampling strata would contain cases belonging to the various subclasses.
Concretely, this affects the calculation of degrees of freedom used to create the confidence intervals. For the Taylor series method of computing complex standard errors, the degrees of freedom are calculated as the number of clusters minus the number of strata. In carrying out this calculation for a specific domain, the strata without valid cases, and the associated clusters that fall into those strata in the sample as a whole, are excluded from consideration. This means that the degrees of freedom for a particular domain will be fewer than the degrees of freedom for subclasses spread over the sample as a whole. The fewer the degrees of freedom, the greater must be the t-statistic for a particular confidence level. This reduction in the degrees of freedom means that the confidence interval for a domain will be a little wider (for a given standard error) than it would be for a subclass with more degrees of freedom.
The problem is that there is no obvious way for SDA to determine in advance whether a particular subgroup of the sample is a sampling domain or is simply a subclass created for analysis.
SDA therefore uses the following rules to decide:
1. Empty Strata:
  If a particular subgroup of the sample has no cases at all in any of the clusters in a particular stratum, that subgroup is assumed to be a sampling domain, selected in a way that excludes the strata with no cases in that domain. In other words, a judgment is made that the lack of any cases at all in a sampling stratum must be the result of the sample design. Then the calculation of confidence intervals for that subgroup is carried out without counting the strata with no valid cases and the associated clusters that fall into those strata in the sample as a whole.
  Since this exclusion of strata is done separately for each subgroup (or cell of a table), the confidence intervals in different cells of a table may be based on different numbers of strata. The optional table of diagnostic information (available in the MEANS program) reports how many strata and clusters were actually used to generate the degrees of freedom for calculating confidence intervals in each cell or for each comparison.
2. Missing Data on the Design Variables:
  The calculation of complex standard errors depends on the availability of stratum and/or cluster information for each case, depending on the particular design for that study. Cases in the dataset that are missing the design information are treated as being excluded from the population of interest and are excluded from the calculation of confidence intervals.
3. Missing Data on the Dependent Variable:
  For the MEANS program, cases that do not have valid values on the dependent variable are treated as being excluded from the population of interest and are excluded from the calculation of confidence intervals. This means that survey questionnaires that have a lot of cases that skipped around the dependent variable of interest will have confidence intervals calculated that are based on the number of cases actually in the question path. This exclusion is only relevant for the MEANS program, which uses the Taylor series method for calculating complex standard errors. The regression programs use another method, and this issue does not arise.

Differences between SDA and other programs

For cluster-only samples the automatic grouping of clusters into pseudo-strata based on the numeric order of the cluster numbers is a special capability available in SDA. It is often preferable to group adjacent clusters into strata, rather than to leave them in one large stratum. This depends on whether or not the numbering of clusters was carried out with any scheme in mind, relevant to the statistics that will be calculated. The SDA default is to group adjacent clusters into pseudo-strata, but it is also possible to leave them in a single stratum. This choice is made when the dataset is set up in the SDA Web archive.
Deciding whether a subgroup is a domain or a subclass of the population may work differently from other programs. SAS, for example, will usually calculate degrees of freedom once for the sample as a whole and then use that number for confidence intervals in every subclass of the sample. Stata, on the other hand, is similar to SDA, at least if the "subpop" option is specified in certain expected ways. The problem is that it is not obvious to most users what the difference is between a domain and a subclass, and it is not obvious how to specify the corresponding options for SAS and Stata. SDA tries to make an educated guess about the correct specification, but there is no guarantee that the results will always match the results of other programs. If the subgroup is determined to be a sample domain, the degrees of freedom are calculated accordingly. As a result, confidence intervals calculated by SDA for particular subgroups of the sample (= cells of a table) may be a little wider (i.e., more conservative) than the confidence intervals generated by SAS and other statistical packages.

References

For a more detailed technical treatment of the standard error specifications for the SDA programs, see the following documents:

Methods Used by SDA 4 for Computing Standard Errors for Complex Samples

CONTENTS

Specifying the Sample Design

Stratum and cluster variables BOTH specified

Cluster variable (only) specified.

Stratum variable (only) specified.

Methods Specific to Certain Programs

Percentages and Means

Differences between means

Regression analysis

Domains and subclasses of the sample

Defining subgroups within a sample

Domains versus subclasses

Differences between SDA and other programs

References

Methods Used by SDA 4 for Computing
Standard Errors for Complex Samples