Methods Used by SDA 1.1 for Computing
Standard Errors of Means for Complex Samples


CONTENTS


Specifying the Sample Design

The sample design is specified for each study when the SDA dataset is defined in the Hypertext Archive Definition file (HARC file). In that file, the owner of the data archive (or whoever can modify that file) may specify a stratum variable and/or a cluster variable for each dataset. Consequently, there are three possible specifications: (1) stratum and cluster BOTH specified; (2) cluster variable only; and (3) stratum variable only. Each of these three specifications results in standard errors being computed differently.

Methods for Each Design

The standard error calculations for each combination of stratum and cluster variable specifications are summarized in this section.

Stratum and cluster variables BOTH specified

For a stratified cluster sample, both a stratum variable and a cluster (PSU) variable are specified. For this design, the Taylor series approximation method is used. Concretely, the sampling variances are calculated based on the differences in the mean values of the dependent variable between clusters within each stratum. This method of calculation is discussed in Kish, Survey Sampling, pp. 190-193. The actual formula is 6.4.4 on p. 192. The finite population correction (1-f) is ignored.

Each stratum must have valid cases in at least two clusters. For certain smaller subclasses of the sample, strata may have to be collapsed. See the discussion below on collapsing strata.


Cluster variable (only) specified.

A cluster sample is always treated by SDA as a stratified cluster sample, for purposes of calculating standard errors. If the sample was designed without any explicit strata, the clusters are combined into pseudo-strata so that there are cases in each pseudo-stratum from more than one primary sampling unit (cluster).

In order to create these pseudo-strata, each pair of adjacent clusters (in numeric order) is combined into a stratum. For example, clusters 1 and 2 would be paired as belonging to stratum 1, and clusters 3 and 4 would belong to stratum 2. If there is an odd number of clusters, the last cluster will be combined with the first two clusters, to form a stratum with 3 clusters.

It is important to understand that this method of creating strata is done separately for each cell of the table (subclass of the sample). If some cells do not have cases from all of the clusters, the strata in some cells may be quite different from the strata in other cells. For example, if a cell does not have any cases from cluster 2, stratum 1 for that cell would consist of clusters 1 and 3, and stratum 2 would consist of clusters 4 and 5. If there are expected to be substantial differences between the clusters, it may be preferable to create explicit strata yourself, rather than let the program create them automatically.

Once the clusters have been combined into pseudo-strata by the program, the Taylor series method is used to calculate standard errors, just as for the stratified cluster design.

However, if the pairing of clusters into pseudo-strata might result in large differences between clusters that happen to fall into the same pseudo-stratum, it may be preferable to group all the clusters into one large stratum. This will avoid having the clusters paired automatically into pseudo-strata. To do this, create a stratum variable that has the same value for all the cases (the number '1', for instance). Then define that variable as the stratum variable in the HARC file. This procedure will sacrifice any potential gains that might result from the implicit stratification of the clusters (if they have been ordered by some relevant criterion). But it will also avoid the inflation of variance that could result from the pairing up of very different clusters.


Stratum variable (only) specified.

If a stratum variable is specified, but no cluster variable, the sample is treated as a stratified element sample. However, the ordinary stratified variance formula cannot be used, since there is no information on the correct distribution of strata within cells of the table.

What this means is that the proportion of cases falling into each stratum, within each cell of the table, can be expected to vary from one sample to another. In other words, the stratum weights are not fixed but rather are random variables computed from each sample.

As a result, the formula for the variance of subclass means in stratified samples is used, instead of the ordinary simpler formula for stratified samples. (See Kish, pp. 132-136; especially formula 4.5.4 on p. 134.) The finite population correction (1-f) is ignored.

If weights are used for each case, the mean within each subclass is a ratio mean, since the weighted sum of cases is not fixed but would vary from one sample to another. Consequently, the Taylor series formula is used to calculate the sampling variance within each subclass. This Taylor series approximation is then used in the formula for the variance of stratified subclass means.


Collapsing of Strata

These methods of computing standard errors require that there be two or more units within each stratum. For cluster samples, this means that there must be at least one case from each of two or more clusters in a stratum. For stratified element samples, this means that there must be two or more cases in a stratum.

In some cells of a table (subclasses of the sample), it is possible that these requirements are not met for some of the strata. In such a case, the program will automatically combine adjacent strata, in order to be able to compute a within-stratum variance.

Since this collapsing of strata is done separately for each cell of the table, standard errors in different cells of the table may be based on different numbers of strata and different stratum definitions. The optional table of diagnostic information reports how many strata were actually used (after collapsing) for calculating standard errors in each cell.

If a stratum variable has been specified, this method of collapsing will preserve some of the gains of stratification, provided that adjacent strata (in numeric order) are more similar to one another than to strata farther removed. In other words, if the strata are part of a broader stratification scheme, it is advantageous to order the strata numerically in accordance with that scheme, so that similar strata are grouped together.

If only a cluster variable has been specified, the strata are always formed by pairing adjacent clusters. If some clusters become empty in small subclasses of the sample, the program will automatically use those clusters that still have cases in them, to create new pairs. This method of creating strata will preserve some of the gains of stratification, provided that adjacent clusters (in numeric order) are more similar to one another than to clusters farther removed. In other words, it is advantageous to order the clusters by some variable that is related to the dependent variable(s), so that the pairs grouped into strata are relatively similar. In that way the implicit stratification, produced by the sorting or ordering of the clusters, will be reflected in the explicit pseudo-strata created in order to calculate standard errors.


Difference between SDA and other standard error programs

The capability of collapsing strata automatically for each subclass of the sample is a special feature of SDA. Most programs will not calculate standard errors at all for any subclasses with strata having only one cluster or element. You must collapse the strata manually before running those programs.

SUDAAN has an option to deal with incomplete strata, but it cannot collapse strata. When there are strata with only one PSU for some cells, by default SUDAAN prints an error message and halts. If you use the MISSUNIT option on the NEST statement, SUDAAN will print the same message (as a warning only), and then estimate the variance contribution of that PSU by using the difference between that PSU's value and the overall mean value for the population.

That SUDAAN solution, however, is not optimal. The overall mean value is not really a good substitute for the stratum-specific values, unless the stratum variable has no relationship to the dependent variable. It is usually preferable to collapse strata and calculate the variance contribution of each PSU by using the difference between that PSU and the other(s) within the collapsed stratum. And this is what SDA is able to do automatically. However, the consequences of combining PSUs into new strata depend on how the strata are ordered numerically. See the section above on collapsing strata for a discussion of those issues.