SDA 4.1 Documentation for MEANS

NAME

means - run tables of means in batch mode

USAGE

means -b filename

DESCRIPTION

MEANS displays the mean value of a dependent variable in a crosstabular format. The means are calculated and displayed within categories defined by the row, column, and control variables. (Only a row variable is necessary.)

Several optional statistics such as medians, percentiles, and standard errors can also be calculated and displayed in each cell of the output table. Note that the standard error option refers only to the mean, not to the median or percentile. Each statistic can be displayed with a specified number of decimal places.

Ordinarily this program is invoked by the Web interface for the SDA programs, and the user does not have to deal with the keywords given in this document. Output from the program is usually in HTML, which is sent to the user's Web browser. However, output can also be produced as a CSV file so that the user can feed the results into other procedures, either for special formatting or for other purposes.
CSV output is produced if 'TYPE = CSV'. is specified.

It is also possible to run the program in batch mode by preparing a command file, which specifies the variables to be analyzed and the options to use. This document explains how to prepare such a file. The name of this batch command file is specified to the program after the `-b' option flag.

KEYWORDS

The batch file contains specifications for the analysis. The specifications are given in the form "keyword = something" with one keyword per line. Keywords may be given in any order, either in upper or in lower case. The valid keywords are as follows (with significant characters shown in capital letters):

Basic Specifications for the Tables


Keyword       Possible Specification          Default (if no keyword)
_____________________________________________________________________


STUdy=        path of dataset directory       Look for variables in
                                                current directory only
SAvefile=     filename to receive output      Output sent to screen
                (overwrite existing file)       (standard output)


Variable Specifications

DEPendent=    variables name(s)               REQUIRED
               (separated by spaces/commas)
ROWvar=       variable name(s)                REQUIRED
               (separated by spaces/commas)
COLUMNvar=    variable name(s)                No column variable

CONtrolvar=   variable name(s)                No control variable

Weight=       name of weight variable         No weighting

Filter=       name(s) and codes of filter     No filter
                variable(s)

GVARCase=     LOWER or UPPER                  No force to lower/upper case

STRatum=      name of variable giving         No stratification for
                sample stratum                  computing standard errors
              $1: Force one stratum

CLuster=      name of variable giving         No cluster variable for
                sample cluster                  computing standard errors


General Options

COLORcoding=  Yes                             No color coding of cells
                                                or colored headings

LAnguagefile= pathname of file with           English labels on
                non-English labels              output

NOTABle=      Yes (to suppress tables of      Display the tables
                means, confidence intervals,
                and diagnostic information
                but still get other info)

TExt=         Yes                             No text for variables

RUNtitle=     title or comments for run       No title or comments

Statistics in Each Cell

Main statistic to display in each cell

The main statistic to display in each cell of the table can be one of five options: the means, the totals (which are the numerators of the means), or the transformation of a 0/1 dependent variable into a logit, a probit, or a logit scaled as a probit. The default main statistics to display are the means.

Instead of displaying the main statistic directly, it is possible to display the DIFFERENCE from something else, by adding the `difference=' keyword. The difference for each cell can be the difference between the cell mean and either the overall mean, the mean in the same column of a specified row, or the mean in the same row of a specified column. If a row or column difference is requested, you must also specify the BASE CATEGORY to use for the comparison.

For differences between a specified row or column, it is possible to obtain the average of the differences, instead of the difference in the marginal column or row. This option is set in the Global Specifications section for the dataset in the SDA Manager (or in the general section of the HARC file by setting XMEANS=YES).

For each statistic the user can specify the number of desired decimal places (in parentheses, after the name of the statistic). See below for the default number of decimals for each statistic.



Keyword       Possible Specification          Default (if no keyword)
_____________________________________________________________________

MAINstat=     MEANs (ndec)                    Display means, with
              TOTALs (ndec)                     two decimal places
              LOgit  (ndec)
              PRobit (ndec)
              LP (ndec)

DIFference=   Overall (ndec)                  Display main statistic
              Row     (ndec)
              Column  (ndec)

BASEcat=      code for comparison row/column  REQUIRED for row/column
                                                differences

AVGDiffs=     Yes                             No average differences
                                                from a row or column
                                                are displayed

Other statistics in each cell

In addition to the main statistic, one or more of the following optional statistics can be displayed in each cell (with the desired number of decimal places in parentheses if the defaults, listed below, are not satisfactory). Note that the 'OTHERSTats=' keyword can be repeated on subsequent lines if necessary.

If confidence intervals are requested, the upper and lower bounds of the confidence interval for the mean (or total or difference) in each cell are shown. (Confidence intervals for medians and percentiles are not available.) The default level of confidence is the 95 percent level, but the 90 or 99 percent levels can also be specified (in parentheses). The number of decimal places displayed will be the same as requested for the means. If both complex and SRS standard errors have been requested, only the complex standard errors are used for the confidence intervals.


Keyword       Possible Specification          Default (if no keyword)
_____________________________________________________________________


OTHERSTats=
              CONFidence (level)              No confidence intervals
               (level can be 90,95,or 99)

              (EITHER medians OR percentiles
               can be specified, but not both)
              MEDIAN (ndec)                   No Median of dep variable
              PERCENTile (nth, ndec)          No nth percentile

              MINimum (ndec)                  No minimum value
              MAXimum (ndec)                  No maximum value

              Ncases                          No unweighted N's
              WNcases (ndec)                  No weighted N's


             (statistics for means only)

              SER (ndec)                      No standard errors for
                                                simple random sample

              ZSTATistic (ndec)               No Z- or T-statistics

              P (ndec)                        No p-value
               (only for differences
                from a row or a col)

              SD (ndec)                       No standard deviations


             (for complex samples only)

              SEC (ndec)                      No standard errors for
                                                complex sample design
              DEFT (ndec)                     No design effect


             (for cluster samples only)


              RHO  (ndec)                     No cluster coefficient

REMEDIAN=     ASNEEDED or ALWAYS              NEVER: No remedian estimates
                                                for medians or percentiles
                                                (see section with additional
                                                 information below)

Optional tables of statistics

Additional tables of statistics can be generated, if desired.

An ANOVA table can be produced. For simple random samples the ANOVA table and an F-test is produced. For complex samples the F-test is omitted and the only output is the eta-squared statistics, which show descriptively the proportion of the variance of the dependent variable that is explained by the row and column variables and their interaction.

For complex samples, a table with diagnostic information in each cell can also be produced.

A multiple classification analysis (MCA) can be carried out. The default number of decimals is 3, but another number of decimal places can be specified.


Keyword       Possible Specification          Default (if no keyword)
_____________________________________________________________________

ANova=        Yes                             No anova table

OTHERTABles=
              DIAGnostics                     No table with diagnostics

              MCA (ndec)                      No Multiple Classification
                                                Analysis

Chart Options

There are several chart options, assuming that the chart generation servlet is running on the server computer. Two of the specifications are required, in order to produce charts.

The statistic charted is the statistic specified with the 'MAINSTAT=' keyword (default is MEANS). However, if MEDIANS or PERCENTILES are specified with the 'OTHERSTats=' keyword, the chart can be based on the median or percentile (only one can be specified), by specifying 'PERCENTile' as the chart type.


Keyword       Possible Specification          Default (if no keyword)
_____________________________________________________________________

CHARTtype=     PERCENTile                      Chart the 'MAINSTAT' instead
                                                 of medians or percentiles

TBLProperties= PATHNAME for chart properties   REQUIRED for charts
                file
               Required location for SDA 4 is:
               SDAROOT/tmpdir/xxx.cht
               where 'SDAROOT' is the pathname
                of the SDA installation on
                your server, and
               where 'xxx' is any name.
                (See the last example below)

               (This is a temporary filename,
                to be passed on to the charting
                servlet. The MEANS program
                will generate multiple files
                from the given filename, if
                multiple charts are generated
                because a control variable
                was specified or because
                multiple dependent or row or
                column variables were
                specified.)

CH_URL=         URL of chart-generation        REQUIRED for charts
                 servlet on the server.
                Required URL for SDA 4 is:
                http://SDAURL/sdaweb/charts
                 where 'SDAURL' is the
                 hostname of the SDAWEB
                 application on your server.
                 (See the last example below)

CH_MAXCHarts=   Maximum number of charts to     25
                 create on this run (1-100)

CH_TYPe=        Type of chart to create         bar
                (bar or line)

CH_ORientation= Orientation of BAR charts       vertical
                (vertical or horizontal)

CH_EFfects=     Visual effects for BAR charts   use2D
                (use2D - 2 dimensional;
                 use3D - 3 dimensional)

CH_SHOWMeans=   Yes                             No means or
                 Put means (or the specified     other stats
                 statistic) on the chart         on the chart

CH_FONT=        Font to use in charts           SansSerif

CH_COLor=       Yes (create charts in color)    Greyscale charts

CH_BARcolors=   Path for custom palette file    Standard colors
                 for bar charts
                 (See additional info below)

CH_LINEcolors=  Path for custom palette file    Standard colors
                 for line charts
                 (See additional info below)

CH_WIdth=       Width of chart in pixels        600

CH_HEight=      Height of chart in pixels       400

CSV Output

Instead of regular HTML output, the MEANS program can produce output as a CSV file (with commas separating the values output). If 'TYPE = csv' is included in the batch command file, the output will be produced as a CSV file. The name of the file is specified with the 'SAvefile=' keyword. The file name should ordinarily have a '.csv' suffix.

By default the various statistics generated for each cell of a table (such as percentages and number of cases) are output in separate sections (separate series of rows) in the CSV file.

If CSVCOMBine = yes, all of the statistics in each cell are output in the same section of the CSV file.


Keyword       Possible Specification          Default (if no keyword)
_____________________________________________________________________

TYPE=          csv (produce CSV output)       Standard HTML output

CSVCOMBine=    yes (combine statistics)       Not combined

ADDITIONAL INFORMATION

ABBREVIATIONS FOR KEYWORDS

Keywords can usually be abbreviated down to the number of characters required to differentiate them from other keywords. Sometimes only one character is required. The keyword for the weight variable, for instance, can be given as "weight=" or "wei=" or even "w=". Either upper or lower case may be used. In the list of keywords above, the minimum string of characters required for each specification is shown in capital letters.

Mention of keyword sufficient

The form `keyword=yes' may be shortened to `keyword'. That is, the `=yes' may be omitted for those options which require no further specification. For example, `text=yes' can be shortened to `text'.

COLORS FOR CHARTS

Each type of chart has a default set of colors that are used for successive bars or lines in the chart. To change the default set of colors, specify the full pathname of a file that specifies, on each line, the three RGB color codes for each successive color to use. This pathname is given after the 'CH_BARcolors=' keyword and/or the 'CH_LINEcolors=' keyword.

COMMENTS

Anything on a line beginning with "#" is ignored by the batch processor and can therefore be used for comments. Blank lines are also ignored.

DECIMAL PLACES

Each statistic has a default number of decimal places with which it will be printed. To change the default, put the desired number of decimals in parentheses after specifying the statistic. The statistics affected and their defaults are:

For the main statistic: means(2), totals(0), logit(2), probit(2), logit-scaled-as-probit(2). If differences are displayed, the default number of decimal places is the same as for the main statistic.
For the optional statistics: medians(2), percentiles(2), ser(3), zstat or tstat(2), p-value(2), sd(3), sec(3), deft(2), rho(2). min(0), max(0), wncases(0).
Confidence intervals have the same number of decimal places as the means.
For MCA statistics: the default is 3 decimals.

It is not necessary to specify the `mean' statistic unless you want to change the number of decimal places for the mean. Unless otherwise specified, the mean is the main statistic that will be displayed, using the default number of decimals (2).

ORDER OF PROCESSING LISTS

When more than one variable is given for the dependent, row, column, or control variable specifications, the tables are produced in the following order: Tables for EACH of the control variables are produced with the FIRST column variable and the FIRST row variable and the FIRST dependent variable. Then the whole list of control variables is processed again for the SECOND column variable and the FIRST row variable and the FIRST dependent variable; and so on until the whole set of column variables has been processed. Then the whole series is repeated for the SECOND row variable; and so on until all the row variables have been used. Finally, the whole series is repeated for each succeeding dependent variable.

Briefly, the variables will cycle in the following order: control, column, row, dependent. All of the tables will be produced using the same weight, filters, and other options.

REMEDIAN ESTIMATES

If a dataset has a very large number of cases, and if a table with many cells is requested, there may not be enough memory available to calculate exact medians or percentiles for every cell in the table. In that case an estimate of the median or percentile, called the remedian, CAN BE computed for some of the cells, as needed. (The 'ALWAYS' option is primarily designed for testing, and 'NEVER' is the default.) An asterisk next to the median or percentile statistic indicates that it was estimated by the remedian algorithm. The remedian statistic will be output with the number of decimal places specified for the median or percentile (default=2 decimals).

For further information on this method of estimating the median or percentile, see Peter J. Rousseeuw and Gilbert W. Bassett, Jr., "The Remedian: A Robust Averaging Method for Large Data Sets." Journal of the American Statistical Association, March 1990, vol. 85, pp. 97-104. Note that SDA uses a base of 101 to calculate the remedian.

REPETITION OF KEYWORDS

If there is not enough room on a line to list all of the desired variables, the keyword can be repeated on a new line, and more variables can be listed. In such a case the second list is appended to the first list, for purposes of generating tables. This appending feature applies to the keywords for specifying the dependent, row, column, control, and filter variables, and also to the `otherstats' and the `othertables' keywords. If other keywords are repeated, the program will print an error message and stop.

BACKWARD COMPATIBILITY

Versions prior to SDA 1.2b used 'vertical' and 'horizontal' to specify the 'rowvar' and 'columnvar' variables in the batch command files. Although the older terminology has been superseded, those keywords are still recognized for now as synonomous with the newer 'rowvar' and 'columnvar' specifications.

Confidence intervals were formerly specified as an OTHERTABle and were output as a separate table. In SDA 4.1.2 and later, they are specified as an OTHERSTat and are shown as a row in the main table of results. Batch files using the older syntax will still run, but the confidence intervals will be displayed in the main table.

EXAMPLES OF BATCH FILES

Basic example


     study = /archive/nes84
     dep = vardep
     row = var1
     column = var3

     otherstats = ncases
     anova = yes
     savefile = mymeans.htm

Multiple variables

Specify multiple dependent, row, and column variables, which will generate a table for each combination of the variables.
Also redefine some ranges, and use weight and filter variables.

     study = /archive/nes84
     dep = vardep1 vardep2
     row = var1(1-9) var2 var3(0-9)
     column = var3, var4

     weight= wtvar
     filters= var21(1-3) var30(1)

     otherstats = se, ncases
     anova
     savefile = mymeans.htm

Differences from means in a specified column

Calculate the differences (with 3 decimal places) from column 1, the standard error and statistical significance of each difference, and request some text options


     study = /archive/nes94
     dep = vote
     row = party
     column = sex

     diffs = col(3)
     basecat = 1

     otherstats =  se p ncases
     anova

     text
     runtitle= Test run to demonstrate batch mode

     savefile= mymeans.htm

Complex standard errors

Specify stratum and cluster variables, for complex standard errors; also request tables of confidence intervals and diagnostics

     study = /archive/nes94
     dep = vote
     row = party
     column = sex

     stratum = stratvar
     cluster = psuvar
     otherstats =  sec ser deft rho ncases
     othertables = confidence diagnostics

     savefile= mymeans.htm

Specifying some chart options

In addition to the required two chart specifications, request charts in color (instead of grayscale) with means printed next to each bar.

     study = /sa/sdatest
     dep = vardep
     row = var1
     column = var3

     savefile = mymeans.htm

     tblproperties = /var/www/sda/tmpdir/testing.cht
     ch_url=http://sda.berkeley.edu/sdaweb/charts
     ch_color = yes
     ch_showmeans= yes

CSM, UC Berkeley/ISA
September 10, 2020