SDA 4.0 Documentation for HARC


NAME

HARC - HTML archive specification file

DESCRIPTION

The HTML archive specification file (HARC file) provides information on the data files and procedures available in a data archive. Although Version 4 of SDA now uses the SDAmanager to define such information and no longer uses the HARC file directly, the information in a HARC file can be imported into the Version 4 database by the SDAmanager or by the HARCIMPORT java program (if you have installed it).

The HARC file specifications still relevant for importing into Version 4 are described in this document. Other specifications (keywords) are ignored. Here is a link to a cross-reference between the specifications in the SDA Manager and the various elements of a HARC file.

CONTENTS OF THIS DOCUMENT


OVERVIEW

The HTML archive specification file (HARC file) lists the codebooks, data files, analysis programs, and data export procedures that are available for online access from an SDA archive, and it indicates where those data files, programs, and other types of files are located. The HARC file also defines labels for the study datasets and files.

Stratum and cluster variables can be specified for a study in the HARC file, to enable the analysis programs to calculate complex standard errors. Weight variables for a study can be specified, so that users can select a weight from a drop-down menu, instead of having to enter the name of a weight variable on the option screen.


HARC FILE LAYOUT

The HARC file is an ASCII file that is laid out in various sections. Each section begins with a section title in square brackets ([ ]). Within each section, specifications are usually given in the form "keyword = something" with one keyword per line. Lines beginning with a pound sign (#) are interpreted as comments. Blank lines are ignored.

There were six possible section headings in the HARC file in SDA version 3. Three of those sections are ignored by the SDA Manager (LABELS, HEADER, and FOOTER). The three sections that are converted are the following:

[GENERAL]
Specification of overall options.
These will be input by the SDA Manager into the ’Global Options’ for a group of datasets.

[PROGRAMS]
Location of analysis programs; and a list of the programs to be made available.
These will also be input by the SDA Manager into the ’Global Options’.

[DATASETS]
Datasets available; names, locations, options.
The SDA Manager will use this information to create and configure each of the defined datasets.

The general layout is as follows:

[GENERAL] keyword = something keyword = something [PROGRAMS] keyword = something keyword = something [DATASETS] keyword = something keyword = something * keyword = something keyword = something

The names of sections and the keywords can be given in either upper or lower case, but they may not be abbreviated. The first section should be the [GENERAL] section. The other sections could be given in any order. However, it is a good idea to put the [DATASET] section last, to facilitate adding datasets to the HARC file.

Keywords within a section can be given in any order. However, within the [DATASET] section the keywords applicable to a specific study must be grouped together and be separated by an asterisk from the specifications for another study.


SPECIFICATIONS WITHIN EACH SECTION

Each section of the HARC file contains specifications appropriate to that section. Most sections contain specifications in the form of "keyword=something," where "something" is either an option specification or the name of a PATH or a Uniform Resource Locator (URL).

If the specification is a PATH, it must be a full pathname on the server computer such as: /bravo2/bravo3/sda

If the specification is a URL, it must be a complete one such as: http://socrates.berkeley.edu/mydata.html
In principle, a URL can refer to a location on any World Wide Web server. However, the checking for valid URLs done by the SDA debugger will only work within the same domain as the local server.

A slash (/) at the end of a PATH or a URL can be used if the referenced location is a directory (and not a specific file). However, this use of a final slash is optional.


[GENERAL] Section Keywords

Possible GENERAL keywords are grouped into the following sections:


Basic Keywords


Keyword       Possible Specification          Default (if no keyword)
_____________________________________________________________________


MAXLISTCASE=  Maximum number of CASES         Limit lists to 500
                to list (for ’listcase’)         cases

MAXLISTVARS=  Maximum number of VARIABLES     Limit lists to 500
                to list in an OUTSTUDY          variables
                before a warning

DUMMYGENMAX=  A number between 1 and 100      Max of 25 dummy vars can be
               (max dummy vars for REGRESS      generated by the "m:" syntax
                and LOGIT)                      for a single categorical var

XMEANS=       YES (to get special output--    No special output in
                average differences)            MEANS program

BATCHSAVEDIR= PATH of directory into which    No batch command files
                to copy the batch command       saved
                files for the analysis
                programs before they are
                deleted

LANGUAGE=     PATH to directory with          Use built-in English
                alternate language files        messages and menus
                for analysis output
                (File named ’langan.txt’
                 will be imported.)


Subsetting


Keyword       Possible Specification          Default (if no keyword)
_____________________________________________________________________

SUBMAXVARS=   Maximum number of variables     Limit subsets to 1000
                to allow in a subset            variables


Charts


Keyword       Possible Specification          Default (if no keyword)
_____________________________________________________________________


MAXCHARTS=    Maximum number of charts per    Maximum of 25 charts
                tables or means request
               (A number between 1 and 100)

CHARTFONT=    Name of font to use in charts   SanSerif



[PROGRAMS] Section Keywords

The ’SDAPATH=’ keyword is ignored in a HARC file by SDA version 4.
When you import global specifications from a HARC file, the SDA Manager will use the default location for the SDA version 4 programs. This location can be changed in the SDA Manager by editing the global specifications, if you wish. But that location cannot be changed by importing a HARC file.

The ’SDAPROGS=’ keyword indicates which SDA analysis programs are to be made available for the datasets specified in this HARC file. The currently available programs are: tables, means, correl, corrtab, regress, logit, listcase, recode, compute, listvars or listvars(delete).

Note the difference between specifying ’listvars’ or ’listvars(delete)’. If you only specify ’listvars’, the user will be able to list the newly created variables but will not be able to delete them. This will protect the created variables from being deleted, but it will also prevent users from deleting variables that were created erroneously.

Since the ’listcase’ program provides access to individual-level data, this program may not be appropriate for sensitive datasets. If the use of this program is suppressed by the use of a disclosure file, it is best to use global options for sensitive datasets that do not include the ’listcase’ program in the list of available SDA programs. Otherwise, an attempt to use ’listcase’ will generate an error message.

For information on the interactive use of each program, see the online help file for analysis programs or the online help file for creating new variables. For information on the batch command files for each program, see the index to the SDA Manual pages.


Keyword       Possible Specification          Default (if no keyword)
_____________________________________________________________________

SDAPROGS=     Analysis programs to provide    REQUIRED
                (tables, means, etc.)

To enable users to create new variables in a dataset, it is necessary, but not sufficient, to mention ’recode’ and ’compute’ in the list of programs. Each dataset for which new variables can be created must also specify where the new variables are to be stored:

The availability of the ’subset’ procedure for a particular dataset does not depend on this list of SDA programs. Rather, that availablity is assumed, unless you specify ’SUBSET=NO’ for a particular dataset in the datasets section of the HARC file.



[DATASETS] Section Keywords

The following keywords are repeated for each study that will be made available for online browsing, analysis, subsetting or downloading. Keywords for each study are grouped together and separated from other studies by an asterisk (*) on a line by itself.

Possible keywords for each dataset are grouped into the following sections:


Basic Keywords


Keyword       Possible Specification          Default (if no keyword)
_____________________________________________________________________

DATASET=      ID or name of the study (one    REQUIRED
                word, only letters or numbers)

DATALABEL=    Label of study to appear on     REQUIRED
                menus (one line)

CODEBOOK=     URL of homepage for HTML        REQUIRED
                codebook or documentation
                (may be repeated)

SDADATA=      PATH of SDA dataset             REQUIRED
                directory
                (may be repeated)

OUTSTUDY=     PATH of SDA dataset for         No recodes or computed
                newly created variables         variables can be stored
                                                (but see below)


The ’VARCASE=’ specification is ignored on Windows servers

VARCASE=      LOWER or UPPER                  Variable names entered on
                (names of variables entered     option screens must match
                 on option screens will be      the case of the variables
                 converted automatically to     stored in the dataset
                 the specified case)

Notes on Basic Keywords


Multiple Codebooks

If there is only one HTML codebook for a study, use the basic ‘CODEBOOK=’ keyword described above.

However, SDA allows each study to have multiple HTML codebooks. For example, a codebook stratified by year or region could be set up, in addition to the basic unstratified codebook. The user can select one of the codebooks to view at a time.

For each codebook, provide the URL of the main codebook HTML file, together with an appropriate label for the codebook (in parentheses). There can be as many codebooks as you wish. Each codebook should be created in a separate directory, in order to avoid filename conflicts.

If no label is provided for a codebook, the label for the first codebook will be ‘Default’, and the label for the others will be ‘Alternative’. Those labels are not very helpful, so it is much better to include more descriptive labels.

See an example of multiple codebook definitions in example 2.


Keyword       Possible Specification          Default (if no keyword)
_____________________________________________________________________

CODEBOOK=     URL for Codebook #1 (label)     Required for codebook
CODEBOOK=     URL for Codebook #2 (label)     Required for codebook


Weights

Appropriate weight variables, and a label for each one, can be specified for a study. If no weights are specified, the user can still enter the name of a weight variable on the option screen.

If more than one weight variable is specified, they are presented to the user as a drop-down list on the option screen for each analysis program. The first one listed in the HARC file is the default weight; but the user may select one of the other available weights from the drop-down list.

One of the weight options listed in the HARC file can be the option NOT to use a weight. This is specified as ‘##none’. An optional label can be given for this option; for example ‘##none(Do not use a weight)’. The default label is ‘(No weight)’.

A set of weight variables is specified in the dataset definition in example 1.

Note that the user can be forced to use a specific weight on every analysis run. If only one weight variable is specified in the HARC file for a study, and if the ‘##none’ option is not provided, the specified weight is used automatically on every analysis run.


Keyword       Possible Specification          Default (if no keyword)
_____________________________________________________________________

WEIGHT=       wtvar1 (label) wtvar2 (label)   Required for drop-down weights
WEIGHT=       ##none (label)

Multiple weight variables and labels can be defined on a single line. Alternatively the ‘WEIGHT=’ keyword can be repeated for additional specifications of weight variables and labels.


Standard Errors

If calculations of complex standard errors are to be enabled for a study, the stratum and/or cluster variables must be specified. This is done with a ‘design=’ keyword. The method of calculating standard errors depends on whether a stratum variable only, a cluster variable only, or both variables are specified.


Keyword       Possible Specification          Default (if no keyword)
_____________________________________________________________________

DESIGN=       STRATUM(var1) CLUSTER(var2)     Both variables are defined

DESIGN=       CLUSTER(var2)                   Clusters will be paired into strata

DESIGN=       STRATUM($1)   CLUSTER(var2)     Clusters all in one stratum

DESIGN=       STRATUM(var1)                   Only strata, no clusters

DESIGN=       STRATUM(var1) XREGRESSION       Default is SRS for REGRESS and LOGIT

If only a cluster variable is defined, the default procedure is to combine pairs of consecutive clusters (by cluster number) into strata, for purposes of calculating standard errors. (See example 2.)
Alternatively, you can force all the clusters to remain in a single stratum by specifying the name of the stratum variable as ’$1’.

See the document on calculating standard errors for more details.

Complex standard errors are computed by default for each analysis using the TABLES, MEANS, REGRESS, and LOGIT programs if a stratum and/or a cluster variable is defined for the dataset. The user, however, may force the calculation of SRS standard errors, effectively assuming that the sample is a simple random sample (SRS), by selecting that option on each program option page.

The calculation of complex standard errors can require a substantial amount of computer time when analyzing a large dataset using REGRESS and especially LOGIT. Therefore, the archive can override the usual default for those programs and make SRS the default for REGRESS and LOGIT. To do that, add ’XREGRESSION’ to the specifications after ’DESIGN=’ in the HARC file. Note that users will still be able to request complex standard errors if they wish, but they should not be surprised by delays in receiving results if they do so.
(See example 3.)


Subsetting

The next keywords are used to enhance or to suppress the customized subsetting of variables and/or cases. The file with information on groups of variables (which is required in SDA Version 4) enables the user to select entire groups of variables, instead of having to specify all desired variables one by one. That group-information file is generated automatically whenever the XCODEBK program produces HTML codebook files. It has the name ‘Xsub.txt’, where ‘X’ is the root name of the HTML codebook files. The default name of that file is ’hcbksub.txt’.

The ’SUBSET=NO’ specification, on the other hand, will suppress the option for creating a customized subset for this particular dataset.


Keyword       Possible Specification          Default (if no keyword)
_____________________________________________________________________

SUBGRPINFO=   PATH of file with info on       REQUIRED
                groups of variables
               (from codebook program)

SUBSET=       NO (to suppress ’subset’ from   Allow subsetting
                the user interface for this
                dataset)

If you want to be absolutely sure that the subset procedure is not available for a particular dataset, you should set up a disclosure file for that dataset and include the specification "subset=no" in that disclosure file. Otherwise, even though the subset option may not be presented to the user, it remains possible to run the subset procedure in batch mode.

Downloading

The next two keywords are used to specify data and documentation files that have been created ahead of time (that is, they are not custom-made on the fly by the ’subset’ procedure) and are available for downloading. These keywords are usually used in pairs -- a heading, followed by the full Pathname (not a URL) of a file available for downloading. The Pathname itself can also be followed by a label in parentheses; that label will appear on the selection screen next to the file name. (See example 3.)

If a ’DLHEADING’ specification immediately precedes a ’DLFILE’ specification, that heading will be imported to SDA version 4.0 as a label for the downloadable file. If BOTH a ’DLHEADING’ and a file label in parentheses are given, both will be imported as the file label.

Note that the file given as a Pathname should preferably have a suffix of ’.txt’, if it is a text file and if users are to be able to view the file in a browser as well as to save it.


Keyword       Possible Specification          Default (if no keyword)
_____________________________________________________________________

DLHEADING=    Heading or label for a file     No heading

DLFILE=       Pathname of file to download    No file available
                (any kind of file)

DLFILE=       Pathname (optional label)       No file available

EXAMPLES OF HARC FILES

  1. Codebooks and Basic Analysis
  2. Multiple Codebooks, Analysis, and Subsetting
  3. Codebooks, Analysis, and Downloading
  4. Allow the Creation and Deletion of New Variables
  5. Different Options for Different Datasets

1. CODEBOOKS AND BASIC ANALYSIS


[GENERAL]


[PROGRAMS]

SDAPROGS = tables means correl corrtab regress logit listcase

[DATASETS]

DATASET = nes92c
DATALABEL = NES 1952-1992 Cumulative Datafile

# For the standard SDA interface to function, the codebook must have
# been created with version 3 (or later) of the ’xcodebk’ program.
CODEBOOK = http://socrates.berkeley.edu/sdadocs/NES92C/n92c.htm

SDADATA = /bravo3/NES/nes52-92.cum/

WEIGHT = sampwt(Sampling wt)
WEIGHT = pswt(Post-stratification wt)
WEIGHT = ##none

DESIGN = STRATUM(stratvar) CLUSTER(psuvar)

*

DATASET = capums1
DATALABEL = 1990 Census - California 1% Sample

SDADATA = /bravo3/capums1/
WEIGHT = houswgt(Household weight) pwgt1(Person weight) ##none(No weight)

2. MULTIPLE CODEBOOKS, ANALYSIS, AND SUBSETTING


[GENERAL]



# Limit subsets to 100 variables
SUBMAXVARS = 100

[PROGRAMS]

SDAPROGS = tables means correl corrtab regress logit listcase

[DATASETS]

# If customized subsetting is to be enabled, include the following:
#
#   SUBGRPINFO= FULL PATHNAME of file with info on groups of variables
#                (produced by codebook program, if there are headings;
#                 use is required in SDA 4.0)
#
# For multiple codebooks, include a URL and (label) for each:
#
#   CODEBOOK=   URL for codebook #1 (label for codebook #1)
#   CODEBOOK=   URL for codebook #2 (label for codebook #2)
#

DATASET = gss
DATALABEL = GSS 1972-2004 Cumulative Datafile
SDADATA = /bravo3/GSS/sda

# Define a cluster variable for this dataset, to calculate
#  complex standard errors

DESIGN = cluster(sampcode)


CODEBOOK = http://sda.berkeley.edu/GSS/Doc/GSS.htm (Standard Codebook)
CODEBOOK = http://sda.berkeley.edu/GSS/Docyr/GSYR.htm (Codebook by Year)

SUBGRPINFO= /bravo3/GSS/Doc/GSSsub.txt

*
DATASET = multi
DATALABEL = 1994 Multi-Investigator Study
CODEBOOK = http://socrates.berkeley.edu/Multi/Doc/mult.htm
SDADATA = /bravo3/Multi/sda

SUBGRPINFO= /bravo3/Multi/Doc/multsub.txt



3. CODEBOOKS, ANALYSIS, AND DOWNLOADING


[GENERAL]


[PROGRAMS]

SDAPROGS = tables means correl corrtab regress logit listcase

[DATASETS]

# If files are to be available for downloading, include the following:
#
#   DLHEADING=  Heading for a file
#
#   DLFILE=     Pathname of file to download, followed by an optional
#                  label (given in parentheses)
#                Note that the Pathname should best have a suffix
#                  of ’.txt’, if users are to be able to view
#                  the file as well as to save it.
#
#               (Many of these headings and URLs can be given.)

DATASET = nes2004c
DATALABEL = NES 1952-2004 Cumulative Datafile
CODEBOOK = http://sda.berkeley.edu/NES2004C/n04c.htm
SDADATA = /bravo3/NES2004C/nes52-04.cum/

# Define stratum and cluster variables for this dataset, but SRS is the
#  default for the regression and logit/probit programs

DESIGN = stratum(stratcode) cluster(psucode) xregression


# The following keywords specify files available for downloading.
# Notice the optional labels in parentheses after some URLs.

DLHEADING = DATA FILES
DLFILE = /socrates.berkeley.edu/DL/NESdat.txt (Plain ASCII file)
DLFILE = /socrates.berkeley.edu/DL/NESdat.zip (Zipped file for PC’s)
DLHEADING = SAS definition file
DLFILE = /socrates.berkeley.edu/DL/NESsas.txt
DLHEADING = SPSS definition file
DLFILE = /socrates.berkeley.edu/DL/NESspss.txt
DLHEADING = DDL file
DLFILE = /socrates.berkeley.edu/DL/NESddl.txt (Plain ASCII file)
DLHEADING = Microsoft Word Codebook ready to be printed
DLFILE = /socrates.berkeley.edu/DL/NEScdbk.doc
DLHEADING = Set of HTML codebook files
DLFILE = /socrates.berkeley.edu/DL/NEShtml.zip (Zip file)


4. ALLOW THE CREATION AND DELETION OF NEW VARIABLES


[GENERAL]

[PROGRAMS]

SDAPROGS = tables means correl corrtab regress logit listcase

SDAPROGS = recode, compute, listvars(delete)
# To allow variables to be created but not deleted, specify:
# SDAPROGS = recode, compute, listvars

[DATASETS]

DATASET = nes92c
DATALABEL = NES 1952-1992 Cumulative Datafile
CODEBOOK = http://socrates.berkeley.edu/sdadocs/NES92C/n92.htm
WEIGHT = sampwt(Sampling weight) finalwt(Final weight) ##none(No weight)
DESIGN = STRATUM(stratvar) CLUSTER(psuvar)
SDADATA = /bravo3/NES/nes52-92.cum/

OUTSTUDY = /bravo3/NES/nes52-92.cum/newvars

*

DATASET = capums1
DATALABEL = 1990 Census - California 1% Sample
CODEBOOK = http://socrates.berkeley.edu/sdadocs/CENSUS/pums.htm
WEIGHT = houswgt(Household weight) pwgt1(Person weight) ##none(No weight)
SDADATA = /bravo3/capums1/

OUTSTUDY = /bravo3/capums1/newvars


5. DIFFERENT OPTIONS FOR DIFFERENT DATASETS


[GENERAL]

[PROGRAMS]

SDAPROGS = tables means correl corrtab regress logit listcase
SDAPROGS = recode compute listvars(delete)

[DATASETS]

# For this dataset, enable browsing of the codebook, online analysis,
#   and creation of new variables.
# No subsetting or downloading is allowed.

DATASET = gss04
DATALABEL = GSS 1972-2004 Cumulative Datafile
CODEBOOK = http://socrates.berkeley.edu/GSS/HTMLBOOK/gss.htm
SDADATA = /bravo3/docs/GSS

OUTSTUDY = /bravo3/docs/GSS/newvars

*

# For this dataset, allow browsing of the codebook,
#   online analysis and downloading.
#   No subsetting is allowed, because the
#   ’SUBSET=NO’ keyword is included.

DATASET = multi
DATALABEL = 1994 Multi-Investigator Study
CODEBOOK = http://socrates.berkeley.edu/Multi/Doc/mult.htm
SDADATA = /bravo3/docs/GSS
SUBSET = NO

DLHEADING = ALL OF THE FOLLOWING FILES ARE PLAIN ASCII FILES
DLHEADING = Data file (616 K)
DLFILE= /socrates.berkeley.edu/Multi/DL/multidat.txt
DLHEADING = SAS definition file
DLFILE= /socrates.berkeley.edu/Multi/DL/multisas.txt
DLHEADING = SPSS definition file
DLFILE= /socrates.berkeley.edu/Multi/DL/multisps.txt
DLHEADING = DDL definition file
DLFILE= /socrates.berkeley.edu/Multi/DL/multiddl.txt

*

# For this dataset, enable all options:
#  codebook, online analysis with complex standard errors,
#  creation of new variables, pre-defined weights,
#  customized subsetting, and downloading of pre-existing files.

DATASET = natlrace
DATALABEL = 1991 Race and Politics Survey

CODEBOOK = http://socrates.berkeley.edu/Natlrace/Doc/race.htm

SDADATA = /bravo3/docs/Natlrace
DESIGN = stratum(stratvar) cluster(psunum)

OUTSTUDY = /bravo3/docs/Natlrace/newvars

WEIGHT = sampwt(Sampling wt)
WEIGHT = pswt(Post-stratification wt)
WEIGHT = ##none

SUBGRPINFO= /bravo3/docs/Natlrace/Doc/racesub.txt

DLHEADING = ALL OF THE FOLLOWING FILES ARE PLAIN ASCII FILES
DLHEADING = Data file (936 K)
DLFILE= /socrates.berkeley.edu/Natlrace/DL/racedat.txt
DLHEADING = SAS definition file
DLFILE= /socrates.berkeley.edu/Natlrace/DL/racesas.txt
DLHEADING = SPSS definition file
DLFILE= /socrates.berkeley.edu/Natlrace/DL/racespss.txt
DLHEADING = DDL definition file
DLFILE= /socrates.berkeley.edu/Natlrace/DL/raceddl.txt

SEE ALSO

DDL Data Description Language
HARCimport Import HARC file into Version 4 SDA database
internationalization Using Non-English languages in SDA
sdalog Generate a Report of SDA Usage
sdamanager SDA Manager


CSM, UC Berkeley/ISA
January 19, 2016