SDA 3.4 Documentation for DISCLOSURE


NAME

disclosure - Specify disclosure specifications to protect confidentiality

DESCRIPTION

All of the analysis programs, including RECODE and COMPUTE, check to see if there is a file named ’disclosure.txt’ in the STUDYINF directory of the SDA datasets. If they find such a file, they enforce the disclosure specifications contained in that file.

If multiple SDA datasets are named in the ’SDADATA=’ specifications in the HARC file, only one of them (usually the main dataset) can have a ’disclosure.txt’ file. The other SDA dataset(s) (usually created to hold recoded and computed variables) must have a file named ’disc-id.txt’ in their STUDYINF directory. This ’disc-id.txt’ file should contain a single ID keyword with the format ’ID=abc’, where ’abc’ is the same ID or name used for this study in the ’disclosure.txt’ file in the main SDA dataset, as described below.


KEYWORDS

The ’disclosure.txt’ file contains specifications for the analysis. These specifications are given in the form "keyword = something" with one keyword per line. Keywords may be given in any order, either in upper or in lower case. All keywords except the ID specification are optional.

The ’disc-id.txt’ file contains only the ’ID=’ keyword. And the specified ID must match the ID in the ’disclosure.txt’ file of the other (main) SDA dataset listed for this study in the HARC file.

The valid keywords are as follows:


Keyword       Possible Specification          Default (if no keyword)
_____________________________________________________________________


DISCLOSURE ID FOR THE STUDY

ID=           a unique identifier for the     REQUIRED
                dataset with disclosure rules
                (one word, only letters or
                 numbers)

PREVENT AN ANALYSIS FROM BEING RUN

VAREXCLUDE=   name(s) of variable that        All variables allowed
                cannot use used in analysis

COMBEXCLUDE=   pairs of variables that        All combinations allowed
                cannot be used together in
                the same analysis run
                and cannot be used at all
                to recode or compute new
                variables
                (see notes below)

MAXFILTERS=   maximum number of selection     Any number of filters OK
                filter variables that can
                be used in a single run

CONTROLVAR=   no, if control variables        A control variable is OK
                cannot used used in tables

LISTCASE=     no, if the ’listcase’ program   Listcase run is OK
                is not allowed to run

SUBSET=       no, if the ’subset’ program     Subset run is OK
                is not allowed to run



SUPPRESS THE OUTPUT AFTER RUNNING AN ANALYSIS


MINCELLN=     minimum number of cases in a    No required minimum cell N
                table cell to allow a table
                to be displayed
                (see notes below)

MINCELLWN=    minimum number of WEIGHTED      No required weighted minimum
                cases in a table cell to        cell N
                allow a table to be displayed

AVGCELLMIN=   minimum average cell size to    No required average cell N
                allow a table to be
                displayed (checks both the
                mean and the median cell size,
                excluding cells with no cases)

AVGCELLWMIN=  minimum WEIGHTED average cell   No required weighted average
                size                            cell N

MINCASEBYIVAR= for regressions, minimum ratio No limit on the number of
                of valid observations to the    independent vars
                number of independent vars

MONITORVAR=   varname, (min_values)           No special monitored vars
                (see notes below)



SUPPRESS UNWEIGHTED NUMBER OF CASES IN OUTPUT


UNWEIGHTEDN=  no                              Show unweighted N’s


NOTES ON THE OPTIONS

Disclosure ID for the Study

The disclosure ID for a study is a one-word name, consisting of letters or numbers. It can be the same as the dataset name in the HARC file, or it can be different. This disclosure ID is a mechanism to ensure that the main SDA data file for a study, and its other associated data files (for recoded or computed variables, for example), all observe the same disclosure rules. One (and only one) of the SDA datasets (referred to as the "main" SDA dataset) must have a ’disclosure.txt’ file in its STUDYINF subdirectory. Each of the other associated SDA datasets (listed together in the HARC file for this same study) must have a ’disc- id.txt’ file in its STUDYINF subdirectory, and that file must have the same ’ID=’ value as the ’disclosure.txt’ file in the main SDA dataset.

Variables Excluded from Recoding and Computing

Any variable named in the ’COMBEXCLUDE=’ or the ’VAREXCLUDE=’ specifications cannot be used in the RECODE or COMPUTE programs. This restriction prevents variables from being copied and then used under the new name.

Minimum Cell Sizes

The cells examined are the individual table cells produced by the TABLES, MEANS, or CORRTAB programs. The number of cases in each cell is evaluated against the required minimum unweighted or weighted cell-size requirement. Cells with no cases at all are not included in the evaluation of the minimum cell size, or the average cell size, in a table.

Average Cell Size in a Table

The cells examined are the individual table cells produced by the TABLES, MEANS, or CORRTAB programs. If a control variable is used, the cells are examined for each separate category of the control variable. The mean number of cases and the median number of cases in the cells of a table are evaluated against the required minimum unweighted or weighted average-cell-size requirement. Cells with no cases at all are not included in the evaluation of the average cell size in a table.

Monitored Variables

The ’MONITORVAR=’ option suppresses analysis results if those results are based on cases or observations that have the same value on one or more sensitive variables. These sensitive variables need not be included in the current analysis run, but their distribution is monitored nevertheless.

For example, you may not want to release analysis results based on cases that are all from the same institution (such as from the same prison). Assuming that there is a variable named ’prison’, you could specify that variable as one to be monitored.

By default the cases must come from at least two distinct categories of the monitored variable(s). However, you can specify a higher required number of categories by giving the desired number of categories in parentheses after the variable name. See the example below.


MESSAGES TO DISPLAY IF AN ANALYSIS IS NOT ALLOWED

If a requested analysis is not run or if analysis output is suppressed, the user receives an explanatory message. The default messages are given below, but they can be modified by inserting revised messages in a language file. It is possible to insert an HTML link in the message, if you want the user to be able to link to some document that explains in more detail what the disclosure rules are and why they have been implemented.

The default messages, following the keyword that would be used in a language file, are as follows. Notice that one or more variable names or a number will sometimes be output after the given message.

DIS_VAREXCLUDE = To preserve confidentiality, analyses are not permitted using the following variable(s):

DIS_COMBEXCLUDE = To preserve confidentiality, analyses are not permitted using the following combination(s) of variables:

DIS_VAREXCLUDE_RECODE = To preserve confidentiality, RECODE and COMPUTE are not permitted using the following variable(s):

DIS_MAXFILTERS = To preserve confidentiality, the number of filter variables cannot be greater than:

DIS_CONTROLVAR = To preserve confidentiality, tables cannot be run with control variables.

DIS_LISTCASE = To preserve confidentiality, the LISTCASE program cannot be used with this dataset.

DIS_SUBSET = To preserve confidentiality, the SUBSET program cannot be used with this dataset.

DIS_AVGCELLMIN = To preserve confidentiality, tables cannot be displayed unless the average number of observations in each cell is at least:

DIS_AVGCELLWMIN = To preserve confidentiality, tables cannot be displayed unless the average weighted number of observations in each cell is at least:

DIS_MINCELLN = To preserve confidentiality, tables cannot be displayed unless the number of observations in each cell is at least:

DIS_MINCELLWN = To preserve confidentiality, tables cannot be displayed unless the weighted number of observations in each cell is at least:

DIS_MINCASEBYIVAR = To preserve confidentiality, regression analyses cannot be shown unless the ratio of valid observations to the number of independent variables is at least:

DIS_MONITORVAR = To preserve confidentiality, analysis results cannot be displayed for any set of observations that has only a very small number of values on certain sensitive variables. In this case the sensitive variable(s) (and the minimum required number of valid values) was:

DIS_UNWEIGHTEDN = To preserve confidentiality, only weighted N’s can be shown.


EXAMPLE OF A DISCLOSURE FILE

In the following example, note that blank lines and lines beginning with ’#’ are treated as comments, and they are ignored by the SDA programs.

# DISCLOSURE SPECIFICATIONS FOR DATA FILE

# ID FOR THIS DATASET
ID = survey25

# A. PREVENTS AN ANALYSIS FROM BEING RUN

# Completely exclude these vars from analysis and recoding/computing
VAREXCLUDE = CASEID, LOCATIONID

# Exclude these combinations of vars (separated by ’;’) from analysis
# Also exclude the individual vars from being used by the ’recode’
#  and ’compute’ programs
COMBEXCLUDE = RACE, GENDER; AGE, RACE

# Maximum number of selection filters allowed in an analysis run
MAXFILTERS = 2

# No tables with a control variable if set equal to ’no’
CONTROLVAR = no

# The LISTCASE program cannot be run if set equal to ’no’
LISTCASE = no

# The SUBSET program cannot be run if set equal to ’no’
SUBSET = no


# B. SUPPRESS ANALYSIS OUTPUT AFTER RUNNING A PROGRAM

# Required average (mean and median) cell sizes - unweighted and weighted
AVGCELLMIN = 10
AVGCELLWMIN = 200

# Required size of smallest cell - unweighted and weighted
MINCELLN = 5
MINCELLWN = 100

# Ratio of cases to number of independent vars in regression
MINCASEBYIVAR = 100

# Check for at least 2 distinct values on the variable ’INSTITUTION’
#  and at least 3 distinct values on ’CBSA’.
MONITORVAR = INSTITUTION CBSA(3)

# Suppress all unweighted N’s if set equal to ’no’
UNWEIGHTEDN = no


EXAMPLE OF A LANGUAGE FILE (’langan’) WITH EMBEDED LINKS

In the following example, note that blank lines and lines beginning with ’#’ are treated as comments, and they are ignored by the SDA programs.

The words "preserve confidentiality" are set up to link to a file that could explain further the disclosure rules and the reasons for setting them up.

DIS_AVGCELLMIN = To preserve confidentiality, tables cannot be displayed unless the average number of observations in each cell is at least: DIS_AVGCELLWMIN = To preserve confidentiality, tables cannot be displayed unless the average weighted number of observations in each cell is at least: DIS_COMBEXCLUDE = To preserve confidentiality, analyses are not permitted using the following combination(s) of variables: DIS_CONTROLVAR = To preserve confidentiality, tables cannot be run with control variables. DIS_LISTCASE = To preserve confidentiality, the LISTCASE program cannot be used with this dataset. DIS_MAXFILTERS = To preserve confidentiality, the number of filter variables cannot be greater than: DIS_MINCASEBYIVAR = To preserve confidentiality, regression analyses cannot be shown unless the ratio of valid observations to the number of independent variables is at least: DIS_MINCELLN = To preserve confidentiality, tables cannot be displayed unless the number of observations in each cell is at least: DIS_MINCELLWN = To preserve confidentiality, tables cannot be displayed unless the weighted number of observations in each cell is at least: DIS_MONITORVAR = To preserve confidentiality, analysis results cannot be displayed for any set of observations that has only a very small number of values on certain sensitive variables. In this case the sensitive variable(s) (and the minimum required number of valid values) was: DIS_SUBSET = To preserve confidentiality, the SUBSET program cannot be used with this dataset. DIS_UNWEIGHTEDN = To preserve confidentiality, only weighted N’s can be shown. DIS_VAREXCLUDE = To preserve confidentiality, analyses are not permitted using the following variable(s): DIS_VAREXCLUDE_RECODE = To preserve confidentiality, RECODE and COMPUTE are not permitted using the following variable(s):

SEE ALSO

harc HARC file specifications
language LANGUAGE file specifications


CSM, UC Berkeley
February 22, 2010