Online Help for Customized Subsetting - SDA 4.1

This file contains the online help available from inside the SDA subset program.

Subset Data File Characteristics
- Type of data file
- Order of variables on the file
Codebook for Subset Data
Data Definitions for Statistical Programs
Filter(s) for Selecting Cases
- Numeric filter variables
  - Open-ended Ranges using '*' and '**'
- Character variables as filters
Specifying Variables to Include in the Subset
- Specify individual variable names
- Select variables from groups
  - To see what is in a group
  - To select variables from a group

Data File to Produce

Data File Characteristics

The data file produced by the subsetting procedure is a simple text file (an ASCII file). The file has one line (record) per case, and each variable is placed into a fixed set of columns.

The values of each variable are placed into adjacent columns in the data record. By default, there are no extra blanks or commas between adjacent variables. As an option, the user may specify that adjacent variables be separated either by a blank space or by a comma.

Type of data file

The data file can be one of the following:

Text file, with no added blanks between variables
Text file, with a blank inserted between each variable
CSV (Comma Separated Values) file, with a comma inserted between each variable, and a header record containing the names of the variables.

Having a blank or a comma between variables may facilitate reading of the data file by database or spreadsheet programs (like Excel). Note that delimiters do not matter for SAS, SPSS, Stata, SDA, or other statistical packages. They can read a data file in fixed-column format either with or without delimiters between variables.

Order of variables on the file

If individual variables have been specified in the text box, those variables will come first, in the order specified. The remaining variables will be output in the same order found in the variable tree and codebook.

Codebook for Subset Data

A codebook can be generated that documents the variables included in the subset data file. For each variable, the codebook typically includes:

the name and the long label of the variable
the text of the question that produced the variable
the category codes and labels
the column location of the variable in the data record
certain properties of the variable such as the number of decimal places and the missing-data codes (if any).

A table of contents listing the name and long label of each variable is included at the front of the codebook.

The codebook is a text file formatted for printing. It should be printed with a fixed width (non-proportional) font such as Courier. Since the codebooks generally use between 70 and 80 characters per line, a 10-point font is a good choice.

The codebook file has a new-page character (form-feed) every 60 lines. Note, however, that not all printers will recognize that character and skip to a new page. In such cases you may have to hand-edit the file to produce the page breaks you want.

Data Definitions for Statistical Programs

SAS data definitions

The SAS data definitions include all the necessary information to create a SAS data library for the subset of variables selected on this run. Before using this file on your computer, however, it will be necessary to add the name of the subset data file and the directory name for the data library on your computer to be used as input to this run. The place to put those names is clearly indicated near the top of the file of data definitions.

The SAS data definitions include the 'DATA IN' command, with location information, and the 'LABEL' section with variable labels, There is also a 'PROC FORMAT' section with category labels; the corresponding 'FORMAT' command associates each label with a variable. There is also a section of 'IF' statements, which set missing-data and out-of-range values of each variable to the SAS missing-data code '.' (a period). A 'PROC DATASETS' command is added after the data definitions; this will generate a list of variables created by SAS.

SPSS data definitions

The SPSS data definitions include all the necessary information to create an SPSS system file for the subset of variables selected on this run. Before using this file on your computer, however, it will be necessary to add the name of the subset data file on your computer to be used as input to this run. The name is put after the 'DATA LIST' command in the place indicated in the file; replace the 'x' with the name you gave the data file. For example, if you saved the data file as 'mydata.txt' in the directory 'C:\mywork', replace 'x' in the SPSS data definitions with "C:\mywork\mydata.txt". It is best to specify the entire pathname for the data file, and to put it in double quotes, so that SPSS can find it on your computer.

If you will be running SPSS under Windows, there is a simple way to use the file of SPSS syntax commands: First, change the extension of this file from '.txt' (with which it was downloaded) to '.sps'. Then in Windows, simply double-clicking on the '.sps' file will open SPSS and put this file in a syntax window. You can edit this file so that it contains the name of the data file that you saved. Note that you should include the entire pathname for the data file (for example: C:\mywork\mydata.txt) so that SPSS can find it. Then highlight the whole syntax file and click 'run'. This generates the system file. Switch to the data window to view it. If you get a warning about an obsolete specifier or 'set' command, just ignore it. You can then proceed to analyze the data.

The SPSS data definitions generated by the subset procedure include the 'DATA LIST' command, with location information, the 'VARIABLE LABEL' section with variable labels, and the 'VALUE LABELS' section with category labels. There is also a section with 'MISSING VALUES' statements, which may define certain values of each variable as missing-data values. There may also be some 'IF' statements which set out-of-range values of each variable to the SPSS system-missing code.

To create a system file, it will be necessary to specify the name to assign to the system file; that name is put in the 'SAVE' command in the indicated place near the end of the file. Replace 'y' with the file name you want. The 'SAVE' command at the end of the file includes the 'MAP' option, in order to generate a list of variables created by SPSS. The 'COMPRESSED' option is also specified for the system file, since many users prefer to save disk space; that line can be removed, if you prefer to save the file as an uncompressed file.

Note that if you are running SPSS under Windows, you can create a '.sav' file interactively. However, you may find that a file named 'y' has been saved on your disk (probably in the 'C:\Program_Files\SPSS' directory), unless you delete the 'SAVE' command from the SPSS syntax file before you click on 'run'.

Stata data definitions

The Stata data definitions include all the necessary information to create a Stata system file for the subset of variables selected on this run. However, you will have to do two things to use the file generated by the subset routine:

Split the file into two files, using an editor.
- The first part of the file generated by the subset program is a Stata do file, which is a file of commands for Stata to execute. Save this part of a file in a new file with any name you like; it should have the suffix '.do'.
  The 'do file' contains category labels and missing-data codes.
- The second part of the file generated by the subset program is a Stata dictionary file, (the part beginning with 'dictionary using Y'). Save this second part of the file in another file with any name you like; it should have the suffix '.dct'.
  The 'dictionary file' contains, for each variable, the type, input format, and label.
Insert actual file names into the files you have created.
- The 'do' file refers to the dictionary file as 'X' on the line containing 'infile using X'. Change the 'X' to the name of the dictionary file you created (from the second part of the file generated by the subset program).
- The first line of the 'dictionary' file contains 'dictionary using Y'. Change the 'Y' to the name of the text (ASCII) data file generated by the subset program; this will be the name you used when you saved the data file.

Once you have broken up the file into two parts and inserted the appropriate file names, you can use the files within Stata. For example, if you gave your do-file the name 'myfile.do', you would start up Stata and give the command 'do myfile'. Stata would then execute the commands in 'myfile.do' and set up the variables as a Stata data file. You could then run any of the available analyses.

XML definitions for DDI-Codebook

The Data Documentation Initiative (DDI) is a standard for documenting data files using XML. The subset procedure can generate documentation for your subset using the conventions of DDI-Codebook. (This is different from the more complex DDI-Lifecycle version.)

For more information on the DDI, see the main DDI Web site.

DDL definitions for SDA

The SDA metadata format is called DDL (Data Description Language). An SDA DDL file contains information on each variable written in SDA's own metadata syntax. The SDA DDL file, together with the subset data file, can be used to generate an SDA dataset.

A DDL file is produced automatically by the subset procedure, even if you do not request that it be sent to you. The DDL file is the basic source of documentation for the subset. The codebook and the data definitions for SAS, SPSS, Stata, and DDI are all derived from the SDA DDL file.

For more information on the content and format of a DDL file, see the SDA manual page for DDL.

Filter(s) for Selecting Cases

If no filter variables are specified, the subset data file will include a record for every case in the original data file. If you want to include only a subset of the cases in the full data file, you must specify one or more filter variables and indicate which codes of those variables to include.

Numeric variables as selection filters

Basic filter use
The name of each filter variable is followed, in parentheses, by a single value such as 'gender(2)' or a range of codes such as 'age(30-50)', to limit the analysis to cases having those codes.

Multiple ranges and codes may be specified.
For example: age(1-17, 25, 95-100)

Multiple filter variables
If you specify more than one filter variable, a case must satisfy ALL of the conditions in order to be included in the table.
For example: gender(1), age(30-50)

Open-ended Ranges using '*' and '**'
A single asterisk, '*', can be used to specify that all cases with VALID codes for a variable will pass the filter.
For example: age(*) includes all cases with valid data on the variable 'age'.

In a range, the '*' can be used to signify the lowest or highest VALID value. For example: age(*-25,75-*). This filter would include all VALID values less than or equal to 25 and all VALID values greater than or equal to 75. However, any missing-data values within those ranges would still be excluded.

In a range, two asterisks '**' can be used to signify the lowest or highest numeric value, regardless of whether or not the codes are defined as missing data. For example: age(50-**) would include ALL numeric values greater than or equal to 50, including data values like 98 or 99, even if they had been defined as missing-data codes. However, any character missing-data values would still be excluded. Note that '**' cannot be used alone in a filter variable. It can only be used as part of a range.

Character variables as selection filters

The syntax for specifying character variable filters is similar to the syntax for numeric variables but with a few differences. Like numeric variable filters, character variable filters specify the variable name followed by the filter value(s) in parentheses.
For example: city( Atlanta )

Multiple filter values can be specified, separated by spaces or commas:
city( Chicago,Atlanta Seattle)

Character variable filters are case-insensitive. For example, the following filters are functionally identical:
city( Atlanta )
city( ATLANTA )
city( AtLAnta )

If a filter value contains internal spaces or commas, it must be enclosed in matching quotation marks (either single or double):
city( "New York" )
state("Cal, Calif")

A filter value containing a single quote (apostrophe) can be specified by enclosing it in double quotes:
city( "Knot's Landing" )

Or, conversely, a filter value containing double quotes can be specified by enclosing it in single quotes:
name( 'William "Bill" Smith' )

Leading and trailing spaces, and multiple internal spaces, are NOT significant. The following filters are all functionally equivalent:
city( "New York    " )
city( "New    York" )
city( "   New York    " )

Note that ranges, which are legal for numeric variables, are not allowed for character variables:
The following syntax is NOT legal: city( Atlanta-Seattle)

Specifying Variables to Include in the Subset

Specify INDIVIDUAL variable names

One way that variables can be selected for inclusion in the subset is by entering their names in the text box. Multiple variable names must be separated by spaces or commas. You can now include variables created by RECODE and COMPUTE in a subset, by specifying the individual names of the created variables.

To select more than a few variables, you will probably want to use the group selection procedure. However, the group selection procedure is available only for the original variables set up with the dataset, and is not available for the variables created by RECODE or COMPUTE. Note that both individually specified variables and groups of variables can be combined together in the same subset.

Variables named individually will be output onto the subset data file first. Then the variables specified by group follow, in the order they are found in the codebook.

Select variables from GROUPS of variables

TO SEE what is in a group

In the variable selection tree there is a little arrow to the left of the name of each group and each subgroup of variables. Click on the arrow to display the contents of each group and subgroup of variables.

TO SELECT variables from a group

Select ALL variables from a group
Click on the box next to the group or subgroup name to select ALL the variables in the group or subgroup.
Select SOME variables from a group
Click on the box next to a variable name to select that variable. Note that a '-' appears in the box next to the group name (and subgroup name, if there is one) to indicate that some, but not all, variables in the group (and subgroup) have been selected.
Verify which variables have been selected
Above the variable selection tree there is a button labeled "Show List of Variables Selected from Tree." Clicking that button will display the names of the variables currently selected from the tree.