[ Program Manual | User's Guide | Data Files | Databases ]

Table of Contents


[ Top | Next ]

DataSet creates a GCG data library from any set of sequences in GCG format.


[ Previous | Top | Next ]

A large set of sequences is more compact to store and faster to search if the sequences are assembled into a personal database like the databases we provide with the Wisconsin Package(TM). When sequences are assembled into a personal database, all of the Wisconsin Package database tools will work with them just as they do with the databases we provide (GenEMBL, PIR-Protein, and SWISS-PROT).

DataSet assembles any set of sequences you specify into a personal database. The sequences in the output files from DataSet are meant to be accessed the same way as sequences in other GCG databases. When you answer the prompt, What should I call the database?, you are giving it a logical name that will be used to refer to the database forever. The command assigning the logical name (globin in the example session below) is written into a file called .datasetrc in your home directory. This causes the logical name to be set automatically every time you initialize the Package. GCG sequence specification syntax (like Globin:h*) then can be used to identify the sequences in the database. (See Chapter 2, Using Sequence Files and Databases of the User's Guide for more information about sequence specification.)


[ Previous | Top | Next ]

Here is a session using DataSet to assemble most of the human globin sequences in GenEMBL into a separate personal database called globin:

% dataset

 Assemble DATASET from what sequence(s) ?  GenEMBL:Humhb*

 What should I call the database ?  globin




 Running DBINDEX to calculate indices for "globin".

 Running SEQCAT to make "globin" available for STRINGSEARCH.

 HomeDir:.datasetrc was modified to assign "globin" and "gl".

 DATASET complete:

        Sequences: 103
     Total length: 215798

 Output files: globin.seq, .header, .ref, .offset, .names, .numbers, .seqcat



[ Previous | Top | Next ]

DataSet writes six files in your current working directory. globin.ref contains the documentation for all of the sequences specified by GenEMBL:Humhb*. globin.seq contains the sequences. globin.names, globin.numbers, and globin.offsets are index files used by the Wisconsin Package to find individual sequences in the database. globin.header provides the Wisconsin Package with information such as logical names, release dates, and formatting information. globin.seqcat is the definition file that is searched by the StringSearch program.


[ Previous | Top | Next ]

DataSet accepts as input multiple sequences of the same type. You can specify multiple sequences in a number of ways: by using a list file, for example @project.list; by using an MSF or RSF file, for example project.msf{*}; or by using a sequence specification with an asterisk (*) wildcard, for example GenEMBL:*.

If the input is a list file, DataSet applies any Begin, End, and Strand attributes it finds within that file. However, with one exception, the Command Line qualifiers -BEGin, -END, -REVerse, and -NOREVerse will override any conflicting attributes found in the list file. The single exception is that if an -END qualifier specified on the Command Line is less than a Begin attribute found in the list file, the output sequence will begin and end at the base indicated by the Begin.


[ Previous | Top | Next ]

Fetch copies GCG sequences or data files from the GCG database into your directory or displays them on your terminal screen.

StringSearch identifies sequences by searching for character patterns such as "globin" or "human" in the sequence documentation. Names identifies GCG data files and sequence entries by name. It can show you what set of sequences is implied by any sequence specification.

GCGToBLAST combines any set of GCG sequences into a database that you can search with BLAST. BLAST searches one or more nucleic acid or protein databases for sequences similar to one or more query sequences of any type. BLAST can produce gapped alignments for the matches it finds.

FastA does a Pearson and Lipman search for similarity between a query sequence and a group of sequences of the same type (nucleic acid or protein). For nucleotide searches, FastA may be more sensitive than BLAST. WordSearch identifies sequences in the database that share large numbers of common words in the same register of comparison with your query sequence. The output of WordSearch can be displayed with Segments.


[ Previous | Top | Next ]

You cannot move or rename the database files unless you change the database name assignment appropriately in HomeDir:.datasetrc.

Upon completion, DataSet tries to run the Wisconsin Package database utilities DBIndex and SeqCat as spawned processes in order to calculate indices for the new database. These two programs must complete successfully before the sequences in the new database can be used with Wisconsin Package programs.

The .offset, .names, and .numbers files are not ASCII text and cannot be viewed on your terminal screen or modified in any way.

The architecture of GCG data libraries requires that all of the files that make up a data library be in the same directory and share the same base name. These files are differentiated only by their filename extensions. The individual data libraries that make up a database, such as Genbank, need not be located in the same directory. For each data library you may specify a different location in the file

The location of a data library must be specified by a logical name. If you need to define a new logical name for a new location, add that definition to the file GenDBConfigure:dblogicals and run the command xNewDBLogx. This is necessary, for example, if you put the data on a new disk.


[ Previous | Top | Next ]

The format of GCG database index files changed starting with Version 8.0 of the Wisconsin Package. Personal databases that were created with DataSet prior to Version 8.0 must be converted to the new format. To make a database compatible with the current version of the Wisconsin Package, use the program DBIndex to create new index files for the database.


[ Previous | Top | Next ]

The sequences in your new personal database are meant to be accessed the same way as any other GCG database sequences. The Wisconsin Package recognizes that a sequence specification like Globin:humhbb is a database sequence specification by examining the logical name globin. If globin is assigned to a complete filename without a filename extension and if there are five files that start with that name and end with the extensions .ref, .seq, .offsets, .numbers, and .names, then the Wisconsin Package assumes globin is a database and tries to find the entry humhbb.

Usually, DataSet permanently assigns a name like globin for you when you run the program, but you can assign a database logical name by yourself with a command like this one:

% name -s globin /usr/user/burgess/seq/globin

In the example session, a logical name globin is assigned by adding a command to the file HomeDir:.datasetrc. If that file did not exist before the session, DataSet would create a new one. Whenever you initialize the Wisconsin Package, this name is assigned correctly.


[ Previous | Top | Next ]

All parameters for this program may be added to the command line. Use -CHEck to view the summary below and to specify parameters before the program executes. In the summary below, the capitalized letters in the parameter names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose parameter values that are optional. For more information, see "Using Program Parameters" in Chapter 3, Using Programs in the User's Guide.

Minimal Syntax: % dataset [-INfile=]GenEMBL:Humhb* [-OUTfile=]Globin -Default

Prompted Parameters:
-BEGin=1 -END=148         sets the range of interest for a single sequence
-REVerse                  uses the back strand of a single sequence

Local Data Files: None

Optional Parameters:

-BEGin=1 -END=148         sets the range of interest for all sequences
-REVerse                  uses reverse strand for all sequences
-IDToken="DEFINITION"     sets the heading's definition line identifier
-TYPe=N                   sets the dataset type
-TOPROtein                translates nucleotide input to protein output
-TRANSlate=translate.txt  specifies file with codon translation table
-LN=Globin                defines the long name
-SN=Gl                    defines the short name
-APPend                   appends data to an existing dataset
-NOMONitor                suppresses screen monitor
-NOSUMmary                suppresses the screen summary
-FASTA                    creates the .seq file in FASTA format


[ Previous | Top | Next ]



[ Previous | Top | Next ]

You can set the parameters listed below from the command line. For more information, see "Using Program Parameters" in Chapter 3, Using Programs in the User's Guide.


sets the beginning position for all input sequences. When the beginning position is set from the command line, DataSet ignores beginning positions specified for individual sequences in a list file.


sets the ending position for all input sequences. When the ending position is set from the command line, DataSet ignores ending positions specified for sequences in a list file.


sets the program to use the reverse strand for each input sequence. When -REVerse or -NOREVerse is on the command line, DataSet ignores any strand designation for individual sequences in a list file.


Many Wisconsin Package programs annotate their output with a single line of documentation from each sequence. This is the same line that is searched in the definition search of the StringSearch program. For sequences in GenBank format, this line begins with DEFINITION. For EMBL format the line begins with DE. You can set this identifier to capture whatever line interests you from the heading of your input sequences. If there is no line that starts with your identification token or if you do not use this parameter, then the first non-blank line in your sequence file is used as the sequence definition.


sets the type of the dataset to N for nucleic acid datasets, and P for protein datasets. The type field is stored in the .header file, and used when reading in sequences to set the type of the sequence to protein or nucleic acid.


translates all six potential reading frames for each nucleotide entry. Peptide sequences representing translations of the three forward reading frames are designated with the original entry name followed by f1, f2, or f3 while those corresponding to the three reverse reading frames have names containing r1, r2, or r3.


specifies a file containing the codon translation matrix.


defines the long logical name that is used to refer to data in this dataset.


defines the short logical name that is used to refer to data in this dataset.


appends data to an existing dataset.


This program normally monitors its progress on your screen. However, when you use -Default to suppress all program interaction, you also suppress the monitor. You can turn it back on with this parameter. If you are running the program in batch, the monitor will appear in the log file.


writes a summary of the program's work to the screen when you've used -Default to suppress all program interaction. A summary typically displays at the end of a program run interactively. You can suppress the summary for a program run interactively with -NOSUMmary.

You can also use this parameter to cause a summary of the program's work to be written in the log file of a program run in batch.


writes entries to the .seq file in FASTA format.

Printed: December 9, 1998 16:29 (1162)

[ Program Manual | User's Guide | Data Files | Databases ]

Documentation Comments:
Technical Support:

Copyright (c) 1982-2001 Genetics Computer Group, Inc. A subsidiary of Pharmacopeia, Inc. All rights reserved.

Licenses and Trademarks Wisconsin Package is a trademark of Genetics Computer Group, Inc. GCG and the GCG logo are registered trademarks of Genetics Computer Group, Inc.

All other product names mentioned in this documentation may be trademarks, and if so, are trademarks or registered trademarks of their respective holders and are used in this documentation for identification purposes only.

Genetics Computer Group