By naming each reading using a rigid convention it is possible to encode much information about each reading in its name. Indeed as it is important to make sure that each reading has a unique name this will generally be the case. By regidly adhering to a convention it is possible to configure pregap to obtain the answers to many of its questions by simply looking at the name. In some cases this may completely avoid the need (and work involved) of creating an database file to extract this information from.
The key is in chosing a naming convention which is easy to process. As we cannot request that everyone uses the same naming convention Pregap cannot magically know how to extract information from a reading name. However we provide here an example naming convention and a corresponding `.pregaprc' file. It is hoped that this will serve as a template for those wishing to tailor pregap to their own local conventions.
We start by defining a reading name to consist of the template name followed
by full stop and an extension encoding the chemistry, reading direction, and
any additional information required to generate unique names. As an example we
may have sequences named xb54b12.s1 and xb54b12.r1. These have
been sequenced from each end of the same insert.
The extension will consist of several characters. The first is f to
indicate a forward reading and r for a reverse reading. We shall use
lower case letters to indicate use of the universal primer and capital letters
to indicate use of a custom primer. The next few characters are optional and
may contain a t to indicate a terminator reaction or L for a
long reading. Finally we add a number so that repeats of the sequencing
reaction from the same template with the same direction and chemistry can be
given unique names.
From this information it should be clear that we can create TN_com,
PR_com and CH_com Pregap configurations. These could be coded as
follows.
TN_com='echo $ID | sed "s/\..*//"'
PR_com='echo $ID | sed "s/.*\.\(.\).*/\1/;s/f/1/;s/r/2/;s/F/3/; \
s/R/4/;s/[^1-4]/0/"'
CH_com='echo $ID | sed "s/.*\..*[tT].*/1/;s/.*\..*/0/"'
The above file makes extensive use of the Unix sed command. This is
just an illustration so you should feel free to use whichever commands you are
familier with. All the above configurations echo the reading name into
sed, which deletes or replaces segments as needed. The TN_com
command replaces ("s/") fullstop ("\.") followed by all other
characters (".*") with nothing ("//").