************************************************
************************************************
** **
** MAMOT **
** a program for HMM modelling **
** 2008, Mauro C. Delorenzi **
** **
************************************************
************************************************/
Contact:
Mauro.Delorenzi@isb-sib.ch
or Frederic.Schutz@isb-sib.ch
If you report errors please describe the problem,
indicate the program version and the command line
and provide access to all input files.
**************************************************************************
FUNCTIONS IMPLEMENTED
Generation of random sequences mamot -G
Baum-Welch (BW) LEARNING mamot -B
Viterbi (Vit) LEARNING mamot -V
Forward-backward "posterior" DECODING mamot -D
FB (forward(-backward)) PROBABILITY mamot -P
Viterbi probability and DECODING mamot -Q
*************************************************************************
QUICK START
The file atcg.model in directory INPUT provides a simple model to describe
simulated AT- and CG-rich regions in the genome. You can try the following
commands:
1) Generate 100 (pseudo-)random sequences of maximal length 1000 using the model
and store them in file "sequences"
mamot -Gf -r4 -m INPUT/atcg.model -n 100 -l 1000 >dnasequences
The -f option creates a tab-delimited file called GeneratedSequences.txt
which contains the true state and parameters for each generated symbol.
The -r option sets a seed, by default the same is used and two identical default
runs give the same random sequences.
2) Using the Forward-Backward posterior decoding algorithm, we create a file "PMatrix"
containing the calculated probability for each state and each symbol:
mamot -D -m INPUT/atcg.model dnasequences
This also produces a file "FBprob" with the log Probability of each sequence.
3) Retrain the model using the generated sequences and the Baum-Welch algorithm:
mamot -B -i 20 -w 0.1 -m INPUT/atcg.model dnasequences
The new model will be stored in file "FinalModel".
The initial model will be stored in file "InitialModel" (this is useful to check
if the provided model file has been correctly interpreted by mamot).
in this case the exercice shows the degree of fluctuations to be expected,
which can be high, especially if the number of generated sequences is small.
4) Retrain with the Viterbi method:
mamot -V -m INPUT/atcg.model sequences
This produces files as with Baum-Welch training.
5) Decode with the Viterbi method:
mamot -Q -m INPUT/atcg.model sequences
This produces a file "FBprob" with the log Probability of each sequence.
*************************************************************************
USAGE - MORE EXAMPLES
General options
---------------
-h : display help page
-v : verbose output (mostly sent to cerr)
-m : Specify the file containing the model (default: INPUT/R1.model)
-s : Specify the file containing the sequences (default: SEQUENCES/seqFile)
-f : Write additional information in a file (depending on the command)
-G: Generation of random sequences
----------------------------------
-r n : seed for the random number generator (default 1)
-n n : number of sequences to generate
-l n : maximum length of each sequence (default 1000, 0 for no limit)
Examples:
mamot -G -r5 -m INPUT/atcg.model -n 200 -l 5000 > sequences
mamot -Gf -l500 -m INPUT/casino.model -n 2
mamot -G -l 10 -n 3 > SEQUENCES/seqFile; mamot -P; cat FBprob
-B and -V: Learning a model
---------------------------
-j n : minimal number of EM iterations (but not below 2)
-i n : maximal number of EM iterations
-w n : a multiplicative weight factor for the pseudocounts added to the emission counts
-d n : a multiplicative weight factor for the pseudocounts added to the emission counts (default: 100)
Examples:
mamot -Batp -j 3 -i 8 -w 0.1 -d 500 -m INPUT/atcg.model -s SEQUENCES/dnasequences
mamot -Vtp -j 3 -i 10 -d 50 -w0.01 -m INPUT/TF.model -s SEQUENCES/dnasequences2
-P: Computing the Probability given the model
----------------------------------------------
mamot -P -m modelfile seqFile
Writes log Prob. to the file FBprob
-D and -Q: Decoding
-------------------
mamot -D -m modelfile seqFile
Writes posterior state probabilitie to the file PMatrix
also the same output as -P to the file FBprob
mamot -Q -m modelfile seqFile
Writes posterior state probabilitie to the file yyyyyyyyyyyyyyyyyyyyy
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
INPUT
seqFile: file with sequencs, name at max 100 chars, multifasta format
modelfile: file with model, name at max 100 chars, please see example file
lines cannot be longer than 1000 chars
***********************************************************************
OPTIONS
-m : specify the file containing the model (default: INPUT/R1.model)
-s : specify the file containing the sequences (default: SEQUENCES/seqFile)
-h : display help page
-v : verbose output
-f : write additional information in a file (depending on the command)
-n : specify nb of sequences to be generated (default: 1)
-l : when generating sequences, max length of each generated sequence
-r : specify the seed for the random generator when generating sequences
-i : maximal number of iterations in Baum Welch (BW) and Viterbi
-j : minimal number of iterations in BW and Viterbi
-d : threshold of absolute value of change of total log likelihood to stop
BW or Vit learning (default is kMINdifftotLogLik)
-b : in Baum Welch and Viterbi Learning do not update transition probabilities
-c : in Baum Welch and Viterbi Learning do not update emission probabilities
-t : in Baum Welch and Viterbi Learning tie emission distributions of states in the same
tie group (pool contributions in Baum Welch)
-p : in Baum Welch and Viterbi Learning tie emission distributions of pairs of
complementary DNA states (pool contributions in Baum Welch)
-w : multiplicative factor for pseudocounts for emissions, 1 for standard pseudocount scheme,
(default is 0: no pseudocounts are added)
-y : number of pseudocounts for transitions (default is 0: no pseudocounts added)
-a : write also intermediate update models to file "UpdateModels" (after each iteration in BW and Vit learning)
-g : limit the additional output of -Q to values above a cutoff
-x : use state weights given in the model file in decoding functions
-z : use the list of state-position combinations given in the file "ext"P as additional
probabilities in decoding.
The program will handle only one sequence. It can be used theoretically
also in learning but the use of one sequence is then a very limiting restriction.
This option is incompatible with special treatment of the two strands of DNA
(options -u and -e)
****************************************************************************
OBSERVATIONS
Pseudocounts
------------
Note that the addition of pseudocounts while regularizing parameters impairs the
monotonous likelihood increase of the BW algorithm and can lead to model instability
and converge problems. By experience we recommend addition of a small number of
pseudocounts when the sample size of the training set is limited and no pseudocounts
when rich data is available for training.
Emission counts:
By default no transition pseudocounts are added.
The number of pseudocounts added is given by the product of the number indicated
under "NULLMODEL" in the model file, the number of letters of the alphabet used
and the weight specified with -w on the command line.
The default weight is 0, so that -w 1 is needed to activate the use of the pseudocounts.
Transition counts:
By default no transition pseudocounts are added.
A number can be specified with the option -y. This number is then added to all transitions
except those starting or ending in the BEGIN or END state.
In many cases the use of pseudocounts for transitions can lead to unexpected consequences
and is not generally recommended. It can also increase model complexity enormously
and slow down learning.
State Weights
------------
Need to activate the use of state weights given in the model file with -x
Tying
------------
In the algorithm there is first a sharing of the counts among the pairs that are
complementary in the sense of DNA.
In a second step the sharing of the counts among all members of a set of tied
states. Therefore if pairs (Aand cA) and (B, cB) are complementary and there
is tying between A and B and between cA and cB, at the end also A and cB respectively
B and cA should results complementary, counts will be effectively shared
among all 4 states, in learning the 4 states will be completely coordinated.
****************************************************************************
LIMITATIONS & TROUBLESHOOTING
One of the most frequent problem arises when a user inadvertently used a model
file that defines a model in which any sequence has probability zero. The mistake
might not be by the program and lead to unpredictable behavior,
possibly including segmentation faults.
Sometimes, the program might not read the model file as it was intended, this can
be checked by running a learning algorithmus and comparing the model defined in
the file "InitialModel" with the expected model (unless the program aborts earlier).
Beware of the use of files from MacIntosh or Windows system, the end-of-line
symbol must be converted to the Unix \n symbol.
MAMOT writes certain output to files with fixed names, so every time the program is
invoked these files are overwritten.
HIDDEN MARKOV MODEL
-------------------
states: any number, any name (within reason) up to 25 chars
letters: max. 30 for now, can handle only capital "latin" letters properly (in readSeq)
emission probabilities must be completely listed in the order given by the alphabet
transition probabilities
CONSTANTS
---------
Most constants used are accessible in file globals.h
(changes must be followed by compilation to get a new executable)
In particular, the maximal length of a sequence might have to be increased for
wide scanning of DNA sequences or reduced on machines with little RAM.
GENERAL LIMITATIONS
-------------------
The symbols that can be used for the Alphabet is currently limited to the 52 single letters
ABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
(except for generating sequences)
There is no global checking if the model "makes sense".
For example if a model does not allow reaching of the END state, when
generating rsndom sequences it will never stop, unless a maximal length is provided.
If all sequences have a probability of zero in the model, the bahavior of mamot
is unpredictable, a warning might be issued when using -P to compute the probability
of any sequence.
Need to give a full set of emission probabilities also for the BEGIN and END states,
although these numbers are not used and therefore arbitrary.
To simplify the code in some locations mamot uses arbitrary constants and represents
for example log of zero with a small negative number, the resulting approximation
should be sufficient in almost all situations.
**************************************************************************
OBSOLETE OPTIONS
-k : during first iteration in learning store sequences in memory
and do not read it from file in successive iterations, -k must be followed by -l
-l : when using -k, maximal length of sequences used in training
-u : use both strands of a DNA sequence, apply decoding to both strands one after the other;
respectively use both independently in learning a model (not recommended)
-e : use both strands of a DNA sequence "conjointly" (not recommended)
**************************************************************************
**************************************************************************