/*********************************************** ************************************************ ** ** ** hmm MAMOT ** ** a program for HMM modelling ** ** 2002, Mauro C. Delorenzi ** ** ** ************************************************ ************************************************/ ******************************************************************************************** IMPLEMENTED generation of random sequences: mamot -G Baum-Welch (BW, EM) LEARNING: mamot -B Viterbi LEARNING (untested) mamot -V FB (forward(-backward)) PROBBAILITY: mamot -P FB forward-backward "posterior" DECODING: mamot -D Viterbi probability and DECODING mamot -Q ******************************************************************************************** USAGE EXAMPLES mamot -G -vf -m modelfile -n 200 mamot -Batpv -j 2 -i 2 -w 0.1 -m model.hmm data.seq mamot -V -j 2 -i 2 -d 55 -m modelfile seqFile > results.txt mamot -P -m modelfile seqFile > probabilityfile.txt mamot -D -m modelfile seqFile mamot -Q -m modelfile seqFile where mamot is the executable file, for example mamotlinux084 - - - - - - - - - - - - - - - - - - - - - - - - - - - - INPUT seqFile: file with sequencs, name at max 100 chars, multifasta format modelfile: file with model, name at max 100 chars, please see example file lines cannot be longer than 1000 chars ******************************************************************************************** INFORMATION -G => asks for the input of a seed (same seed, same result), the request is written to stderr, the seed read from stdin (terminal) the random sequences are written to stdout With option -f also writes to the file GenSeqs in a verbose format that includes the state sequence as well. -P => writes sequences with added log Prob. to the file FBprob additionally log Prob and (for now) some "controls" to stdout -P is the default, can be omitted -B, -V => writes the new model to the file BWprot, {should be in a format that can be used as model input file} and (for now) some "controls" to stdout -D => writes posterior state probabilitie to the file FBpostprob and also the same output as -P to the file FBprob ******************************************************************************************** ADDITIONAL OPTIONS -p (in Baum Welch only) tie emission distributions of pairs of complementary states (pool contributions in Baum Welch) -t (in Baum Welch only) tie emission distributions of states in the same tie group (pool contributions in Baum Welch) -u use both strands of a DNA sequence independently (in Baum Welch only) -e use both strands of a DNA sequence conjointly (in Baum Welch only) -m filename of the ModelFile -s filename of the Sequences File -d threshold of absolute value of change of total log likelihood to stop BW (vMINdifftotLogLik) default is kMINdifftotLogLik (here a signed number), -n nb of sequences to be generated default: 1 -a writes also intermediate (after each round) BW model results to file (alloutput = true) -f allows "additional" output to a file (bfileoutput = true) -g limits the "additional" output of -g to values above a cutoff (vMINProbPrint, with default kMINProbPrint) -i maximal number of iterations in BW (vMAXnbITERATIONS) default is kMAXnbITERATIONS -j minimal number of iterations in BW (vMINnbITERATIONS) default is kMINnbITERATIONS -k store sequences in memory when doing BaumWelch after first reading, followed by the nb of sequences and by -l: -l when using -k, maximal length of sequences that have to be used (for memory assignment) -b in Baum Welch and Viterbi Learning do not update transition probabilities -c in Baum Welch and Viterbi Learning do not update emission probabilities -w number (double) as weight for pseudocounts, 1 for standard pseudocount scheme, default is 0: no pseudocounts added in alphabetical order: -a writes also intermediate (after each round) BW model results to file (alloutput = true) -b in Baum Welch and Viterbi Learning do not update transition probabilities -c in Baum Welch and Viterbi Learning do not update emission probabilities -d threshold of absolute value of change of total log likelihood to stop BW (vMINdifftotLogLik) default is kMINdifftotLogLik (here a signed number), -e use both strands of a DNA sequence conjointly (in Baum Welch only) -f allows "additional" output to a file (bfileoutput = true) -g limits the "additional" output of -g to values above a cutoff (vMINProbPrint, with default kMINProbPrint) -i maximal number of iterations in BW (vMAXnbITERATIONS) default is kMAXnbITERATIONS -j minimal number of iterations in BW (vMINnbITERATIONS) default is kMINnbITERATIONS -k store sequences in memory when doing BaumWelch after first reading, followed by the nb of sequences and by -l: -l when using -k, maximal length of sequences that have to be used (for memory assignment) -m filename of the ModelFile -n nb of sequences to be generated default: 1 -p (in Baum Welch only) tie emission distributions of pairs of complementary states (pool contributions in Baum Welch) -s filename of the Sequences File -t (in Baum Welch only) tie emission distributions of states in the same tie group (pool contributions in Baum Welch) -u use both strands of a DNA sequence independently (in Baum Welch only) -w number (double) as weight for pseudocounts, 1 for standard pseudocount scheme, default is 0: no pseudocounts added ******************************************************************************************** OBSERVATIONS / LIMITATIONS / BUGS This is a working version, there are some bugs, there is no checking if assumed conditions are respected by the user, so segmentation faults can happen, if the input specifications are not as expected (see examples). There is no checking if the model makes sense, for example if it allows for termination when generating sequences. If not, it will never stop. HIDDEN MARKOV MODEL states: any number, any name (within reason) up to 25 chars letters: max. 30 for now, can handle only capital "latin" letters properly emission probabilities must be completely listed in the order given by the alphabet WATCH OUT: cannot handle void lines in model file cannot handle excessively long sequences ********************************************************************************************