A protein coding region is the region in a DNA sequence which results in the generation of a protein product. Protein coding region identification is the first thing usually done after the determination of a DNA sequence. Many different computer programs have been developed for this purpose and it is one of the major and productive fields of computational biology nowadays.
Protein coding region databases for E. coli, primate, and S. cerevisiae were created from GenBank. Trimer frequencies for 64 trimers in 6 different phases (3 for each direction) were counted from these databases. An analysis of trimer frequencies in above three organisms were done. A new protein coding measure called TFD(trimer frequency difference) was devised by subtracting a trimer frequency in a phase by another phase. Among 30 possible combinations, 5 of them (subtracting phase 1, direction 0 by the other 5 phases) are selected to use as a protein coding measure. An analysis of TFDs of above three organisms was done and the quality of TFD as a protein coding measure was examined.
A frequency fluctuation presenting method called NC(normalized cumulative) plot is devised. Different from sliding window method, NC plot shows frequency fluctuation as it is. Many different applications are possible with NC plot.
By combining TFD and NC plot, a new computer program for protein coding region identification called DNAClimber was devised. In the case of E. coli, 96.4% of 319 test protein coding regions can be found using DNAClimber. For S. cerevisiae, 93.5% of 371 test protein coding regions were found using DNAClimber. The so-called antisense symmetry problem of protein coding region identification methods is overcome in DNAClimber by using $TFD_5$. Another usage of DNAClimber is detecting sequencing errors. Since the current method of DNA sequence determination is error prone, it is important to have a tool for detecting sequencing