Mathematics behind the identifying CpG islands
Main Article Content
Abstract
The major objective of the paper is to review the theory for an hidden Markov model, a very general type of probabilistic model for sequences of symbols. In order for the hidden Markov model to be applicable to real-world applications, three key problems about the model must be addressed, and to do this, first we go over how to choose the best state sequence to explain an observation sequence, then we go over how to calculate the probability of an observation sequence, and finally we go over how to compute the maximization of the probability of the observation sequence. From these three angles, we review the mathematical concept behind the identification of CpG islands. The entire process and study of the outcomes have been tackled by examining both hypothetical and real DNA sequences side by side. We use well-known biological sequence analysis servers to carry out the experiment. Analytical and algorithmic approaches are compared while taking the hypothetical DNA sequence example into consideration.
Article Details
This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).
References
Alberts, B., Johnson, A., Lewis, J., Raff, M., Roberts, K. & Walter, P. Molecular Biology of the Cell (Garland science, 2017).
Bioinformatics and Biostatistics Core Research Center for Medical Excellence, National Taiwan University. DBCAT. Accessed: 2023-10-21. http://dbcat.cgm.ntu.edu.tw/.
Birney, E. Hidden Markov models in biological sequence analysis. IBM Journal of Research and Development 45, 449 –454 (2001). https://doi.org/10.2174/138920209789177575
Coelho, J. P., Pinho, T. M. & Boaventura-Cunha, J. Hidden Markov Models: Theory and Implementation using MATLAB (CRC Press, 2019). https://doi.org/10.1201/9780429261046
Compeau, P. & Pevzner, P. A. Bioinformatics Algorithms: An Active Learning Approach (Active Learning Publishers, 2015).
David, A. C., Thomas, J., Danny, R. & Marie-Laure, C. Epigenetics (Cold Spring Harbor Press, New York, NY, USA, 2007).
Durbin, R., Eddy, S. R., Krogh, A. & Mitchison, G. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids (Cambridge University Press, 1998). https://doi.org/10.1110/ps.8.3.695
Ejigu, G. F. & Jung, J. Review on the computational genome annotation of sequences obtained by next-generation sequencing. Biology 9, 295 (2020). https://doi.org/10.3390/biology9090295
EMBL’s European Bioinformatics Institute. EMBOSS Cpgplot. Accessed: 2023-10-21. https://www.ebi.ac.uk/jdispatcher/seqstats/emboss_cpgplot.
Franzese, M. & Iuliano, A. Hidden Markov models. Encyclopedia of Bioinformatics and Computational Biology: ABC of Bioinformatics, 672 (2018).
Fuentes-Beals, C.,Valdés-Jiménez, A.&Riadi, G. Hidden Markov modeling with HMMTeacher. PLOS Computational Biology 18, e1009703 (2022). https://doi.org/10.1371/journal.pcbi.1009703
Fuentes-Beals, C., Valdés-Jiménez, A. & Riadi, G. HMMTeacher. Accessed: 2023-10-21. https://hmmteacher.mobilomics.org/.
Gundersen, G. W. Scaling factors for hidden Markov models https://gregorygundersen.com/blog/2022/08/13/hmm-scaling-factors/. Accessed: 2023-10-21. Aug. 2022.
Isaev, A. Introduction to Mathematical Methods in Bioinformatics (Springer, 2006). https://doi.org/10.1007/978-3-540-48426-4
Lan, M., Xu, Y., Li, L., Wang, F., Zuo, Y., Chen, Y., Tan, C. L. & Su, J. CpG-discover: A machine learning approach for CpG islands identification from human DNA sequence in 2009 International Joint Conference on Neural Networks (2009), 1702–1707. https://doi.org/10.1109/IJCNN.2009.5178863
Mathé, C., Sagot, M. F., Schiex, T. & Rouzé, P. Current methods of gene prediction, their strengths and weaknesses. Nucleic Acids Research 30, 4103–4117 (2002). https://doi.org/10.1093/nar/gkf543
National Library of Medicine. NCBI. Accessed: 2023-10-21. https://www.ncbi.nlm.nih.gov/.
Rabiner, L. R. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE 77, 257 –286 (1989). https://doi.org/10.1109/5.18626
Robeva, R., Garrett, A., Kirkwood, J. & Davies, R. Identifying CpG islands: Sliding window and hidden Markov model approaches. Mathematical Concepts and Methods in Modern Biology: Using Modern Discrete Models, 267 (2013). http://dx.doi.org/10.1016/B978-0-12-415780-4.00009-0
Rocha, M. & Ferreira, P. G. Bioinformatics Algorithms: Design and Implementation in Python (Academic Press, 2018).
Takai, D. & Jones, P. A. Comprehensive analysis of CpG islands in human chromosomes 21 and 22. Proceedings of the National Academy of Sciences 99, 3740–3745 (2002). https://doi.org/10.1073/pnas.052410099
The MathWorks Inc. MATLAB Online - MathWorks. Accessed: 2023-10-21. https://in.mathworks.com/products/matlab-online.html.
Yoon, B. J. Hidden Markov models and their applications in biological sequence analysis. Current Genomics 10, 402 – 415 (2009). https://doi.org/10.2174/138920209789177575