Mathematics behind the identifying CpG islands

Main Article Content

Bijan Sarkar
https://orcid.org/0000-0001-8978-4339

Abstract

The major objective of the paper is to review the theory for an hidden Markov model, a very general type of probabilistic model for sequences of symbols. In order for the hidden Markov model to be applicable to real-world applications, three key problems about the model must be addressed, and to do this, first we go over how to choose the best state sequence to explain an observation sequence, then we go over how to calculate the probability of an observation sequence, and finally we go over how to compute the maximization of the probability of the observation sequence. From these three angles, we review the mathematical concept behind the identification of CpG islands. The entire process and study of the outcomes have been tackled by examining both hypothetical and real DNA sequences side by side. We use well-known biological sequence analysis servers to carry out the experiment. Analytical and algorithmic approaches are compared while taking the hypothetical DNA sequence example into consideration.

Article Details

How to Cite
Sarkar, B. (2024). Mathematics behind the identifying CpG islands. Brazilian Journal of Biometrics, 42(4), 307–328. https://doi.org/10.28951/bjb.v42i4.704
Section
Articles

References

Alberts, B., Johnson, A., Lewis, J., Raff, M., Roberts, K. & Walter, P. Molecular Biology of the Cell (Garland science, 2017).

Bioinformatics and Biostatistics Core Research Center for Medical Excellence, National Taiwan University. DBCAT. Accessed: 2023-10-21. http://dbcat.cgm.ntu.edu.tw/.

Birney, E. Hidden Markov models in biological sequence analysis. IBM Journal of Research and Development 45, 449 –454 (2001). https://doi.org/10.2174/138920209789177575

Coelho, J. P., Pinho, T. M. & Boaventura-Cunha, J. Hidden Markov Models: Theory and Implementation using MATLAB (CRC Press, 2019). https://doi.org/10.1201/9780429261046

Compeau, P. & Pevzner, P. A. Bioinformatics Algorithms: An Active Learning Approach (Active Learning Publishers, 2015).

David, A. C., Thomas, J., Danny, R. & Marie-Laure, C. Epigenetics (Cold Spring Harbor Press, New York, NY, USA, 2007).

Durbin, R., Eddy, S. R., Krogh, A. & Mitchison, G. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids (Cambridge University Press, 1998). https://doi.org/10.1110/ps.8.3.695

Ejigu, G. F. & Jung, J. Review on the computational genome annotation of sequences obtained by next-generation sequencing. Biology 9, 295 (2020). https://doi.org/10.3390/biology9090295

EMBL’s European Bioinformatics Institute. EMBOSS Cpgplot. Accessed: 2023-10-21. https://www.ebi.ac.uk/jdispatcher/seqstats/emboss_cpgplot.

Franzese, M. & Iuliano, A. Hidden Markov models. Encyclopedia of Bioinformatics and Computational Biology: ABC of Bioinformatics, 672 (2018).

Fuentes-Beals, C.,Valdés-Jiménez, A.&Riadi, G. Hidden Markov modeling with HMMTeacher. PLOS Computational Biology 18, e1009703 (2022). https://doi.org/10.1371/journal.pcbi.1009703

Fuentes-Beals, C., Valdés-Jiménez, A. & Riadi, G. HMMTeacher. Accessed: 2023-10-21. https://hmmteacher.mobilomics.org/.

Gundersen, G. W. Scaling factors for hidden Markov models https://gregorygundersen.com/blog/2022/08/13/hmm-scaling-factors/. Accessed: 2023-10-21. Aug. 2022.

Isaev, A. Introduction to Mathematical Methods in Bioinformatics (Springer, 2006). https://doi.org/10.1007/978-3-540-48426-4

Lan, M., Xu, Y., Li, L., Wang, F., Zuo, Y., Chen, Y., Tan, C. L. & Su, J. CpG-discover: A machine learning approach for CpG islands identification from human DNA sequence in 2009 International Joint Conference on Neural Networks (2009), 1702–1707. https://doi.org/10.1109/IJCNN.2009.5178863

Mathé, C., Sagot, M. F., Schiex, T. & Rouzé, P. Current methods of gene prediction, their strengths and weaknesses. Nucleic Acids Research 30, 4103–4117 (2002). https://doi.org/10.1093/nar/gkf543

National Library of Medicine. NCBI. Accessed: 2023-10-21. https://www.ncbi.nlm.nih.gov/.

Rabiner, L. R. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE 77, 257 –286 (1989). https://doi.org/10.1109/5.18626

Robeva, R., Garrett, A., Kirkwood, J. & Davies, R. Identifying CpG islands: Sliding window and hidden Markov model approaches. Mathematical Concepts and Methods in Modern Biology: Using Modern Discrete Models, 267 (2013). http://dx.doi.org/10.1016/B978-0-12-415780-4.00009-0

Rocha, M. & Ferreira, P. G. Bioinformatics Algorithms: Design and Implementation in Python (Academic Press, 2018).

Takai, D. & Jones, P. A. Comprehensive analysis of CpG islands in human chromosomes 21 and 22. Proceedings of the National Academy of Sciences 99, 3740–3745 (2002). https://doi.org/10.1073/pnas.052410099

The MathWorks Inc. MATLAB Online - MathWorks. Accessed: 2023-10-21. https://in.mathworks.com/products/matlab-online.html.

Yoon, B. J. Hidden Markov models and their applications in biological sequence analysis. Current Genomics 10, 402 – 415 (2009). https://doi.org/10.2174/138920209789177575