Scene Text Detection and Recognition Using Maximally Stable Extremal Region


  • Golda Jeyasheeli P Department of Computer Science Engineering, Mepco Schlenk Engineering College, Sivakasi, Tamilnadu, India.
  • Athinarayanan B Department of Computer Science Engineering, Mepco Schlenk Engineering College, Sivakasi, Tamilnadu, India.
  • Manish T Department of Computer Science Engineering, Mepco Schlenk Engineering College, Sivakasi, Tamilnadu, India.
  • Mohamad Umar M Department of Computer Science Engineering, Mepco Schlenk Engineering College, Sivakasi, Tamilnadu, India.



MSER, SWT, Text Detection, Text Recognition, Deep Learning, CRNN


In recent years, scene text detection and recognition have become important research areas in computer vision and machine learning. Traditional text detection and recognition methods may struggle with detecting and recognizing text in images with low resolution, complex backgrounds, and varying font sizes. The proposed methodology addresses these challenges by combining multiple algorithms and using deep learning techniques. In this paper, we propose a method for scene text detection based on Maximally Stable Extremal Regions (MSER) combined with Stroke Width Transform (SWT) and recognition using Convolutional Recurrent Neural Networks (CRNN). Our method consists of two stages: text detection and text recognition. To detect text, we use MSER and SWT to extract candidate text regions from the input and then, we eradicate non-text regions using image to image translation. Finally, to recognize text, CRNN is used to recognize the text present in the detected regions. Our CRNN architecture consists of convolutional and recurrent layers, which enable us to capture both spatial and temporal features of the text. The methodology is evaluated on various benchmark datasets and has obtained good results with accuracy of 96% when compared to existing methods.


Download data is not yet available.


Bagi, R., Dutta, T., Nigam, N., Verma, D., & Gupta, H. P. (2021). Met-MLTS: leveraging smartphones for end-to-end spotting of multilingual oriented scene texts and traffic signs in adverse meteorological conditions. IEEE Transactions on Intelligent Transportation Systems, 23(8), 12801-12810.

Cheng, P., Cai, Y., & Wang, W. (2019). A direct regression scene text detector with position-sensitive segmentation. IEEE Transactions on Circuits and Systems for Video Technology, 30(11), 4171-4181.

Das, A., Palaiahnakote, S., Banerjee, A., Antonacopoulos, A., & Pal, U. (2024). Soft Set-based MSER End-to-End System for Occluded Scene Text Detection, Recognition and Prediction. Knowledge-Based Systems, 112593.

Dutta, I. N., Chakraborty, N., Mollah, A. F., Basu, S., & Sarkar, R. (2019). Multi-lingual text localization from camera captured images based on foreground homogenity analysis. In Recent Developments in Machine Learning and Data Analytics: IC3 2018 (pp. 149-158). Springer Singapore.

Epshtein, B., Ofek, E., & Wexler, Y. (2010, June). Detecting text in natural scenes with stroke width transform. In 2010 IEEE computer society conference on computer vision and pattern recognition (pp. 2963-2970). IEEE.

Fang, S., Mao, Z., Xie, H., Wang, Y., Yan, C., & Zhang, Y. (2022). Abinet++: Autonomous, bidirectional and iterative language modeling for scene text spotting. IEEE transactions on pattern analysis and machine intelligence, 45(6), 7123-7141.

Geng, T. (2024). Transforming Scene Text Detection and Recognition: A Multi-Scale End-to-End Approach With Transformer Framework. IEEE Access.

Gomez, L., & Karatzas, D. (2014, August). MSER-based real-time text detection and tracking. In 2014 22nd International Conference on Pattern Recognition (pp. 3110-3115). IEEE.

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., ... & Bengio, Y. (2014). Generative adversarial nets. Advances in neural information processing systems, 27.

He, W., Zhang, X. Y., Yin, F., & Liu, C. L. (2017). Deep direct regression for multi-oriented scene text detection. In Proceedings of the IEEE international conference on computer vision (pp. 745-753).

Islam, M. R., Mondal, C., Azam, M. K., & Islam, A. S. M. J. (2016, May). Text detection and recognition using enhanced MSER detection and a novel OCR technique. In 2016 5th International Conference on Informatics, Electronics and Vision (ICIEV) (pp. 15-20). IEEE.

Isola, P., Zhu, J. Y., Zhou, T., & Efros, A. A. (2017). Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1125-1134).

Kai, H. E., Jinlong, T. A. N. G., Zikang, L. I. U., & Ziqi, Y. A. N. G. (2024). HAFE: A Hierarchical Awareness and Feature Enhancement Network for Scene Text Recognition. Knowledge-Based Systems, 284, 111178.

Karatzas, D., Gomez-Bigorda, L., Nicolaou, A., Ghosh, S., Bagdanov, A., Iwamura, M., ... & Valveny, E. (2015, August). ICDAR 2015 competition on robust reading. In 2015 13th international conference on document analysis and recognition (ICDAR) (pp. 1156-1160). IEEE.

Karatzas, D., Shafait, F., Uchida, S., Iwamura, M., i Bigorda, L. G., Mestre, S. R., ... & De Las Heras, L. P. (2013, August). ICDAR 2013 robust reading competition. In 2013 12th international conference on document analysis and recognition (pp. 1484-1493). IEEE.

Khalid, S., Shah, J. H., Sharif, M., Dahan, F., Saleem, R., & Masood, A. (2024). A Robust Intelligent System for Text-Based Traffic Signs Detection and Recognition in Challenging Weather Conditions. IEEE Access.

Koo, H. I., & Kim, D. H. (2013). Scene text detection via connected component clustering and nontext filtering. IEEE transactions on image processing, 22(6), 2296-2305.

Liu, Y., Jin, L., & Fang, C. (2019). Arbitrarily shaped scene text detection with a mask tightness text detector. IEEE Transactions on Image Processing, 29, 2918-2930.

Matas, J., Chum, O., Urban, M., & Pajdla, T. (2004). Robust wide-baseline stereo from maximally stable extremal regions. Image and vision computing, 22(10), 761-767.

Mirza, M. (2014). Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784.

Mu, D., Sun, W., Xu, G., & Li, W. (2021). Random blur data augmentation for scene text recognition. IEEE Access, 9, 136636-136646.

Mukhopadhyay, A., Kumar, S., Chowdhury, S. R., Chakraborty, N., Mollah, A. F., Basu, S., & Sarkar, R. (2019). Multi-lingual scene text detection using one-class classifier. International Journal of Computer Vision and Image Processing (IJCVIP), 9(2), 48-65.

Panda, S., Ash, S., Chakraborty, N., Mollah, A. F., Basu, S., & Sarkar, R. (2020). Parameter tuning in MSER for text localization in multi-lingual camera-captured scene text images. In Computational Intelligence in Pattern Recognition: Proceedings of CIPR 2019 (pp. 999-1009). Springer Singapore.

Shi, B., Bai, X., & Yao, C. (2016). An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE transactions on pattern analysis and machine intelligence, 39(11), 2298-2304.

Sun, W., Wang, Q., Hou, Z., Chen, X., Yan, Q., & Zhang, Y. (2024). DPGS: Cross-cooperation guided dynamic points generation for scene text spotting. Knowledge-Based Systems, 302, 112399.

Tian, S., Zhu, K. X., Qin, H. B., & Yang, C. (2024). Dynamic receptive field adaptation for scene text recognition. Pattern Recognition Letters, 178, 55-61.

Tong, G., Dong, M., Sun, X., & Song, Y. (2022). Natural scene text detection and recognition based on saturation-incorporated multi-channel MSER. Knowledge-Based Systems, 250, 109040.

Wu, L., Xu, Y., Hou, J., Chen, C. P., & Liu, C. L. (2022). A two-level rectification attention network for scene text recognition. IEEE Transactions on Multimedia, 25, 2404-2414.

Wu, Y., Kong, Q., Qian, C., Nappi, M., & Wan, S. (2023). End-PolarT: Polar Representation for End-to-End Scene Text Detection. Big Data Research, 34, 100410.

Xu, Y., Liang, Z., Liang, Y., Li, X., Pan, W., You, J., ... & Scotti, F. (2024). Data-Driven Container Marking Detection and Recognition System with an Open Large-Scale Scene Text Dataset. IEEE Transactions on Emerging Topics in Computational Intelligence.

Yan, X., Fang, Z., & Jin, Y. (2023). An adaptive n-gram transformer for multi-scale scene text recognition. Knowledge-Based Systems, 280, 110964.

Yao, C., Bai, X., Sang, N., Zhou, X., Zhou, S., & Cao, Z. (2016). Scene text detection via holistic, multi-channel prediction. arXiv preprint arXiv:1606.09002.

Ye, Q., & Doermann, D. (2014). Text detection and recognition in imagery: A survey. IEEE transactions on pattern analysis and machine intelligence, 37(7), 1480-1500.

Yin, X. C., Pei, W. Y., Zhang, J., & Hao, H. W. (2015). Multi-orientation scene text detection with adaptive clustering. IEEE transactions on pattern analysis and machine intelligence, 37(9), 1930-1937.

Yu, W., Liu, Y., Zhu, X., Cao, H., Sun, X., & Bai, X. (2024). Turning a clip model into a scene text spotter. IEEE Transactions on Pattern Analysis and Machine Intelligence.

Zhang, J., & Kasturi, R. (2014). A novel text detection system based on character and link energies. IEEE Transactions on Image Processing, 23(9), 4187-4198.

Zhang, Z., Shen, W., Yao, C., & Bai, X. (2015). Symmetry-based text line detection in natural scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2558-2567).

Zhou, G., Liu, Y., Tian, Z., & Su, Y. (2011, September). A new hybrid method to detect text in natural scene. In 2011 18th IEEE International Conference on Image Processing (pp. 2605-2608). IEEE.




How to Cite

P, G. J., B, A., T, M., & M, M. U. (2024). Scene Text Detection and Recognition Using Maximally Stable Extremal Region . Journal of Applied Engineering and Technological Science (JAETS), 6(1), 103–114.