Analysis and Detection of Sections of English Pop Song Lyrics Using Transfer Learning from The Longformer Model

Willy Reiji Nurhuda Ekaputra

Abstract


Understanding the structural composition of song lyrics is essential for various applications, including music recommendation, summarization, and computational creativity. In this study, we explore automated section classification of song lyrics—specifically identifying parts such as verse, chorus, bridge, and others—using a transformer-based model. We fine-tuned a Longformer model on a dataset of 10,000 English pop lyrics with annotated section labels. The model was trained as a token classification task without the use of global attention, relying solely on local context to capture structural cues. Despite working with a significantly reduced dataset and limited training resources, the model achieved strong performance on the dominant structural classes, reaching F1-scores of 0.78 for verse and 0.77 for chorus. Secondary and infrequent sections such as bridge and prechorus showed moderate performance, while more ambiguous categories like postchorus and other were less accurately predicted. Analysis of the confusion matrix revealed that most misclassifications occurred between semantically overlapping sections, particularly among chorus-adjacent types. The results demonstrate that transformer models can effectively learn lyric structure from text alone, even with constrained data and without musical input. Our findings suggest that such models can serve as a strong foundation for future lyric analysis systems and that performance can be further improved through dataset expansion, label refinement, and multimodal integration.

Keywords: Lyrics Segmentation, Section Classification, Transformer, Longformer, Natural Language Processing, Music Structure Analysis, Chorus Detection, Lyric Modeling.


Full Text:

PDF

References


K. Watanabe and M. Goto, “A method to detect chorus sections in lyrics text,” IEICE Transactions on Information and Systems, no. 9, pp. 1600–1609, 2023.

K. Watanabe et al., “Modeling discourse segments in lyrics using repeated patterns,” COLING 2016 - 26th International Conference on Computational Linguistics, Proceedings of COLING 2016: Technical Papers, vol. 26, no. 9784879747020, pp. 1959–1969.

J. Wang, Z. Li, B. Gu, T. Zhang, Q. Liu, and Z. Chen, “Multi-modal Chorus Recognition for Improving Song Search,” in Lecture Notes in Computer Science, Springer International Publishing, 2021, pp. 427–438.

Q. He, X. Sun, Y. Yu, and W. Li, “Deepchorus: A hybrid model of multi-scale convolution and self-attention for chorus detection,” in ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 411– 415.

M. Fell, Y. Nechaev, G. Meseguer-Brocal, E. Cabrio, F. Gandon, and G. Peeters, “Lyrics segmen- tation via bimodal text–audio representation,” Natural Language Engineering, vol. 28, no. 3, pp. 317–336, 2021.

“Songwriting 101: Learn Common Song Structures - 2025.”

O. Maimon and L. Rokach, “Introduction to Knowledge Discovery in Databases,” in Data Mining and Knowledge Discovery Handbook, Springer-Verlag, pp. 1–17. [Online]. Available: https://doi. org/10.1007/0-387-25465-x_1

M. Fell, Y. Nechaev, E. Cabrio, and F. Gandon, “Lyrics Segmentation: Textual Macrostructure Detection using Convolutions,” COLING 2018 - 27th International Conference on Computational Linguistics, Proceedings, vol. 0, pp. 2044–2054.

A. Vaswani et al., “Attention Is All You Need.” [Online]. Available: https://arxiv.org/abs/1706. 03762

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.” [Online]. Available: https://arxiv.org/abs/1810.04805

V. S and J. R, “Text Mining: open Source Tokenization Tools – An Analysis,” Advanced Compu- tational Intelligence: An International Journal (ACII), vol. 3, no. 1, pp. 37– 47, Jan. 2016, doi: 10.5121/acii.2016.3104.

D. Salvato, “Your Reality.” [Online]. Available: https://genius.com/Dan-salvato-your-reality- lyrics

Mage, “The Words I Never Said.” [Online]. Available: https://genius.com/Mage-the-words-i- never-said-in-d-b-lyrics

Zedd and Foxes, Clarity. Zedd, 2012. [Online]. Available: https://genius.com/Zedd-clarity- lyrics

D. Khurana, A. Koli, K. Khatter, and S. Singh, “Natural language processing: state of the art, current trends and challenges,” Multimedia Tools and Applications, vol. 82, no. 3, pp. 3713–3744, Jul. 2022, doi: 10.1007/s11042-022-13428-4.

K. W. Church, “Word2Vec,” Natural Language Engineering, vol. 23, no. 1, pp. 155–162, Dec. 2016, doi: 10.1017/s1351324916000334.

T. Hastie, R. Tibshirani, and J. H. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer Science, Business Media, 2001.

I. Loshchilov and F. Hutter, “Decoupled Weight Decay Regularization.” [Online]. Available: https:// arxiv.org/abs/1711.05101

I. Beltagy, M. E. Peters, and A. Cohan, “Longformer: The Long-Document Transformer.” [Online]. Available: https://arxiv.org/abs/2004.05150

B. Muller, “BERT 101.” [Online]. Available: https://huggingface.co/blog/bert-101

M. M. Lopez and J. Kalita, “Deep Learning applied to NLP.” [Online]. Available: https://arxiv.org/ abs/1703.03091

J.-C. Wang, J. B. Smith, J. Chen, X. Song, and Y. Wang, “Supervised chorus detection for popular music using convolutional neural network and multi-task learning,” in ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Jun. 2021, pp. 566–570.




DOI: https://doi.org/10.62389/bina.v4i1.129


View My Stats

Bina : Jurnal Pembangunan Daerah is licensed under CC BY-NC 4.0