Automated Subject Identification using the Universal Decimal Classification: The ANN Approach

Aditi Roy; Saptarshi Ghosh

doi:10.17821/srels/2023/v60i2/170963

Authors

Aditi Roy
Research Scholar, Department of Library and Information Science, University of North Bengal, Raja Rammohunpur – 734014, West Bengal
Saptarshi Ghosh
Professor, Department of Library and Information Science, University of North Bengal, Raja Rammohunpur – 734014, West Bengal

DOI:

https://doi.org/10.17821/srels/2023/v60i2/170963

Keywords:

Automatic Classification, BERT Model, KNIME, Multi-Label Classification, UDC (Universal Decimal Classification)

Abstract

Universal Decimal Classification (UDC) is a popular controlled vocabulary that is used to represent subjects of documents. Text categorization determines a text's category, as evident from the notation-text label format of the Universal Decimal Classification. With the help of machine learning techniques and the Universal Decimal Classification (UDC), the present work aims to develop an end-user (library professional) based recommender system for automatically classifying documents using the UDC scheme. The proposed work is conceived for determining and constructing a complex class number using the syntax of Universal Decimal Classification (UDC). A corpus of documents classified with the UDC scheme is used as a training dataset. The classification of the documents is done with human mediation having proficiency in classificatory approaches. The BERT model and the KNIME software are used for the study. This study uses the classified dataset to fine-tune the pre-trained BERT model to construct the semi-automatic classification model. The results show that the model is constructed with high accuracy and Area Under Curve (AUC) value, although the prediction represented a low accuracy rate. This study reflected that if the model is explicitly trained by annotating each concept and if the full licensed version of UDC class numbers becomes available, there is a greater potency of developing an automated, freely faceted classification scheme for practical use.

Downloads

Download data is not yet available.

References

Alrowili, S. E. and Vijay-Shanker, K. (2021). BioMTransformers: Building large biomedical language models with, 221-227. https://doi.org/10.18653/v1/2021. bionlp-1.24 DOI: https://doi.org/10.18653/v1/2021.bionlp-1.24

Arora, S., May, A., Zhang, J. and Ré, C. (2020). Contextual embeddings: When are they worth it? 2650- 2663. https://doi.org/10.18653/v1/2020.acl-main.236 DOI: https://doi.org/10.18653/v1/2020.acl-main.236

Berthold, M. R., Cebron, N., Dill, F., Fatta, G. D., Gabriel, T. R., Georg, F., Meinl, T., Ohl, P., Sieb, C. and Wiswedel, B. (2009). Knime: The Konstanz Information Miner. ACM SIGKDD Explorations Newsletter, 11(1). https:// doi.org/10.1145/1656274.1656280 PMCid:PMC2670301 DOI: https://doi.org/10.1145/1656274.1656280

Borovic, M., Ojstersek, M. and Strnad, D. (2022). A hybrid approach to recommending universal decimal classification codes for cataloguing in Slovenian digital libraries. IEEE Access. https://doi.org/10.1109/ ACCESS.2022.3198706 DOI: https://doi.org/10.1109/ACCESS.2022.3198706

British Standards, I. (2005). UDC, Universal Decimal Classification. British Standards Institution.

Devlin, J., Chang, M.-W., Lee, K. and Toutanova, K. (2018). BERT: Pre-training of deep bidirectional transformers for language understanding.

González-Carvajal, S. and Garrido-Merchán, E. C. (2020). Comparing BERT against traditional machine learning text classification.

Guo, Z. (2021) When can BERT beat SVM? Replication of four text classification papers and fine-tuning RobBERT on Dutch language datasets.;11p. retrieved from: https:// theses.liacs.nl/pdf/2020-2021-GuoZhenyu.pdf

Gweon, H. and Schonlau, M. (2022). Automated classification for open-ended questions with BERT. DOI: https://doi.org/10.1093/jssam/smad015

Joorabchi, A. and Mahdi, A. E. (2011). An Unsupervised Approach to Automatic Classification of Scientific Literature Utilising Bibliographic Metadata. Journal of Information Science, 37(5), 499-514 https://doi. org/10.1177/0165551511417785 DOI: https://doi.org/10.1177/0165551511417785

Koroteev, M. V. (2021). BERT: A review of applications in natural language processing and understanding. 188.

KNIME Community Hub. Available from https://hub. knime.com/

Kragelj, M. and Kljajić Borštnar, M. (2021). Automatic classification of older electronic texts into the Universal Decimal Classification-UDC. Journal of Documentation. 77, 755-776 https://doi.org/10.1108/JD-06-2020-0092 DOI: https://doi.org/10.1108/JD-06-2020-0092

Narkhede, S. (2018). Understanding AUC - ROC Curve. Available from https://towardsdatascience.com/understanding- auc-roc-curve-68b2303cc9c5

Nevzorova, O. (2021). Towards a recommender system for the choice of UDC code for mathematical articles. 190.

Schrumpf, J., Weber, F. and Thelen, T. (2021). A neural natural language processing system for educational resource knowledge domain classification. 194, 283-283.

Slavic, A. (2008). Use of the universal decimal classification: A world-wide survey. Journal of Documentation. 64, 211-228 https://doi.org/10.1108/00220410810858029 DOI: https://doi.org/10.1108/00220410810858029

Slavic, A., Siebes, R. and Scharnhorst, A. (2021). Chapter 5. Publishing a knowledge organization system as linked data. The case of the universal decimal classification. In: Linking Knowledge, Ergon - ein Verlag in der Nomos Verlagsgesellschaft, 69-98. https://doi. org/10.5771/9783956506611-69 DOI: https://doi.org/10.5771/9783956506611-69

UDC Consortium. (2022). About Universal Decimal Classification (UDC). Available from: https://udcc.org/ index.php/site/page?view=about