From Text Corpus to Dewey Number: Designing a Prototype for Automated Classification
DOI:
https://doi.org/10.17821/srels/2024/v61i6/171643Keywords:
Annif, Automated Indexing, Automatic Classification, DDC, NDCG, Neural NetworkAbstract
This research is an attempt to explore the possibilities of an AI/ML-based automated indexing system for book collections in a library. Library classification systems are essentially pre-coordinated indexing approaches. Researchers since the 1980s have used different techniques for synthesizing classification numbers automatically from the text corpus. With the advent of machine learning techniques in the late 1990s, a more recent approach involves using a supervised learning algorithm to train a model on a set of documents that have been manually classified by trained library professionals using classification schemes like UDC, DDC, or Colon Classification. The trained model (machine learning backend) learns patterns from the training data and then predicts the subject and class number for new documents. In the preliminary phase, we gathered a substantial collection of more than 200,000 MARC 21-formatted bibliographic records from different libraries and then curated these datasets appropriately to include Tag 082 (DDC Call Number), Tag 245 (Title of Document), Tag 520 (Summary Note), and Tag 650 (Subject Descriptors) for developing the final dataset. This final dataset was subsequently divided into three sections: (i) a training dataset (96% of the final dataset); (ii) a validation dataset (2% of the final dataset); and (ii) a test dataset (2% of the final dataset). We deployed Annif, an open-source AI/ML framework, along with different backends supported by it (Associative group: FastText, Omikuji and SVC; and Ensemble: Simple and Neural Network). In the next stage, the framework was trained using a variety of backend algorithms (as mentioned), and finally, results were combined into an ensemble based on a neural network model. To assess the effectiveness of these models, all of these machine-learning backends were compared using two crucial retrieval metrics: F1@5 and NDCG. When it comes to automated class number building, we have discovered that the neural network model outperforms all other backends. Moreover, it is quite feasible to adopt these methods and tools for building a real-life automated classification system, as the Annif supports REST/API-based access in generating suggestions for DDC-based class numbers (along with accuracy scores) based on given text corpora. This overall framework is based on open-source software, open datasets, and open standards.
Downloads
References
Ahmed, M. (2023). Automatic indexing for agriculture: Designing a framework by deploying Agrovoc, Agris and Annif. Journal of Information and Knowledge, 60(2), 85-95. https://doi.org/10.17821/srels/2023/v60i2/170966
Ahmed, M., Mukhopadhyay, M., & Mukhopadhyay, P. (2023). Automated knowledge organization: AI/ML based subject indexing system for libraries. DESIDOC Journal of Library & Information Technology, 43(1), 45-54. https://doi.org/10.14429/djlit.43.01.18619
Bianchini, C. (2023). CCLitBox. A wikidata gadget to classify world literature. Journal of Information and Knowledge, 60(3), 133-141. https://doi.org/10.17821/srels/2023/v60i3/171024
Cheng, P. T., & Wu, A. K. (1995). ACS: An automatic classification system. Journal of Information Science, 21(4), 289-299. https://doi.org/10.1177/016555159502100405
Desale, S. K., & Kumbhar, R. M. (2013). Research on automatic classification of documents in library environment: A literature review. Knowledge Organization, 40(5), 295-304. https://doi.org/10.5771/0943-7444-2013-5-295
Golub, K. (2011). Automated subject classification of textual documents in the context of web-based hierarchical browsing. Knowledge Organization, 38(3), 230-244. https://doi.org/10.5771/0943-7444-2011-3-230
Golub, K. (2021). Automated subject indexing: An overview. Cataloging and Classification Quarterly, 59(8), 702-719. https://doi.org/10.1080/01639374.2021.2012311
Golub, K., Suominen, O., Mohammed, A. T., Aagaard, H., & Osterman, O. (2024). Automated Dewey decimal classification of Swedish library metadata using Annif software. Journal of Documentation. (ahead-to-print). https://doi.org/10.1108/JD-01-2022-0026
Gupta, S., Agarwal, M., & Jain, S. (2019). Automated genre classification of books using machine learning and natural language processing. 2019 9th International Conference on Cloud Computing, Data Science and Engineering (Confluence) (pp. 269-272). https://doi.org/10.1109/CONFLUENCE.2019.8776935
Halder, D., & Biswas, M. (2023). Machine-generated Colon class numbers: Automatic classification of Indian literary works in the wikidata environment. Journal of Information and Knowledge, 60(3), 143-149. https://doi.org/10.17821/srels/2023/v60i3/171025
Jenkins, C., Jackson, M., Burden, P., & Wallis, J. (1998). Automatic classification of web resources using Java and Dewey decimal classification. Computer Networks and ISDN Systems, 30(1-7), 646-648.
Junger, U. (2017). Automation first – The subject cataloguing policy of the Deutsche National Bibliothek. http://library.ifla.org/id/eprint/2213/
Mitra, R., & Mukhopadhyay, P. (2023). Machine learning applications in digital humanities: Designing a semiautomated subject indexing system for a low-resource domain. DESIDOC Journal of Library and Information Technology, 43(4), 219-225. https://doi.org/10.14429/djlit.43.04.19227
Mukhopadhyay, P. (2023). Machine learning and bibliographic data universe: Assessing efficacy of backend algorithms in Annif through retrieval metrics. Journal of Information and Knowledge, 60(1), 39-48. https://doi.org/10.17821/srels/2023/v60i1/170891
Panigrahi, P., & Prasad, A. R. D. (2007). Facet sequence in analytico synthetic scheme: A study for developing an AI based automatic classification system. Annals of Library and Information Studies, 54(1), 37-43.
Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34(1), 1-47. https://doi.org/10.1145/505282.505283
Suominen, O. (2019). Annif: DIY automated subject indexing using multiple algorithms. LIBER Quarterly: The Journal of the Association of European Research Libraries, 29(1), 1-25. https://doi.org/10.18352/lq.10285
Suominen, O., Inkinen, J., & Lehtinen, M. (2022). Annif and Finto AI: Developing and implementing automated subject indexing. JLIS.It, 13(1), 265-282. https://doi. org/10.4403/jlis.it-12740
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2024 Journal of Information and Knowledge

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
All the articles published in Journal of Information and Knowledge are held by the Publisher. Sarada Ranganathan Endowment for Library Science (SRELS), as a publisher requires its authors to transfer the copyright prior to publication. This will permit SRELS to reproduce, publish, distribute and archive the article in print and electronic form and also to defend against any improper use of the article.