From Text Corpus to Dewey Number: Designing a Prototype for Automated Classification

Authors

  • Department of Library and Information Science, Kalyani University, Kalyani – 741235, West Bengal
  • Department of Library and Information Science, Kalyani University, Kalyani – 741235, West Bengal

DOI:

https://doi.org/10.17821/srels/2024/v61i6/171643

Keywords:

Annif, Automated Indexing, Automatic Classification, DDC, NDCG, Neural Network

Abstract

This research is an attempt to explore the possibilities of an AI/ML-based automated indexing system for book collections in a library. Library classification systems are essentially pre-coordinated indexing approaches. Researchers since the 1980s have used different techniques for synthesizing classification numbers automatically from the text corpus. With the advent of machine learning techniques in the late 1990s, a more recent approach involves using a supervised learning algorithm to train a model on a set of documents that have been manually classified by trained library professionals using classification schemes like UDC, DDC, or Colon Classification. The trained model (machine learning backend) learns patterns from the training data and then predicts the subject and class number for new documents. In the preliminary phase, we gathered a substantial collection of more than 200,000 MARC 21-formatted bibliographic records from different libraries and then curated these datasets appropriately to include Tag 082 (DDC Call Number), Tag 245 (Title of Document), Tag 520 (Summary Note), and Tag 650 (Subject Descriptors) for developing the final dataset. This final dataset was subsequently divided into three sections: (i) a training dataset (96% of the final dataset); (ii) a validation dataset (2% of the final dataset); and (ii) a test dataset (2% of the final dataset). We deployed Annif, an open-source AI/ML framework, along with different backends supported by it (Associative group: FastText, Omikuji and SVC; and Ensemble: Simple and Neural Network). In the next stage, the framework was trained using a variety of backend algorithms (as mentioned), and finally, results were combined into an ensemble based on a neural network model. To assess the effectiveness of these models, all of these machine-learning backends were compared using two crucial retrieval metrics: F1@5 and NDCG. When it comes to automated class number building, we have discovered that the neural network model outperforms all other backends. Moreover, it is quite feasible to adopt these methods and tools for building a real-life automated classification system, as the Annif supports REST/API-based access in generating suggestions for DDC-based class numbers (along with accuracy scores) based on given text corpora. This overall framework is based on open-source software, open datasets, and open standards.

Downloads

Download data is not yet available.

References

Ahmed, M. (2023). Automatic indexing for agriculture: Designing a framework by deploying Agrovoc, Agris and Annif. Journal of Information and Knowledge, 60(2), 85-95. https://doi.org/10.17821/srels/2023/v60i2/170966

Ahmed, M., Mukhopadhyay, M., & Mukhopadhyay, P. (2023). Automated knowledge organization: AI/ML based subject indexing system for libraries. DESIDOC Journal of Library & Information Technology, 43(1), 45-54. https://doi.org/10.14429/djlit.43.01.18619

Bianchini, C. (2023). CCLitBox. A wikidata gadget to classify world literature. Journal of Information and Knowledge, 60(3), 133-141. https://doi.org/10.17821/srels/2023/v60i3/171024

Cheng, P. T., & Wu, A. K. (1995). ACS: An automatic classification system. Journal of Information Science, 21(4), 289-299. https://doi.org/10.1177/016555159502100405

Desale, S. K., & Kumbhar, R. M. (2013). Research on automatic classification of documents in library environment: A literature review. Knowledge Organization, 40(5), 295-304. https://doi.org/10.5771/0943-7444-2013-5-295

Golub, K. (2011). Automated subject classification of textual documents in the context of web-based hierarchical browsing. Knowledge Organization, 38(3), 230-244. https://doi.org/10.5771/0943-7444-2011-3-230

Golub, K. (2021). Automated subject indexing: An overview. Cataloging and Classification Quarterly, 59(8), 702-719. https://doi.org/10.1080/01639374.2021.2012311

Golub, K., Suominen, O., Mohammed, A. T., Aagaard, H., & Osterman, O. (2024). Automated Dewey decimal classification of Swedish library metadata using Annif software. Journal of Documentation. (ahead-to-print). https://doi.org/10.1108/JD-01-2022-0026

Gupta, S., Agarwal, M., & Jain, S. (2019). Automated genre classification of books using machine learning and natural language processing. 2019 9th International Conference on Cloud Computing, Data Science and Engineering (Confluence) (pp. 269-272). https://doi.org/10.1109/CONFLUENCE.2019.8776935

Halder, D., & Biswas, M. (2023). Machine-generated Colon class numbers: Automatic classification of Indian literary works in the wikidata environment. Journal of Information and Knowledge, 60(3), 143-149. https://doi.org/10.17821/srels/2023/v60i3/171025

Jenkins, C., Jackson, M., Burden, P., & Wallis, J. (1998). Automatic classification of web resources using Java and Dewey decimal classification. Computer Networks and ISDN Systems, 30(1-7), 646-648.

Junger, U. (2017). Automation first – The subject cataloguing policy of the Deutsche National Bibliothek. http://library.ifla.org/id/eprint/2213/

Mitra, R., & Mukhopadhyay, P. (2023). Machine learning applications in digital humanities: Designing a semiautomated subject indexing system for a low-resource domain. DESIDOC Journal of Library and Information Technology, 43(4), 219-225. https://doi.org/10.14429/djlit.43.04.19227

Mukhopadhyay, P. (2023). Machine learning and bibliographic data universe: Assessing efficacy of backend algorithms in Annif through retrieval metrics. Journal of Information and Knowledge, 60(1), 39-48. https://doi.org/10.17821/srels/2023/v60i1/170891

Panigrahi, P., & Prasad, A. R. D. (2007). Facet sequence in analytico synthetic scheme: A study for developing an AI based automatic classification system. Annals of Library and Information Studies, 54(1), 37-43.

Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34(1), 1-47. https://doi.org/10.1145/505282.505283

Suominen, O. (2019). Annif: DIY automated subject indexing using multiple algorithms. LIBER Quarterly: The Journal of the Association of European Research Libraries, 29(1), 1-25. https://doi.org/10.18352/lq.10285

Suominen, O., Inkinen, J., & Lehtinen, M. (2022). Annif and Finto AI: Developing and implementing automated subject indexing. JLIS.It, 13(1), 265-282. https://doi. org/10.4403/jlis.it-12740

Published

2024-12-03

How to Cite

Kerketta, S., & Mukhopadhyay, P. (2024). From Text Corpus to Dewey Number: Designing a Prototype for Automated Classification. Journal of Information and Knowledge, 61(6), 295–302. https://doi.org/10.17821/srels/2024/v61i6/171643

Most read articles by the same author(s)

1 2 > >>