Automatic Indexing for Agriculture: Designing a Framework by Deploying Agrovoc, Agris and Annif
DOI:
https://doi.org/10.17821/srels/2023/v60i2/170966Keywords:
Agriculture, Annif, Automatic Subject Indexing, Ensemble, Neural Network, Openrefine, Subject IndexingAbstract
There are several ways to employ machine learning for automating subject indexing. One popular strategy is to utilize a supervised learning algorithm to train a model on a set of documents that have been manually indexed by subject matter using a standard vocabulary. The resulting model can then predict the subject of new and previously unseen documents by identifying patterns learned from the training data. To do this, the first step is to gather a large dataset of documents and manually assign each document a set of subject keywords/descriptors from a controlled vocabulary (e.g., from Agrovoc). Next, the dataset (obtained from Agris) can be divided into – i) a training dataset, and ii) a test dataset. The training dataset is used to train the model, while the test dataset is used to evaluate the model's performance. Machine learning can be a powerful tool for automating the process of subject indexing. This research is an attempt to apply Annif (http://annif. org/), an open-source AI/ML framework, to autogenerate subject keywords/descriptors for documentary resources in the domain of agriculture. The training dataset is obtained from Agris, which applies the Agrovoc thesaurus as a vocabulary tool (https://www.fao.org/agris/download).
Downloads
References
Ahmed, M., Mukhopadhyay, M. and Mukhopadhyay, P. (2023). Automated knowledge organization: AI/ML-based subject indexing system for libraries. DESIDOC Journal of Library and Information Technology, 43(01), 45-54. https://doi.org/10.14429/ djlit.43.01.18619 DOI: https://doi.org/10.14429/djlit.43.01.18619
Aizawa, A. (2003). An information-theoretic perspective of tf-idf measures. Information Processing and Management, 39(1), 45-65. https://doi.org/10.1016/ S0306-4573(02)00021-3 DOI: https://doi.org/10.1016/S0306-4573(02)00021-3
Anderson, J. D. and Pérez-Carballo, J. (2001). The nature of indexing: How humans and machines analyze messages and texts for retrieval. Part II: Machine indexing, and the allocation of human versus machine effort. Information Processing and Management, 37(2), 255- 77. https://doi.org/10.1016/S0306-4573(00)00046-7 DOI: https://doi.org/10.1016/S0306-4573(00)00046-7
Benos, L., Tagarakis, A. C., Dolias, G., Berruto, R., Kateris, D. and Bochtis, D. (2021). Machine Learning in Agriculture: A comprehensive updated review. Sensors, 21(11), 3758. https://doi.org/10.3390/s21113758 PMid:34071553 PMCid:PMC8198852 DOI: https://doi.org/10.3390/s21113758
Borlund, P. (2003). The concept of relevance in IR. Journal of the American Society for Information Science and Technology, 54(10), 913-925. https://doi. org/10.1002/asi.10286 DOI: https://doi.org/10.1002/asi.10286
Celli, F. and Keizer, J. Enabling multilingual search through controlled vocabularies: The AGRIS approach. In 10th International Conference, MTSR 2016, 22-25 November 2016, Göttingen, Germany, edited by E. Garoufallou, I. Subirats Coll, A. Stellato, and J. Greenberg, 2016, Metadata and Semantics Research, 672, pp. 237- 248. https://doi.org/10.1007/978-3-319-49157-8_21 DOI: https://doi.org/10.1007/978-3-319-49157-8_21
Frank, E. and Paynter, G. W. (2004). Predicting Library of Congress classifications from Library of Congress subject headings. Journal of the American Society for Information Science and Technology, 55(3), 214-27. https://doi.org/10.1002/asi.10360 DOI: https://doi.org/10.1002/asi.10360
Golub, K. (2021). Automated subject indexing: An overview. Cataloging and Classification Quarterly, 59(8), 702-19. https://doi.org/10.1080/01639374.2021.2012311 DOI: https://doi.org/10.1080/01639374.2021.2012311
Golub, K., Soergel, D., Buchanan, G., Tudhope, D., Lykke, M. and Hiom, D. (2016). A framework for evaluating automatic indexing or classification in the context of retrieval. Journal of the Association for Information Science and Technology, 67(1), 3-16. https://doi. org/10.1002/asi.23600 DOI: https://doi.org/10.1002/asi.23600
Hahn, J. (2021). Semi-automated methods for bibframe work entity description. Cataloging and Classification Quarterly, 59(8), 853-867. https://doi.org/10.1080/0163 9374.2021.2014011 DOI: https://doi.org/10.1080/01639374.2021.2014011
Hahn, J. (2022). Cataloger acceptance and use of semiautomated subject recommendations for web scale linked data systems. IFLA WLIC, 2022. 10. Available from: https:// repository.ifla.org/bitstream/123456789/1955/1/062- hahn-en.pdf
Handler, A., Denny, M., Wallach, H. and O’Connor, B. (2016). Bag of what? Simple noun phrase extraction for text analysis. In EMNLP Workshop on Natural Language Processing and Computational Social Science, 5 November 2016, Austin, TX, pp. 114-124. https://doi. org/10.18653/v1/W16-5615 DOI: https://doi.org/10.18653/v1/W16-5615
Hillard, D., Purpura, S. and Wilkerson, J. (2008). Computer-assisted topic classification for mixedmethods social science research. Journal of Information Technology and Politics, 4(4), 31-46. https://doi.org/10.1080/19331680801975367 DOI: https://doi.org/10.1080/19331680801975367
Huang, X. and Soergel, D. (2013). Functional relevance and inductive development of an e-retailing product information typology. Information Research, 18(2). Available from: https://informationr.net/ir/18-2/ paper574.html
ISO. (1985). ISO 5963:1985, Documentation-methods for examining documents, determining their subjects, and selecting indexing terms. Available from: https:// www.iso.org/obp/ui/#iso:std:iso:5963:ed-1:v1:en
Joorabchi, A. and E. Mahdi, A. (2013). Classification of scientific publications according to library controlled vocabularies: A new concept matching-based approach. Library Hi Tech, 31(4), 725-747. https://doi. org/10.1108/LHT-03-2013-0030 DOI: https://doi.org/10.1108/LHT-03-2013-0030
Junger, U. (2018). Automation first- The subject cataloguing policy of the Deutsche Nationalbibliothek. Available from: http://library.ifla.org/id/eprint/2213/
Lin, S.-C., Yang, J.-H., Nogueira, R., Tsai, M.-F., Wang, C.-J. and Lin, J. (2021). Multi-stage conversational passage retrieval: An approach to fusing term importance estimation and neural query rewriting (arXiv:2005.02230). arXiv. Available from: http://arxiv.org/abs/2005.02230 https://doi.org/10.1145/3446426 DOI: https://doi.org/10.1145/3446426
Martín-Moncunill, D., Sicilia-Urban, M. A., García- Barriocanal, E. and Stracke, C. M. (2017). Evaluating the concept specialization distance from an end-user perspective: The case of AGROVOC. Online Information Review, 41(6), 860-876. https://doi.org/10.1108/OIR-03- 2016-0094 DOI: https://doi.org/10.1108/OIR-03-2016-0094
Misra, N. N., Dixit, Y., Al-Mallahi, A., Bhullar, M. S., Upadhyay, R. and Martynenko, A. (2022). IoT, big data, and artificial intelligence in agriculture and food industry. IEEE Internet of Things Journal, 9(9), 6305-6324. https://doi.org/10.1109/JIOT.2020.2998584 DOI: https://doi.org/10.1109/JIOT.2020.2998584
Möller, G., Carstensen, K., Diekmann, B. and Wätjen, H. (1999). Automatic classification of the worldwide web using the universal decimal classification. Available from: https://www.semanticscholar.org/paper/ Automatic-Classification-of-the-World-Wide-Web-the- M%C3%B6ller-Carstensen/fb9f0675dd18608dc57244a9 34a552220183f34c
Mukhopadhyay, P. (2022). How green is my valley? Measuring open access friendliness of Indian Institutes of Technology (IITs) through data carpentry. In Panorama of Open Access: Progress, Practices and Prospects; pp. 67-89. Ess Ess. https://doi.org/10.5281/zenodo.6511080
Mukhopadhyay, P., Mitra, R. and Mukhopadhyay, M. (2021). Library carpentry: Towards a new professional dimension (Part I - Concepts and Case Studies). Journal of Information and Knowledge (Formerly SRELS Journal of Information Management), 58(2), 67-80. https://doi. org/10.17821/srels/2021/v58i2/159969 DOI: https://doi.org/10.17821/srels/2021/v58i2/159969
National Agricultural Library. (2014). NFAIS webinar: Automated indexing: A case study from the National Agricultural Library | ISSN. Available from: https:// www.issn.org/newsletter_issn/nfais-webinar- automatedindexing- a-case-study-from-the-national-agriculturallibrary/
National Library of Medicine (NLM). (2002). NLM Medical Text Indexer (MTI). Available from: https:// lhncbc.nlm.nih.gov/ii/tools/MTI.html
Oliver, C. (2021). Leveraging KOS to extend our reach with automated processes. Cataloging and Classification Quarterly, 59(8), 868-874. https://doi.org/10.1080/0163 9374.2021.2023717 DOI: https://doi.org/10.1080/01639374.2021.2023717
Purpura, S. and Hillard, D. (2006). Automated classification of congressional legislation. In 2006 National Conference on Digital Government Research, 21-24 May, 2006, San Diego California USA; pp. 219-225. https://doi.org/10.1145/1146598.1146660 DOI: https://doi.org/10.1145/1146598.1146660
Rayhana, R., Xiao, G. and Liu, Z. (2020). Internet of things empowered smart greenhouse farming. IEEE Journal of Radio Frequency Identification, 4(3), 195- 211. https://doi.org/10.1109/JRFID.2020.2984391 DOI: https://doi.org/10.1109/JRFID.2020.2984391
Roitblat, H. L., Kershaw, A. and Oot, P. (2010). Document categorization in legal electronic discovery: Computer classification vs. manual review. Journal of the American Society for Information Science and Technology, 61(1), 70-80. https://doi.org/10.1002/asi.21233 DOI: https://doi.org/10.1002/asi.21233
Salisbury, L. and Smith, J. J. (2014). Building the AgNIC Resource Database Using Semi-Automatic Indexing of Material. Journal of Agricultural and Food Information, 15(3), 159-176. https://doi.org/10.1080/10496505.2014. 919805 DOI: https://doi.org/10.1080/10496505.2014.919805
Salton, G. and McGill, M. J. (1983). Introduction to modern information retrieval. New York: McGraw-Hill.
Salton, G., Wong, A. and Yang, C. S. (1975). A vector space model for automatic indexing. Communications of the ACM, 18(11), 613-620. https://doi.org/10.1145/ 361219.361220 DOI: https://doi.org/10.1145/361219.361220
Saracevic, T. (2007). Relevance: A review of the literature and a framework for thinking on the notion in information science. Part II: Nature and manifestations of relevance. Journal of the American Society for Information Science and Technology, 58(13), 1915- 1933. https://doi.org/10.1002/asi.20682 DOI: https://doi.org/10.1002/asi.20682
Scorpion. (2022). OCLC. Available from: https://www. oclc.org/research/activities/scorpion.html
Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34(1), 1-47. https://doi.org/10.1145/505282.505283 DOI: https://doi.org/10.1145/505282.505283
Shafer, K. E. (2001). Automatic subject assignment via the scorpion system. Journal of Library Administration, 34(1- 2), 187-189. https://doi.org/10.1300/J111v34n01_28 DOI: https://doi.org/10.1300/J111v34n01_28
Silvester, J. P. (1997). Computer supported indexing: A history and evaluation of NASA’s MAI System. Encyclopedia of Library and Information Science, 61. Available from: https://ntrs.nasa.gov/citations/19980010465
Sood, A., Sharma, R. K. and Bhardwaj, A. K. (2021). Artificial intelligence research in agriculture: A review. Online Information Review, 46(6), 1054-1075. https:// doi.org/10.1108/OIR-10-2020-0448 DOI: https://doi.org/10.1108/OIR-10-2020-0448
Suominen, O. (2019). Annif: DIY automated subject indexing using multiple algorithms. LIBER Quarterly: The Journal of the Association of European Research Libraries, 29(1). https://doi.org/10.18352/lq.10285 DOI: https://doi.org/10.18352/lq.10285
Suominen, O., Inkinen, J. and Lehtinen, M. (2022). Annif and Finto AI: Developing and Implementing Automated Subject Indexing. JLIS.It, 13(1). https://doi.org/10.4403/ jlis.it12740
Svarre, T. and Lykke, M. (2014). Experiences with automated categorization in E-Government Information Retrieval. Knowledge Organization, 41, 76-84. https:// doi.org/10.5771/0943-7444-2014-1-76 DOI: https://doi.org/10.5771/0943-7444-2014-1-76
Talaviya, T., Shah, D., Patel, N., Yagnik, H. and Shah, M. (2020). Implementation of artificial intelligence in agriculture for optimisation of irrigation and application of pesticides and herbicides. Artificial Intelligence in Agriculture, 4, 58-73. https://doi.org/10.1016/j. aiia.2020.04.002 DOI: https://doi.org/10.1016/j.aiia.2020.04.002
Thomas, R. L. and Uminsky, D. (2022). Reliance on metrics is a fundamental challenge for AI. Patterns, 3(5), 100476. https://doi.org/10.1016/j.patter.2022.100476 PMid:35607624 PMCid:PMC9122957 DOI: https://doi.org/10.1016/j.patter.2022.100476
Ünal, Z. (2020). Smart farming becomes even smarter with deep learning- a bibliographical analysis. IEEE Access, 8, 105587-609. https://doi.org/10.1109/ACCESS. 2020.3000175 DOI: https://doi.org/10.1109/ACCESS.2020.3000175
Willis, C. and Losee, R. M. (2013). A random walk on an ontology: Using thesaurus structure for automatic subject indexing: A random walk on an ontology: Using thesaurus structure for automatic subject indexing. Journal of the American Society for Information Science and Technology, 64(7), 1330-44. https://doi.org/10.1002/ asi.22853 DOI: https://doi.org/10.1002/asi.22853
Wu, H. C., Luk, R. W. P., Wong, K. F. and Kwok, K. L. (2008). Interpreting TF-IDF term weights as making relevance decisions. ACM Transactions on Information Systems, 26(3), 13:1-13:37. https://doi. org/10.1145/1361684.1361686 DOI: https://doi.org/10.1145/1361684.1361686
Young, L. and Soroka, S. (2012). Affective News: The Automated Coding of Sentiment in Political Texts. Political Communication, 29(2): 205-231. https://doi. org/10.1080/10584609.2012.671234 DOI: https://doi.org/10.1080/10584609.2012.671234
Zhang, Z., Liu, H., Meng, Z. and Chen, J. (2019). Deep learning-based automatic recognition network of agricultural machinery images. Computers and Electronics in Agriculture, 166, 104978. https://doi.org/10.1016/j. compag.2019.104978 DOI: https://doi.org/10.1016/j.compag.2019.104978
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2023 Journal of Information and Knowledge

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
All the articles published in Journal of Information and Knowledge are held by the Publisher. Sarada Ranganathan Endowment for Library Science (SRELS), as a publisher requires its authors to transfer the copyright prior to publication. This will permit SRELS to reproduce, publish, distribute and archive the article in print and electronic form and also to defend against any improper use of the article.