Designing Question-Answer Based Search System in Libraries: Application of Open Source Retrieval Augmented Generation (RAG) Pipeline

Jhantu Mazumder; Parthasarathi Mukhopadhyay

doi:10.17821/srels/2024/v61i5/171583

Authors

Jhantu Mazumder
Department of Library and Information Science, Kalyani University, Kalyani – 741235, West Bengal
Parthasarathi Mukhopadhyay
Department of Library and Information Science, Kalyani University, Kalyani – 741235, West Bengal

DOI:

https://doi.org/10.17821/srels/2024/v61i5/171583

Keywords:

Conversational AI, ChatGPT, Gemini, Generative AI, LangChain, Large Language Models (LLMs), Llama3, LlamaIndex, Mistral, NLP, Retrieval Augmented Generation (RAG)

Abstract

This study primarily aims to prepare a prototype and demonstrate that libraries can develop a low-cost conversational search system using open-source software tools and Large Language Models (LLMs) through a Retrieval-Augmented Generation (RAG) framework. LLMs often hallucinate and provide outdated and non-contextualized responses. However, this experiment shows that LLMs can deliver contextualized, relevant responses when augmented with a set of relevant documents. Augmenting LLMs with relevant documents before generating answers is known as retrieval-augmented generation. The methodology involved creating a RAG pipeline using tools like LangChain, vector databases like ChromaDB, and open-source LLMs like Llama3 (a 70-billion parameter-based model). The prototype developed includes a dataset of 250+ relevant documents on the Chandrayaan-3 mission that was collected, processed, and ingested into the pipeline. Finally, the study compared responses from standard LLMs and LLMs with RAG augmentation. Key findings revealed that standard LLMs (without RAG) produced confidently incorrect, hallucinated responses against queries related to Chandrayaan-3, while LLMs with RAG consistently provided accurate, informative, and contextualized answers when supplied with a set of relevant documents before generating the response. The study concluded that open-source RAG-based systems offer a cost-effective solution for libraries to enhance information retrieval and transform libraries into dynamic information services.

Downloads

Download data is not yet available.

References

Agrawal, G., Kumarage, T., Alghamdi, Z., & Liu, H. (2024). Can knowledge graphs reduce hallucinations in llms? : A survey. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 3947-3960. https://doi.org/10.18653/v1/2024.naacllong.219

Berant, J., Chou, A., Frostig, R., & Liang, P. (2013). Semantic parsing on freebase from question-answer pairs. Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, 1533-1544. https://aclanthology.org/D13-1160

Chen, J., Lin, H., Han, X., & Sun, L. (2024). Benchmarking large language models in retrieval-augmented generation. Proceedings of the AAAI Conference on Artificial Intelligence, 38(16), 17754-17762. https://doi.org/10.1609/aaai.v38i16.29728

Fitch, K. (2023). Searching for meaning rather than keywords and returning answers rather than links. The Code4Lib Journal, 57. https://journal.code4lib.org/articles/17443

Gao, Y., Xiong, Y., Gao, X., Jia, K., Pan, J., Bi, Y., … Wang, H. (2024). Retrieval-augmented generation for large language models: A survey. arXiv. https://doi.org/10.48550/arXiv.2312.10997

Gur, S., Neverova, N., Stauffer, C., Lim, S.-N., Kiela, D., & Reiter, A. (2021). Cross-modal retrieval augmentation for multi-modal classification. Findings of the Association for Computational Linguistics: EMNLP 2021, 111-123. https://doi.org/10.18653/v1/2021.findings-emnlp.11

Hoshi, Y., Miyashita, D., Ng, Y., Tatsuno, K., Morioka, Y., Torii, O., & Deguchi, J. (2023). RaLLe: A framework for developing and evaluating retrieval-augmented large language models. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 52-69. https://doi.org/10.18653/v1/2023.emnlp-demo.4

Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., & Soricut, R. (2020). ALBERT: A lite BERT for self-supervised learning of language representations. arXiv. https://doi.org/10.48550/arXiv.1909.11942

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., … Kiela, D. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. Advances in Neural Information Processing Systems, 33, 9459-9474. https://proceedings.neurips.cc/paper/2020/hash/6b493230205f780e1bc26945df7481e5-Abstract.html

Li, M., Kilicoglu, H., Xu, H., & Zhang, R. (2024). BiomedRAG: A retrieval augmented large language model for biomedicine.arXiv. https://doi.org/10.48550/arXiv.2405.00465

Mukhopadhyay, P. (2024). Optimizing retrieval in libraries through RAG: A framework. Indian Journal of Information, Library and Society, 37, 6-22.

Ovadia, O., Brief, M., Mishaeli, M., & Elisha, O. (2024). Finetuning or retrieval? comparing knowledge injection in llms. arXiv. https://doi.org/10.48550/arXiv.2312.05934

Roberts, A., Raffel, C., & Shazeer, N. (2020). How much knowledge can you pack into the parameters of a language model? Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 54185426. https://doi.org/10.18653/v1/2020.emnlp-main.437

Siriwardhana, S., Weerasekera, R., Elliott, W., Kaluarachchi, T., Rana, R., & Nanayakkara, S. (2023). Improving the domain adaptation of Retrieval Augmented Generation (RAG) models for open domain question answering. Transactions of the Association for Computational Linguistics, 11, 1-17. https://doi.org/10.1162/tacl_a_00530

Wang, Z., Araki, J., Jiang, Z., Parvez, M. R., & Neubig, G. (2023). Learning to filter context for retrievalaugmented generation. arXiv. https://doi.org/10.48550/arXiv.2311.08377

Yan, S.-Q., Gu, J.-C., Zhu, Y., & Ling, Z.-H. (2024). Corrective retrieval augmented generation. arXiv. https://doi.org/10.48550/arXiv.2401.15884

Yang, Y., Yih, W., & Meek, C. (2015). WikiQA: A challenge dataset for open-domain question answering. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 2013-2018. https://doi.org/10.18653/v1/D15-1237

Yu, W., Zhu, C., Li, Z., Hu, Z., Wang, Q., Ji, H., & Jiang, M. (2022). A survey of knowledge-enhanced text generation. ACM Computing Surveys, 54(11s), 1-38. https://doi.org/10.1145/3512467