Javascript is required
Behrens, G., Karatzas, K., & Orlowski, C. (2024). Large language modeling method to support the analysis of environmental, social, and governance reporting e-metrics. In EnviroInfo 2024 (pp. 27–34). Gesellschaft für Informatik eV. [Google Scholar] [Crossref]
Birti, M., Maurino, A., & Osborne, F. (2025). Optimizing large language models for ESG activity detection in financial texts. In Proceedings of the 6th ACM International Conference on AI in Finance (pp. 856–863). [Google Scholar] [Crossref]
Bronzini, M., Nicolini, C., Lepri, B., Passerini, A., & Staiano, J. (2024). Glitter or gold? Deriving structured insights from sustainability reports via large language models. EPJ Data Sci., 13(1), 1–41. [Google Scholar] [Crossref]
Chakraborty, S. (2024). Generative AI in modern education society. arXiv Preprint, arXiv:2412.08666. [Google Scholar] [Crossref]
Global Reporting Initiative. (n.d.). About GRI. https://www.globalreporting.org/about-gri/ [Google Scholar]
Gupta, T., Goel, T., & Verma, I. (2025). Exploring multimodal language models for sustainability disclosure extraction: A comparative study. In The Sixth Workshop on Insights from Negative Results in NLP (pp. 141–149). [Google Scholar] [Crossref]
KPMG International. (2024). The move to mandatory reporting: Survey of Sustainability Reporting 2024. https://kpmg.com/xx/en/our-insights/esg/the-move-to-mandatory-reporting.html [Google Scholar]
Li, S. & Ramakrishnan, N. (2025). Oreo: A plug-in context reconstructor to enhance retrieval-augmented generation. In Proceedings of the 2025 International ACM SIGIR Conference on Innovative Concepts and Theories in Information Retrieval (ICTIR) (pp. 238–253). Padua, Italy. [Google Scholar] [Crossref]
Maharani, A. M. & Rozzaid, Y. R. (2022). GRI standards-based corporate social responsibility disclosure toward market reaction in energy companies listed in Indonesia stock exchange. Int. Soc. Sci. Humanit., 1(1), 72–79. [Google Scholar] [Crossref]
Melton, C., Sorokine, A., & Peterson, S. (2025). Evaluating retrieval augmented generative models for document queries in transportation safety. arXiv Preprint, arXiv:2504.07022. [Google Scholar] [Crossref]
Ni, J., Bingler, J., Colesanti-Senni, C., Kraus, M., Gostlow, G., Schimanski, T., Stammbach, D., Vaghefi, S. A., Wang, Q., Webersinke, N., & et al. (2023). CHATREPORT: Democratizing sustainability disclosure analysis through LLM-based tools. arXiv Preprint, arXiv:2307.15770. [Google Scholar] [Crossref]
Polignano, M., Bellantuono, N., Lagrasta, F. P., Caputo, S., Pontrandolfo, P., & Semeraro, G. (2022). An NLP approach for the analysis of global reporting initiative indexes from corporate sustainability reports. In Proceedings of the first computing social responsibility workshop within the 13th language resources and evaluation conference (pp. 1–8). Marseille, France. https://aclanthology.org/2022.csrnlp-1.1/ [Google Scholar]
Verma, S. (2024). Contextual compression in retrieval-augmented generation for large language models: A survey. arXiv Preprint, arXiv:2409.13385. [Google Scholar] [Crossref]
Villacampa-Porta, J., Coronado-Vaca, M., & Garrido-Merchán, E. C. (2025). Impact of EU non-financial reporting regulation on Spanish companies’ environmental disclosure: A cutting-edge natural language processing approach. Environ. Sci. Eur., 37(1), 1–33. [Google Scholar] [Crossref]
Wu, Q., Xiang, X., Hejia, H., Wang, X., Wei Jie, Y., Satapathy, R., Filho, R. S., & Veeravalli, B. (2025). SusGen-GPT: A data-centric LLM for financial NLP and sustainability report generation. In Findings of the Association for Computational Linguistics: NAACL 2025 (pp. 1184–1203). Albuquerque, New Mexico. [Google Scholar] [Crossref]
Yang, J. Y., Chi, R. H., Wu, C. C., Chen, L. J., Lin, W. M., Hu, H. W., & Cheng, H. R. (2024). EcoSmartGuide: Language learning model and retrieval-augmented generation-based platform for streamlined environmental, social, and governance information access and report generation. In 2024 IEEE 6th Eurasia Conference on Biomedical Engineering, Healthcare and Sustainability (ECBIOS) (pp. 343–347). [Google Scholar] [Crossref]
Zahid, M., Naqvi, S. U., Jan, A., Rahman, H. U., & Wali, S. (2023). The nexus of environmental, social, and governance practices with the financial performance of banks: A comparative analysis for the pre and COVID-19 periods. Cogent Econ. Financ., 11(1), 2183654. [Google Scholar] [Crossref]
Zou, Y., Shi, M., Chen, Z., Deng, Z., Lei, Z., Zeng, Z., Yang, S., Tong, H., Xiao, L., & Zhou, W. (2025). ESGReveal: An LLM-based approach for extracting structured data from ESG reports. J. Clean. Prod., 489, 144572. [Google Scholar] [Crossref]
Search
Research article

The Use of Large Language Models in Sustainability Reporting: Performance Analysis of RAG and LoRA Techniques

Berra Öz1*,
Ali Hakan Işık2
1
Department of Computer Technologies and Information Systems, Burdur Mehmet Akif Ersoy University, 15100 Burdur, Turkey
2
Department of Computer Engineering, Faculty of Engineering—Architecture, Burdur Mehmet Akif Ersoy University, 15100 Burdur, Turkey
Challenges in Sustainability
|
Volume 14, Issue 3, 2026
|
Pages 589-604
Received: 01-19-2026,
Revised: 06-01-2026,
Accepted: 06-09-2026,
Available online: N/A
View Full Article|Download PDF

Abstract:

An Environmental, Social, and Governance (ESG) report is an essential information source for evaluating a company’s performance in sustainability practices. Organizations structure their environmental impacts, social responsibilities, and governance practices within a defined framework. This standardization is provided by the Global Reporting Initiative (GRI), which constitutes an internationally recognized guideline for sustainability reporting. Traditional reporting workflows are time-consuming for organizations and prone to data-entry errors, which limits the reliability of disclosed information. In this context, leveraging the capabilities of Large Language Models (LLMs) offers significant time and resource savings. This study uses the Llama-3.1-8B-Instruct model under two scenarios, Retrieval-Augmented Generation (RAG) and Low-Rank Adaptation (LoRA) fine-tuning, to analyze 30 food-sector ESG reports and produce ESG summaries, SWOT analyses, and GRI-aligned recommendations. The two approaches are evaluated on a stratified hold-out set of 6 unseen test reports (24 reports used for training) under a fair, matched-budget setup in which RAG retrieves the target report at inference. On four quality metrics, LoRA achieved higher mean scores than RAG; however, statistically significant differences were observed in only 4 of the 12 task–metric comparisons. Token usage was comparable, whereas RAG was substantially faster at inference. Rather than favoring one approach over the other, these findings reveal a trade-off between output quality and computational efficiency: LoRA yields quality gains on specific metrics, whereas RAG is substantially more efficient at inference. Given the limited size of the held-out test set, these results should be interpreted with caution.

Keywords: Large Language Models, Retrieval-Augmented Generation, Low-Rank Adaptation, Environmental, Social, and Governance reports, Global Reporting Initiative

1. Introduction

The concept of ESG, introduced in 2004 in line with the report published under the United Nations Global Compact, is derived from the abbreviation of Environmental, Social, and Governance (ESG). Companies publish an ESG report each year in which they provide a detailed account of their activities and performance related to ESG issues. ESG reporting plays a significant role for companies by enhancing corporate credibility and fostering strong relationships among stakeholders. Moreover, it contributes to creating value for businesses by strengthening brand reputation. Today, companies are increasingly developing and implementing ESG-oriented practices (Z​a​h​i​d​ ​e​t​ ​a​l​.​,​ ​2​0​2​3).

Sustainability reports structured through reporting standards such as the Global Reporting Initiative (GRI), the Sustainability Accounting Standards Board (SASB), and the Corporate Sustainability Reporting Directive (CSRD) enable businesses of all scales to assess and analyze their impacts on the environment, society, and the economy (M​a​h​a​r​a​n​i​ ​&​a​m​p​;​ ​R​o​z​z​a​i​d​,​ ​2​0​2​2). The GRI is an international organization that provides standards for sustainability reporting. These standards are utilized by approximately 14,000 organizations across more than 100 countries and, according to KPMG’s 2024 Survey of Sustainability Reporting, represent the most widely adopted sustainability reporting framework worldwide (K​P​M​G​ ​I​n​t​e​r​n​a​t​i​o​n​a​l​,​ ​2​0​2​4). The Global Sustainability Standards Board (GSSB) regularly revises and improves the GRI Standards to ensure they remain current and align with the information demands of all stakeholders (G​l​o​b​a​l​ ​R​e​p​o​r​t​i​n​g​ ​I​n​i​t​i​a​t​i​v​e​,​ ​n​.​d​.).

Large Language Models (LLMs), trained on massive datasets, represent a significant milestone for artificial intelligence by consistently producing coherent outputs across various domains such as information extraction, content generation, text summarization, translation, and text classification (C​h​a​k​r​a​b​o​r​t​y​,​ ​2​0​2​4). The rise of LLMs has driven transformation in various sectors and led to innovative solutions for complex problems (M​e​l​t​o​n​ ​e​t​ ​a​l​.​,​ ​2​0​2​5). Sustainability reports have a high volume of data; since manual analysis is costly for many firms, it can only be performed by a limited number of organizations (N​i​ ​e​t​ ​a​l​.​,​ ​2​0​2​3). Furthermore, the inclusion of diverse formats within sustainability reports, such as images, graphs, and tables, restricts their analytical tractability (G​u​p​t​a​ ​e​t​ ​a​l​.​,​ ​2​0​2​5). LLMs enable the transformation of raw and unstructured data into consistent and structured information through information extraction. Owing to these capabilities, they automate reporting processes and facilitate significant savings in both time and resources.

Although the importance of sustainability and its reporting are widely acknowledged by businesses today, the implementation process still faces various challenges and barriers, as noted in the previous paragraph. At this point, LLM-based ESG reporting tools enhance efficiency, making the process significantly easier and more feasible. This study compares two approaches for LLM-assisted analysis of ESG reports in the food sector, namely Retrieval-Augmented Generation (RAG) and Low-Rank Adaptation (LoRA) fine-tuning, both built on the Llama-3.1-8B-Instruct model. Using a corpus of 30 ESG reports, each approach is applied to summarize the reports, derive a SWOT analysis from the summaries, and produce solution recommendations aligned with GRI standards for the identified weaknesses and threats. The two approaches are compared on a held-out test set under matched conditions, using four quality metrics (Faithfulness, Factual Consistency, Answer Correctness, and Answer Relevancy) together with operational measures of token consumption and processing time. Ultimately, the study aims to contribute to the understanding of how LLMs can support the assessment of corporate sustainability performance.

2. Literature Review

2.1 LLMs in ESG and Sustainability Reporting

The study focused on optimizing the use of LLMs for classifying and detecting ESG-related activities in financial and corporate documents. Models such as Llama, Gemma, Mistral, and GPT-4o Mini were used in the study. The highest performance was achieved by the Llama2_7B model with an F1 score of 84.9% (B​i​r​t​i​ ​e​t​ ​a​l​.​,​ ​2​0​2​5).

The aim was to analyze the scope and quality of 729 reports prepared under voluntary and mandatory systems between 2015 and 2022. Using Transformer LLM ClimateBERT models fine-tuned with the ClimaText database, the authors concluded that the fine-tuned model showed higher precision in climate texts (V​i​l​l​a​c​a​m​p​a​-​P​o​r​t​a​ ​e​t​ ​a​l​.​,​ ​2​0​2​5).

The study focused on the use of LLMs in the creation of ESG reports. The Llama-3.1-405b-instruct and Mistral-large-2-instruct models were analyzed to assess their suitability for the reporting process. It was concluded that the Mistral-large-2-instruct model produced predictions that were highly consistent with the reference text (B​e​h​r​e​n​s​ ​e​t​ ​a​l​.​,​ ​2​0​2​4).

A system based on Natural Language Processing (NLP) techniques was developed using Italian and English sustainability reports that complied with GRI Standards. The system identified references to various sustainability topics, their page numbers, the content of each reference, and, through sentiment analysis, whether each reference is positive or negative. The study yielded an F1 score ranging from 0.87210 to 0.94365 (P​o​l​i​g​n​a​n​o​ ​e​t​ ​a​l​.​,​ ​2​0​2​2).

2.2 RAG-Based Approaches for ESG and Report Analysis

A system called ESGReveal was introduced to analyze unstructured ESG reports and evaluate corporate performance. The study encompassed 2249 ESG reports from 166 companies across 12 different sectors listed on the Hong Kong Stock Exchange. Utilizing various LLMs such as GPT-3.5, GPT-4, ChatGLM and QWEN, in combination with RAG technology, the study reported the highest information extraction accuracy with GPT-4, achieving 76.9% (Z​o​u​ ​e​t​ ​a​l​.​,​ ​2​0​2​5).

The LLM and RAG-based system, EcoSmartGuide, was developed to simplify the ESG reporting process. This system automates the collection and analysis of ESG data and reflects ESG information with 95% accuracy (Y​a​n​g​ ​e​t​ ​a​l​.​,​ ​2​0​2​4).

The study focused on automatic knowledge extraction from unstructured data in ESG reports. LLMs were used in conjunction with the RAG architecture and Context-Based Learning method (B​r​o​n​z​i​n​i​ ​e​t​ ​a​l​.​,​ ​2​0​2​4).

A system called CHATREPORT was developed for automated analysis of sustainability reports, based on the Task Force on Climate-Related Financial Disclosures (TCFD) reporting framework. ChatGPT was used as the base LLM, and LangChain was used for API operations and vector-based retrieval processes, while OpenAI’s text-embedding-ada-002 model was chosen for the text embedding process. The study concluded that the developed system had a low hallucination rate and that errors were easily detected by users (N​i​ ​e​t​ ​a​l​.​,​ ​2​0​2​3).

A system called SusGen-GPT was developed to facilitate the creation of sustainability reports based on the TCFD reporting framework. Models were trained on the SusGen-30K database, and a report generation system was developed by integrating the RAG technique (W​u​ ​e​t​ ​a​l​.​,​ ​2​0​2​5).

A Knowledge Graph-Retrieval Augmented Generation (KG-RAG) based system integrated with LLM was developed, designed to extract information from corporate ESG data and sustainability-focused news content, thus enabling user queries to be answered from these sources (G​u​p​t​a​ ​e​t​ ​a​l​.​,​ ​2​0​2​5).

These studies demonstrate the increasing applicability of LLMs in ESG-related text analysis, reporting, and sustainability-oriented document understanding. However, most studies primarily focus on information extraction, classification, or report generation tasks, while comparative analyses between retrieval-based and parameter-efficient adaptation approaches remain limited. In addition, existing studies generally evaluate either RAG-based systems or fine-tuned LLM architectures independently, without providing a direct comparison between these approaches within the same ESG domain and task setting. To address this gap, the present study comparatively evaluates RAG and LoRA-based architectures using food sector ESG reports across multiple ESG-oriented tasks, including summarization, SWOT analysis, and GRI-based recommendation generation.

3. Methodology

3.1 Dataset

The detailed information regarding the ESG report corpus used in this study is presented in Table 1.

Table 1. Details of the Environmental, Social, and Governance (ESG) report corpus used in the study

Company Code

Country

Reporting Year

Language

File Format

Size (MB)

Tables/Figures Usage

GRI Standards Usage

A01

USA

2023

English

PDF

10.4

Yes

Yes

A02

Brazil

2023

English

PDF

13.5

Yes

Yes

A03

Italy

2023

English

PDF

28.5

Yes

Yes

A04

Switzerland

2023

English

PDF

22.7

Yes

Yes

A05

USA

2023

English

PDF

12.0

Yes

Yes

A06

USA

2023

English

PDF

6.1

Yes

Yes

A07

Switzerland

2023

English

PDF

16.4

Yes

Yes

A08

USA

2023

English

PDF

13.3

Yes

Yes

A09

Türkiye

2023

Turkish

PDF

9.3

Yes

Yes

A10

Türkiye

2023

Turkish

PDF

12.6

Yes

Yes

A11

Italy

2023

English

PDF

34.3

Yes

Yes

A12

New Zealand

2023

English

PDF

24.6

Yes

Yes

A13

USA

2023

English

PDF

18.8

Yes

Yes

A14

USA

2023

English

PDF

2.8

Yes

Yes

A15

USA

2023

English

PDF

33.5

Yes

Yes

A16

Brazil

2023

English

PDF

15.3

Yes

Yes

A17

Türkiye

2023

Turkish

PDF

13.5

Yes

Yes

A18

USA

2023

English

PDF

17.3

Yes

Yes

A19

France

2023

English

PDF

5.8

Yes

Yes

A20

Canada

2023

English

PDF

9.5

Yes

Yes

A21

USA

2023

English

PDF

3.7

Yes

Yes

A22

Türkiye

2023

Turkish

PDF

25.8

Yes

Yes

A23

Türkiye

2023

Turkish

PDF

14.8

Yes

Yes

A24

Thailand

2023

English

PDF

21.8

Yes

Yes

A25

USA

2023

English

PDF

25.7

Yes

Yes

A26

USA

2023

English

PDF

17.3

Yes

Yes

A27

Türkiye

2023

English

PDF

51.4

Yes

Yes

A28

Türkiye

2023

Turkish

PDF

8.0

Yes

Yes

A29

USA

2023

English

PDF

28.6

Yes

Yes

A30

Switzerland

2023

English

PDF

18.9

Yes

Yes

Note: Global Reporting Initiative (GRI).

Within the scope of this study, the food sector was selected due to its high ESG materiality and the simultaneous presence of ESG dimensions within a single industrial domain. The food sector involves a broad range of ESG-related themes, such as agricultural sustainability, emissions management, labor rights, and supply-chain transparency, making it a suitable domain for context-aware LLM evaluation. In addition, the presence of the GRI 13 (Agriculture, Aquaculture and Fishing Sectors) Sector Standard also enhances the suitability of the food sector for retrieval-based ESG analysis scenarios. Furthermore, the availability of publicly accessible ESG reports in the food sector supported the creation of a comparable evaluation corpus.

The use of 30 ESG reports was considered appropriate because the study primarily aims to perform a methodological comparison between RAG- and LoRA-based architectures rather than to establish a statistical representation of the entire global food sector. Accordingly, the same ESG report collection was employed across both integration scenarios to maintain a consistent and comparable experimental setting. To assess generalization rather than memorization, the 30 ESG reports were partitioned into a stratified holdout split, with 24 reports assigned to the training set and 6 reports reserved for the held-out test set. Instead of applying a fully random partitioning strategy, the dataset was split to preserve the representation of food-sector sub-categories across both the training and held-out test sets.

A central distinction in this study concerns where leakage must be prevented. Train-test leakage is a concern only for LoRA fine-tuning, where the model must not be trained on the reports on which it is later evaluated. For RAG, retrieving the content of the target report at inference time is the intended use of RAG and does not constitute leakage. Accordingly, the six held-out test reports were excluded only from LoRA fine-tuning: the LoRA instruction–response training pairs were constructed solely from the 24 training reports (together with the GRI standards), and the held-out reports were never seen during fine-tuning.

For RAG, the retrievable knowledge base is constructed per target report at inference time. When a held-out report is analyzed, a FAISS index (using the all-MiniLM-L6-v2 sentence-embedding model) is built from that report’s own text chunks; for the GRI-aligned recommendation task, the GRI standards are additionally indexed and retrieved, with report and GRI passages interleaved so that the prompt contains both the company’s own content and GRI standard text (reference/bibliography sections of the GRI standards were excluded from the GRI corpus to avoid retrieving non-substantive citation lists). The 24 training reports are not part of the retrievable base, since each report is summarized and analyzed from its own content. This ensures that RAG can access the very documents it is expected to summarize, analyze, and use for GRI-aligned recommendations. For the recommendation task, both methods incorporate GRI knowledge, though through different mechanisms: RAG retrieves GRI text into the prompt, whereas the fine-tuned model relies on GRI patterns learned during fine-tuning.

In summary, the two settings differ only in how each obtains the target report’s content at inference, not in whether or how much they can access it. The fine-tuned (LoRA) model is never fine-tuned on the held-out reports so that its outputs reflect task-level generalization rather than report-specific memorization, yet it receives the content of the target report in-context at inference (the first ~1,200 words, ≈1,700 tokens). The RAG model is allowed to retrieve the target test report during inference, which is the normal use case of RAG, with its prompt capped at 2,048 tokens, comparable to the in-context budget of the fine-tuned model. To keep the comparison fair, both arms use matched generation budgets and identical decoding settings (up to 384 new tokens; temperature 0.4; repetition penalty 1.2; no-repeat-trigram). Both methods, therefore, operate on previously unseen reports with comparable access to the target report’s content when generating summaries, SWOT analyses, and GRI-aligned recommendations, so that observed performance differences reflect the methods themselves rather than unequal access or generation budgets. Finally, practical computational considerations also influenced the dataset size, as both LoRA fine-tuning and RAG-based inference require substantial processing resources and extended evaluation time.

3.2 System Architecture

Our study focuses on the analysis of 30 distinct ESG reports from the food sector using the Llama-3.1-8B-Instruct model under two different integration scenarios: RAG and LoRA fine-tuning. To assess generalization rather than memorization, the 30 reports are divided into a stratified hold-out split of 24 training reports and 6 held-out test reports; both scenarios are evaluated on the same 6 held-out reports under matched generation budgets and identical decoding settings, so that the two approaches are compared fairly.

In the first scenario, the RAG technique is employed. Rather than building a single static database from all reports, the retrievable knowledge base is constructed per target report at inference time: for each held-out report under analysis, a vector index is built from that report’s own text chunks (and, for the GRI-aligned recommendation task, from the GRI standards). The system semantically retrieves the most relevant text chunks based on the user query and subsequently incorporates these chunks into the prompt, referred to as the model input. Consequently, the generated output is not solely dependent on the training data but is also augmented by the content of the target report retrieved at inference. This approach aims to increase the accuracy of the outputs and minimize the model’s hallucination risk. Because RAG retrieves the very document it is asked to analyze, the held-out reports are available to RAG at inference, the intended use of RAG, while remaining excluded from LoRA fine-tuning.

The second scenario utilizes the LoRA technique. The model is fine-tuned using the LoRA method on the 24 training reports only; the 6 held-out reports are never used in fine-tuning. Crucially, the model’s entire set of parameters is not updated during this process. Instead, the large weight matrices are approximated by the product of two smaller matrices. Only these smaller matrices are updated during training, while the model’s original parameters remain static. In inference, the fine-tuned model receives the content of the target held-out report in-context. This approach aims to enhance the model’s competence in a specific domain, enabling it to produce more consistent and domain-specific outputs.

In both integration scenarios, the ESG reports were summarized, a SWOT analysis was conducted based on the summarized information, and in the final step, GRI Standards-compliant solution proposals were developed for the weaknesses and threats identified from the SWOT analysis. The resulting outputs, produced under matched generation budgets and identical decoding settings, were evaluated on the 6 held-out reports, and a performance comparison was conducted using metrics such as Faithfulness, Factual Consistency, Answer Correctness, and Answer Relevancy, in addition to token consumption and computational time.

The workflow of the RAG and LoRA-enhanced LLM system is shown in Figure 1.

Figure 1. Workflow of the Retrieval-Augmented Generation (RAG)- and Low-Rank Adaptation (LoRA)-enhanced Large Language Models (LLM) system
3.3 Research Questions

1.What differences can be observed in the performance of the Llama-3.1-8B-Instruct models, structured with RAG and LoRA, in the processes of analyzing and summarizing ESG reports from the food sector, conducting SWOT analyses, and generating solution recommendations aligned with GRI standards?

2.Which model approach is more resource-efficient in terms of token usage and computational time?

  1. In the evaluation conducted across the companies included in the study, which technique demonstrated higher overall performance?

3.4 Model Llama-3.1-8B-Instruct

Developed by Meta and belonging to the Llama 3.1 model family, the Llama-3.1-8B-Instruct model is an open-source LLM based on the Transformer architecture, consisting of 8 billion parameters and pre-trained on a dataset containing approximately 15 trillion tokens. The model supports a 128,000-token context window, which enables the processing of long texts as a whole and provides a stronger architectural capability compared to its predecessors.

With the use of the Grouped-Query Attention (GQA) technique, memory overhead and computational costs during inference are optimized, enabling more efficient and low-latency usage. Fine-tuned using Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO), the model has improved capabilities in following user instructions accurately and producing outputs that directly provide human-oriented value. Architectural improvements minimize the “lost in the middle” problem and ensure high performance in complex reasoning tasks.

3.5 Retrieval-Augmented Generation

RAG is a technique that enables LLMs to produce more accurate and specific responses by augmenting them with external knowledge sources. The adoption of this technique with LLMs has led to a significant shift in how these models access and process information. This integration has significantly reduced both the problem of outdated knowledge and the risk of hallucination (V​e​r​m​a​,​ ​2​0​2​4).

The RAG approach combines the strengths of two main stages: retrieval-based (information fetching) and generation-based (text production). The retrieval technique searches external knowledge sources for relevant chunks pertaining to the user’s query. In the generation technique, the chunks identified in the previous step are provided to the LLM. The model then generates its response to the user by utilizing both its internal training data and the external knowledge sources. In summary, while the retrieval step provides access to various data sources, the generation step creates contextually appropriate content, thus achieving effective results in the field of Natural Language Processing (L​i​ ​&​ ​R​a​m​a​k​r​i​s​h​n​a​n​,​ ​2​0​2​5).

In this study, RAG is configured to operate per target report, so that at inference the model retrieves the very document under analysis. For each held-out report, a FAISS index (IndexFlatL2) is built from that report’s own text chunks (800-word windows, 120-word overlap) encoded with the all-MiniLM-L6-v2 model, and the six most relevant chunks to the task query are retrieved into the prompt. For summarization and SWOT, the retrievable base consists solely of the target report; for the GRI-aligned recommendation task, the GRI standards are additionally indexed, and report and GRI passages are retrieved jointly and interleaved. Reference/bibliography sections of the GRI standards are excluded from the corpus to avoid retrieving non-substantive citation lists. To ensure a fair comparison, the RAG prompt is capped at 2,048 tokens and generation is limited to 384 new tokens under fixed decoding (temperature 0.4, repetition penalty 1.2, no-repeat trigram); the base model is loaded in 4-bit NF4. The training reports are not part of the retrievable base.

3.6 Low-Rank Adaptation

LoRA, an abbreviation for Low-Rank Adaptation, is a technique used to fine-tune LLMs for specific tasks. Unlike traditional fine-tuning methods, LoRA does not require retraining all of the model’s parameters. In conventional approaches, the entire set of parameters is retrained, a process that is both time-consuming and costly.

LoRA represents the newly acquired knowledge during fine-tuning with low-rank matrices and updates only these matrices during the training phase, while the other parameters remain static. Once the training process is complete, the low-rank matrices are integrated with the pre-trained model parameters, customizing the model for the desired task. As noted, performing fine-tuning with LoRA, rather than training the entire model, requires less computational power and reduces costs. Furthermore, since the number of parameters that need to be stored is lower compared to traditional methods, a reduction in memory consumption is also observed.

Quantization is an optimization technique that reduces the parameters of LLMs from high-precision formats like float32 or float16 to lower bit values, such as 8-bit or 4-bit. This technique significantly reduces bandwidth usage and memory consumption, resulting in less hardware capacity and faster inference. The Quantized Low-Rank Adaptation (QLoRA) method, developed based on this technique, reduces memory usage by compressing model weights to the 4-bit level, while performing the training process through low-rank LoRA adapters added to these weights. These adapter matrices are updated in FP16/BF16 precision, which allows the model to preserve performance while significantly reducing memory consumption. This structure enables the fine-tuning process to be carried out efficiently without substantial performance degradation.

In this study, the model is fine-tuned with LoRA on the 24 training reports only; the six held-out reports are excluded from fine-tuning to prevent leakage. LoRA adapters of rank r = 16 (α = 32, dropout 0.1) are applied to the projection matrices (q/k/v/o, gate/up/down), with the base model in 4-bit NF4 (QLoRA). Training uses a learning rate of 2e-4 for three epochs, batch size 1 with gradient accumulation 16 (effective batch 16), and a max sequence length of 512. In inference, the fine-tuned model is given the first ~1,200 words of the target report in-context and generates up to 384 new tokens under the same decoding settings used for RAG (temperature 0.4, repetition penalty 1.2, no-repeat trigram), so the two approaches are compared under matched generation budgets.

3.7 Implementation Details

The main implementation settings and retrieval-related parameters adopted in the RAG pipeline are summarized in Table 2.

Table 2. Retrieval-Augmented Generation (RAG) Implementation Details

Parameter

Configuration

Chunk size

800 words

Chunk overlap

120 words

Embedding model

sentence-transformers/all-MiniLM-L6-v2

Vector database setting

FAISS with IndexFlatL2

Top-k value

6

Similarity threshold

None (deterministic top-k retrieval; no score cutoff)

Prompt template

Retrieved context + task instruction + user query + output formatting instruction

Max prompt length

2,048 tokens

Max new tokens

384

Temperature

0.4

Repetition penalty

1.2

No-repeat n-gram size

3

Quantization Setting

4-bit NF4 (Normalized Float 4)

Reranking

Not used

Table 3 presents the prompt template structure and task-specific instruction configurations used within the RAG pipeline.

Table 3. Task-specific prompt configurations used in the RAG pipeline

Task Type

Instruction

Summary

Summarize the ESG report within 20% of the original length

SWOT

Generate a SWOT analysis based on the retrieved ESG content

GRI-based Recommendation

Produce concise GRI-aligned recommendations for weaknesses and threats

Note: Environmental, Social, and Governance (ESG); Global Reporting Initiative (GRI); Retrieval-Augmented Generation (RAG); Strengths, Weaknesses, Opportunities, Threats (SWOT).

Table 4 presents the LoRA fine-tuning configuration and training parameters used in this study.

Table 4. Low-Rank Adaptation (LoRA) implementation details

Parameter

Configuration

Base Large Language Model (LLM)

Llama-3.1-8B-Instruct

Quantization setting

4-bit NF4 (Normalized Float 4)

Number of epochs

3

Rank value

16

Alpha value

32

Dropout

0.1

Learning rate

2e-4

Per-device train batch size

1

Gradient accumulation steps

16

Effective batch size

16

Training time

3 hours 3 minutes

Fine-tuning corpus

24

Max sequence length

512

Inference input

first ~1,200 words in-context

Max new tokens (output)

384

Temperature

0.4

Repetition penalty

1.2

No-repeat n-gram size

3

3.8 Evaluation Metrics

The four-quality metrics are computed automatically and are reference-free: they are evaluated against the source report and the task query rather than against human-authored gold answers. Let O denote a generated output, S the source reference (the first 12,000 characters of the target report, used as the grounding reference), and Q the task instruction (query). Semantic comparisons use a multilingual sentence-embedding model, paraphrase-multilingual-mpnet-base-v2 (768 dimensions), denoted E(·). For any two texts a and b, sim(a, b) is the cosine similarity between their sentence embeddings, truncated to the range [0, 1] (negative values are set to 0):

$\operatorname{sim}(a, b)=\max (0,(E(a) \cdot E(b)) /(\|E(a)\| \cdot\|E(b)\|))$
(1)

Since the embeddings are normalized, the cosine similarity lies in [−1, 1] and is at most 1; negative similarities, which are rare for semantically related ESG text, are mapped to 0, so sim(a, b) ∈ [0, 1].

3.8.1 Faithfulness

The output O and the source S are split into sentences, {o_1, ..., o_m} and {s_1, ..., s_n}. Each output sentence is scored by its best semantic support in the source, and the scores are averaged:

$\operatorname{Faithfulness}(O, S)=\frac{1}{m} \sum_{i=1}^m \max _{1 \leq j \leq n} \operatorname{sim}(o i, s j)$
(2)

A low value indicates output sentences that are not supported by any sentence in the source (potential hallucination).

3.8.2 Factual consistency

Let N(·) be the set of numeric tokens in a text (matched by a numeric regular expression) and Ent(·) the set of heuristic named entities (capitalized tokens of length ≥ 3 and 2–5-letter acronyms, after removing a stopword list). Define the numeric and entity overlaps as

num = |N(O) ∩ N(S)| / |N(O)| (and num = 1 if N(O) = ), and

ent = |Ent(O) ∩ Ent(S)| / |Ent(O)| (and ent = 1 if Ent(O) = ).

Factual Consistency combines these with semantic similarity to the source:

$\text{FactualConsistency} (O, S)=\operatorname{clip}(0.20 \cdot num +0.20 \cdot ent +0.60 \cdot \operatorname{sim}(O, S), 0,1)$
(3)

This rewards outputs whose numbers and named entities are supported by the source report.

3.8.3 Answer correctness

This is a composite proxy that combines task-form adherence, structural formatting, length adequacy, and semantic alignment with the source. Let K_task be a small set of task-specific keywords (for summarization: environmental/social/governance terms; for SWOT: strengths/weaknesses/opportunities/threats terms; for recommendations: GRI/recommendation/action terms; full lists are provided in the released code). With w denoting the word count of O, the components are:

kw = min( |{k ∈ K_task : k ∈ O}| / max(0.4·|K_task|, 1), 1 );

struct = 1.0 if O contains any list or structure marker (-, •, *, “1.”, “:”), otherwise 0.4;

len = 1.0 if 40 ≤ w ≤ 700; len = max(0.5, 1 − (w − 700)/1500) if w > 700; len = max(0.2, w/40) if w < 40;

and the metric is

$\begin{aligned} & \text { AnswerCorrectness }(0, S)=\operatorname{clip}(0.20 \cdot k w \\ & +0.15 \cdot \text { struct } \\ & +0.15 \cdot \text { len } \\ & +0.50 \cdot \operatorname{sim}(0, S), 0,1)\end{aligned}$
(4)

We note that this metric does not compare the output against an independent reference answer; it is a reference-free proxy for task-appropriate, source-aligned output.

3.8.4 Answer relevancy

This measures how well the output addresses the requested task, computed as the semantic similarity between the output and the task query:

$ \text{AnswerRelevancy} (O, Q)=\operatorname{sim}(O, Q)$
(5)

All composite weights (0.20/0.20/0.60 for Factual Consistency and 0.20/0.15/0.15/0.50 for Answer Correctness) are heuristic. Because the metrics are computed against the source document and the task instruction rather than against expert-annotated reference outputs, and because no human or expert spot-checking was performed in this study, the scores should be interpreted as automated proxies for output quality; this limitation, and the value of complementary human evaluation, are discussed in Section 6.2.

4. Results

Table 5 presents the comparative performance on the six held-out test companies. These reports were excluded only from the LoRA fine-tuning corpus; for RAG, the target report is retrieved at inference, in line with the intended use of RAG.

Table 5. Comparison of RAG and LoRA mean scores across 12 task-metric combinations (Held-out test set, $n$ = 6 companies)

Task

Metric

RAG_Average

LoRA_Average

Difference

Summary

Faithfulness

0.465

0.582

+0.118

Summary

Factual Consistency

0.324

0.387

+0.063

Summary

Answer Correctness

0.498

0.654

+0.157

Summary

Answer Relevancy

0.338

0.437

+0.099

SWOT

Faithfulness

0.476

0.503

+0.027

SWOT

Factual Consistency

0.306

0.381

+0.076

SWOT

Answer Correctness

0.462

0.607

+0.145

SWOT

Answer Relevancy

0.282

0.447

+0.166

GRI-based Recommendation

Faithfulness

0.481

0.525

+0.044

GRI-based Recommendation

Factual Consistency

0.371

0.395

+0.024

GRI-based Recommendation

Answer Correctness

0.571

0.670

+0.099

GRI-based Recommendation

Answer Relevancy

0.245

0.528

+0.283

Note: Global Reporting Initiative (GRI); Retrieval-Augmented Generation (RAG); Low-Rank Adaptation (LoRA); Strengths, Weaknesses, Opportunities, Threats (SWOT).

All scores are averages over the 6 held-out test companies. These reports were excluded only from LoRA fine-tuning; for RAG, the target report is retrieved at inference, consistent with the intended use of RAG. Quality metrics are computed using embedding-based measures (sentence-transformer cosine similarity for faithfulness and relevancy; weighted numeric + entity + semantic overlap for factual consistency).

The RAG_Average column reports the mean score obtained by the RAG scenario for the corresponding task and metric, across the 6 held-out test companies. The LoRA_Average column reports the corresponding mean for the LoRA fine-tuned model on the same 6 companies. The Difference column reports the mean difference (LoRA_Average - RAG_Average); a positive value indicates that LoRA scores higher than RAG.

Table 6 presents the paired statistical comparisons between LoRA and RAG (paired t-test, Wilcoxon signed-rank test, and Cohen’s d) for each of the 12 task-metric combinations across the six held-out test companies, complementing the descriptive results in Table 5.

Table 6 reports, for each of the 12 task–metric combinations, the mean scores and standard deviations (Mean ± SD) over the six held-out companies, together with paired statistical comparisons between LoRA and RAG: paired t-test and Wilcoxon signed-rank p-values (α = 0.05) and paired Cohen’s d, with |d| > 0.8 conventionally interpreted as a large effect. Because n = 6, the smallest attainable two-sided Wilcoxon p-value is 0.031 (obtained when all six companies change in the same direction), and 0.062 represents the next attainable level; the test therefore has limited resolution at this sample size. The task-level findings are interpreted in the preceding paragraphs (and summarized in the "Taken together" discussion).

For the summarization task, LoRA obtains higher mean scores than RAG on all four metrics, but the difference is statistically significant for only two: Faithfulness (Δ = +0.118, p = 0.023, d = 1.33) and Answer Correctness (Δ = +0.157, p = 0.021, d = 1.35), both also supported by the Wilcoxon test (p = 0.031). The gains in Factual Consistency (Δ = +0.063, p = 0.247) and Answer Relevancy (Δ = +0.099, p = 0.365) are not significant. These results indicate that LoRA-generated summaries are more closely grounded in the source documents and more structurally aligned with the task instruction, while the two approaches are comparable in factual consistency and query relevance for this task.

Table 6. Per-metric statistical comparison across the 6 held-out companies

Task

Metric

RAG Mean ± SD

LoRA Mean ± SD

Paired t-Test p

Wilcoxon p

Effect Size (Cohen’s d)

Summary

Faithfulness

0.464 ± 0.082

0.582 ± 0.141

0.023

0.031

1.33

Summary

Factual Consistency

0.324 ± 0.108

0.387 ± 0.058

0.247

0.438

0.53

Summary

Answer Correctness

0.498 ± 0.071

0.654 ± 0.106

0.021

0.031

1.35

Summary

Answer Relevancy

0.338 ± 0.129

0.437 ± 0.179

0.365

0.562

0.41

SWOT

Faithfulness

0.476 ± 0.107

0.502 ± 0.056

0.650

1.000

0.20

SWOT

Factual Consistency

0.306 ± 0.137

0.381 ± 0.124

0.359

0.438

0.41

SWOT

Answer Correctness

0.462 ± 0.059

0.607 ± 0.137

0.037

0.062

1.15

SWOT

Answer Relevancy

0.282 ± 0.081

0.447 ± 0.129

0.058

0.094

1.00

GRI-based Recommendation

Faithfulness

0.481 ± 0.096

0.525 ± 0.055

0.229

0.156

0.56

GRI-based Recommendation

Factual Consistency

0.371 ± 0.105

0.395 ± 0.090

0.761

0.438

0.13

GRI-based Recommendation

Answer Correctness

0.571 ± 0.075

0.670 ± 0.052

0.098

0.156

0.83

GRI-based Recommendation

Answer Relevancy

0.245 ± 0.107

0.528 ± 0.183

0.014

0.062

1.52

Note: Global Reporting Initiative (GRI); Retrieval-Augmented Generation (RAG); Low-Rank Adaptation (LoRA); Strengths, Weaknesses, Opportunities, Threats (SWOT).

For the SWOT analysis task, LoRA again obtains higher mean scores than RAG on all four metrics, but only Answer Correctness reaches statistical significance under the paired t-test (Δ = +0.145, p = 0.037, d = 1.15), and even this comparison is borderline under the more conservative Wilcoxon test (p = 0.062). The differences in Answer Relevancy (Δ = +0.166, p = 0.058, d = 1.00), Factual Consistency (Δ = +0.076, p = 0.359), and Faithfulness (Δ = +0.027, p = 0.650) did not reach statistical significance, indicating that the two approaches perform comparably on these dimensions. LoRA’s clearest advantage on this task is therefore its stronger adherence to the four-category SWOT structure, as reflected in Answer Correctness.

For the GRI-based recommendation task, LoRA again obtains higher mean scores than RAG on all four metrics, but only Answer Relevancy reaches statistical significance under the paired t-test (Δ = +0.283, p = 0.014, d = 1.52), and as with SWOT Answer Correctness this comparison is borderline under the Wilcoxon test (p = 0.062). The differences in Answer Correctness (Δ = +0.099, p = 0.098), Faithfulness (Δ = +0.044, p = 0.229), and Factual Consistency (Δ = +0.024, p = 0.761) did not reach statistical significance, indicating comparable performance on these dimensions. This pattern suggests that LoRA’s main advantage for this task lies in producing recommendations that are more responsive to the specific weaknesses and threats raised by the query.

Taken together, Table 5 and Table 6 show that the LoRA fine-tuned model obtains higher mean scores than RAG on all 12 task–metric combinations, but only 4 of the 12 paired comparisons reach statistical significance under the paired t-test (p < 0.05): Faithfulness and Answer Correctness for summarization, Answer Correctness for SWOT, and Answer Relevancy for recommendations. Two of these four are corroborated by the Wilcoxon test (both summarization metrics, p = 0.031), whereas the other two are borderline (p = 0.062), reflecting the limited resolution of the Wilcoxon test at n = 6. Where significant, the effect sizes are large (Cohen’s d between 1.15 and 1.52); for the remaining comparisons, the effects are small to moderate and non-significant. Given the small held-out sample, these results indicate a task- and metric-dependent advantage for LoRA rather than a uniform one. Because the scores are obtained on companies that the LoRA model never saw during training, while RAG retrieved each target report at inference, the comparison reflects performance on unseen reports under matched conditions rather than memorization of report-specific wording.

Figure 2 summarises the average performance of the RAG and LoRA scenarios on the held-out test set, while Tables 5 and 6 disaggregate the same comparison across the three task types (Summary, SWOT, and GRI-aligned Recommendations).

At the aggregate level, the LoRA scenario obtains a higher mean score than RAG on all four-quality metrics, but the magnitude of the difference varies considerably across metrics. The largest average gaps are observed for Answer Relevancy, where the mean rises from 0.288 under RAG to 0.471 under LoRA (Δ = +0.183), and for Answer Correctness (RAG 0.510, LoRA 0.644; Δ = +0.133). The gaps for Faithfulness (RAG 0.474, LoRA 0.537; Δ = +0.063) and Factual Consistency (RAG 0.333, LoRA 0.388; Δ = +0.054) are comparatively small.

Figure 2. Average performance scores of the Retrieval-Augmented Generation (RAG) and Low-Rank Adaptation (LoRA) scenarios on the held-out test set across the four-quality metrics Faithfulness, Factual Consistency, Answer Correctness, and Answer Relevancy
Note: Bars show the mean score (range 0-1) per metric; numerical values are reported above each bar.

These aggregate differences should be interpreted in light of the task-level significance tests in Table 6, where only 4 of the 12 task–metric comparisons reach statistical significance. The LoRA advantage is most consistent for Answer Correctness, which is significant for both the Summary and SWOT tasks; Answer Relevancy is significant for the Recommendation task, and Faithfulness for the Summary task. Factual Consistency, although showing a slightly higher mean for LoRA, does not reach significance in any task and exhibits the smallest aggregate gap.

This pattern indicates that, under the fair, matched-budget comparison, the LoRA scenario’s advantage is concentrated in the structural and task-responsiveness dimensions, producing outputs that are better aligned with the task instruction (Answer Correctness) and more responsive to the specific query (Answer Relevancy) rather than in source grounding, where the differences are smaller and largely non-significant (no significant difference for Factual Consistency in any task, and for Faithfulness only in the Summary task).

Taken together, Figure 2 and Tables 5-6 indicate a task- and metric-dependent advantage for LoRA rather than a uniform one: LoRA leads on average across all four metrics, but this advantage is statistically supported in only 4 of the 12 task-level comparisons Answer Correctness (Summary and SWOT), Answer Relevancy (Recommendation), and Faithfulness (Summary) and is absent for Factual Consistency in every task. Given the small held-out sample (n = 6), these patterns should be regarded as indicative rather than conclusive.

The distribution and stability of the four-quality metrics across the held-out test set are shown in Figure 3.

Figure 3. Distribution of the four-performance metrics across the held-out test set for Retrieval-Augmented Generation (RAG) (blue) and Low-Rank Adaptation (LoRA) (red)
Boxes show IQR, lines medians, diamonds means, and circles outliers.

Figure 3 reports the full distribution of the four-quality metrics across the held-out test set for both scenarios. The central tendency (mean and median) of the LoRA distribution lies above that of RAG for every metric, but the degree of separation varies across metrics. For Answer Correctness and Answer Relevancy the two distributions show little overlap, with the LoRA box positioned above the RAG box (Answer Correctness: RAG μ = 0.510 vs LoRA μ = 0.644; Answer Relevancy: 0.288 vs 0.471). For Faithfulness (0.474 vs 0.537) and Factual Consistency (0.333 vs 0.388) the gap is smaller and the inter-quartile ranges overlap substantially, mirroring the smaller and mostly non-significant differences for these two metrics (Table 6).

The spread of the two scenarios differs by metric rather than systematically. RAG distributions are tighter than LoRA on three of the four metrics, Faithfulness, Answer Correctness, and especially Answer Relevancy, where LoRA shows the widest spread (σ = 0.161), whereas for Factual Consistency LoRA is the tighter of the two (σ = 0.089 vs RAG σ = 0.114). LoRA therefore tends to achieve higher absolute scores, but in most cases at the cost of greater variability across reports and tasks, which should be acknowledged when interpreting individual report-level outputs.

A few isolated outliers are visible for example, a high-Faithfulness LoRA case and both a high- and a low-Answer-Relevancy RAG case but they do not alter the overall pattern. The distributional view is consistent with the aggregate and task-level results: the LoRA advantage is most apparent for Answer Correctness and Answer Relevancy, where the distributions are more separated, and is marginal for Faithfulness and Factual Consistency, where they largely overlap. Given the small held-out sample (n = 6 companies), these distributional patterns should be regarded as indicative rather than conclusive.

The comparison of the two scenarios with respect to average token consumption and processing time per generation is shown in Figure 4.

Figure 4. Average processing time and token consumption

In terms of token consumption, the two scenarios are comparable: LoRA uses on average 2,075 tokens per generation and RAG 2,420, a difference of approximately 14%. These counts reflect the input actually processed by each model under the matched 2,048-token prompt budget. Thus, although LoRA is marginally more token-efficient, the two approaches do not differ substantially on this dimension.

In terms of processing time, the difference is large and in the opposite direction: RAG completes a generation in 42.91 seconds on average, whereas the LoRA scenario requires 473.16 seconds. This gap arises at inference rather than from any parameter updates (which occur only during training). The RAG pipeline runs the base model under 4-bit quantization, whereas the fine-tuned model in our setup was run at full precision, contributing to its longer runtime. In addition, LoRA incurs a one-time fine-tuning cost (approximately 2h19min) that is not reflected in the per-generation time but is part of its overall computational budget.

Taken together, Figure 4 indicates that the two approaches involve a computational trade-off rather than a clear efficiency advantage for either. Token consumption is comparable, while RAG is substantially faster at inference. Efficiency should therefore be assessed jointly across token use, inference latency, and the one-time cost of fine-tuning, rather than on token count alone.

5. Conclusions

This study evaluated 30 food-sector ESG reports using two LLM-assisted analysis scenarios—RAG and LoRA—built on the same Llama-3.1-8B-Instruct base model. To prevent train–test leakage, the corpus was partitioned into a stratified hold-out split of 24 training and 6 unseen test reports, with all reported metrics computed on the test set; leakage prevention was applied to LoRA fine-tuning only, while RAG was allowed to retrieve the target report at inference, consistent with its intended use, and both scenarios were compared under matched generation budgets and identical decoding. Each scenario was applied in a three-stage pipeline: ESG-report summarization, a SWOT analysis derived from the summary, and GRI-aligned solution recommendations for the identified weaknesses and threats. Outputs were assessed on four quality metrics: Faithfulness, Factual Consistency, Answer Correctness, and Answer Relevancy, together with operational measures of token consumption and processing time.

On the held-out test set, the LoRA-based scenario obtained higher mean scores than RAG across all four-quality metrics. However, statistically significant differences were observed in only 4 of the 12 task metric comparisons reported in Table 6, namely Summary-Faithfulness, Summary-Answer Correctness, SWOT-Answer Correctness, and GRI-based Recommendation-Answer Relevancy. The remaining comparisons did not reach statistical significance, although several exhibited moderate to large effect sizes. These findings suggest that the performance advantage of LoRA was task-dependent rather than uniformly distributed across all evaluation dimensions.

In terms of operational cost, token consumption was comparable between the two scenarios (RAG ≈ 2,420 vs LoRA ≈ 2,075 tokens per generation, a difference of ~14%), whereas RAG was substantially faster at inference (≈ 43 s vs ≈ 473 s per generation, roughly an order of magnitude). We note that this latency gap is partly an implementation choice (fp16 LoRA inference versus 4-bit RAG inference) and is expected to narrow under matched configurations; LoRA additionally incurs a one-time fine-tuning cost not reflected in per-generation time.

Within the scope of this study, the results suggest that LoRA fine-tuning yields outputs that are better aligned with the task instruction and more responsive to the query (Answer Correctness and Answer Relevancy), while the two approaches are comparable in source grounding (Faithfulness and Factual Consistency); RAG, in turn, remains attractive when traceability, rapid corpus updates, and low inference latency are prioritized. We therefore restrict our conclusion to the conditions of this study: 30 food-sector ESG reports, the Llama-3.1-8B-Instruct base model, and reference-free automatic evaluation on a held-out sample of n = 6, and emphasize that further validation on additional industries, base models, and expert-annotated reference outputs is needed before these findings can be generalized. Replicating the experiments on other sectors, with larger held-out samples or stratified k-fold cross-validation, and complementing the automatic metrics with human assessment by ESG/GRI auditors, are the most important directions we identify for future work.

6. Limitations

The findings of this study should be interpreted in light of several limitations, which we group into four categories: scope, evaluation methodology, implementation, and reproducibility.

6.1 Scope and Generalizability

Our experiments are restricted in scope. They cover a single industry, the food sector, using 30 ESG reports drawn from a mix of consumer-packaged-goods, dairy, meat and protein, food-service, B2B-ingredient, and beverage sub-segments. A single base model, Llama-3.1-8B-Instruct, is used throughout. The reports are predominantly in English, with a smaller subset of Turkish reports analyzed jointly rather than evaluated as separate language conditions. The conclusions, therefore, apply under the conditions of this study and should not be interpreted as a general claim about LoRA versus RAG across arbitrary domains. Whether the advantages observed for LoRA on certain metrics transfer to other regulated sectors (e.g., banking, energy, mining, healthcare), to other base models, to non-Latin-script languages, or to longer time-series of disclosures remains to be tested. We view replication across additional sectors and models as the most important direction for future work.

6.2 Evaluation Methodology and Statistical Power

To prevent data leakage between LoRA training and evaluation, the corpus is partitioned into 24 training and 6 held-out test reports, stratified across six sub-segments. Per-company comparisons are therefore based on n = 6, which limits the statistical power and external validity of inferences regarding cross-company variation. Although several of the paired comparisons reach significance with large effect sizes, the majority do not under this small sample, and replication using stratified k-fold cross-validation, for which the framework already provides utilities (src/data_split.py), is required before strong generalization claims can be made.

The four-performance metrics (faithfulness, factual consistency, answer correctness, and answer relevancy) are automated, reference-free measures derived primarily from multilingual sentence-embedding similarity (paraphrase-multilingual-mpnet-base-v2), supplemented by lightweight task-specific heuristic checks where appropriate. While these metrics improve robustness to paraphrasing and surface-form variation, they approximate rather than replace human or expert assessment; ESG-domain experts may legitimately disagree with automated scores, particularly for nuanced GRI-compliance judgments. A complementary human evaluation, ideally involving certified sustainability auditors, would strengthen the conclusions.

The LoRA fine-tuning targets themselves are constructed through deterministic heuristic extraction procedures rather than gold-standard human annotation, meaning that the upper bound of LoRA performance is partly dependent on the quality of these heuristics; further gains may be achievable with curated, expert-written training data.

6.3 Implementation and Engineering Choices

The latency comparison is not fully like-for-like: in the current implementation, RAG inference uses 4-bit quantization while LoRA inference uses fp16, contributing to the observed latency differences between the two approaches. Using matched quantization settings for both pipelines could potentially reduce part of this gap; however, this configuration was not evaluated in the present study and is left for future engineering work.

Results are reported from a single random seed and a single training run; run-to-run variance arising from optimizer stochasticity was not quantified. Hyperparameters were not exhaustively tuned: LoRA rank (r = 16), α, dropout, learning rate, and the RAG retrieval depth (top-k = 6), chunk size (800 words), and embedding model (all-MiniLM-L6-v2) were selected from common defaults. Alternative configurations, including hybrid BM25+dense retrieval, rerankers, larger or domain-specific embedding models, and full (rather than parameter-efficient) fine-tuning, were not evaluated.

Document parsing relies on pypdf text extraction; tabular content, figures, multi-column layouts, and scanned pages are not OCR-processed and may be partially lost, which can disadvantage scenarios that depend heavily on numerical or structured content.

6.4 Baselines and Task Coverage

We compare RAG and LoRA but do not include a zero-shot Llama baseline, a fully fine-tuned baseline, or larger reference models (e.g., GPT-4-class systems) due to compute and licensing constraints. The task suite is likewise limited to three ESG analyses (summarization, SWOT generation, and GRI-aligned recommendation generation); other practitioner-relevant tasks such as materiality assessment, quantitative KPI extraction, year-over-year change detection, and supply-chain risk extraction remain outside the scope of this study.

6.5 Reproducibility and Licensing

The base model is hosted in a gated Hugging Face repository requiring authentication, and several licensing constraints apply to its use. The GRI standard documents used as the RAG knowledge base and as part of the LoRA training corpus are subject to copyright and therefore cannot be redistributed. Accordingly, we release code, configurations, processed-derived statistics, and per-document metric outputs, but not the original source PDFs. Researchers seeking to reproduce the pipeline must independently obtain the GRI documents and, where applicable, the ESG reports.

Taken together, these limitations suggest that the reported findings should be interpreted as evidence of the potential of LoRA fine-tuning for specific ESG-analysis dimensions under the experimental configuration studied, rather than as a universal claim of superiority over RAG across all settings. Future work addressing additional sectors, multiple base models, expert-validated evaluation protocols, hybrid RAG-LoRA architectures, and matched latency configurations will be necessary to establish the broader applicability of these findings.

Author Contributions

Conceptualization, B.Ö. and A.H.I.; methodology, B.Ö. and A.H.I.; software, B.Ö.; validation, B.Ö. and A.H.I.; formal analysis, B.Ö.; investigation, B.Ö.; resources, B.Ö.; data curation, B.Ö.; writing—original draft preparation, B.Ö.; writing—review and editing, B.Ö.; visualization, B.Ö.; supervision, A.H.I.; project administration, B.Ö. All authors have read and agreed to the published version of the manuscript.

Data Availability

The data used to support the research findings are available from the corresponding author upon request.

Conflicts of Interest

The authors declare no conflict of interest.

Declaration on the Use of Generative AI and AI-assisted Technologies

During the preparation of this work, the authors utilized generative AI for minor language editing. Afterward, they reviewed and edited the content as necessary and took full responsibility for the publication’s content.

References
Behrens, G., Karatzas, K., & Orlowski, C. (2024). Large language modeling method to support the analysis of environmental, social, and governance reporting e-metrics. In EnviroInfo 2024 (pp. 27–34). Gesellschaft für Informatik eV. [Google Scholar] [Crossref]
Birti, M., Maurino, A., & Osborne, F. (2025). Optimizing large language models for ESG activity detection in financial texts. In Proceedings of the 6th ACM International Conference on AI in Finance (pp. 856–863). [Google Scholar] [Crossref]
Bronzini, M., Nicolini, C., Lepri, B., Passerini, A., & Staiano, J. (2024). Glitter or gold? Deriving structured insights from sustainability reports via large language models. EPJ Data Sci., 13(1), 1–41. [Google Scholar] [Crossref]
Chakraborty, S. (2024). Generative AI in modern education society. arXiv Preprint, arXiv:2412.08666. [Google Scholar] [Crossref]
Global Reporting Initiative. (n.d.). About GRI. https://www.globalreporting.org/about-gri/ [Google Scholar]
Gupta, T., Goel, T., & Verma, I. (2025). Exploring multimodal language models for sustainability disclosure extraction: A comparative study. In The Sixth Workshop on Insights from Negative Results in NLP (pp. 141–149). [Google Scholar] [Crossref]
KPMG International. (2024). The move to mandatory reporting: Survey of Sustainability Reporting 2024. https://kpmg.com/xx/en/our-insights/esg/the-move-to-mandatory-reporting.html [Google Scholar]
Li, S. & Ramakrishnan, N. (2025). Oreo: A plug-in context reconstructor to enhance retrieval-augmented generation. In Proceedings of the 2025 International ACM SIGIR Conference on Innovative Concepts and Theories in Information Retrieval (ICTIR) (pp. 238–253). Padua, Italy. [Google Scholar] [Crossref]
Maharani, A. M. & Rozzaid, Y. R. (2022). GRI standards-based corporate social responsibility disclosure toward market reaction in energy companies listed in Indonesia stock exchange. Int. Soc. Sci. Humanit., 1(1), 72–79. [Google Scholar] [Crossref]
Melton, C., Sorokine, A., & Peterson, S. (2025). Evaluating retrieval augmented generative models for document queries in transportation safety. arXiv Preprint, arXiv:2504.07022. [Google Scholar] [Crossref]
Ni, J., Bingler, J., Colesanti-Senni, C., Kraus, M., Gostlow, G., Schimanski, T., Stammbach, D., Vaghefi, S. A., Wang, Q., Webersinke, N., & et al. (2023). CHATREPORT: Democratizing sustainability disclosure analysis through LLM-based tools. arXiv Preprint, arXiv:2307.15770. [Google Scholar] [Crossref]
Polignano, M., Bellantuono, N., Lagrasta, F. P., Caputo, S., Pontrandolfo, P., & Semeraro, G. (2022). An NLP approach for the analysis of global reporting initiative indexes from corporate sustainability reports. In Proceedings of the first computing social responsibility workshop within the 13th language resources and evaluation conference (pp. 1–8). Marseille, France. https://aclanthology.org/2022.csrnlp-1.1/ [Google Scholar]
Verma, S. (2024). Contextual compression in retrieval-augmented generation for large language models: A survey. arXiv Preprint, arXiv:2409.13385. [Google Scholar] [Crossref]
Villacampa-Porta, J., Coronado-Vaca, M., & Garrido-Merchán, E. C. (2025). Impact of EU non-financial reporting regulation on Spanish companies’ environmental disclosure: A cutting-edge natural language processing approach. Environ. Sci. Eur., 37(1), 1–33. [Google Scholar] [Crossref]
Wu, Q., Xiang, X., Hejia, H., Wang, X., Wei Jie, Y., Satapathy, R., Filho, R. S., & Veeravalli, B. (2025). SusGen-GPT: A data-centric LLM for financial NLP and sustainability report generation. In Findings of the Association for Computational Linguistics: NAACL 2025 (pp. 1184–1203). Albuquerque, New Mexico. [Google Scholar] [Crossref]
Yang, J. Y., Chi, R. H., Wu, C. C., Chen, L. J., Lin, W. M., Hu, H. W., & Cheng, H. R. (2024). EcoSmartGuide: Language learning model and retrieval-augmented generation-based platform for streamlined environmental, social, and governance information access and report generation. In 2024 IEEE 6th Eurasia Conference on Biomedical Engineering, Healthcare and Sustainability (ECBIOS) (pp. 343–347). [Google Scholar] [Crossref]
Zahid, M., Naqvi, S. U., Jan, A., Rahman, H. U., & Wali, S. (2023). The nexus of environmental, social, and governance practices with the financial performance of banks: A comparative analysis for the pre and COVID-19 periods. Cogent Econ. Financ., 11(1), 2183654. [Google Scholar] [Crossref]
Zou, Y., Shi, M., Chen, Z., Deng, Z., Lei, Z., Zeng, Z., Yang, S., Tong, H., Xiao, L., & Zhou, W. (2025). ESGReveal: An LLM-based approach for extracting structured data from ESG reports. J. Clean. Prod., 489, 144572. [Google Scholar] [Crossref]

Cite this:
APA Style
IEEE Style
BibTex Style
MLA Style
Chicago Style
GB-T-7714-2015
Öz, B. & Işık, A. H. (2026). The Use of Large Language Models in Sustainability Reporting: Performance Analysis of RAG and LoRA Techniques. Chall. Sustain., 14(3), 589-604. https://doi.org/10.56578/cis140310
B. Öz and A. H. Işık, "The Use of Large Language Models in Sustainability Reporting: Performance Analysis of RAG and LoRA Techniques," Chall. Sustain., vol. 14, no. 3, pp. 589-604, 2026. https://doi.org/10.56578/cis140310
@research-article{Öz2026TheUO,
title={The Use of Large Language Models in Sustainability Reporting: Performance Analysis of RAG and LoRA Techniques},
author={Berra öZ and Ali Hakan IşıK},
journal={Challenges in Sustainability},
year={2026},
page={589-604},
doi={https://doi.org/10.56578/cis140310}
}
Berra öZ, et al. "The Use of Large Language Models in Sustainability Reporting: Performance Analysis of RAG and LoRA Techniques." Challenges in Sustainability, v 14, pp 589-604. doi: https://doi.org/10.56578/cis140310
Berra öZ and Ali Hakan IşıK. "The Use of Large Language Models in Sustainability Reporting: Performance Analysis of RAG and LoRA Techniques." Challenges in Sustainability, 14, (2026): 589-604. doi: https://doi.org/10.56578/cis140310
ÖZ B, IŞIK A H. The Use of Large Language Models in Sustainability Reporting: Performance Analysis of RAG and LoRA Techniques[J]. Challenges in Sustainability, 2026, 14(3): 589-604. https://doi.org/10.56578/cis140310
cc
©2026 by the author(s). Published by Acadlore Publishing Services Limited, Hong Kong. This article is available for free download and can be reused and cited, provided that the original published version is credited, under the CC BY 4.0 license.