The Use of Large Language Models in Sustainability Reporting: Performance Analysis of RAG and LoRA Techniques
Abstract:
An Environmental, Social, and Governance (ESG) report is an essential information source for evaluating a company’s performance in sustainability practices. Organizations structure their environmental impacts, social responsibilities, and governance practices within a defined framework. This standardization is provided by the Global Reporting Initiative (GRI), which constitutes an internationally recognized guideline for sustainability reporting. Traditional reporting workflows are time-consuming for organizations and prone to data-entry errors, which limits the reliability of disclosed information. In this context, leveraging the capabilities of Large Language Models (LLMs) offers significant time and resource savings. This study uses the Llama-3.1-8B-Instruct model under two scenarios, Retrieval-Augmented Generation (RAG) and Low-Rank Adaptation (LoRA) fine-tuning, to analyze 30 food-sector ESG reports and produce ESG summaries, SWOT analyses, and GRI-aligned recommendations. The two approaches are evaluated on a stratified hold-out set of 6 unseen test reports (24 reports used for training) under a fair, matched-budget setup in which RAG retrieves the target report at inference. On four quality metrics, LoRA achieved higher mean scores than RAG; however, statistically significant differences were observed in only 4 of the 12 task–metric comparisons. Token usage was comparable, whereas RAG was substantially faster at inference. Rather than favoring one approach over the other, these findings reveal a trade-off between output quality and computational efficiency: LoRA yields quality gains on specific metrics, whereas RAG is substantially more efficient at inference. Given the limited size of the held-out test set, these results should be interpreted with caution.
1. Introduction
The concept of ESG, introduced in 2004 in line with the report published under the United Nations Global Compact, is derived from the abbreviation of Environmental, Social, and Governance (ESG). Companies publish an ESG report each year in which they provide a detailed account of their activities and performance related to ESG issues. ESG reporting plays a significant role for companies by enhancing corporate credibility and fostering strong relationships among stakeholders. Moreover, it contributes to creating value for businesses by strengthening brand reputation. Today, companies are increasingly developing and implementing ESG-oriented practices (Zahid et al., 2023).
Sustainability reports structured through reporting standards such as the Global Reporting Initiative (GRI), the Sustainability Accounting Standards Board (SASB), and the Corporate Sustainability Reporting Directive (CSRD) enable businesses of all scales to assess and analyze their impacts on the environment, society, and the economy (Maharani & Rozzaid, 2022). The GRI is an international organization that provides standards for sustainability reporting. These standards are utilized by approximately 14,000 organizations across more than 100 countries and, according to KPMG’s 2024 Survey of Sustainability Reporting, represent the most widely adopted sustainability reporting framework worldwide (KPMG International, 2024). The Global Sustainability Standards Board (GSSB) regularly revises and improves the GRI Standards to ensure they remain current and align with the information demands of all stakeholders (Global Reporting Initiative, n.d.).
Large Language Models (LLMs), trained on massive datasets, represent a significant milestone for artificial intelligence by consistently producing coherent outputs across various domains such as information extraction, content generation, text summarization, translation, and text classification (Chakraborty, 2024). The rise of LLMs has driven transformation in various sectors and led to innovative solutions for complex problems (Melton et al., 2025). Sustainability reports have a high volume of data; since manual analysis is costly for many firms, it can only be performed by a limited number of organizations (Ni et al., 2023). Furthermore, the inclusion of diverse formats within sustainability reports, such as images, graphs, and tables, restricts their analytical tractability (Gupta et al., 2025). LLMs enable the transformation of raw and unstructured data into consistent and structured information through information extraction. Owing to these capabilities, they automate reporting processes and facilitate significant savings in both time and resources.
Although the importance of sustainability and its reporting are widely acknowledged by businesses today, the implementation process still faces various challenges and barriers, as noted in the previous paragraph. At this point, LLM-based ESG reporting tools enhance efficiency, making the process significantly easier and more feasible. This study compares two approaches for LLM-assisted analysis of ESG reports in the food sector, namely Retrieval-Augmented Generation (RAG) and Low-Rank Adaptation (LoRA) fine-tuning, both built on the Llama-3.1-8B-Instruct model. Using a corpus of 30 ESG reports, each approach is applied to summarize the reports, derive a SWOT analysis from the summaries, and produce solution recommendations aligned with GRI standards for the identified weaknesses and threats. The two approaches are compared on a held-out test set under matched conditions, using four quality metrics (Faithfulness, Factual Consistency, Answer Correctness, and Answer Relevancy) together with operational measures of token consumption and processing time. Ultimately, the study aims to contribute to the understanding of how LLMs can support the assessment of corporate sustainability performance.
2. Literature Review
The study focused on optimizing the use of LLMs for classifying and detecting ESG-related activities in financial and corporate documents. Models such as Llama, Gemma, Mistral, and GPT-4o Mini were used in the study. The highest performance was achieved by the Llama2_7B model with an F1 score of 84.9% (Birti et al., 2025).
The aim was to analyze the scope and quality of 729 reports prepared under voluntary and mandatory systems between 2015 and 2022. Using Transformer LLM ClimateBERT models fine-tuned with the ClimaText database, the authors concluded that the fine-tuned model showed higher precision in climate texts (Villacampa-Porta et al., 2025).
The study focused on the use of LLMs in the creation of ESG reports. The Llama-3.1-405b-instruct and Mistral-large-2-instruct models were analyzed to assess their suitability for the reporting process. It was concluded that the Mistral-large-2-instruct model produced predictions that were highly consistent with the reference text (Behrens et al., 2024).
A system based on Natural Language Processing (NLP) techniques was developed using Italian and English sustainability reports that complied with GRI Standards. The system identified references to various sustainability topics, their page numbers, the content of each reference, and, through sentiment analysis, whether each reference is positive or negative. The study yielded an F1 score ranging from 0.87210 to 0.94365 (Polignano et al., 2022).
A system called ESGReveal was introduced to analyze unstructured ESG reports and evaluate corporate performance. The study encompassed 2249 ESG reports from 166 companies across 12 different sectors listed on the Hong Kong Stock Exchange. Utilizing various LLMs such as GPT-3.5, GPT-4, ChatGLM and QWEN, in combination with RAG technology, the study reported the highest information extraction accuracy with GPT-4, achieving 76.9% (Zou et al., 2025).
The LLM and RAG-based system, EcoSmartGuide, was developed to simplify the ESG reporting process. This system automates the collection and analysis of ESG data and reflects ESG information with 95% accuracy (Yang et al., 2024).
The study focused on automatic knowledge extraction from unstructured data in ESG reports. LLMs were used in conjunction with the RAG architecture and Context-Based Learning method (Bronzini et al., 2024).
A system called CHATREPORT was developed for automated analysis of sustainability reports, based on the Task Force on Climate-Related Financial Disclosures (TCFD) reporting framework. ChatGPT was used as the base LLM, and LangChain was used for API operations and vector-based retrieval processes, while OpenAI’s text-embedding-ada-002 model was chosen for the text embedding process. The study concluded that the developed system had a low hallucination rate and that errors were easily detected by users (Ni et al., 2023).
A system called SusGen-GPT was developed to facilitate the creation of sustainability reports based on the TCFD reporting framework. Models were trained on the SusGen-30K database, and a report generation system was developed by integrating the RAG technique (Wu et al., 2025).
A Knowledge Graph-Retrieval Augmented Generation (KG-RAG) based system integrated with LLM was developed, designed to extract information from corporate ESG data and sustainability-focused news content, thus enabling user queries to be answered from these sources (Gupta et al., 2025).
These studies demonstrate the increasing applicability of LLMs in ESG-related text analysis, reporting, and sustainability-oriented document understanding. However, most studies primarily focus on information extraction, classification, or report generation tasks, while comparative analyses between retrieval-based and parameter-efficient adaptation approaches remain limited. In addition, existing studies generally evaluate either RAG-based systems or fine-tuned LLM architectures independently, without providing a direct comparison between these approaches within the same ESG domain and task setting. To address this gap, the present study comparatively evaluates RAG and LoRA-based architectures using food sector ESG reports across multiple ESG-oriented tasks, including summarization, SWOT analysis, and GRI-based recommendation generation.
3. Methodology
The detailed information regarding the ESG report corpus used in this study is presented in Table 1.
Company Code | Country | Reporting Year | Language | File Format | Size (MB) | Tables/Figures Usage | GRI Standards Usage |
A01 | USA | 2023 | English | 10.4 | Yes | Yes | |
A02 | Brazil | 2023 | English | 13.5 | Yes | Yes | |
A03 | Italy | 2023 | English | 28.5 | Yes | Yes | |
A04 | Switzerland | 2023 | English | 22.7 | Yes | Yes | |
A05 | USA | 2023 | English | 12.0 | Yes | Yes | |
A06 | USA | 2023 | English | 6.1 | Yes | Yes | |
A07 | Switzerland | 2023 | English | 16.4 | Yes | Yes | |
A08 | USA | 2023 | English | 13.3 | Yes | Yes | |
A09 | Türkiye | 2023 | Turkish | 9.3 | Yes | Yes | |
A10 | Türkiye | 2023 | Turkish | 12.6 | Yes | Yes | |
A11 | Italy | 2023 | English | 34.3 | Yes | Yes | |
A12 | New Zealand | 2023 | English | 24.6 | Yes | Yes | |
A13 | USA | 2023 | English | 18.8 | Yes | Yes | |
A14 | USA | 2023 | English | 2.8 | Yes | Yes | |
A15 | USA | 2023 | English | 33.5 | Yes | Yes | |
A16 | Brazil | 2023 | English | 15.3 | Yes | Yes | |
A17 | Türkiye | 2023 | Turkish | 13.5 | Yes | Yes | |
A18 | USA | 2023 | English | 17.3 | Yes | Yes | |
A19 | France | 2023 | English | 5.8 | Yes | Yes | |
A20 | Canada | 2023 | English | 9.5 | Yes | Yes | |
A21 | USA | 2023 | English | 3.7 | Yes | Yes | |
A22 | Türkiye | 2023 | Turkish | 25.8 | Yes | Yes | |
A23 | Türkiye | 2023 | Turkish | 14.8 | Yes | Yes | |
A24 | Thailand | 2023 | English | 21.8 | Yes | Yes | |
A25 | USA | 2023 | English | 25.7 | Yes | Yes | |
A26 | USA | 2023 | English | 17.3 | Yes | Yes | |
A27 | Türkiye | 2023 | English | 51.4 | Yes | Yes | |
A28 | Türkiye | 2023 | Turkish | 8.0 | Yes | Yes | |
A29 | USA | 2023 | English | 28.6 | Yes | Yes | |
A30 | Switzerland | 2023 | English | 18.9 | Yes | Yes |
Within the scope of this study, the food sector was selected due to its high ESG materiality and the simultaneous presence of ESG dimensions within a single industrial domain. The food sector involves a broad range of ESG-related themes, such as agricultural sustainability, emissions management, labor rights, and supply-chain transparency, making it a suitable domain for context-aware LLM evaluation. In addition, the presence of the GRI 13 (Agriculture, Aquaculture and Fishing Sectors) Sector Standard also enhances the suitability of the food sector for retrieval-based ESG analysis scenarios. Furthermore, the availability of publicly accessible ESG reports in the food sector supported the creation of a comparable evaluation corpus.
The use of 30 ESG reports was considered appropriate because the study primarily aims to perform a methodological comparison between RAG- and LoRA-based architectures rather than to establish a statistical representation of the entire global food sector. Accordingly, the same ESG report collection was employed across both integration scenarios to maintain a consistent and comparable experimental setting. To assess generalization rather than memorization, the 30 ESG reports were partitioned into a stratified holdout split, with 24 reports assigned to the training set and 6 reports reserved for the held-out test set. Instead of applying a fully random partitioning strategy, the dataset was split to preserve the representation of food-sector sub-categories across both the training and held-out test sets.
A central distinction in this study concerns where leakage must be prevented. Train-test leakage is a concern only for LoRA fine-tuning, where the model must not be trained on the reports on which it is later evaluated. For RAG, retrieving the content of the target report at inference time is the intended use of RAG and does not constitute leakage. Accordingly, the six held-out test reports were excluded only from LoRA fine-tuning: the LoRA instruction–response training pairs were constructed solely from the 24 training reports (together with the GRI standards), and the held-out reports were never seen during fine-tuning.
For RAG, the retrievable knowledge base is constructed per target report at inference time. When a held-out report is analyzed, a FAISS index (using the all-MiniLM-L6-v2 sentence-embedding model) is built from that report’s own text chunks; for the GRI-aligned recommendation task, the GRI standards are additionally indexed and retrieved, with report and GRI passages interleaved so that the prompt contains both the company’s own content and GRI standard text (reference/bibliography sections of the GRI standards were excluded from the GRI corpus to avoid retrieving non-substantive citation lists). The 24 training reports are not part of the retrievable base, since each report is summarized and analyzed from its own content. This ensures that RAG can access the very documents it is expected to summarize, analyze, and use for GRI-aligned recommendations. For the recommendation task, both methods incorporate GRI knowledge, though through different mechanisms: RAG retrieves GRI text into the prompt, whereas the fine-tuned model relies on GRI patterns learned during fine-tuning.
In summary, the two settings differ only in how each obtains the target report’s content at inference, not in whether or how much they can access it. The fine-tuned (LoRA) model is never fine-tuned on the held-out reports so that its outputs reflect task-level generalization rather than report-specific memorization, yet it receives the content of the target report in-context at inference (the first ~1,200 words, ≈1,700 tokens). The RAG model is allowed to retrieve the target test report during inference, which is the normal use case of RAG, with its prompt capped at 2,048 tokens, comparable to the in-context budget of the fine-tuned model. To keep the comparison fair, both arms use matched generation budgets and identical decoding settings (up to 384 new tokens; temperature 0.4; repetition penalty 1.2; no-repeat-trigram). Both methods, therefore, operate on previously unseen reports with comparable access to the target report’s content when generating summaries, SWOT analyses, and GRI-aligned recommendations, so that observed performance differences reflect the methods themselves rather than unequal access or generation budgets. Finally, practical computational considerations also influenced the dataset size, as both LoRA fine-tuning and RAG-based inference require substantial processing resources and extended evaluation time.
Our study focuses on the analysis of 30 distinct ESG reports from the food sector using the Llama-3.1-8B-Instruct model under two different integration scenarios: RAG and LoRA fine-tuning. To assess generalization rather than memorization, the 30 reports are divided into a stratified hold-out split of 24 training reports and 6 held-out test reports; both scenarios are evaluated on the same 6 held-out reports under matched generation budgets and identical decoding settings, so that the two approaches are compared fairly.
In the first scenario, the RAG technique is employed. Rather than building a single static database from all reports, the retrievable knowledge base is constructed per target report at inference time: for each held-out report under analysis, a vector index is built from that report’s own text chunks (and, for the GRI-aligned recommendation task, from the GRI standards). The system semantically retrieves the most relevant text chunks based on the user query and subsequently incorporates these chunks into the prompt, referred to as the model input. Consequently, the generated output is not solely dependent on the training data but is also augmented by the content of the target report retrieved at inference. This approach aims to increase the accuracy of the outputs and minimize the model’s hallucination risk. Because RAG retrieves the very document it is asked to analyze, the held-out reports are available to RAG at inference, the intended use of RAG, while remaining excluded from LoRA fine-tuning.
The second scenario utilizes the LoRA technique. The model is fine-tuned using the LoRA method on the 24 training reports only; the 6 held-out reports are never used in fine-tuning. Crucially, the model’s entire set of parameters is not updated during this process. Instead, the large weight matrices are approximated by the product of two smaller matrices. Only these smaller matrices are updated during training, while the model’s original parameters remain static. In inference, the fine-tuned model receives the content of the target held-out report in-context. This approach aims to enhance the model’s competence in a specific domain, enabling it to produce more consistent and domain-specific outputs.
In both integration scenarios, the ESG reports were summarized, a SWOT analysis was conducted based on the summarized information, and in the final step, GRI Standards-compliant solution proposals were developed for the weaknesses and threats identified from the SWOT analysis. The resulting outputs, produced under matched generation budgets and identical decoding settings, were evaluated on the 6 held-out reports, and a performance comparison was conducted using metrics such as Faithfulness, Factual Consistency, Answer Correctness, and Answer Relevancy, in addition to token consumption and computational time.
The workflow of the RAG and LoRA-enhanced LLM system is shown in Figure 1.

1.What differences can be observed in the performance of the Llama-3.1-8B-Instruct models, structured with RAG and LoRA, in the processes of analyzing and summarizing ESG reports from the food sector, conducting SWOT analyses, and generating solution recommendations aligned with GRI standards?
2.Which model approach is more resource-efficient in terms of token usage and computational time?
In the evaluation conducted across the companies included in the study, which technique demonstrated higher overall performance?
Developed by Meta and belonging to the Llama 3.1 model family, the Llama-3.1-8B-Instruct model is an open-source LLM based on the Transformer architecture, consisting of 8 billion parameters and pre-trained on a dataset containing approximately 15 trillion tokens. The model supports a 128,000-token context window, which enables the processing of long texts as a whole and provides a stronger architectural capability compared to its predecessors.
With the use of the Grouped-Query Attention (GQA) technique, memory overhead and computational costs during inference are optimized, enabling more efficient and low-latency usage. Fine-tuned using Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO), the model has improved capabilities in following user instructions accurately and producing outputs that directly provide human-oriented value. Architectural improvements minimize the “lost in the middle” problem and ensure high performance in complex reasoning tasks.
RAG is a technique that enables LLMs to produce more accurate and specific responses by augmenting them with external knowledge sources. The adoption of this technique with LLMs has led to a significant shift in how these models access and process information. This integration has significantly reduced both the problem of outdated knowledge and the risk of hallucination (Verma, 2024).
The RAG approach combines the strengths of two main stages: retrieval-based (information fetching) and generation-based (text production). The retrieval technique searches external knowledge sources for relevant chunks pertaining to the user’s query. In the generation technique, the chunks identified in the previous step are provided to the LLM. The model then generates its response to the user by utilizing both its internal training data and the external knowledge sources. In summary, while the retrieval step provides access to various data sources, the generation step creates contextually appropriate content, thus achieving effective results in the field of Natural Language Processing (Li & Ramakrishnan, 2025).
In this study, RAG is configured to operate per target report, so that at inference the model retrieves the very document under analysis. For each held-out report, a FAISS index (IndexFlatL2) is built from that report’s own text chunks (800-word windows, 120-word overlap) encoded with the all-MiniLM-L6-v2 model, and the six most relevant chunks to the task query are retrieved into the prompt. For summarization and SWOT, the retrievable base consists solely of the target report; for the GRI-aligned recommendation task, the GRI standards are additionally indexed, and report and GRI passages are retrieved jointly and interleaved. Reference/bibliography sections of the GRI standards are excluded from the corpus to avoid retrieving non-substantive citation lists. To ensure a fair comparison, the RAG prompt is capped at 2,048 tokens and generation is limited to 384 new tokens under fixed decoding (temperature 0.4, repetition penalty 1.2, no-repeat trigram); the base model is loaded in 4-bit NF4. The training reports are not part of the retrievable base.
LoRA, an abbreviation for Low-Rank Adaptation, is a technique used to fine-tune LLMs for specific tasks. Unlike traditional fine-tuning methods, LoRA does not require retraining all of the model’s parameters. In conventional approaches, the entire set of parameters is retrained, a process that is both time-consuming and costly.
LoRA represents the newly acquired knowledge during fine-tuning with low-rank matrices and updates only these matrices during the training phase, while the other parameters remain static. Once the training process is complete, the low-rank matrices are integrated with the pre-trained model parameters, customizing the model for the desired task. As noted, performing fine-tuning with LoRA, rather than training the entire model, requires less computational power and reduces costs. Furthermore, since the number of parameters that need to be stored is lower compared to traditional methods, a reduction in memory consumption is also observed.
Quantization is an optimization technique that reduces the parameters of LLMs from high-precision formats like float32 or float16 to lower bit values, such as 8-bit or 4-bit. This technique significantly reduces bandwidth usage and memory consumption, resulting in less hardware capacity and faster inference. The Quantized Low-Rank Adaptation (QLoRA) method, developed based on this technique, reduces memory usage by compressing model weights to the 4-bit level, while performing the training process through low-rank LoRA adapters added to these weights. These adapter matrices are updated in FP16/BF16 precision, which allows the model to preserve performance while significantly reducing memory consumption. This structure enables the fine-tuning process to be carried out efficiently without substantial performance degradation.
In this study, the model is fine-tuned with LoRA on the 24 training reports only; the six held-out reports are excluded from fine-tuning to prevent leakage. LoRA adapters of rank r = 16 (α = 32, dropout 0.1) are applied to the projection matrices (q/k/v/o, gate/up/down), with the base model in 4-bit NF4 (QLoRA). Training uses a learning rate of 2e-4 for three epochs, batch size 1 with gradient accumulation 16 (effective batch 16), and a max sequence length of 512. In inference, the fine-tuned model is given the first ~1,200 words of the target report in-context and generates up to 384 new tokens under the same decoding settings used for RAG (temperature 0.4, repetition penalty 1.2, no-repeat trigram), so the two approaches are compared under matched generation budgets.
The main implementation settings and retrieval-related parameters adopted in the RAG pipeline are summarized in Table 2.
Parameter | Configuration |
Chunk size | 800 words |
Chunk overlap | 120 words |
Embedding model | sentence-transformers/all-MiniLM-L6-v2 |
Vector database setting | FAISS with IndexFlatL2 |
Top-k value | 6 |
Similarity threshold | None (deterministic top-k retrieval; no score cutoff) |
Prompt template | Retrieved context + task instruction + user query + output formatting instruction |
Max prompt length | 2,048 tokens |
Max new tokens | 384 |
Temperature | 0.4 |
Repetition penalty | 1.2 |
No-repeat n-gram size | 3 |
Quantization Setting | 4-bit NF4 (Normalized Float 4) |
Reranking | Not used |
Table 3 presents the prompt template structure and task-specific instruction configurations used within the RAG pipeline.
Task Type | Instruction |
Summary | Summarize the ESG report within 20% of the original length |
SWOT | Generate a SWOT analysis based on the retrieved ESG content |
GRI-based Recommendation | Produce concise GRI-aligned recommendations for weaknesses and threats |
Table 4 presents the LoRA fine-tuning configuration and training parameters used in this study.
Parameter | Configuration |
Base Large Language Model (LLM) | Llama-3.1-8B-Instruct |
Quantization setting | 4-bit NF4 (Normalized Float 4) |
Number of epochs | 3 |
Rank value | 16 |
Alpha value | 32 |
Dropout | 0.1 |
Learning rate | 2e-4 |
Per-device train batch size | 1 |
Gradient accumulation steps | 16 |
Effective batch size | 16 |
Training time | 3 hours 3 minutes |
Fine-tuning corpus | 24 |
Max sequence length | 512 |
Inference input | first ~1,200 words in-context |
Max new tokens (output) | 384 |
Temperature | 0.4 |
Repetition penalty | 1.2 |
No-repeat n-gram size | 3 |
The four-quality metrics are computed automatically and are reference-free: they are evaluated against the source report and the task query rather than against human-authored gold answers. Let O denote a generated output, S the source reference (the first 12,000 characters of the target report, used as the grounding reference), and Q the task instruction (query). Semantic comparisons use a multilingual sentence-embedding model, paraphrase-multilingual-mpnet-base-v2 (768 dimensions), denoted E(·). For any two texts a and b, sim(a, b) is the cosine similarity between their sentence embeddings, truncated to the range [0, 1] (negative values are set to 0):
Since the embeddings are normalized, the cosine similarity lies in [−1, 1] and is at most 1; negative similarities, which are rare for semantically related ESG text, are mapped to 0, so sim(a, b) ∈ [0, 1].
The output O and the source S are split into sentences, {o_1, ..., o_m} and {s_1, ..., s_n}. Each output sentence is scored by its best semantic support in the source, and the scores are averaged:
A low value indicates output sentences that are not supported by any sentence in the source (potential hallucination).
Let N(·) be the set of numeric tokens in a text (matched by a numeric regular expression) and Ent(·) the set of heuristic named entities (capitalized tokens of length ≥ 3 and 2–5-letter acronyms, after removing a stopword list). Define the numeric and entity overlaps as
num = |N(O) ∩ N(S)| / |N(O)| (and num = 1 if N(O) = ∅), and
ent = |Ent(O) ∩ Ent(S)| / |Ent(O)| (and ent = 1 if Ent(O) = ∅).
Factual Consistency combines these with semantic similarity to the source:
This rewards outputs whose numbers and named entities are supported by the source report.
This is a composite proxy that combines task-form adherence, structural formatting, length adequacy, and semantic alignment with the source. Let K_task be a small set of task-specific keywords (for summarization: environmental/social/governance terms; for SWOT: strengths/weaknesses/opportunities/threats terms; for recommendations: GRI/recommendation/action terms; full lists are provided in the released code). With w denoting the word count of O, the components are:
kw = min( |{k ∈ K_task : k ∈ O}| / max(0.4·|K_task|, 1), 1 );
struct = 1.0 if O contains any list or structure marker (-, •, *, “1.”, “:”), otherwise 0.4;
len = 1.0 if 40 ≤ w ≤ 700; len = max(0.5, 1 − (w − 700)/1500) if w > 700; len = max(0.2, w/40) if w < 40;
and the metric is
We note that this metric does not compare the output against an independent reference answer; it is a reference-free proxy for task-appropriate, source-aligned output.
This measures how well the output addresses the requested task, computed as the semantic similarity between the output and the task query:
All composite weights (0.20/0.20/0.60 for Factual Consistency and 0.20/0.15/0.15/0.50 for Answer Correctness) are heuristic. Because the metrics are computed against the source document and the task instruction rather than against expert-annotated reference outputs, and because no human or expert spot-checking was performed in this study, the scores should be interpreted as automated proxies for output quality; this limitation, and the value of complementary human evaluation, are discussed in Section 6.2.
4. Results
Table 5 presents the comparative performance on the six held-out test companies. These reports were excluded only from the LoRA fine-tuning corpus; for RAG, the target report is retrieved at inference, in line with the intended use of RAG.
Task | Metric | RAG_Average | LoRA_Average | Difference |
Summary | Faithfulness | 0.465 | 0.582 | +0.118 |
Summary | Factual Consistency | 0.324 | 0.387 | +0.063 |
Summary | Answer Correctness | 0.498 | 0.654 | +0.157 |
Summary | Answer Relevancy | 0.338 | 0.437 | +0.099 |
SWOT | Faithfulness | 0.476 | 0.503 | +0.027 |
SWOT | Factual Consistency | 0.306 | 0.381 | +0.076 |
SWOT | Answer Correctness | 0.462 | 0.607 | +0.145 |
SWOT | Answer Relevancy | 0.282 | 0.447 | +0.166 |
GRI-based Recommendation | Faithfulness | 0.481 | 0.525 | +0.044 |
GRI-based Recommendation | Factual Consistency | 0.371 | 0.395 | +0.024 |
GRI-based Recommendation | Answer Correctness | 0.571 | 0.670 | +0.099 |
GRI-based Recommendation | Answer Relevancy | 0.245 | 0.528 | +0.283 |
All scores are averages over the 6 held-out test companies. These reports were excluded only from LoRA fine-tuning; for RAG, the target report is retrieved at inference, consistent with the intended use of RAG. Quality metrics are computed using embedding-based measures (sentence-transformer cosine similarity for faithfulness and relevancy; weighted numeric + entity + semantic overlap for factual consistency).
The RAG_Average column reports the mean score obtained by the RAG scenario for the corresponding task and metric, across the 6 held-out test companies. The LoRA_Average column reports the corresponding mean for the LoRA fine-tuned model on the same 6 companies. The Difference column reports the mean difference (LoRA_Average - RAG_Average); a positive value indicates that LoRA scores higher than RAG.
Table 6 presents the paired statistical comparisons between LoRA and RAG (paired t-test, Wilcoxon signed-rank test, and Cohen’s d) for each of the 12 task-metric combinations across the six held-out test companies, complementing the descriptive results in Table 5.
Table 6 reports, for each of the 12 task–metric combinations, the mean scores and standard deviations (Mean ± SD) over the six held-out companies, together with paired statistical comparisons between LoRA and RAG: paired t-test and Wilcoxon signed-rank p-values (α = 0.05) and paired Cohen’s d, with |d| > 0.8 conventionally interpreted as a large effect. Because n = 6, the smallest attainable two-sided Wilcoxon p-value is 0.031 (obtained when all six companies change in the same direction), and 0.062 represents the next attainable level; the test therefore has limited resolution at this sample size. The task-level findings are interpreted in the preceding paragraphs (and summarized in the "Taken together" discussion).
For the summarization task, LoRA obtains higher mean scores than RAG on all four metrics, but the difference is statistically significant for only two: Faithfulness (Δ = +0.118, p = 0.023, d = 1.33) and Answer Correctness (Δ = +0.157, p = 0.021, d = 1.35), both also supported by the Wilcoxon test (p = 0.031). The gains in Factual Consistency (Δ = +0.063, p = 0.247) and Answer Relevancy (Δ = +0.099, p = 0.365) are not significant. These results indicate that LoRA-generated summaries are more closely grounded in the source documents and more structurally aligned with the task instruction, while the two approaches are comparable in factual consistency and query relevance for this task.
Task | Metric | RAG Mean ± SD | LoRA Mean ± SD | Paired t-Test p | Wilcoxon p | Effect Size (Cohen’s d) |
Summary | Faithfulness | 0.464 ± 0.082 | 0.582 ± 0.141 | 0.023 | 0.031 | 1.33 |
Summary | Factual Consistency | 0.324 ± 0.108 | 0.387 ± 0.058 | 0.247 | 0.438 | 0.53 |
Summary | Answer Correctness | 0.498 ± 0.071 | 0.654 ± 0.106 | 0.021 | 0.031 | 1.35 |
Summary | Answer Relevancy | 0.338 ± 0.129 | 0.437 ± 0.179 | 0.365 | 0.562 | 0.41 |
SWOT | Faithfulness | 0.476 ± 0.107 | 0.502 ± 0.056 | 0.650 | 1.000 | 0.20 |
SWOT | Factual Consistency | 0.306 ± 0.137 | 0.381 ± 0.124 | 0.359 | 0.438 | 0.41 |
SWOT | Answer Correctness | 0.462 ± 0.059 | 0.607 ± 0.137 | 0.037 | 0.062 | 1.15 |
SWOT | Answer Relevancy | 0.282 ± 0.081 | 0.447 ± 0.129 | 0.058 | 0.094 | 1.00 |
GRI-based Recommendation | Faithfulness | 0.481 ± 0.096 | 0.525 ± 0.055 | 0.229 | 0.156 | 0.56 |
GRI-based Recommendation | Factual Consistency | 0.371 ± 0.105 | 0.395 ± 0.090 | 0.761 | 0.438 | 0.13 |
GRI-based Recommendation | Answer Correctness | 0.571 ± 0.075 | 0.670 ± 0.052 | 0.098 | 0.156 | 0.83 |
GRI-based Recommendation | Answer Relevancy | 0.245 ± 0.107 | 0.528 ± 0.183 | 0.014 | 0.062 | 1.52 |
For the SWOT analysis task, LoRA again obtains higher mean scores than RAG on all four metrics, but only Answer Correctness reaches statistical significance under the paired t-test (Δ = +0.145, p = 0.037, d = 1.15), and even this comparison is borderline under the more conservative Wilcoxon test (p = 0.062). The differences in Answer Relevancy (Δ = +0.166, p = 0.058, d = 1.00), Factual Consistency (Δ = +0.076, p = 0.359), and Faithfulness (Δ = +0.027, p = 0.650) did not reach statistical significance, indicating that the two approaches perform comparably on these dimensions. LoRA’s clearest advantage on this task is therefore its stronger adherence to the four-category SWOT structure, as reflected in Answer Correctness.
For the GRI-based recommendation task, LoRA again obtains higher mean scores than RAG on all four metrics, but only Answer Relevancy reaches statistical significance under the paired t-test (Δ = +0.283, p = 0.014, d = 1.52), and as with SWOT Answer Correctness this comparison is borderline under the Wilcoxon test (p = 0.062). The differences in Answer Correctness (Δ = +0.099, p = 0.098), Faithfulness (Δ = +0.044, p = 0.229), and Factual Consistency (Δ = +0.024, p = 0.761) did not reach statistical significance, indicating comparable performance on these dimensions. This pattern suggests that LoRA’s main advantage for this task lies in producing recommendations that are more responsive to the specific weaknesses and threats raised by the query.
Taken together, Table 5 and Table 6 show that the LoRA fine-tuned model obtains higher mean scores than RAG on all 12 task–metric combinations, but only 4 of the 12 paired comparisons reach statistical significance under the paired t-test (p < 0.05): Faithfulness and Answer Correctness for summarization, Answer Correctness for SWOT, and Answer Relevancy for recommendations. Two of these four are corroborated by the Wilcoxon test (both summarization metrics, p = 0.031), whereas the other two are borderline (p = 0.062), reflecting the limited resolution of the Wilcoxon test at n = 6. Where significant, the effect sizes are large (Cohen’s d between 1.15 and 1.52); for the remaining comparisons, the effects are small to moderate and non-significant. Given the small held-out sample, these results indicate a task- and metric-dependent advantage for LoRA rather than a uniform one. Because the scores are obtained on companies that the LoRA model never saw during training, while RAG retrieved each target report at inference, the comparison reflects performance on unseen reports under matched conditions rather than memorization of report-specific wording.
Figure 2 summarises the average performance of the RAG and LoRA scenarios on the held-out test set, while Tables 5 and 6 disaggregate the same comparison across the three task types (Summary, SWOT, and GRI-aligned Recommendations).
At the aggregate level, the LoRA scenario obtains a higher mean score than RAG on all four-quality metrics, but the magnitude of the difference varies considerably across metrics. The largest average gaps are observed for Answer Relevancy, where the mean rises from 0.288 under RAG to 0.471 under LoRA (Δ = +0.183), and for Answer Correctness (RAG 0.510, LoRA 0.644; Δ = +0.133). The gaps for Faithfulness (RAG 0.474, LoRA 0.537; Δ = +0.063) and Factual Consistency (RAG 0.333, LoRA 0.388; Δ = +0.054) are comparatively small.

These aggregate differences should be interpreted in light of the task-level significance tests in Table 6, where only 4 of the 12 task–metric comparisons reach statistical significance. The LoRA advantage is most consistent for Answer Correctness, which is significant for both the Summary and SWOT tasks; Answer Relevancy is significant for the Recommendation task, and Faithfulness for the Summary task. Factual Consistency, although showing a slightly higher mean for LoRA, does not reach significance in any task and exhibits the smallest aggregate gap.
This pattern indicates that, under the fair, matched-budget comparison, the LoRA scenario’s advantage is concentrated in the structural and task-responsiveness dimensions, producing outputs that are better aligned with the task instruction (Answer Correctness) and more responsive to the specific query (Answer Relevancy) rather than in source grounding, where the differences are smaller and largely non-significant (no significant difference for Factual Consistency in any task, and for Faithfulness only in the Summary task).
Taken together, Figure 2 and Tables 5-6 indicate a task- and metric-dependent advantage for LoRA rather than a uniform one: LoRA leads on average across all four metrics, but this advantage is statistically supported in only 4 of the 12 task-level comparisons Answer Correctness (Summary and SWOT), Answer Relevancy (Recommendation), and Faithfulness (Summary) and is absent for Factual Consistency in every task. Given the small held-out sample (n = 6), these patterns should be regarded as indicative rather than conclusive.
The distribution and stability of the four-quality metrics across the held-out test set are shown in Figure 3.

Figure 3 reports the full distribution of the four-quality metrics across the held-out test set for both scenarios. The central tendency (mean and median) of the LoRA distribution lies above that of RAG for every metric, but the degree of separation varies across metrics. For Answer Correctness and Answer Relevancy the two distributions show little overlap, with the LoRA box positioned above the RAG box (Answer Correctness: RAG μ = 0.510 vs LoRA μ = 0.644; Answer Relevancy: 0.288 vs 0.471). For Faithfulness (0.474 vs 0.537) and Factual Consistency (0.333 vs 0.388) the gap is smaller and the inter-quartile ranges overlap substantially, mirroring the smaller and mostly non-significant differences for these two metrics (Table 6).
The spread of the two scenarios differs by metric rather than systematically. RAG distributions are tighter than LoRA on three of the four metrics, Faithfulness, Answer Correctness, and especially Answer Relevancy, where LoRA shows the widest spread (σ = 0.161), whereas for Factual Consistency LoRA is the tighter of the two (σ = 0.089 vs RAG σ = 0.114). LoRA therefore tends to achieve higher absolute scores, but in most cases at the cost of greater variability across reports and tasks, which should be acknowledged when interpreting individual report-level outputs.
A few isolated outliers are visible for example, a high-Faithfulness LoRA case and both a high- and a low-Answer-Relevancy RAG case but they do not alter the overall pattern. The distributional view is consistent with the aggregate and task-level results: the LoRA advantage is most apparent for Answer Correctness and Answer Relevancy, where the distributions are more separated, and is marginal for Faithfulness and Factual Consistency, where they largely overlap. Given the small held-out sample (n = 6 companies), these distributional patterns should be regarded as indicative rather than conclusive.
The comparison of the two scenarios with respect to average token consumption and processing time per generation is shown in Figure 4.

In terms of token consumption, the two scenarios are comparable: LoRA uses on average 2,075 tokens per generation and RAG 2,420, a difference of approximately 14%. These counts reflect the input actually processed by each model under the matched 2,048-token prompt budget. Thus, although LoRA is marginally more token-efficient, the two approaches do not differ substantially on this dimension.
In terms of processing time, the difference is large and in the opposite direction: RAG completes a generation in 42.91 seconds on average, whereas the LoRA scenario requires 473.16 seconds. This gap arises at inference rather than from any parameter updates (which occur only during training). The RAG pipeline runs the base model under 4-bit quantization, whereas the fine-tuned model in our setup was run at full precision, contributing to its longer runtime. In addition, LoRA incurs a one-time fine-tuning cost (approximately 2h19min) that is not reflected in the per-generation time but is part of its overall computational budget.
Taken together, Figure 4 indicates that the two approaches involve a computational trade-off rather than a clear efficiency advantage for either. Token consumption is comparable, while RAG is substantially faster at inference. Efficiency should therefore be assessed jointly across token use, inference latency, and the one-time cost of fine-tuning, rather than on token count alone.
5. Conclusions
This study evaluated 30 food-sector ESG reports using two LLM-assisted analysis scenarios—RAG and LoRA—built on the same Llama-3.1-8B-Instruct base model. To prevent train–test leakage, the corpus was partitioned into a stratified hold-out split of 24 training and 6 unseen test reports, with all reported metrics computed on the test set; leakage prevention was applied to LoRA fine-tuning only, while RAG was allowed to retrieve the target report at inference, consistent with its intended use, and both scenarios were compared under matched generation budgets and identical decoding. Each scenario was applied in a three-stage pipeline: ESG-report summarization, a SWOT analysis derived from the summary, and GRI-aligned solution recommendations for the identified weaknesses and threats. Outputs were assessed on four quality metrics: Faithfulness, Factual Consistency, Answer Correctness, and Answer Relevancy, together with operational measures of token consumption and processing time.
On the held-out test set, the LoRA-based scenario obtained higher mean scores than RAG across all four-quality metrics. However, statistically significant differences were observed in only 4 of the 12 task metric comparisons reported in Table 6, namely Summary-Faithfulness, Summary-Answer Correctness, SWOT-Answer Correctness, and GRI-based Recommendation-Answer Relevancy. The remaining comparisons did not reach statistical significance, although several exhibited moderate to large effect sizes. These findings suggest that the performance advantage of LoRA was task-dependent rather than uniformly distributed across all evaluation dimensions.
In terms of operational cost, token consumption was comparable between the two scenarios (RAG ≈ 2,420 vs LoRA ≈ 2,075 tokens per generation, a difference of ~14%), whereas RAG was substantially faster at inference (≈ 43 s vs ≈ 473 s per generation, roughly an order of magnitude). We note that this latency gap is partly an implementation choice (fp16 LoRA inference versus 4-bit RAG inference) and is expected to narrow under matched configurations; LoRA additionally incurs a one-time fine-tuning cost not reflected in per-generation time.
Within the scope of this study, the results suggest that LoRA fine-tuning yields outputs that are better aligned with the task instruction and more responsive to the query (Answer Correctness and Answer Relevancy), while the two approaches are comparable in source grounding (Faithfulness and Factual Consistency); RAG, in turn, remains attractive when traceability, rapid corpus updates, and low inference latency are prioritized. We therefore restrict our conclusion to the conditions of this study: 30 food-sector ESG reports, the Llama-3.1-8B-Instruct base model, and reference-free automatic evaluation on a held-out sample of n = 6, and emphasize that further validation on additional industries, base models, and expert-annotated reference outputs is needed before these findings can be generalized. Replicating the experiments on other sectors, with larger held-out samples or stratified k-fold cross-validation, and complementing the automatic metrics with human assessment by ESG/GRI auditors, are the most important directions we identify for future work.
6. Limitations
The findings of this study should be interpreted in light of several limitations, which we group into four categories: scope, evaluation methodology, implementation, and reproducibility.
Our experiments are restricted in scope. They cover a single industry, the food sector, using 30 ESG reports drawn from a mix of consumer-packaged-goods, dairy, meat and protein, food-service, B2B-ingredient, and beverage sub-segments. A single base model, Llama-3.1-8B-Instruct, is used throughout. The reports are predominantly in English, with a smaller subset of Turkish reports analyzed jointly rather than evaluated as separate language conditions. The conclusions, therefore, apply under the conditions of this study and should not be interpreted as a general claim about LoRA versus RAG across arbitrary domains. Whether the advantages observed for LoRA on certain metrics transfer to other regulated sectors (e.g., banking, energy, mining, healthcare), to other base models, to non-Latin-script languages, or to longer time-series of disclosures remains to be tested. We view replication across additional sectors and models as the most important direction for future work.
To prevent data leakage between LoRA training and evaluation, the corpus is partitioned into 24 training and 6 held-out test reports, stratified across six sub-segments. Per-company comparisons are therefore based on n = 6, which limits the statistical power and external validity of inferences regarding cross-company variation. Although several of the paired comparisons reach significance with large effect sizes, the majority do not under this small sample, and replication using stratified k-fold cross-validation, for which the framework already provides utilities (src/data_split.py), is required before strong generalization claims can be made.
The four-performance metrics (faithfulness, factual consistency, answer correctness, and answer relevancy) are automated, reference-free measures derived primarily from multilingual sentence-embedding similarity (paraphrase-multilingual-mpnet-base-v2), supplemented by lightweight task-specific heuristic checks where appropriate. While these metrics improve robustness to paraphrasing and surface-form variation, they approximate rather than replace human or expert assessment; ESG-domain experts may legitimately disagree with automated scores, particularly for nuanced GRI-compliance judgments. A complementary human evaluation, ideally involving certified sustainability auditors, would strengthen the conclusions.
The LoRA fine-tuning targets themselves are constructed through deterministic heuristic extraction procedures rather than gold-standard human annotation, meaning that the upper bound of LoRA performance is partly dependent on the quality of these heuristics; further gains may be achievable with curated, expert-written training data.
The latency comparison is not fully like-for-like: in the current implementation, RAG inference uses 4-bit quantization while LoRA inference uses fp16, contributing to the observed latency differences between the two approaches. Using matched quantization settings for both pipelines could potentially reduce part of this gap; however, this configuration was not evaluated in the present study and is left for future engineering work.
Results are reported from a single random seed and a single training run; run-to-run variance arising from optimizer stochasticity was not quantified. Hyperparameters were not exhaustively tuned: LoRA rank (r = 16), α, dropout, learning rate, and the RAG retrieval depth (top-k = 6), chunk size (800 words), and embedding model (all-MiniLM-L6-v2) were selected from common defaults. Alternative configurations, including hybrid BM25+dense retrieval, rerankers, larger or domain-specific embedding models, and full (rather than parameter-efficient) fine-tuning, were not evaluated.
Document parsing relies on pypdf text extraction; tabular content, figures, multi-column layouts, and scanned pages are not OCR-processed and may be partially lost, which can disadvantage scenarios that depend heavily on numerical or structured content.
We compare RAG and LoRA but do not include a zero-shot Llama baseline, a fully fine-tuned baseline, or larger reference models (e.g., GPT-4-class systems) due to compute and licensing constraints. The task suite is likewise limited to three ESG analyses (summarization, SWOT generation, and GRI-aligned recommendation generation); other practitioner-relevant tasks such as materiality assessment, quantitative KPI extraction, year-over-year change detection, and supply-chain risk extraction remain outside the scope of this study.
The base model is hosted in a gated Hugging Face repository requiring authentication, and several licensing constraints apply to its use. The GRI standard documents used as the RAG knowledge base and as part of the LoRA training corpus are subject to copyright and therefore cannot be redistributed. Accordingly, we release code, configurations, processed-derived statistics, and per-document metric outputs, but not the original source PDFs. Researchers seeking to reproduce the pipeline must independently obtain the GRI documents and, where applicable, the ESG reports.
Taken together, these limitations suggest that the reported findings should be interpreted as evidence of the potential of LoRA fine-tuning for specific ESG-analysis dimensions under the experimental configuration studied, rather than as a universal claim of superiority over RAG across all settings. Future work addressing additional sectors, multiple base models, expert-validated evaluation protocols, hybrid RAG-LoRA architectures, and matched latency configurations will be necessary to establish the broader applicability of these findings.
Conceptualization, B.Ö. and A.H.I.; methodology, B.Ö. and A.H.I.; software, B.Ö.; validation, B.Ö. and A.H.I.; formal analysis, B.Ö.; investigation, B.Ö.; resources, B.Ö.; data curation, B.Ö.; writing—original draft preparation, B.Ö.; writing—review and editing, B.Ö.; visualization, B.Ö.; supervision, A.H.I.; project administration, B.Ö. All authors have read and agreed to the published version of the manuscript.
The data used to support the research findings are available from the corresponding author upon request.
The authors declare no conflict of interest.
During the preparation of this work, the authors utilized generative AI for minor language editing. Afterward, they reviewed and edited the content as necessary and took full responsibility for the publication’s content.
