ReguTourRAG: A Multi-Algorithm Collaborative Retrieval Framework for Tourism Regulations and Standards Documents

zhanhong wu; xiaoyong wang; jun zhang; qian he

Outline

Open Access

Research article

ReguTourRAG: A Multi-Algorithm Collaborative Retrieval Framework for Tourism Regulations and Standards Documents

Zhanhong Wu¹

,

Xiaoyong Wang¹

,

Jun Zhang²^*

,

Qian He³

¹

School of Management Engineering, Capital University of Economics and Business, 100070 Beijing, China

²

School of Artificial Intelligence, Capital University of Economics and Business, 100070 Beijing , China

³

Department of Safety and Security, Capital University of Economics and Business, 100070 Beijing, China

Information Dynamics and Applications

|

Volume 5, Issue 1, 2026

|

Pages 27-44

https://doi.org/10.56578/ida050103

Received: 02-14-2026,

Revised: 03-10-2026,

Accepted: 03-23-2026,

Available online: 03-28-2026

View Full Article|

Download PDF

Abstract:

Effective tourism planning, scenic-area evaluation, and regulatory supervision depend on the accurate interpretation of extensive collections of tourism-related laws, administrative regulations, technical standards, and local normative documents. However, these documents are characterized by heterogeneous structures, frequent revisions, and complex cross-document dependencies, which limit the effectiveness of conventional keyword-based retrieval approaches and increase the risk of unsupported or unverifiable outputs generated by large language models. To address these challenges, a retrieval-augmented generation framework, termed ReguTourRAG, was proposed for intelligent question answering and knowledge access within tourism regulatory and standards corpora. A two-stage retrieval architecture was adopted. In the first stage, broad hybrid recall was performed through the collaborative integration of Best Matching 25 (BM25) lexical retrieval, Elastic Learned Sparse EncodeR (ELSER)-based sparse semantic retrieval, and Hierarchical Navigable Small World (HNSW)-based dense vector retrieval. In the second stage, candidate documents were refined through a cross-encoder reranking model, whereby high-value evidence was prioritized before response generation. Through the explicit separation of coverage-oriented recall and precision-oriented reranking, the traceability, completeness, and reliability of generated responses were enhanced for regulation-driven tourism management tasks. The proposed framework was evaluated using a corpus comprising 970 tourism regulatory and standards documents. Experimental results demonstrated consistent improvements over representative single-strategy retrieval-augmented generation baselines across multiple retrieval and generation metrics, including mean reciprocal rank, normalized discounted cumulative gain, accuracy, Recall-Oriented Understudy for Gisting Evaluation-Longest Common Subsequence (ROUGE-L), and BERTScore. The observed gains indicate that the collaborative utilization of lexical, sparse semantic, and dense retrieval signals, together with cross-encoder evidence refinement, provides substantial advantages for regulation-intensive domains in which precise legal terminology, semantic paraphrasing, and cross-document reasoning must be simultaneously accommodated. These findings suggest that ReguTourRAG offers a robust and scalable foundation for regulatory decision support, policy interpretation, compliance assessment, and intelligent knowledge services in tourism governance environments.

Keywords: Multi-algorithm collaboration, Retrieval-augmented generation, Tourism regulations and standards, Document intelligence, Hybrid retrieval, Reranking

1. Introduction

Tourism has become a strategic sector for consumption expansion, regional development, and cultural exchange. According to the Ministry of Culture and Tourism, China recorded 5.615 billion domestic tourist trips and CNY 5.75 trillion in domestic tourism expenditure in 2024 [1]. At the policy level, the State Council’s 14th Five-Year Plan for Tourism Development emphasizes digital transformation, smart scenic-area services, and high-quality tourism governance [2]. These policy and market trends make tourism management increasingly dependent on timely, accurate, and traceable interpretation of regulatory documents. From a legal and administrative perspective, scenic-area rating, rectification, safety supervision, accessibility services, and emergency management involve not only operational guidelines but also administrative licensing, administrative penalties, public safety obligations, and technical standards. The regulatory corpus is multi-level, decentralized, and frequently updated. Manual retrieval therefore raises compliance costs and increases the risk of misapplied provisions, incomplete evidence, and inconsistent rectification plans. Building an intelligent retrieval and question-answering mechanism for tourism regulations is consequently not only a technical improvement but also an institutional safeguard for lawful administration and compliant scenic-area operation.

The rise of smart tourism has created new opportunities for digital transformation in the sector. With the continued progress of big data and artificial intelligence, natural language understanding and natural language generation provide critical support for the efficient management and production of text-based knowledge. Representative large language model families, such as Generative Pre-trained Transformer (GPT)-style models [3], Qwen [4], and Gemini [5], can process large-scale text, produce structured summaries, and answer complex queries. However, domain deployment exposes a persistent gap between linguistic fluency and evidential reliability. In knowledge-intensive settings, model parameters alone cannot guarantee that an answer is current, jurisdiction-specific, or grounded in the applicable clause. Retrieval-augmented generation addresses this gap by retrieving external evidence before generation, allowing the model to align its response with a curated knowledge base. Retrieval-augmented generation has been adopted in domains such as medicine and tourism recommendation [6], [7], and the original formulation by Lewis et al. [8] demonstrated its value for knowledge-intensive natural language processing tasks. For tourism administration, retrieval-augmented generation can support policy interpretation, safety-standard explanation, rectification-plan drafting, and cross-document evidence comparison, provided that retrieval quality is high enough to expose the relevant legal basis.

Despite this promise, tourism regulatory applications present several adaptation challenges. First, domain terminology is dense and often appears in fixed legal expressions, while user questions may be phrased in operational language. This mismatch weakens general-purpose semantic embeddings and can lead to missing evidence [9]. Second, regulatory answers are high-stakes, and unsupported or outdated statements can mislead planning, supervision, or public-service decisions. Therefore, hallucination control must be treated as a core system requirement rather than a peripheral generation issue [10]. Third, tourism standards and local normative documents change over time; models trained once cannot dynamically reflect new provisions, and stale knowledge directly undermines answer reliability [11]. Fourth, many practical questions require cross-document reasoning, such as linking scenic-area quality rating standards with accessibility, emergency response, complaint handling, and environmental protection requirements. A single retrieval strategy is usually insufficient for this setting. Lexical retrieval preserves exact clause terms, sparse semantic retrieval improves synonym and paraphrase matching, dense retrieval captures broader semantic intent, and reranking is needed to suppress noisy candidates before generation. These observations motivate a retrieval framework that explicitly combines complementary algorithms rather than relying on a single index.

Compared with generic hybrid retrieval-augmented generation architectures, ReguTourRAG is distinguished by its regulation-oriented retrieval design and tourism-management application logic. First, the framework treats source provenance, issuing authority, document type, validity status, and clause-level context as core retrieval metadata rather than as auxiliary annotations. This is important for tourism governance because the same operational question may involve national laws, ministerial rules, local regulations, and technical standards with different legal effects. Second, the framework deliberately separates coverage-oriented recall from precision-oriented reranking: Best Matching 25 (BM25) protects exact legal terminology and standard codes, Elastic Learned Sparse EncodeR (ELSER) expands sparse semantic matches for regulatory paraphrases, Hierarchical Navigable Small World (HNSW) captures operational intent, and the cross-encoder reranker suppresses weak evidence before generation. Third, the generation stage is constrained by evidence-grounding and insufficient-evidence behavior, which is essential in regulation-intensive tourism scenarios such as scenic-area rating, safety rectification, accessibility service provision, and complaint handling. These features make ReguTourRAG not merely a combination of existing retrieval modules, but a domain-specific architecture for traceable and auditable tourism regulatory question answering.

To address these challenges, this study introduces the ReguTourRAG framework. The framework integrates complementary retrieval algorithms and a domain knowledge base constructed from tourism laws, administrative regulations, technical standards, local rules, and industry specifications. Its objective is to provide scenic-area planning and management with high-precision, traceable, and operationally useful regulatory information support. The main contributions are as follows:

(i) A two-stage hybrid retrieval framework is designed and implemented. In the first stage, ReguTourRAG combines BM25 lexical retrieval, ELSER learning-based sparse semantic retrieval, and HNSW-based dense semantic retrieval for parallel recall. In the second stage, a cross-encoder reranking module selects high-value evidence for the generation model, improving the balance between recall coverage and contextual precision.

(ii) The study empirically evaluates the complementarity of sparse, dense, and reranking strategies across multiple metrics, including BERTScore, mean reciprocal rank, normalized discounted cumulative gain, accuracy, and Recall-Oriented Understudy for Gisting Evaluation-Longest Common Subsequence (ROUGE-L). The results provide evidence that hybrid retrieval is better suited than single-strategy retrieval for regulation-intensive corpora.

(iii) The study develops and validates a domain-specific solution for tourism regulatory and standards documents. The resulting framework offers a transferable design pattern for other vertical domains where answers must be both semantically relevant and evidence-grounded.

2. Related Work

Retrieval-augmented generation is a paradigm that integrates information retrieval with generative modeling to improve answer accuracy, freshness, and reliability. Conventional retrieval systems can locate relevant documents but do not synthesize task-specific answers, whereas large language models generate fluent text but are constrained by parametric knowledge and may not provide verifiable evidence. Retrieval-augmented generation bridges this gap through a retrieve-then-generate pipeline. As illustrated in Figure 1, a typical retrieval-augmented generation workflow consists of indexing, retrieval, and generation. During indexing, raw documents are transformed into searchable representations. During retrieval, relevant evidence is selected for a user query. During generation, the query and retrieved passages are combined to produce the final answer. This mechanism injects external knowledge into the model context and is effective in knowledge-intensive scenarios [12], [13]. Recent domain studies in clinical and educational settings further illustrate the broader movement toward evidence-grounded large language model applications [14], [15], [16].

Figure 1. Flowchart of the retrieval-augmented generation architecture

Note: LLMs = large language models.

2.1 Indexing Stage

Index construction transforms raw text into structured, searchable representations, forming the foundation for efficient and accurate evidence retrieval in retrieval-augmented generation pipelines [13]. The process comprises data preprocessing, text chunking, vectorization, and index storage. During data preprocessing, heterogeneous sources such as PDF, Word, and web pages are converted to clean text, normalized, and associated with metadata such as issuing authority, document type, release date, and validity status. For regulatory corpora, preserving source provenance is as important as removing noise. In terms of text chunking, documents are segmented into retrievable units that remain semantically coherent at the clause or section level, reducing the risk that a retrieved fragment loses its legal context. In terms of vectorization, chunks are embedded into the vector space using models such as BAAI General Embedding (BGE) [17] or task-specific embedding models. Compared with keyword-only representations, vector representations better capture semantic relations between terms and concepts. As for index storage, chunk-vector pairs and metadata are organized in retrieval indexes. Systems such as Facebook AI Similarity Search (FAISS) [18] and search engines with sparse-vector support enable scalable approximate nearest neighbor or hybrid search, providing the data access layer required for retrieval augmentation.

2.2 Retrieval Stage

The retrieval stage identifies and extracts the most relevant passages from the index to expand the model’s context for generation. Effectiveness and efficiency at this stage are primary determinants of factual accuracy and contextual relevance in the final response. Retrieval approaches are commonly grouped into sparse retrieval, dense retrieval, and hybrid retrieval methods.

2.2.1 Sparse retrieval

Sparse retrieval relies on lexical matching and sparse vector representations to locate relevant documents. BM25 remains the most widely used baseline, estimating relevance with term frequency-inverse document frequency-style term weighting and a probabilistic ranking function [19]. Its advantages are efficiency, interpretability, and strong performance when queries contain exact regulatory terms. However, legal and standards documents often use specialized expressions, abbreviations, and paraphrases. Learned sparse retrieval methods address part of this limitation by expanding the lexical matching space with weighted semantic terms while retaining the transparency and index efficiency of sparse representations. In tourism regulation retrieval, this property is valuable because exact names such as scenic-area grades, standard codes, and administrative measures must remain recoverable, while semantically related expressions also need to be matched.

2.2.2 Dense retrieval

In contrast to sparse retrieval, dense retrieval maps both queries and documents into a shared high-dimensional vector space. Dense passage retrieval is a representative dual-encoder approach in which one encoder processes the query and another encodes the passage; relevance is determined by inner product or cosine similarity [9]. Later systems extend dense retrieval in different directions. ColBERTv2 introduces late interaction for token-level similarity while preserving efficiency [20], [21], retrieval-enhanced Transformer demonstrates the value of retrieval from very large corpora for language modeling [22], and Contriever uses contrastive learning to improve unsupervised dense retrieval [23]. Dense retrieval is especially useful for operational questions that do not share exact words with the regulatory text. Nevertheless, dense methods require embedding computation, vector storage, and approximate nearest neighbor search, which increase memory and update costs compared with inverted indexes.

Despite these advantages, dense retrieval faces critical computational challenges. Unlike sparse methods, which rely on inverted indexes based on term frequency, dense retrieval requires frequent updates to embeddings and computationally expensive similarity calculations in high-dimensional spaces. These operations substantially increase both memory consumption and processing overhead, making dense approaches more resource-intensive than sparse counterparts, particularly in large-scale deployments.

2.2.3 Hybrid retrieval

Hybrid retrieval integrates the strengths of sparse and dense approaches. Sparse retrieval captures exact terminology and interpretable term-level evidence, while dense retrieval detects implicit semantic associations. The two paradigms can be combined through parallel recall, score fusion, or cascaded reranking. Neural sparse retrieval methods such as the sparse lexical and expansion model show that sparse representations can incorporate semantic expansion while remaining compatible with inverted-index retrieval [24]. In retrieval-augmented generation systems, hybrid retrieval is particularly attractive because generation quality depends not only on whether some relevant passage is retrieved, but also on whether the candidate set covers the diverse evidence required by the question. For tourism regulatory documents, this means retrieving exact standard clauses, semantically related operational guidance, and cross-document supporting provisions in the same candidate pool.

2.3 Generation Stage

Within the retrieval-augmented generation framework, the generation phase is primarily responsible for integrating retrieved external knowledge with large language models to synthesize the final response. The principal objective of this phase is to ensure that the generated outputs maintain accuracy, coherence, and contextual relevance. The overall effectiveness of retrieval-augmented generation is highly contingent on the efficient integration of retrieved evidence into the generation process. A baseline approach is the concatenation strategy, where retrieved passages are directly appended to the user query before being processed by the large language model. While straightforward and easy to implement, this method often results in information redundancy and can exceed the large language model’s context window limitations when a large number of documents are retrieved.

To improve evidence integration, researchers have proposed knowledge aggregation strategies that filter, compress, and rerank retrieved passages before they are passed to the generator. Representative techniques include self-reflective retrieval and critique [25], context compression [26], and retrieve-rerank-generate architectures [27]. These strategies reduce irrelevant context and help the generation model focus on evidence that is both relevant and reliable. Another notable technique is fusion-in-decoder. In this model, each retrieved document is independently encoded, enabling the large language model to dynamically weigh their contributions during response generation [12]. Unlike the simple concatenation method, fusion-in-decoder allows the model to evaluate multiple sources in parallel, thereby producing outputs that are more semantically coherent and information-rich.

Adaptive fusion methods further adjust the influence of retrieved knowledge during decoding or pretraining. Retrieval-augmented language model, for example, integrates retrieval into language-model pretraining [28]. More generally, adaptive retrieval-augmented generation methods attempt to determine when retrieval is necessary, how much evidence should be used, and how retrieved information should be weighted. For regulatory applications, such adaptivity must be constrained by provenance and citation requirements: the system should prefer a concise, well-supported answer over a fluent answer that cannot be traced to authoritative documents.

3. ReguTourRAG Framework

The development of retrieval-augmented generation has evolved through several stages. The initial naive retrieval-augmented generation employed a simple pipeline but suffered from limitations such as irrelevant retrievals and restricted context windows. To address these issues, the research community introduced advanced retrieval-augmented generation, which enhances performance through pre-retrieval processes (e.g., data cleaning, index optimization, and improved chunking strategies) and post-retrieval techniques (e.g., reranking and context compression). More recently, modular retrieval-augmented generation has been proposed, decomposing the workflow into pluggable and optimizable modules. This design enables the integration of advanced strategies such as query rewriting, multi-retrieval fusion, and iterative retrieval, providing a flexible foundation for constructing more powerful retrieval-augmented generation systems.

ReguTourRAG is a two-stage retrieval-augmented generation framework designed for complex regulatory documents in the tourism sector. Its core design follows a broad-to-narrow strategy. In the first stage, complementary retrieval methods are executed in parallel to maximize coverage and recall. In the second stage, a reranking model refines the candidate set, improving contextual precision and the signal-to-noise ratio before generation. Through experiments with different retrieval combinations, the framework integrates BM25, ELSER, and HNSW to handle exact regulatory terminology, sparse semantic expansion, and dense semantic similarity. This section presents the detailed implementation of the ReguTourRAG framework, including hybrid recall, reranking, and generation mechanisms. To establish clarity, the notations used throughout the framework are first defined in Table 1.

Table 1. Notation used in the proposed framework

Symbol	Description
$Q$	User-submitted text query
$d$	Document collection, including structured and unstructured texts such as scenic area policies and tourist guidelines
$R_{\mathrm{BM25}}$	Retrieved document set obtained via the Best Matching 25 (BM25) retrieval
$R_{\mathrm{ELSER}}$	Retrieved document set obtained via the Elastic Learned Sparse Encoder (ELSER) retrieval
$R_{\mathrm{HNSW}}$	Retrieved document set obtained via the Hierarchical Navigable Small World (HNSW) retrieval
$R_{\mathrm{merge}}$	Merged document set produced during the hybrid retrieval stage
$S_{\mathrm{rerank}}$	Relevance scores generated by the reranking model
$D_{\mathrm{final}}$	Top-$k$ documents selected after reranking
$A$	Final answer generated by the large language model

3.1 The First-Stage Retrieval

In the first retrieval stage, three recall strategies are executed in parallel to construct the initial candidate set. BM25 performs efficient lexical screening and retrieves fragments with strong term overlap with the user query. This is essential for clauses containing exact legal names, standard codes, and administrative terms. As a classical keyword-based retrieval method, BM25 estimates relevance by computing a matching score between the query and document terms. The core BM25 equation is expressed as follows:

$\operatorname{Score}_{\mathrm{BM25}}(Q, d)=\sum_{t \in Q} \operatorname{IDF}(t) \cdot \frac{f(t, d) \cdot\left(k_1+1\right)}{f(t, d)+k_1 \cdot\left(1-b+b \cdot \frac{|d|}{\text { avgdl }}\right)}$

(1)

where, $Q$ denotes the user query, which consists of a set of terms; $d$ represents a candidate document or chunk in the collection; and $t$ denotes an individual query term. $k_1$ controls the effect of term frequency on the relevance score, and $b$ controls document-length normalization. $f(t, d)$ denotes the frequency of term $t$ in document $d$; $\left|d\right|$. denotes the length of document $d ;$ $avgdl$ is the average document length of the corpus; and $\operatorname{IDF}(t)$ denotes the inverse document frequency of term $t$, reducing the weight of terms that occur frequently across the corpus. The computation is defined as follows:

$\operatorname{IDF}(t)=\log \frac{N-n(t)+0.5}{n(t)+0.5}+1$

(2)

where, $N$ denotes the total number of documents in the corpus, and $n(t)$ represents the number of documents containing term $t$. BM25 adjusts the relevance score by incorporating term frequency $f(t, d)$ and document length, ensuring proper normalization when handling documents of varying lengths. Using Eq. (2), the contribution scores across different documents are aggregated, and the Top-N documents ranked by descending scores are selected to form the set $R_{\mathrm{BM25}}$ .

Sparse semantic retrieval extends BM25 by using sparse neural representations to capture deeper semantic matching while preserving index efficiency. This study employs ELSER, a sparse retrieval model provided by Elasticsearch for semantic search. ELSER transforms a query into a sparse weighted representation, allowing the system to retrieve documents that are semantically related even when the surface wording differs. For a query $Q$, the sparse representation is denoted as follows:

$\mathrm{v}_{\mathrm{Q}}=\operatorname{ReLU}\left(\mathrm{W}_2 \cdot \operatorname{GELU}\left(\mathrm{~W}_1 \cdot \operatorname{BERT}(\mathrm{Q})\right)\right)$

(3)

Each document $d$ in the collection $D$ is encoded offline into the same sparse representation space. During retrieval, the index stores only non-zero dimensions and corresponding weights, enabling efficient sparse-vector search. Similarity is computed through an inner product between the query and document sparse vectors as follows:

$\operatorname{Sim}_{\mathrm{ELSER}}(Q,d)={\textstyle \sum_{i \in \operatorname{NonZero}(v_Q)}} v_Q[i] \cdot v_d[i]$

(4)

This design enables real-time retrieval over large document collections, with the Top-M documents selected to form the set $R_{\mathrm{ELSER }}$. The value of the sparse encoder lies in its ability to capture long-tail terms, synonyms, and domain-specific expressions while preserving the interpretability and efficiency of sparse retrieval.

Dense retrieval is performed through a dual-encoder architecture. The query encoder $E_Q$ and document encoder $E_D$ are based on the fine-tuned pretrained model Zpoint-large-embedding-zh (ZH). The vector representations of query $Q$ and document $d$ are generated as follows:

$q = E_Q(Q)_{[\mathrm{CLS}]}$

(5)

$d = E_D(d)_{[\mathrm{CLS}]}$

(6)

The pooling operation uses the hidden state of the [CLS] token as the sentence representation. To accelerate retrieval, the HNSW algorithm is adopted to construct an approximate nearest neighbor index. HNSW organizes the vector space through a multilayer graph structure and has been shown to provide efficient and robust approximate nearest neighbor search [20]. During graph construction, each node, representing a text vector, is probabilistically assigned to different hierarchical levels $L$, with the distribution following an exponential decay rule:

$\mathrm{P}(L = I)=\mathrm{e}^{-\lambda \mathrm{I}}$

(7)

where, the parameter $\lambda$ controls the hierarchical distribution, and $L_{\max }$ denotes the maximum number of layers. Most nodes exist only in the lower layers, while a small subset is promoted to higher layers, forming long-range links in the small-world graph. During construction, HNSW connects each node with selected neighbors within its layer, using cosine similarity as the metric for measuring vector similarity. The computation is defined as follows:

$\operatorname{Sim}_{\mathrm{DPR}}(Q, d)=\frac{Q \cdot d}{\|Q\|\|d\|}$

(8)

During query processing, HNSW starts from the top layer and greedily traverses toward neighbors closer to the query vector until it reaches the bottom layer. At the bottom layer, a priority queue is used for local best-first expansion, returning the top-ranked nearest results. In the context of tourism regulatory documents, this structure helps retrieve semantically related evidence for questions such as “AAAAA scenic-area environmental remediation standards” or “emergency response mechanisms for tourism reception facilities,” even when the query wording does not exactly match the original clause.

The recall sets obtained from BM25, ELSER, and HNSW are merged into a unified candidate set $R_{\mathrm{merge}}$. The advantage of this hybrid retrieval design lies in complementarity. BM25 excels at precise keyword matching, such as retrieving exact names such as “Jiuzhaigou Scenic Area.” ELSER captures sparse semantic expansion and phrase variation, such as aligning “ticket reservation” with related ticket-purchase expressions. Dense retrieval captures broader contextual semantics, such as connecting “senior citizen discount policy” with age thresholds, identity verification, and preferential-service clauses. The merged candidate set therefore improves recall while reducing dependence on any single retrieval paradigm.

3.2 The Second-Stage Reranking

In the second stage, the retrieved documents are reranked to refine evidence selection. This study employs BGE-reranker-large, which assigns fine-grained relevance scores to candidate documents. The model is implemented as a cross-encoder: the query $Q$ and candidate document $d$ are concatenated and processed together, allowing token-level interactions between the query and the candidate evidence. The input is represented as follows:

$\mathrm{Input} = [\mathrm{CLS}] \oplus Q \oplus [\mathrm{SEP}] \oplus d \oplus$

(9)

where, $\oplus$ denotes the text concatenation operation. The model is built upon the Bidirectional Encoder Representations from Transformers (BERT) architecture, which employs a self-attention mechanism to capture fine-grained interactions between the query and the document. Specifically, the computation in the $l$-th Transformer layer is expressed as follows:

$H_l = \mathrm{LayerNorm}\left(A_l + \mathrm{FFN}(A_l)\right)$

(10)

where, $A_l$ = MultiHeadAttention($H_{l-1}$) represents the multi-head attention output, and $\mathrm{FFN}$ denotes the two-layer feed-forward network. The final relevance score is derived from the hidden state of the [CLS] token, denoted as $h_{\mathrm{CLS}}$.

$S_{\text {rerank }}(Q, d)=\sigma\left(\mathbf{w}^{\top} h_{\mathrm{CLS}}+b\right)$

(11)

where, $\sigma$ is the sigmoid function, and $w$ and $b$ are trainable parameters. This design enables the model to evaluate the overall semantic relevance between the query and the document, rather than relying solely on vector-space distance.

In the generation stage, the reranked document set $D_{\mathrm{final}}$ and the original query $Q$ are fed into a large language model, such as GPT-4, Enhanced Representation through Knowledge Integration (ERNIE), or Large Language Model Meta AI (LLaMA). The input is structured using a prompt template that instructs the model to answer only from retrieved evidence and to report insufficient evidence when the retrieved context is inadequate. The prompt template is defined as follows:

$\quad$ Prompt = f """

$\quad$ You are a professional assistant for tourism regulation management. Please answer the question based on the following reference materials. If the information is insufficient, reply with “No relevant information found”:

$\quad$ $\lt$Reference Materials>

$\quad$ {context}

$\quad$ """

The large language model employs a Transformer-based decoder to perform autoregressive generation, with its probability distribution computed as follows:

$P(y_t \mid y_{<t}, x) = \mathrm{Softmax}(W_o h_t)$

(12)

where, $h_t$ denotes the hidden state at the $t$-th decoder layer, and $W_o$ represents the output projection matrix. During generation, deterministic or low-temperature decoding is preferable for regulatory question answering, because the priority is factual consistency rather than creative variation. When diversity-oriented decoding such as nucleus sampling is used, it should be constrained by evidence-grounding instructions and post-generation verification.

3.3 ReguTourRAG Workflow

In the domain of tourism management, regulatory texts are heterogeneous in format, issuing authority, validity period, and terminology. This complexity often causes traditional information retrieval systems to return fragments that are lexically similar but legally incomplete. ReguTourRAG addresses this issue through a two-stage hybrid retrieval workflow. BM25, ELSER, and HNSW first provide broad candidate coverage; BGE-reranker-large then acts as a precision filter before generation. This design improves not only answer readability but also evidential traceability, because the final response can be linked to a smaller set of high-confidence source passages. The overall architecture of the proposed framework is illustrated in Figure 2.

Figure 2. Two-stage retrieval algorithm process

Note: BM25 = Best Matching 25; ELSER = Elastic Learned Sparse Encoder; HNSW = Hierarchical Navigable Small World; LLM = Large Language Model; $R_{\mathrm{BM25}}$, $R_{\mathrm{ELSER}}$, and $R_{\mathrm{HNSW}}$ denote the retrieved candidate sets produced by the three first-stage retrieval methods.

The proposed two-stage hybrid retrieval framework leverages the complementary strengths of multiple methods during the recall phase: BM25 ensures efficient keyword matching, ELSER captures long-tail sparse semantic information, and HNSW extends coverage to dense semantic similarity. Building on this candidate pool, the reranking module scores each candidate based on query-evidence interaction and selects the most relevant passages for generation. The framework therefore supplies the language model with higher-quality context and reduces the probability that irrelevant or weakly related passages influence the final answer. The pseudocode of the two-stage hybrid retrieval algorithm is presented in Algorithm 1, with the inference process summarized below.

Algorithm 1 Inference Process of the Two-stage Hybrid Retrieval Algorithm

Require: Document collection $D$

Input: User query $Q$

Output: Generated answer $A$

Hybrid Retrieval Phase

$R_{\mathrm{BM} 25} \leftarrow \mathrm{BM} 25 . \operatorname{retrieve}\left(Q, D, \operatorname{top}_n\right) \quad / / \mathrm{BM} 25$ keyword-based retrieval

$v \leftarrow$ ELSER.encode $(Q)$ // ELSER sparse vector encoding

$R_{\text {ELSER }} \leftarrow \operatorname{ELSER} . \operatorname{search}\left(v, D, \operatorname{top}_m\right)$

$q \leftarrow$ ELSER.encode_query $(Q)$

$R_{\text {HNSW }} \leftarrow$ HNSW.search $\left(q, D, \operatorname{top}_k\right)$ // HNSW query encoding

$R_{\text {merge }} \leftarrow \operatorname{merge}\left(R_{\mathrm{BM} 25}, R_{\mathrm{ELSER}}, R_{\mathrm{HNSW}}\right) \quad / /$ merged results

Re-ranking Phase

$scores \leftarrow \{\}$

for each $d$ in $R_{\text {merge }}$ do

$input\_pair \leftarrow \mathrm{Concat}(Q, d)$ // Concatenate query with document

$scores[d] \leftarrow R_{\mathrm{BGE}}.\mathrm{predict}(input\_pair)$ // BGE-based relevance scoring

end for

$D_{\text {final }} \leftarrow$ sort_descending $($ scores $) \cdot \operatorname{top}(x) \quad / /$ Select Top-10 documents

Generate Phase

$prompt \leftarrow \mathrm{format\_prompt}(Q, D_{\mathrm{final}})$ // Construct prompt template

$A \leftarrow M_{\mathrm{generate}}(\mathrm{prompt}, \mathrm{max\_tokens}=\mathrm{Num})$ // Generate answer using the LLM

return $A$

The detailed steps of the proposed algorithm are as follows:

(i) Hybrid Retrieval Phase

Step 1: Apply BM25 to match the user query $Q$ with the document collection $D$ and retrieve preliminary results based on term-level matching.

Step 2: Encode query $Q$ using ELSER and perform sparse semantic retrieval to supplement documents with related regulatory expressions.

Step 3: Encode query $Q$ with the dense encoder and perform HNSW approximate nearest neighbor search to retrieve semantically proximate documents.

Step 4: Merge and deduplicate the results of the three retrieval approaches to construct a unified candidate set.

(ii) Reranking and Generation Phase

Step 1: Traverse the candidate set, concatenate each candidate document $d$ with query $Q$, and form an input pair.

Step 2: Apply BGE-reranker-large to predict semantic relevance scores for each input pair.

Step 3: Rank the documents by score and select the Top-k documents to construct Dfinal.

Step 4: Construct a prompt template by combining $Q$ with $D_{\mathrm{final}}$ and explicit evidence-grounding instructions.

Step 5: Input the prompt into the large language model to generate the final answer $A$.

The proposed algorithm integrates keyword-based, sparse-vector, and dense-vector retrieval, thereby improving candidate coverage and semantic relevance. The reranking module performs fine-grained relevance assessment, and the final generation step converts the selected evidence into an actionable answer. Overall, ReguTourRAG is well suited for tourism-related question-answering scenarios characterized by structural diversity, specialized terminology, and cross-document dependencies.

4. Experiments and Analysis

4.1 Dataset

During algorithm implementation, this study collected standardized documents and reports related to the tourism industry from authoritative sources, including the Ministry of Culture and Tourism, local government portals, industry associations, and published service standards. A total of 970 valid documents were compiled, covering laws, administrative regulations, ministerial rules, local normative documents, rectification reports, and operational standards. This collection provides a domain-specific foundation for constructing a standardized tourism regulatory knowledge base.

After consolidation, the textual characteristics of the corpus were analyzed. The word count per document ranged from approximately 1,000 to more than 70,000, with an average of 13,677 words. The largest document contained 74,705 words, reflecting substantial variation in document granularity and depth. Vocabulary statistics indicate that the dataset contains approximately 4.4 million words and about 50,000 unique terms. This confirms the linguistic and terminological diversity of tourism regulatory texts. A partial overview is presented in Table 3.

Table 2. Examples of tourism regulatory documents (partial)

Document Type	Issuing Authority/Examples
Law	Tourism Law of the People’s Republic of China
Administrative Regulations	Regulation on Travel Agencies, Regulation on Tour Guides, Administrative Measures for Outbound Tourism of Chinese Citizens, etc.
Ministerial Regulations	Measures for the Handling of Tourism Complaints, Regulations on Border Tourism Pilot Management, Measures for the Administration of Travel Agency Liability Insurance, etc.
Normative Documents Issued by the National Tourism Administration	Measures for the Protection of Tourism Resources, Detailed Rules for the Implementation of Border Tourism Pilot Management with Russia, etc.
Local Regulations	Regulations enacted by 31 provinces; provincial and lower-level legislative rules
Local Government Rules	Implementation measures for tourism management issued by local governments
National Standards (GB, GB/T)	Examples: GB 5768-2022 (Road Traffic Signs and Markings), GB/T 51224-2017 (Technical Specifications for Rural Road Engineering), etc.
Industry Standards (Ministry of Culture and Tourism)	Examples: LB/T 034-2014 (Basic Requirements for Rural Tourism), LB/T 067-2023 (Barrier-Free Tourism Service Specifications), etc.
Association/Local Standards	Examples: Yunnan Tourism Standardization System, T/CATS 002-2019 (Quality Standards for Rural Tourism Services), etc.

The experimental queries were generated by domain experts based on the structure of the tourism knowledge base and practical regulatory scenarios. Corresponding reference answers were curated to ensure consistency and evidential grounding. The query set covers core tourism management tasks, including scenic-area service regulations, safety management requirements, complaint handling, accessibility services, emergency response, and implementation of industry standards. This construction process improves domain relevance and practical applicability, although the resulting benchmark remains limited by the size and representativeness of expert-designed questions.

4.2 Models Utilized

This study employed Qwen2.5 [29] and GPT-4 [30] series models as benchmark systems. The Qwen series, developed by Alibaba Cloud, consists of large-scale pretrained language models with multiple parameter sizes and strong Chinese-language capabilities. GPT-4, developed by OpenAI, is a large-scale multimodal model with strong long-context processing and instruction-following ability; because OpenAI has not publicly disclosed its parameter count, this study does not assume a specific model size. For the baseline construction of traditional retrieval-augmented generation methods, the Zpoint-large-embedding-zh model was used during text embedding to convert documents into vector representations. This embedding model supports Chinese and English semantic retrieval and can process text at multiple levels of granularity, making it suitable for tourism regulatory documents of varying length and specificity.

The main implementation settings are specified below. Raw PDF, Word, and web documents were converted into Unicode Transformation Format-8 (UTF-8) text and segmented with a clause-aware sliding-window strategy. Article, section, and appendix boundaries were preserved where available; otherwise, text was split into chunks of approximately 350–500 Chinese characters or 250–350 English tokens, with an overlap of about 80 tokens to preserve cross-clause context. Each chunk retained metadata fields, including source title, issuing authority, document type, release date, validity status, and section identifier. BM25 retrieval used the standard Okapi configuration with $k_1 = 1.2$ and $b = 0.75$, and returned the top 50 candidates. ELSER sparse semantic retrieval and HNSW dense retrieval each returned the top 50 candidates. Dense retrieval used cosine similarity over Zpoint-large-embedding-zh embeddings; the HNSW index was configured with $M=16$, $efConstruction=200$, and $efSearch=100$. Candidate sets from the three recall branches were merged and deduplicated by document identifier and chunk offset. BGE-reranker-large was then applied as a cross-encoder reranker over the merged candidate pool, and the final Top-10 passages were supplied to the generation model. For generation, a deterministic or low-temperature setting was used, with the temperature set to 0.1 and a maximum output length of 1,024 tokens.

4.3 Evaluation Metrics

This study employed BERTScore and Generative Evaluation (G-Eval) [31] to assess generation performance. BERTScore measures semantic similarity between generated answers and reference responses by computing cosine similarity over token embeddings. It includes precision, recall, and F1, emphasizing semantic equivalence rather than surface lexical overlap. Because regulatory answers may use different wording while preserving the same legal meaning, embedding-based evaluation is useful as a complementary metric.

To further assess practical answer quality, a five-dimensional evaluation was conducted, covering faithfulness, relevance, completeness, fluency, and conciseness. Each dimension was assessed using both a binary judgment and a Likert-scale score from 1 to 5, as shown in Table 3. The scoring process jointly considered the query, reference answer, and generated response, with evaluations performed automatically by a language model. The averaged scores provide a multidimensional view of model behavior. Because automatic large language model-based evaluation may introduce evaluator bias, the results are interpreted as comparative evidence rather than absolute proof of answer correctness.

Table 3. Multidimensional scoring definition for Generative Evaluation (G-Eval)

Dimension	Evaluation Criteria	Output Type
Faithfulness	Whether the answer is factually consistent with the reference information	Yes/No and 1–5
Relevance	Whether the answer directly addresses the question	Yes/No and 1–5
Completeness	Whether the answer covers the key information required	Yes/No and 1–5
Fluency	Whether the language is natural and sentences are grammatically correct	Yes/No and 1–5
Conciseness	Whether the answer is concise and avoids redundancy	Yes/No and 1–5

Note: Each dimension is evaluated using both a binary judgment (Yes/No) and a 1–5 Likert-scale rating, where 1 indicates the lowest score and 5 indicates the highest score. A score of 3 or above is regarded as “Yes,” while a score below 3 is regarded as “No.”

The ablation study evaluates the impact of integrating multiple retrieval algorithms into ReguTourRAG using mean reciprocal rank, normalized discounted cumulative gain, accuracy, and ROUGE-L. Mean reciprocal rank is the reciprocal of the rank at which the first relevant answer appears; higher values indicate stronger early-ranking capability. Normalized discounted cumulative gain measures ranking quality while accounting for graded relevance and rank position. Accuracy reflects the proportion of answers judged correct under the reference-answer criterion.

ROUGE-L measures the degree of overlap between the candidate and reference texts based on the longest common subsequence. This metric emphasizes word order and evaluates the structural alignment between generated and reference outputs. First, recall and precision are computed using Eq. (13) and Eq. (14), followed by the calculation of ROUGE-L via Eq. (15).

$R=\frac{\operatorname{LCS}(X, Y)}{m}$

(13)

$P=\frac{L C S(X, Y)}{n}$

(14)

$V_{\mathrm{ROUGE -L}}=\frac{\left(1+\beta^2\right) R P}{R+\beta^2 P}$

(15)

where, $X$ denotes the reference answer, $Y$ represents the generated answer, $m$ and $n$ are the lengths of $X$ and $Y$, respectively, $L C S({X, Y})$ refers to their longest common subsequence, and $\beta$ is a tunable parameter.

4.4 Results

4.4.1 Metric comparison

To validate the effectiveness of the proposed approach, ReguTourRAG was compared against several baseline models, which were categorized into simple and adaptive methods (Table 4). The simple methods included no retrieval and single-step retrieval-augmented generation. The adaptive methods included self-retrieval-augmented generation [25], adaptive retrieval [32], SA-RAG , and the proposed ReguTourRAG. These baselines allow the evaluation to distinguish the effects of retrieval presence, retrieval adaptivity, hybrid recall, and reranking.

Table 4. Average BERTScore across different knowledge bases in the tourism domain

Types	Methods	BERTScore Precision	BERTScore Recall	BERTScore F1
Simple	No retrieval	0.6116	0.7066	0.6542
Simple	Single-step retrieval-augmented generation	0.6254	0.7087	0.6630
Adaptive	Self-retrieval-augmented generation	0.6351	0.7260	0.6759
Adaptive	Adaptive retrieval	0.6244	0.7237	0.6686
Adaptive	SA-RAG	0.6233	0.7258	0.6688
Adaptive	ReguTourRAG	0.6352	0.7290	0.6772

In this experiment, evaluation was conducted using GPT-based assessment to incorporate a degree of human-like judgment. Specifically, GPT-4o was used as the evaluator, and the results are summarized in Table 6. ReguTourRAG achieved the highest faithfulness and relevance scores among the compared methods and tied for the best fluency and conciseness scores. Adaptive retrieval obtained a slightly higher completeness score, indicating that its responses covered marginally more reference content in some cases. Overall, ReguTourRAG remained competitive with the strongest adaptive baseline while showing a more balanced profile across faithfulness, relevance, and answer fluency. This pattern suggests that the proposed hybrid retrieval and reranking design improves evidence alignment without sacrificing readability, although the small score margins indicate that future work should include human expert evaluation for stronger validation.

Table 5. Average Generative Evaluation (G-Eval) comparison across different knowledge bases in the tourism domain

Methods	Faithfulness	Relevance	Completeness	Fluency	Conciseness
No retrieval	4.75	4.92	4.59	5.00	2.83
Single-step approach	4.15	4.47	3.92	4.99	2.50
Self-retrieval-augmented generation	4.69	4.90	4.52	5.00	2.77
Adaptive retrieval	4.81	4.92	4.73	4.99	2.85
ReguTourRAG	4.82	4.93	4.68	5.00	2.85

4.4.2 Ablation study

To validate the effectiveness of ReguTourRAG within retrieval-augmented generation frameworks, comparative experiments were conducted against several baseline approaches, including traditional retrieval-augmented generation, retrieval-augmented generation + BM25, retrieval-augmented generation + ELSER, and retrieval-augmented generation + HNSW. These evaluations were designed to isolate the performance gains associated with hybrid retrieval and reranking. Because the same knowledge base and evaluation protocol were used across variants, the comparison focused on retrieval architecture rather than changes in corpus content. Table 6 and Table 7 show the comparative results of different retrieval models on Mean Reciprocal Rank at $k$ ($\mathrm{MRR@k}$) and Normalized Discounted Cumulative Gain at $k$ ($\mathrm{nDCG@k}$).

The experimental results demonstrate that hybrid retrieval combined with reranking has a clear impact on retrieval performance. The BM25-based retrieval-augmented generation baseline maintains mean reciprocal rank values around 0.5 across retrieval depths, showing stable but limited keyword-matching capability. Retrieval-augmented generation + HNSW performs better than single sparse retrieval methods, indicating the value of semantic vector search for operationally phrased queries. The full ReguTourRAG model achieves an $\mathrm{MRR@3}$ of 0.6567 and an $\mathrm{nDCG@3}$ of 0.6626, outperforming BM25 and other single-strategy baselines. As retrieval depth increases, ReguTourRAG maintains its advantage, reaching an $\mathrm{MRR@20}$ of 0.6667 and an $\mathrm{nDCG@20}$ of 0.8099. These results indicate that the proposed architecture improves both early precision and overall ranking quality.

Table 6. Comparative results of different retrieval models on mean reciprocal rank at $k$ ($\mathrm{MRR@k}$)

Model	$\boldsymbol{\mathrm{MRR@3}}$	$\boldsymbol{\mathrm{MRR@6}}$	$\boldsymbol{\mathrm{MRR@9}}$	$\boldsymbol{\mathrm{MRR@12}}$	$\boldsymbol{\mathrm{MRR@20}}$
Retrieval-augmented generation + Best Matching 25 (BM25)	0.5000	0.5000	0.5000	0.5000	0.5000
Retrieval-augmented generation + Elastic Learned Sparse Encoder (ELSER)	0.2800	0.3100	0.3100	0.3100	0.3100
Retrieval-augmented generation + Hierarchical Navigable Small World (HNSW)	0.5367	0.5533	0.5583	0.5583	0.5683
ReguTourRAG without reranking	0.5467	0.4683	0.4667	0.4667	0.5113
ReguTourRAG	0.6567	0.6667	0.6667	0.6667	0.6667

Note: For ReguTourRAG without reranking, the non-monotonic $\mathrm{MRR@k}$ values are due to the independent construction of candidate sets at each cutoff $k$ after hybrid fusion and deduplication of BM25, ELSER, and HNSW outputs. Therefore, the candidate set at a larger $k$ is not necessarily a cumulative extension of that at a smaller $k$, and the $\mathrm{MRR@k}$ values may fluctuate.

Table 7. Comparative results of different retrieval models on normalized discounted cumulative gain at $k$ ($\mathrm{nDCG@k}$)

Model	$\boldsymbol{\mathrm{nDCG@3}}$	$\boldsymbol{\mathrm{nDCG@6}}$	$\boldsymbol{\mathrm{nDCG@9}}$	$\boldsymbol{\mathrm{nDCG@12}}$	$\boldsymbol{\mathrm{nDCG@20}} $
Retrieval-augmented generation + Best Matching 25 (BM25)	0.5000	0.5000	0.5000	0.5000	0.5000
Retrieval-augmented generation + Elastic Learned Sparse Encoder (ELSER)	0.2926	0.3252	0.3252	0.3252	0.3252
Retrieval-augmented generation + Hierarchical Navigable Small World (HNSW)	0.5552	0.6191	0.6529	0.6756	0.7287
ReguTourRAG without reranking	0.6031	0.5656	0.6123	0.6336	0.7839
ReguTourRAG	0.6626	0.6852	0.7105	0.7231	0.8099

The comparison between ReguTourRAG without reranking and the full ReguTourRAG further confirms the contribution of the reranking stage. The full model achieves an $\mathrm{MRR@12}$ of 0.6667, compared with 0.4667 for the non-reranked variant, reflecting a substantial improvement in early evidence ordering. In terms of overall ranking quality, the $\mathrm{nDCG@20}$ score increases from 0.7839 to 0.8099. These findings suggest that hybrid recall alone is not sufficient; without a precision-oriented reranker, the candidate set may contain relevant documents but fail to place the most useful evidence at the top.

Figure 3 and Figure 4 illustrate the trends of $\mathrm{MRR}@k$ and $\mathrm{nDCG}@k$ across different $k$ values. Overall, ReguTourRAG demonstrates stable performance across retrieval lists of varying lengths, with the strongest normalized discounted cumulative gain performance observed at $\mathrm{nDCG}@20$. This result highlights the method's practical value for complex regulatory queries, where multiple clauses may be needed to support a complete answer. From a system-design perspective, the results suggest that retrieval coverage and reranking precision should be optimized jointly rather than treated as independent components.

Figure 3. Trend of Mean Reciprocal Rank ($\mathrm{MRR}@k$) across different $k$ values

Note: RAG+BM25 = retrieval-augmented generation + Best Matching 25 (BM25); RAG+ELSER = retrieval-augmented generation + Elastic Learned Sparse Encoder (ELSER); RAG+HNSW = retrieval-augmented generation + Hierarchical Navigable Small World (HNSW); ReguTourRAG without reranking = the proposed framework without the reranking module.

Figure 4. Trend of Normalized Discounted Cumulative Gain ($\mathrm{nDCG}@k$) across different $k$ values

Note: RAG+BM25 = retrieval-augmented generation + Best Matching 25 (BM25); RAG+ELSER = retrieval-augmented generation + Elastic Learned Sparse Encoder (ELSER); RAG+HNSW = retrieval-augmented generation + Hierarchical Navigable Small World (HNSW); ReguTourRAG without reranking = the proposed framework without the reranking module.

The generation results in Table 8 show performance differences among retrieval configurations. The baseline retrieval-augmented generation model achieves an accuracy of 0.363 and a ROUGE-L score of 0.354, reflecting the limitations of generation without stronger retrieval enhancement. Among single-strategy retrieval models, retrieval-augmented generation + HNSW performs best, with an accuracy of 0.398 and a ROUGE-L score of 0.385, suggesting that dense semantic retrieval provides useful context for operational queries. The non-reranked ReguTourRAG variant achieves an accuracy of 0.448 and a ROUGE-L score of 0.417, while the full ReguTourRAG model maintains the same accuracy and improves ROUGE-L to 0.439. This indicates that reranking contributes most clearly to answer structure and textual alignment rather than changing the binary accuracy outcome in this experiment. Overall, the results support the effectiveness of the multi-retrieval fusion plus post-retrieval optimization strategy for complex tourism regulatory question answering.

Table 8. Accuracy and ROUGE-L comparison of different retrieval models

Model	Accuracy	ROUGE-L
Retrieval-augmented generation	0.363	0.354
Retrieval-augmented generation + Best Matching 25 (BM25)	0.374	0.367
Retrieval-augmented generation + Elastic Learned Sparse Encoder (ELSER)	0.369	0.362
Retrieval-augmented generation + Hierarchical Navigable Small World (HNSW)	0.398	0.385
ReguTourRAG without reranking	0.448	0.417
ReguTourRAG	0.448	0.439

4.4.3 Case study

As illustrated in Appendix, a case study was conducted in which the same query with conditional constraints was submitted to both traditional retrieval-augmented generation and the proposed ReguTourRAG framework. The outputs show that traditional retrieval-augmented generation retrieves semantically related passages but struggles to organize evidence around user-specific constraints. In contrast, ReguTourRAG retrieves documents from multiple complementary indexes and reranks the evidence before generation, enabling the final answer to better reflect the applicable regulatory conditions. The case study illustrates the practical value of the framework: for scenic-area managers, the system does not merely return a list of documents, but produces an evidence-grounded response that can support policy interpretation, rectification planning, and compliance review.

To further examine scenario coverage beyond the accessibility example shown in Appendix, additional representative query-response cases were tested across safety management, accessibility services, and complaint handling. As summarized in Table 9, the retrieved evidence and generated answers remained organized around authoritative clauses and practical management actions, indicating that the framework can support diverse regulatory tasks rather than only a single demonstration query.

Table 9. Representative query-response cases across tourism management scenarios

Scenario	Representative Query	Condensed ReguTourRAG Response	Main Evidence Type
Safety management	What should a scenic area include when preparing a peak-season emergency evacuation and visitor-flow control plan?	The answer identified emergency-response responsibility allocation, visitor-capacity monitoring, evacuation-route signage, risk warnings, drill records, and coordination with local public-security and emergency-management departments.	Safety management rule, scenic-area service standards, and emergency-response provisions
Accessibility services	What accessibility requirements should be checked when renovating entrances, visitor routes, and service facilities in a scenic area?	The answer organized requirements for barrier-free entrances, continuous accessible routes, ramp or level-transition design, accessible toilets, visible guidance signs, and service assistance for elderly or disabled visitors.	Accessibility design codes and tourism service standards
Complaint handling	How should a scenic area or travel service provider handle a tourist complaint about service quality and refund disputes?	The answer summarized complaint acceptance, evidence collection, timely investigation, mediation or correction measures, preservation of records, and escalation to competent authorities when a dispute cannot be resolved internally.	Tourism complaint-handling measures and consumer-rights provisions

4.5 Discussion and Threats to Validity

The empirical results suggest that ReguTourRAG’s main advantage comes from aligning retrieval architecture with the structure of regulatory knowledge. Tourism regulations contain exact legal terms, semantically related service concepts, and cross-document dependencies. BM25, ELSER, HNSW, and reranking each address a different part of this problem. The practical implication is that regulatory question-answering systems should be evaluated not only by final answer fluency, but also by whether the system retrieves authoritative evidence early enough for the generator to use it reliably.

There are also threats to validity. The corpus contains 970 documents, but tourism regulation is continuously updated and differs across jurisdictions. The expert-designed query set improves relevance but may not fully represent real user behavior. BERTScore, ROUGE-L, and GPT-based scoring provide useful comparative signals, yet they cannot replace legal-domain expert review. In addition, the ablation results should be interpreted as retrieval-architecture evidence rather than a complete deployment benchmark because latency, update cost, and human-in-the-loop review were not fully evaluated. For deployment, ReguTourRAG should therefore be integrated with document-version monitoring, source validity checks, and claim-level citation. In regulation-driven scenarios, the system should explicitly return “insufficient evidence” when authoritative support is unavailable. This conservative behavior is essential for reducing hallucination risk and for maintaining institutional trust in artificial intelligence-assisted tourism governance.

Practical deployment also requires lifecycle and governance mechanisms beyond model accuracy. A tourism regulatory question-answering system should include scheduled crawling or manual ingestion for newly issued documents, version comparison for revised or repealed provisions, and metadata-based filtering so that expired documents do not dominate retrieval results. At scale, incremental indexing is preferable to full re-indexing because dense-vector reconstruction can be costly when local standards are updated frequently. The system should also expose source documents, clause identifiers, and retrieval confidence to human reviewers, especially for answers related to administrative penalties, safety obligations, or accessibility compliance. From an engineering perspective, deployment should monitor latency, index freshness, retrieval drift, and user feedback, while preserving audit logs that record which document versions supported each answer. These requirements show that ReguTourRAG is best understood as a continuously maintained decision-support component rather than a one-time static question-answering model.

5. Conclusion

This study designs and implements ReguTourRAG, a multidimensional retrieval-augmented generation framework for tourism laws, regulations, and standards. By integrating BM25, ELSER, HNSW, and BGE-based reranking, the framework addresses three key requirements of regulatory question answering: exact terminology matching, semantic expansion, and precision-oriented evidence selection. Experimental evaluations demonstrate that ReguTourRAG outperforms traditional and single-strategy retrieval baselines across multiple retrieval and generation metrics, including BERTScore, mean reciprocal rank, normalized discounted cumulative gain, accuracy, and ROUGE-L. These results indicate that multi-algorithm collaboration can substantially improve evidence retrieval and answer quality in tourism regulatory scenarios.

Beyond its technical contribution, ReguTourRAG provides a practical pathway for improving the scientific rigor and compliance support capacity of tourism planning and scenic-area management. The framework is also transferable to other regulation-driven domains such as finance, healthcare, education, and engineering management. However, several limitations remain. First, the knowledge base should be expanded to include a broader range of local standards, international conventions, and updated validity metadata. Second, the current benchmark relies on expert-designed questions and automatic evaluation. Therefore, future work should incorporate larger-scale human expert assessment and real user logs. Third, knowledge freshness requires automated monitoring, document-version control, and incremental re-indexing. Finally, future systems should strengthen answer attribution by linking each generated claim to specific source clauses, thereby further improving transparency and auditability.

Author Contributions

Conceptualization, Z.H.W. and X.Y.W.; methodology, Z.H.W. and X.Y.W.; formal analysis, Z.H.W., X.Y.W., and J.Z.; resources, Q.H.; writing—original draft preparation, X.Y.W.; writing—review and editing, Z.H.W. and X.Y.W.; supervision, J.Z. and Q.H. All authors have read and agreed to the published version of the manuscript.

Funding

This study is funded by the Major Program of National Fund of Philosophy and Social Science of China (Grant No.: 20BGL001) and the 2025 Second-Batch Research Project of the Beijing Higher Education Security Association (Grant No.: Z20260014).

Data Availability

The data used to support the research findings are available from the corresponding author upon request.

Conflicts of Interest

All authors declare that they have no conflicts of interest.

References

1.

The State Council of the People’s Republic of China, “China sees domestic travel surge in 2024,” 2025. https://english.www.gov.cn/archive/statistics/202501/22/content_WS6790867bc6d0868f4e8ef0f0.html [Google Scholar]

2.

The State Council of the People’s Republic of China, “China sets out 5-year path for tourism,” 2022. https://english.www.gov.cn/policies/latestreleases/202201/20/content_WS61e9256dc6d09c94e48a3fd2.html [Google Scholar]

3.

A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, “Improving language understanding by generative pre-training,” OpenAI, Tech. Rep., 2018. [Online]. Available: https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf [Google Scholar]

4.

J. Bai, S. Bai, Y. Chu, Z. Cui, K. Dang, X. Deng, Y. Fan, W. Ge, Y. Han, F. Huang, et al., “Qwen technical report,” arXiv Preprint., p. arXiv:2309.16609, 2023. [Google Scholar] [Crossref]

5.

R. Anil, S. Borgeaud, J. B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, M. Andrew Dai, A. Hauth, K. Millican, D. Silver, et al., “Gemini: A family of highly capable multimodal models,” arXiv Preprint., p. arXiv:2312.11805, 2023. [Google Scholar] [Crossref]

6.

Q. Yang, H. Zuo, R. Su, H. Su, T. Zeng, H. Zhou, R. Wang, J. Chen, and Y. Lin, “Dual retrieving and ranking medical large language model with retrieval augmented generation,” Sci. Rep., vol. 15, p. 18062, 2025. [Google Scholar] [Crossref]

7.

A. Banerjee, A. Satish, and W. Wörndl, “Enhancing tourism recommender systems for sustainable city trips using retrieval-augmented generation,” in Recommender Systems for Sustainability and Social Good, Bari, Italy, 2025. [Google Scholar] [Crossref]

8.

P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W. t. Yih, Rocktäschel, et al., “Retrieval-augmented generation for knowledge-intensive NLP tasks,” in Advances in Neural Information Processing Systems, 2020, pp. 9459–9474. [Online]. Available: https://proceedings.neurips.cc/paper/2020/hash/6b493230205f780e1bc26945df7481e5-Abstract.html [Google Scholar]

9.

V. Karpukhin, B. Oğuz, S. Min, P. Lewis, L. Wu, S. Edunov, D. Chen, and W. t. Yih, “Dense passage retrieval for open-domain question answering,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, 2020, pp. 6769–6781. [Online]. Available: https://aclanthology.org/2020.emnlp-main.550/ [Google Scholar]

10.

D. Wan, M. Liu, K. McKeown, M. Dreyer, and M. Bansal, “Faithfulness-aware decoding strategies for abstractive summarization,” in Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, Dubrovnik, Croatia, 2023, pp. 2864–2880. [Google Scholar] [Crossref]

11.

T. Vu, M. Iyyer, X. Wang, N. Constant, J. Wei, J. Wei, C. Tar, Y. H. Sung, D. Zhou, Q. Le, et al., “FreshLLMs: Refreshing large language models with search engine augmentation,” in Findings of the Association for Computational Linguistics: ACL 2024, Bangkok, Thailand, 2024, pp. 13697–13720. [Online]. Available: https://aclanthology.org/2024.findings-acl.813/ [Google Scholar]

12.

G. Izacard and E. Grave, “Leveraging passage retrieval with generative models for open domain question answering,” in Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics, 2021, pp. 874–880. [Online]. Available: https://aclanthology.org/2021.eacl-main.74/ [Google Scholar]

13.

Y. Gao, Y. Xiong, X. Gao, K. Jia, J. Pan, Y. Bi, Y. Dai, J. Sun, and H. Wang, “Retrieval-augmented generation for large language models: A survey,” arXiv Preprint., p. ArXiv:2312.10997, 2023. [Google Scholar] [Crossref]

14.

E. J. Gong, C. S. Bang, J. J. Lee, S. Seo, D. Oh, and D. K. Lee, “The potential clinical utility of the customized large language model in gastroenterology: A pilot study,” Bioengineering, vol. 12, no. 1, p. 1, 2025. [Google Scholar] [Crossref]

15.

K. Taneja, P. Maiti, S. Kakar, and K. Ashok Goel, “Jill Watson: A virtual teaching assistant powered by ChatGPT,” in Artificial Intelligence in Education, Springer, 2024, pp. 324–337. [Google Scholar]

16.

N. Chondamrongkul, G. Hristov, and P. Temdee, “Addressing technical challenges in large language model-driven educational software system,” IEEE Access, vol. 13, pp. 12846–12858, 2025. [Google Scholar] [Crossref]

17.

S. Xiao, Z. Liu, P. Zhang, N. Muennighoff, D. Lian, and J. Y. Nie, “C-Pack: Packed resources for general Chinese embeddings,” arXiv Preprint., p. arXiv:2309.07597, 2023. [Google Scholar] [Crossref]

18.

J. Johnson, M. Douze, and H. Jégou, “Billion-scale similarity search with GPUs,” IEEE Trans. Big Data, vol. 7, no. 3, pp. 535–547, 2019. [Google Scholar] [Crossref]

19.

S. Robertson and H. Zaragoza, “The probabilistic relevance framework: BM25 and beyond,” Found. Trends Inf. Retr., vol. 4, no. 1–2, pp. 1–174, 2009. [Google Scholar] [Crossref]

20.

Y. A. Malkov and D. A. Yashunin, “Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 42, no. 4, pp. 824–836, 2020. [Google Scholar] [Crossref]

21.

K. Santhanam, O. Khattab, J. Saad-Falcon, C. Potts, and M. Zaharia, “ColBERTv2: Effective and efficient retrieval via lightweight late interaction,” in Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Seattle, United States, 2022, pp. 3715–3734. [Online]. Available: https://aclanthology.org/2022.naacl-main.272/ [Google Scholar]

22.

S. Borgeaud, A. Mensch, J. Hoffmann, T. Cai, E. Rutherford, K. Millican, B. M. George Driessche, J. B. Lespiau, B. Damoc, A. Clark, et al., “Improving language models by retrieving from trillions of tokens,” in Proceedings of the 39th International Conference on Machine Learning, Baltimore, Maryland, USA, 2022, pp. 2206–2240. [Online]. Available: https://proceedings.mlr.press/v162/borgeaud22a.html [Google Scholar]

23.

G. Izacard, M. Caron, L. Hosseini, S. Riedel, P. Bojanowski, A. Joulin, and E. Grave, “Unsupervised dense information retrieval with contrastive learning,” arXiv Preprint., p. arXiv:2112.09118, 2021. [Google Scholar] [Crossref]

24.

T. Formal, C. Lassance, B. Piwowarski, and S. Clinchant, “Towards effective and efficient sparse neural information retrieval,” ACM Trans. Inf. Syst., vol. 42, no. 5, pp. 1–46, 2024. [Google Scholar] [Crossref]

25.

A. Asai, Z. Wu, Y. Wang, A. Sil, and H. Hajishirzi, “Self-RAG: Learning to retrieve, generate, and critique through self-reflection,” in Proceedings of the International Conference on Learning Representations, Vienna, Austria, 2024. [Online]. Available: https://openreview.net/forum?id=hSyW5go0v8 [Google Scholar]

26.

F. Xu, W. Shi, and E. Choi, “RECOMP: Improving retrieval-augmented LMs with context compression and selective augmentation,” in Proceedings of the International Conference on Learning Representations, Vienna, Austria, 2024. [Online]. Available: https://openreview.net/forum?id=mlJLVigNHp [Google Scholar]

27.

M. Glass, G. Rossiello, M. F. M. Chowdhury, A. Naik, P. Cai, and A. Gliozzo, “Re2G: Retrieve, rerank, generate,” in Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Seattle, Washington, United States, 2022, pp. 2701–2715. [Online]. Available: https://aclanthology.org/2022.naacl-main.194/ [Google Scholar]

28.

K. Guu, K. Lee, Z. Tung, P. Pasupat, and M. W. Chang, “Retrieval augmented language model pre-training,” in Proceedings of the 37th International Conference on Machine Learning, 2020, pp. 3929–3938. [Online]. Available: https://proceedings.mlr.press/v119/guu20a.html [Google Scholar]

29.

Qwen Team, “Qwen2.5 technical report,” arXiv Preprint., p. arXiv:2412.15115, 2024. [Google Scholar] [Crossref]

30.

OpenAI, J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, et al., “GPT-4 technical report,” arXiv Preprint., p. arXiv:2303.08774, 2023. [Google Scholar] [Crossref]

31.

Y. Liu, D. Iter, Y. Xu, S. Wang, R. Xu, and C. Zhu, “G-eval: NLG evaluation using GPT-4 with better human alignment,” arXiv Preprint., p. arXiv:2303.16634, 2023. [Google Scholar] [Crossref]

32.

S. Jeong, J. Baek, S. Cho, S. J. Hwang, and C. Jong Park, “Adaptive-RAG: Learning to adapt retrieval-augmented large language models through question complexity,” in Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Mexico City, Mexico, 2024, pp. 7036–7050. [Online]. Available: https://aclanthology.org/2024.naacl-long.389/ [Google Scholar]

Appendix

Question

What accessibility design requirements should be met for gates in scenic areas?

RAG

The accessibility design of gates should meet the following requirements: Accessible gates should preferably use automatic doors, but swing doors, sliding doors, folding doors, lightweight spring doors, or accessible revolving doors are also acceptable. When revolving doors are adopted, they should be supplemented with an electric swing or sliding door, or an adjacent accessible door should be provided. For glass doors, clear and prominent markings are required. Accessible gates must be clearly identifiable, easy to open and close, and ensure safe passage. No obstacles or thresholds should be present; where level differences exist, the height shall not exceed 15 mm, and ramps with a slope not greater than 1:10 shall be installed for transition.

ReguTourRAG

According to GB 50763-2012 Code for Accessibility Design, the accessibility design of gates shall comply with the following requirements:

Large-resistance spring doors shall not be used, and spring or glass doors are generally discouraged; where glass doors are adopted, clear and prominent markings shall be provided.
The clear passage width of automatic doors after opening shall not be less than 1.00 m.
For swing, sliding, or folding doors, the clear passage width after opening shall not be less than 800 mm, and preferably not less than 900 mm when feasible.
A wheelchair turning space with a diameter of at least 1.50 m shall be reserved inside and outside the doorway.
On the wall adjacent to the handle side of single-leaf swing, sliding, or folding doors, a wall section of not less than 400 mm shall be provided.
Swing, sliding, or folding doors shall be equipped with handles positioned at 900 mm above floor level; observation glass is recommended, and a protective kick plate should be installed within 350 mm of the floor.
The threshold height and ground level difference between the interior and exterior shall not exceed 15 mm, and ramps shall be provided for transition.
Door leaves on accessible passages shall be easy to operate.
Contrasting colors on adjacent walls are recommended to facilitate recognition.

According to GB 55019-2021 General Code for Accessibility in Urban and Municipal Engineering, additional requirements for accessible automatic, manual, and all-glass doors include:

Automatic doors meeting accessibility requirements shall comply with: a) for newly constructed or expanded buildings, the clear passage width after opening shall not be less than 900 mm; for renovated or retrofitted buildings, it shall not be less than 800 mm; b) swing doors shall be equipped with handrails on both sides, operable with a closed fist, positioned at a height of 0.85 m–1.00 m above the floor; c) excluding fire doors, the opening force shall not exceed 25 N.
For automatic doors, a) the clear passage width after opening shall not be less than 1.00 m; b) where manual control devices are provided, their operational components shall be positioned at a height of 0.85 m–1.00 m above the floor.
All-glass doors shall comply with: a) safety glass or protective measures must be adopted, along with prominent anti-collision markings; b) where glass partitions exist on both sides of the door, the door must be visually distinct from the partitions, and the partitions must include anti-collision markings; c) anti-collision markings shall extend across the glass door or partition, positioned at a height of 0.85 m–1.50 m above the floor.
Bi-directional accessible doors shall be fitted with observation windows within the visible height range, with the lower edge of the transparent portion positioned no higher than 850 mm above floor level.

Cite this:

APA Style

IEEE Style

BibTex Style

MLA Style

Chicago Style

GB-T-7714-2015

Wu, Z. H., Wang, X. Y., Zhang, J., & He, Q. (2026). ReguTourRAG: A Multi-Algorithm Collaborative Retrieval Framework for Tourism Regulations and Standards Documents. Inf. Dyn. Appl., 5(1), 27-44. https://doi.org/10.56578/ida050103

cc

©2026 by the author(s). Published by Acadlore Publishing Services Limited, Hong Kong. This article is available for free download and can be reused and cited, provided that the original published version is credited, under the CC BY 4.0 license.

pdf

Figure 1. Flowchart of the retrieval-augmented generation architecture

Table 1. Notation used in the proposed framework

Citations

Crossref: 0