RB-BERT: A hybrid framework of rule-based weak supervision and BERT for aspect-level sentiment analysis of tourist attractions
Abstract:
Multi-aspect sentiment analysis aims to identify different aspects and associated sentiments within user-generated reviews. In recent years, bidirectional encoder representations from transformer (BERT) have been widely used for sentiment analysis due to its strong ability to capture contextual information. However, BERT has limitations in explicitly identifying aspect boundaries and aligning sentiments, especially when multiple aspects with different sentiments appear in the same review. To address this issue, we propose a combination of rule-based and BERT (RB-BERT). The main idea of RB-BERT is to utilize domain-specific linguistic rules to automatically generate weak labels for aspect and sentiment pairs, which are then used to fine-tune the pretrained BERT model. A key contribution of this study is addressing BERT’s limitations in aspect-based sentiment analysis (ABSA) by enhancing aspect identification and sentiment assignment. The dataset consists of 3811 user reviews about Sarangan Lake, a popular tourist attraction in East Java, Indonesia. We collected the dataset from Google Maps. The aspects used in this study are scenery view, price, and local environment. The sentiment polarities are positive and negative. We applied four rule levels to enhance the BERT model. The first rule handles aspect extraction, the second addresses sentiment extraction, and the third determines the dominant sentiment based on the frequency of positive and negative words. The fourth rule combines aspects and sentiment in each review to produce a label. BERT tokenization and BERT embeddings are used for feature extraction, with a fully connected linear layer serving as the classification head. RB-BERT performs best with a precision value of 0.9218, a recall of 0.9748, a Micro-F1 of 0.9476, and a Hamming Loss value of 0.0132. Thus, RB-BERT can be used as an approach to perform automatic labeling in multilabel classification by offering speed, low cost, and good performance.1. Introduction
Sentiment analysis is an automated text extraction process that aims to identify the polarity of opinions [1], [2]. However, this approach struggles to capture an opinion’s aspect. For example, someone might say the hotel service is good. However, the location is far from the tourist area. Sentiment analysis would classify this as a negative review. This happens because an annotator will highlight consumer disappointment at the location, possibly missing the positive aspect of good service. Yet, good service enhances the hotel’s value. The distant location cannot be easily changed, but aspects like service can be improved. To address this issue, the multi-label sentiment approach evolved into a more detailed method: aspect-based sentiment analysis (ABSA) [3]. ABSA categorizes reviews by aspect and sentiment. For example, in the previous case, it would classify: aspect: service; sentiment polarity: positive and aspect: location; sentiment polarity: negative.
To achieve this detail, ABSA uses a multi-stage extraction process. The first stage identifies aspects in the user’s review, such as service, location, or price. The second stage extracts sentiment. The third stage determines the target opinion from the detected aspects and sentiments [4]. For example, the target opinion could be “service_positive” and “location_negative”. These tasks commonly addressed within natural language processing (NLP), which enables machines to interpret human language. Recently, pretrained models have become popular and effective for complex NLP tasks. This includes sentiment analysis and ABSA. One widely used pretrained model is bidirectional encoder representations from transformer (BERT) [5]. BERT is trained on Wikipedia and BookCorpus by Google. It performs tasks like masked language modeling and next-sentence prediction [6]. Although BERT captures sentence structure and context, it cannot classify text on its own [7]. Adding a classification head, such as a fully connected linear layer, allows BERT to perform text classification [8]. However, just adding a classification head does not enable BERT to classify text without supervision. The new layer’s parameters must be trained on labeled data. Thus, BERT’s ability to perform classification tasks, including ABSA, depends on the availability of labeled datasets. For ABSA, required labels include sentiment polarity and aspect information at the word or phrase level. This makes labeling more complex than in conventional sentiment analysis.
The labeling process for the ABSA dataset relies on manual annotation and weak labeling. Manual annotation by linguists provides high accuracy. However, it is time-consuming, costly, and hard to scale across domains or languages with limited resources [9]. This need for scalability highlights the importance of weak labeling to generate supervision data at a lower cost.
Weak labeling methods include unsupervised approaches like topic modeling (e.g., Latent Dirichlet Allocation) [10] and clustering [11]. These methods extract “topics” as aspect sentiment from a large corpus. However, they are limited to ABSA. First, they produce document- or sentence-level topics but do not extract explicit aspect terms [10]. Second, performance drops on short texts due to data sparsity and rare word co-occurrence. This makes it difficult to perform aspect extraction [10]. Third, clusters or topics need manual labeling or human interpretation to generate meaningful categories. This reduces the benefits of automation [11]. Thus, topic modeling and clustering do not provide the necessary label granularity for training ABSA models. These models require annotation at the token or phrase level. Large language model (LLM)-based weak labeling is another option, but it also has constraints. LLMs can produce inconsistent outputs and suffer from reproducibility issues. They also require significant computational resources [12]. These issues highlight the need for weak labeling methods that use explicit linguistic knowledge, such as dependency relations between aspect and opinion words. This helps label aspects and polarities for ABSA more consistently, reproducibly, and efficiently.
This study proposes a hybrid weakly supervised framework for aspect-level sentiment analysis of tourist attraction reviews. It combines rule-based pseudo-label generation with BERT-based classification. The approach aims to reduce dependency on extensive manual annotation. It also improves classification performance using contextual semantic learning. In addition, this study aims to demonstrate that rule-based supervision can serve as an effective bridge between traditional interpretable methods and modern deep contextual language models. The main contributions of this study are as follows:
1. To develop a rule-based weak supervision mechanism for automatically generating aspect-level sentiment pseudo-labels from unlabeled tourism reviews.
2. To fine-tune a BERT-based classifier using pseudo-labeled data in order to improve contextual understanding and generalization beyond rules.
3. To evaluate the effectiveness of the proposed hybrid framework in supporting annotation-efficient aspect-level sentiment analysis in the tourism domain.
4. To provide empirical evidence that integrating rule-based supervision with contextual language modeling can offer a practical alternative for low-resource sentiment classification tasks.
This paper is organized as follows: Section 2 contains the details of materials and methods; Section 3 presents results and discussions; and Section 4 presents the conclusion and the following work.
2. Methodology
This section describes the research architecture as shown in Figure 1. The figure illustrates data collection, data labelling, and performance evaluation. The following section explains these stages in more detail.

This study used a dataset obtained with the Instant Data Scraper extension. This tool automates data retrieval from Google Maps. The scraping process collected usernames and user reviews for Sarangan Lake, a popular tourist spot in East Java, Indonesia. In total, 3,811 reviews were collected.
The primary objective of this labeling process is to categorize user reviews by aspect and sentiment polarity. In this study, we used three aspects: scenery view, price, and local environment. Each aspect is classified into two sentiment polarities: positive and negative. Combining aspect and sentiment results in six classification labels: LP (Positive Environment), LN (Negative Environment), HP (Positive Price), HN (Negative Price), PP (Positive View), and PN (Negative View).
Each labeled dataset is transformed into a binary relevance format [13], as shown in Table 1. This enables effective handling of multilabel classification. Each label is represented as a binary value: 1 means the label is present in a review; 0 means it is absent. This approach is widely used in multilabel modeling. It allows a single review to have multiple labels and enables independent prediction for each label.
User Reviews (Indonesian) | User Reviews (English) | View | Price | Local Environment | |||
PP | PN | HP | HN | LP | LN | ||
Pemandangannya indah di pagi hari, cuacanya juga dingin. Cocok untuk wisata dengan keluarga dan juga teman. | The scenery is beautiful in the morning, and the weather is also cold. Suitable for tours with family and friends. | 1 | 0 | 0 | 0 | 1 | 0 |
There are four rules required to create the target opinion or the final label for ABSA. A detailed explanation of the four rules is as follows.
• Rule 1 for aspect detection
Sentence as a sequence of tokens: $S=\left(w_1, w_2, \ldots, w_n\right)$ and a predefined set of aspects $A=\left\{a_1, a_2, \ldots, a_m\right\}$, each aspect $a_j$ is associated with a keyword set $K\left(a_j\right)$. An aspect is considered active in sentence $S$ if at least one token $W_i$ in $S$ matches any keyword in $K\left(a_j\right)$. The matching token positions are defined as:
The aspect activation function is then defined as:
The final detected aspect set is:
• Rule 2 for sentiment detection
After aspect detection, sentiment-bearing words are identified using a contextual polarity function. Given a sentence $S=\left\{w_1, w_2, \ldots, w_n\right\}$, each token $w_i$ is assigned a base polarity:
where, $L^{+}$and $L^{-}$denote the positive and negative sentiment lexicons, respectively. To capture contextual modification, the polarity is adjusted using negation and intensifier cues:
Thus, the contextual polarity of token $w_i$ is defined as:
For each detected aspect $a_j$, sentiment cues are associated using a nearest-window heuristic. Let $P\left(a_j, S\right)$ denote the set of positions of aspect keywords in sentence $S$, and let $W$ be the window size. A sentiment token $w_i$ is associated with aspect $a_j$ if:
The sentiment cue set for aspect $a_j$ is therefore defined as:
The corresponding sentiment score collection is:
• Rule 3 for dominant sentiment with contextual word polarity
For each sentence $S=\left(w_1, w_2, \ldots, w_n\right)$ each token $w_i$ is assigned a contextual polarity $C P\left(w_i\right)$ as defined in Rule 2. Let $C$ denote the set of contrastive conjunctions (e.g., but, however, although). The positions of contrastive words are defined as:
If a contrastive conjunction exists, only the clause after the first contrastive word is prioritized:
where,
The positive and negative sentiment scores are then computed as:
The dominant sentence sentiment is defined as:
In case of equal sentiment scores, the polarity of the last sentiment-bearing word in the sentence is used:
• Rule 4 for final aspect sentiment annotations
For each detected aspect $a_j$, the final sentiment label is assigned by aggregating all aspect-related sentiment cues obtained from Rule 2. Let: Score $\left(a_j, S\right)=\left\{s_1, s_2, \ldots, s_k\right\}$ denote the set of contextual sentiment scores associated with aspect $a_j$ in sentence $S$. The total aspect sentiment score is computed as:
The final aspect label is then assigned as:
When aspect-specific polarity is not identified (i.e., $\operatorname{AspScore}\left(a_j, S\right)=0$), the dominant sentence-level sentiment from Rule 3 is used as the fallback polarity. Thus, the final output of the rule-based labeling process for sentence $S$ is:
where, $A(S)$ denotes the set of detected aspects in sentence S. For example, we use the dataset shown in Table 1 to generate ABSA labels according to predefined rules. The results are presented in Figure 2. Rule 1 detects the aspect [scenic view], and Rule 2 detects the sentiment terms [indah] and [dingin]. Rule 3 counts the number of positive terms appearing in the sentence. Because there are two positive terms, the sentiment label is determined as [positive]. Rule 4 combines the detected aspect and sentiment; therefore, the final labels are [PP] and [LP].

BERT is a pre-trained NLP model developed by Google in 2018. It is built on the Transformer encoder architecture and is designed to capture bidirectional context. This allows the model to consider both the left and right contexts of a word simultaneously. This capability enables a deeper understanding of word meaning within sentences [14]. BERT consists of multiple stacked encoder layers, enabling it to learn complex language representations. Through this deep architecture, BERT can effectively model semantic relationships in text. As a result, it has been widely applied in various NLP tasks such as sentiment analysis, information retrieval, and machine translation. Numerous studies have shown that BERT improves both accuracy and efficiency across a range of language tasks, making it one of the most widely used approaches in modern NLP [15].
In the BERT model, input text is converted into a sequence of tokens. These tokens include the [CLS] token, which indicates the sentence start, and the [SEP] token, which separates sentences. The [PAD] token adds padding, ensuring each sentence matches the model’s fixed length [16]. The PAD token is essential for batch processing, as it ensures all sentences have a consistent length. This allows the model to work efficiently and accurately. Such a transformation improves contextual understanding during learning, enabling BERT to capture a sentence’s semantic details more effectively. The process for representing the dataset in Table 1 is shown in Figure 3. Input sentences are tokenized with the BERT tokenizer, embedded, and processed into BERT outputs. The architecture used in this research is detailed in Table 2.

Layer (Type) | Shape Output (seq_len=128) | Param ($\boldsymbol{\approx}$) |
|---|---|---|
Input_ids | (None, 128) | 0 |
Attention_mask | (None, 128) | 0 |
Embeddings (token + position) | (B, 128, 768) | $\sim$V $\times$ 768 |
Encoder $\times$12 (Transformer) | (B, 128, 768) | $\sim$124 M |
Pooling (CLS) | (B, 768) | 0 |
Dense 768 $\rightarrow$ 256 | (B, 256) | 196,864 |
Batch Normalization | (B, 256) | 1,024 |
Dropout | (B, 256) | 0 |
Dense 256 $\rightarrow$ 128 | (B, 128) | 32,896 |
Dropout | (B, 128) | 0 |
Output Dense 128 $\rightarrow$ 6 | (B, 6) | 774 |
One challenge in conducting research with multi-aspect datasets is obtaining a balanced distribution. This is because applying balancing techniques, such as synthetic minority over-sampling technique (SMOTE), across multiple aspects can unintentionally increase instances of other labels, making it harder to achieve proper balance. To address this issue, this study applied a text augmentation method specifically to words that influence class formation, following the approach proposed by Santoso et al. [17], with the following steps:
1. Identify the distribution of classes in the dataset and define minority classes.
2. Furthermore, each sentence in a minority class is analyzed at the word level to determine its contribution to the sentiment label. This process uses two main approaches, analytical correlation and semantic similarity. Analytical correlation is calculated using Pointwise Mutual Information (PMI), which measures the statistical association between a word and a particular label in the corpus. Meanwhile, semantic similarity is computed using word-embedding-based cosine similarity (Word2Vec) to capture the proximity of meaning between a word and a sentiment label representation. These values are then normalized and used to group words into four categories: meaningful, necessity, reward, and irrelevant. The meaningful category includes words that strongly define the sentiment and should not be altered, while the other categories allow greater flexibility for augmentation.
3. The augmentation process uses these word categories to selectively modify sentences in three sequential steps: selective replacement, selective insertion, and selective deletion. In selective replacement, synonyms of words not categorized as meaningful replace the originals, preserving the term aspect. In selective insertion, synonyms of selected words are added to the sentence to strengthen context without changing meaning. In selective deletion, words identified as irrelevant are removed to reduce noise and better focus on sentiment. These three operations are applied to ensure that, while the wording may vary, the sentence’s sentiment remains consistent.
4. The amount of new data generated from each sentence depends on the augmentation strength ($\alpha$) parameter. This parameter controls how much each sentence is modified. The process is repeated for all minority-class data, producing synthetic datasets with high variation but the same sentiment label. Augmented data is combined with the original dataset to increase the number of minority-class samples.
Since ABSA involves assigning multiple aspect–sentiment labels to a single instance, its evaluation protocol is aligned with multi-label classification, where metrics such as Hamming Loss, Precision, Recall, and Micro-averaged F1 score are commonly used [18].
3. Result and Discussion
This study aims to compare rule-based weak supervision with human labeling and evaluate the performance of several deep learning methods. Furthermore, the best-performing model will be examined in more detail through hyperparameter tuning.
In this study, we used 100 reviews to check the validity of rule-based labeling compared to human labeling. The reviews were randomly selected from a total of 3,811 reviews. We used Cohen’s Kappa to assess agreement between the two labeling methods [19]. Cohen’s Kappa is a widely used statistical measure for evaluating the consistency between two annotators [20]. A higher Cohen’s Kappa score indicates a stronger agreement between rule-based and manual labeling. The results show that the distribution of positive labels generated by the rule-based approach is generally close to that of human labeling, as seen in Figure 4. This indicates that rule-based approaches have captured labeling patterns that are relatively similar to human annotations, although some labels still cannot be detected.

According to the quantitative results, the rule-based method achieved an overall Cohen’s Kappa of 0.90, indicating almost perfect agreement with manual labeling. In addition, an Exact Match Ratio of 0.89 shows that 89% of the data shares the same label combination between the rule-based and human labeling results. Table 3 shows that most labels achieve good agreement rates. In the review sample, the HP and HN labels achieved a Cohen’s Kappa score of 1.00. It indicates that rule-based and human labeling yielded identical labels. The PP label also showed strong performance, with a Cohen’s Kappa of 0.97, while the LP label achieved 0.81.
| Label | Cohen Kappa | Agreement Count | Total |
|---|---|---|---|
| LP | 0.81 | 91 | 100 |
| LN | 0.39 | 97 | 100 |
| HP | 1.00 | 100 | 100 |
| HN | 1.00 | 100 | 100 |
| PP | 0.97 | 99 | 100 |
| PN | 0.00 | 99 | 100 |
However, rule-based doesn’t perform well across all labels. The LN label produced a Cohen’s Kappa of 0.39, even though 97 of 100 data points agreed. This shows that rule-based approaches are weak in recognizing LN labels. It is similar to the PN label, which earned a Cohen’s Kappa score of 0.00 despite agreement at 99 out of 100 data points. This value indicates that the rule-based approach failed to detect PN labels, as did human labeling. In other words, the rule-based method tends to agree with human labeling only when PN labels are absent, rather than consistently identifying their presence
Overall, these results show that the rule-based approach has good validity and is feasible to use as a weak supervision or pseudo-label generator. A rule-based approach has proven particularly effective for labels with more explicit linguistic patterns, such as HP, HN, PP, and LP. However, this approach couldn’t detect labels that appear infrequently, are ambiguous, or are context-dependent, such as LN and PN. Therefore, these findings strengthen the rationale for employing deep learning models in later stages. While rule-based approaches can produce high-quality pseudo-labels, deep learning models can improve generalization, capture complex semantic relationships, and overcome the limitations of rule-based methods, particularly in handling minority or implicit labels.
In text classification research, several deep learning methods are commonly used. These include sequential and bidirectional models. In this study, the most frequently used methods are compared: convolutional neural networks (CNN) [21], recurrent neural networks (RNN) [22], long short-term memory (LSTM) [23], bidirectional long short-term memory (Bi-LSTM) [21], and bidirectional gated recurrent units (Bi-GRU) [24]. Their performance is shown in Table 4. CNN and BERT models achieved the highest precision, both reaching 0.91. This means the model made a few false-positive predictions. In contrast, the RNN had the lowest precision at 0.39. This is due to the RNN’s limitations in capturing long-term word dependencies. As a result, there are more false positives.
BERT also achieved the highest recall (0.95). This shows it is effective at identifying most positive instances. On the other hand, Bi-LSTM had the lowest recall at 0.77. The lower recall of the Bi-LSTM model can be attributed to its more selective prediction behavior. It is causing some relevant labels to be overlooked and leading to a higher number of false negatives. For the Micro-F1 score, BERT achieved the highest value at 0.93. CNN and Bi-GRU also performed well, with scores of 0.87, while RNN had the lowest score at 0.53. This shows that BERT has the best balance of precision and recall. In terms of Hamming Loss, BERT also gets the best score of 0.02. Both CNN and Bi-GRU had Hamming Loss values of 0.04. RNN had a high error rate, as indicated by the highest Hamming Loss of 0.23. A higher error rate corresponds to higher label prediction errors.
Based on this study, we conclude that RNN performed the lowest due to its limited ability to capture long-term word dependencies. LSTM performed better than RNN because it captures long-term dependencies. CNN performed well, particularly in precision, which is comparable to BERT. It is because CNN can extract local patterns correctly. Bidirectional models such as Bi-LSTM and Bi-GRU generally outperformed unidirectional models. However, BERT still performed best, leading most evaluation metrics, especially recall, Micro-F1, and Hamming Loss. This shows that BERT is the best deep learning model in this study.
In this study, BERT appeared as the highest-performing model compared to sequential and bidirectional models. As mentioned, the best model will be tuned to identify the optimal parameter that maximizes BERT performance. The parameter tuning for BERT used in this study is the epoch and batch size. The two epoch settings are 5 and 10. We selected these parameters following Devlin et al. [25], who recommended 3–4 epochs for fine-tuning. However, prior studies by Dodge et al. [26] and Mosbach et al. [27] indicated that increasing the number of epochs may improve performance on small datasets. To optimize the control of update weight, we used batch sizes of 16, 32, and 64. The range of batch sizes is standardized to maximize performance [25]. The learning rate was varied among four values (2e-5, 3e-5, 4e-5, 5e-5). These learning rate parameters were selected because, according to Devlin et al., they have a significant impact on model performance in classification tasks [25].
Model | Precision | Recall | Micro-F1 | Hamming Loss |
|---|---|---|---|---|
CNNs | 0.91 | 0.83 | 0.87 | 0.04 |
RNNs | 0.39 | 0.84 | 0.53 | 0.23 |
LSTM | 0.77 | 0.78 | 0.78 | 0.06 |
Bi-LSTM | 0.87 | 0.77 | 0.81 | 0.53 |
Bi-GRU | 0.85 | 0.89 | 0.87 | 0.04 |
BERT | 0.91 | 0.95 | 0.93 | 0.02 |
After evaluating combinations of epoch, batch size, and learning rate, we observed differences in BERT’s performance, as shown in Table 5 and Table 6. In parameter epoch 5, the best performance was achieved with a batch size of 32 and a learning rate of 2e-5. This configuration achieved a precision of 0.9355, a recall of 0.9388, Micro-F1 of 0.9372, and a Hamming Loss of 0.0154. However, at epoch 10, the best performance was achieved with a batch size of 64 and a learning rate of 2e-5, producing a precision of 0.9218, a recall of 0.9748, a Micro-F1 of 0.9476, and a Hamming Loss of 0.0132. Increasing the number of epochs from 5 to 10 allows BERT to better understand contextual relationships in the data. As a result, the model becomes better at capturing relevant patterns and identifying true positive labels, and improves overall performance.
Based on the results in Table 5 and Table 6, increasing the number of epochs from 5 to 10 can improve Micro-F1, Hamming Loss, and recall. In epoch 5, the best Micro-F1 is 0.9372, which increases to 0.9476 when using epoch 10. The best Hamming Loss in epoch 5 is 0.0154 and drops to 0.0132. These improvements indicate that increasing the epoch length allows a model to learn the patterns of relationships between labels and text contexts more effectively.
In addition, epoch also affected recall. In epoch 5, the best recall reached 0.9388, and in epoch 10, it reached 0.9748. This means that models could detect labels more accurately and reduce false negatives. In other words, more epochs allow the model to better adjust its weights, leading to a more precise semantic representation. However, increasing the number of epochs does not always guarantee improved performance of BERT. In some configurations, precision decreases when we increase the length of the epoch. It means that the highest epochs can lead to higher recall, but precision may drop. Therefore, the number of epochs must be carefully chosen to help the model capture more labels while maintaining prediction accuracy.
Epoch | Batch Size | Learning Rate | Precision | Recall | Micro-F1 | Hamming Loss |
5 | 16 | 2e-5 | 0.9234 | 0.9101 | 0.9167 | 0.0203 |
3e-5 | 0.8963 | 0.9640 | 0.9289 | 0.0181 | ||
4e-5 | 0.8654 | 0.9712 | 0.9153 | 0.0220 | ||
5e-5 | 0.9190 | 0.9388 | 0.9288 | 0.0176 | ||
32 | 2e-5 | 0.9355 | 0.9388 | 0.9372 | 0.0154 | |
3e-5 | 0.9144 | 0.9604 | 0.9368 | 0.0159 | ||
4e-5 | 0.9196 | 0.9460 | 0.9326 | 0.0168 | ||
5e-5 | 0.9214 | 0.9281 | 0.9247 | 0.0185 | ||
64 | 2e-5 | 0.9326 | 0.8957 | 0.9138 | 0.0207 | |
3e-5 | 0.9164 | 0.9460 | 0.9310 | 0.0172 | ||
4e-5 | 0.8851 | 0.9424 | 0.9129 | 0.0220 | ||
5e-5 | 0.9228 | 0.9460 | 0.9343 | 0.0163 |
Epoch | Batch Size | Learning Rate | Precision | Recall | Micro-F1 | Hamming Loss |
10 | 16 | 2e-5 | 0.9138 | 0.9532 | 0.9331 | 0.0168 |
3e-5 | 0.9146 | 0.9245 | 0.9195 | 0.0198 | ||
4e-5 | 0.9027 | 0.9676 | 0.9340 | 0.0168 | ||
5e-5 | 0.8935 | 0.9353 | 0.9139 | 0.0216 | ||
32 | 2e-5 | 0.9135 | 0.9496 | 0.9312 | 0.0172 | |
3e-5 | 0.9281 | 0.9281 | 0.9281 | 0.0176 | ||
4e-5 | 0.9296 | 0.9496 | 0.9395 | 0.0150 | ||
5e-5 | 0.9345 | 0.9245 | 0.9295 | 0.0172 | ||
64 | 2e-5 | 0.9218 | 0.9748 | 0.9476 | 0.0132 | |
3e-5 | 0.9412 | 0.9209 | 0.9309 | 0.0168 | ||
4e-5 | 0.9091 | 0.9353 | 0.9220 | 0.0194 | ||
5e-5 | 0.9097 | 0.9424 | 0.9258 | 0.0185 |
Based on the two tables, the learning rate affects the quality of BERT’s learning. In general, smaller learning rates, such as 2e-5, can yield better results than larger ones, such as 3e-5, 4e-5, and 5e-5. In epoch 5, the best configuration was obtained at a learning rate of 2e-5 and a batch size of 32, which gave the best result in a Micro-F1 of 0.9372 and a Hamming Loss of 0.0154. In epoch 10, the best performance was achieved with a learning rate of 2e-5 and a batch size of 64, resulting in a Micro-F1 of 0.9476 and a Hamming Loss of 0.0132. This indicates that a smaller learning rate allows the BERT fine-tuning process to be more stable. However, if the learning rate is too large, the weight updates are too fast, which risks the model missing important representations learned during pre-training. As a result, model performance decreases, and Hamming Loss tends to increase.
We also observed that learning rates of 4e-5 or 5e-5 can lead to high recall, but are not always accompanied by good precision and low Hamming Loss. For example, in epoch 5, with a batch size of 16 and a learning rate of 4e-5, recall reached 0.9712, but precision dropped to 0.8654, and Hamming Loss increased to 0.0220. This condition shows that a too-large learning rate can make the model assign labels faster, but leads to more false positives.
Besides the epoch and learning rate, batch size also affects model performance. The larger batch sizes can lead to lower Hamming Loss and higher Micro-F1 scores. Increasing the batch size from 16 to 32 could improve BERT’s performance. For example, in epoch 5, when we used a batch size of 16, the Hamming Loss was 0.0203 and dropped to 0.0154 when we increased the batch size to 32. While the Micro-F1 also increased from 0.9167 to 0.9372. In epoch 10, the best performance was also achieved with a maximum batch size of 64, with a Micro-F1 of 0.9476 and a Hamming Loss of 0.0132. The larger batch sizes make the learning process more stable, as gradient estimates are derived from a larger set of samples at each iteration, thereby reducing variance. Larger batch sizes can improve training stability, but not always improve the precision. In our experiments, increasing batch size improved recall but sometimes reduced precision. This suggests that batch size influences the balance between detecting labels and maintaining prediction accuracy.
The results of this study show that rule-based and BERT (RB-BERT)’s best performance is not determined by a single parameter, but by the optimal combination of hyperparameters. Detailed results of BERT’s best performance in Hamming Loss are presented in Figure 5 and Figure 6. The combination of epoch 10, batch size 64, and learning rate 2e-5 yielded the best overall results. It means that epoch 10 provides sufficient training time, that a learning rate of 2E-5 maintains fine-tuning stability, and that a batch size of 64 helps produce more stable gradient updates. In other words, the optimal performance of the RB-BERT is achieved when the model is given sufficient training time, less aggressive weight updates, and a large enough batch size to stabilize the learning process. in contrast, a combination of improper parameters, such as too large of a learning rate or too small of a batch size, tends to result in less stable performance and higher Hamming Loss. Therefore, hyperparameter tuning is a crucial step in optimizing the performance of the BERT model for multi-label classification tasks.


4. Conclusion
Based on the evaluation results, RB-BERT achieved the highest performance, with a precision of 0.9218, a recall of 0.9748, a micro F1 score of 0.9476, and a Hamming loss of 0.0132. These findings confirm that rule-based labeling can be effectively used for automatic annotation for ABSA. This approach can address individual interpretation biases in human labeling, particularly when identifying ABSA. Further studies are needed to integrate semi-supervised learning methods that combine a small set of high-quality human-labeled data with larger rule-based annotations. Additionally, involving multiple annotators and calculating inter-annotator agreement would help validate the quality of manual labeling. Another promising direction is the use of adaptive rule-based systems that incorporate linguistic variations or domain-specific terms, particularly for languages with rich morphology, such as Indonesian, to enhance annotation accuracy and robustness.
Conceptualization, I. and F.H.R.; methodology, I. and F.H.R.; validation, I.; formal analysis, I.; investigation, I.; resources, B.D.S. and D.F.; data curation, D.F.; writing—original draft preparation, I.; writing—review and editing, S.H., F.D., E.M.S.R., and D.A.D.; project administration, D.F. All authors were actively involved in discussing the findings and refining the final manuscript.
The data used to support the findings of this study are available from the corresponding author upon request.
The authors declare no conflicts of interest.
