Hybrid Improved Stacking over Tabular Temporal Features with Blockchain-Certified Data: Millet Yield Prediction and Explainability
Abstract:
Accurate crop yield prediction is essential for food security planning in developing countries. However, real-world deployments remain challenging due to limited imagery availability, heterogeneous tabular data, and concerns regarding data reliability. This paper proposes a tabular-only temporal deep learning framework enhanced with a blockchain-based data provenance layer for millet yield prediction in Senegal. The proposed model embeds per-timestep agroecological features using a multilayer perceptron (MLP), captures temporal dependencies through a bidirectional Long Short-Term Memory (BiLSTM) network, and integrates a hybrid improved stacking strategy by incorporating predictions from classical machine learning models, including Random Forest, XGBoost, LightGBM, and CatBoost. Unlike conventional stacking approaches, these predictions are injected directly into the temporal representation at the final timestep, thereby improving generalization and calibration performance. To ensure data integrity and traceability, a blockchain-inspired certification mechanism is introduced. This mechanism relies on canonicalization, SHA-256 hashing, and HMAC-based signatures of zone-year records. Experimental results demonstrate that the proposed approach achieves strong predictive performance (MAE $\approx$ 0.074, RMSE $\approx$ 0.101, R$^2$ $\approx$ 0.946), outperforming baseline models. A comprehensive evaluation framework is employed, including cross-validation, statistical significance testing, and explainability analysis using SHAP, LIME, and gradient-based saliency methods. Results indicate that while performance improvements are significant under static evaluation settings, they are less consistent under temporal cross-validation, highlighting the importance of robust evaluation protocols. Overall, the proposed framework provides a practical, auditable, and high-performing solution for yield prediction in data-scarce environments, combining predictive accuracy with data trustworthiness.1. Introduction
Agriculture in Senegal exhibits strong interannual variability driven by climate uncertainty, resource constraints, and heterogeneous management practices, making reliable and timely yield prediction a critical requirement for food security and strategic planning [1]. In practice, yield forecasting in developing countries relies predominantly on tabular agroecological variables and historical yield records. However, these datasets are often affected by inconsistent collection protocols, limited validation, missing values, and vulnerabilities during transmission or aggregation, which significantly undermine their reliability and downstream usability for artificial intelligence (AI) models [2]. Achieving consensus and trust among diverse agricultural stakeholders further exacerbates this challenge, highlighting the need for auditable and tamper-resistant data pipelines.
Recent advances in AI have promoted hybrid temporal models that combine deep sequential architectures (e.g., Long Short-Term Memory (LSTM), bidirectional Long Short-Term Memory (BiLSTM), and Transformers) with ensemble or stacking strategies to improve predictive performance on complex agroclimatic data. While these approaches often report accuracy gains over single models, the state of the art reveals several persistent limitations. First, most stacking-based studies treat ensemble predictions as a post-hoc aggregation layer, rather than integrating them directly into the temporal representation learning process. Second, many works rely on limited evaluation protocols, often lacking rigorous cross-validation, statistical significance testing (e.g., Student’s $t$-test), or robustness analysis across unseen seasonal conditions. Third, despite increasing demands for trustworthy AI, explainability remains insufficiently addressed: few hybrid or stacked models provide comprehensive interpretability analyses using established XAI techniques such as SHAP, LIME, or saliency methods, and even fewer assess the stability and consistency of feature attributions over time.
In parallel, the integrity and provenance of agricultural data used to train these models remain largely overlooked. AI systems trained on uncertified or manipulable inputs risk producing unreliable predictions, regardless of architectural sophistication. Blockchain technology offers a complementary solution by enabling immutable, verifiable, and auditable data certification, yet its integration with hybrid temporal learning pipelines for yield prediction remains underexplored.
To address these gaps, we introduce a tabular-only temporal deep learning framework with Hybrid Improved Stacking and a blockchain-certified data provenance layer. Per-timestep tabular features are first embedded using a multilayer perceptron (MLP), aggregated through a BiLSTM [3], and refined via a Transformer encoder to capture both local temporal dynamics and long-range dependencies. Unlike conventional stacking approaches, predictions from classical machine learning models—Random Forest, XGBoost, LightGBM, and CatBoost—are injected directly into the hybrid architecture at the harvest-proximate timestep ($t=T$), enriching the deep temporal representation with complementary inductive biases and improving calibration and generalization. To ensure data trustworthiness, records are canonicalized, hashed using SHA-256, and HMAC-signed on a per-zone, per-year basis, providing verifiable integrity without exposing raw measurements.
Our contributions are threefold:
A practical tabular-only hybrid MLP-BiLSTM-Transformer model that incorporates classical model predictions as stacked inputs, yielding significant improvements in R$^2$ and error metrics.
An auditable blockchain-based provenance layer that certifies agricultural datasets through canonicalization, cryptographic digests, and signatures.
A fully reproducible evaluation framework combining cross-validation, statistical significance testing, ablation studies, and comprehensive explainability analyses using SHAP, LIME, and saliency methods.
Empirical results demonstrate that the proposed stacked hybrid model achieves MAE $\approx$ 0.074, RMSE $\approx$ 0.101, and R$^2 \approx$ 0.946, consistently outperforming the base tabular temporal model and individual learners. These results highlight the benefits of integrating stacking, explainability, and data certification within a unified and trustworthy AI pipeline for precision agriculture. The remainder of this paper is structured as follows. Section 2 reviews blockchain fundamentals, agricultural applications, and hybrid temporal architectures. Section 3 describes the dataset and the blockchain-based data certification workflow. Section 4 presents the temporal deep learning backbone. Section 5 introduces the Hybrid Improved Stacking strategy. Section 6 details the end-to-end implementation. Section 7 reports experimental results, including model comparison, statistical validation, ablation studies, pruning analysis, and interpretability. Section 8 discusses the implications of our findings. Section 9 concludes and outlines future work.
2. Fundamentals of Blockchain Technology
Blockchain is a distributed ledger technology in which transactions are stored in cryptographically linked blocks, ensuring immutability and transparency through hash chaining. Consensus mechanisms such as Proof of Work (PoW) [4], Proof of Stake (PoS) [5], and Practical Byzantine Fault Tolerance (PBFT) [6] regulate transaction validation, reflecting trade-offs between security, scalability, and energy efficiency. In resource-constrained environments such as agriculture and Internet of Things (IoT) systems, PoS and PBFT are generally preferred due to their lower computational demands. Blockchain systems also rely on cryptographically secure Random Number Generators (RNGs) to guarantee fairness and unpredictability in decentralized processes [7], [8]. Since deterministic execution limits native randomness, mechanisms such as commit–reveal schemes and Verifiable Random Functions (VRFs) ensure tamper-resistant and publicly verifiable randomization. In this work, blockchain primarily serves as a data integrity and traceability layer supporting trustworthy AI-driven yield prediction.
Blockchain enhances transparency, traceability, and security in agricultural supply chains by providing an immutable ledger supported by consensus mechanisms and smart contracts [9]. It improves food safety, reduces fraud, and strengthens consumer trust through secure transaction records [10]. Systematic reviews further confirm gains in decentralization, cybersecurity, and resilience within agri-food systems [11], [12]. However, scalability constraints, regulatory uncertainty, interoperability challenges, and deployment costs continue to limit large-scale adoption. Smart contracts enable automated and trustless execution of agricultural agreements, including insurance payouts and subsidy distribution [13]. Despite their efficiency, vulnerabilities such as coding errors and reentrancy attacks require lifecycle-based security frameworks [14]. In IoT-driven agricultural environments, AI–blockchain architectures have demonstrated improvements in threat detection, data integrity, and privacy protection [15], [16], [17]. Nevertheless, implementation complexity and energy efficiency remain open challenges. Although blockchain clearly strengthens traceability and system security, its role as a certification layer for datasets used in predictive modeling remains comparatively underexplored
Hybrid temporal models integrate sequential deep learning architectures—such as BiLSTM networks and Transformer-based attention mechanisms—with MLPs or ensemble stacking strategies to capture both short-term dynamics and long-range dependencies in tabular time-series data. BiLSTM layers model bidirectional temporal relationships, while Transformer blocks enhance global contextual representation through self-attention. MLP components further refine nonlinear feature interactions. This hybridization improves robustness, adaptability, and predictive accuracy in heterogeneous agroclimatic datasets characterized by seasonality and nonlinearity.
Hybrid LSTM-Transformer architectures exploit the complementary strengths of recurrent memory and attention mechanisms. LSTM layers preserve sequential dependencies, whereas Transformers emphasize relevant temporal features across extended sequences. This combination mitigates the limitations of standalone models, where LSTMs may struggle with very long dependencies and Transformers may overlook fine-grained local dynamics.
Recent work has extended these architectures with online learning and knowledge distillation to improve efficiency in real-time contexts. The advanced hybrid LSTM-Transformer model integrates sequential memory with attention-based contextualization, incorporating online adaptation and model compression [18]. Evaluated on engineering systems such as subsurface drilling and stormwater infrastructure, the model outperformed classical statistical baselines (ARIMA, Holt-Winters) and standalone deep learning approaches, achieving higher predictive accuracy with reduced computational cost. These findings demonstrate the scalability of hybrid architectures in dynamic environments requiring continuous learning.
Ensemble stacking strategies further enhance predictive performance by combining multiple base learners (e.g., linear models, tree-based methods, kernel approaches) through a meta-learner. Kale et al. [19] integrated linear regression, Least Absolute Shrinkage and Selection Operator (LASSO), decision trees, Random Forest, XGBoost, and Support Vector Machine (SVM) into a stacked ensemble, reporting improved crop yield prediction (MSE reduced from approximately 8315 to 6408 and accuracy increased from 83% to 85%) compared to individual models. Similarly, Li et al. [20] developed a stacking ensemble model for maize yield forecasting using 34 years of meteorological and yield data across 596 Chinese counties. By combining LightGBM, Bagging, and other base learners through a meta-model, they achieved a mean absolute percentage error (MAPE) of 4.6%, demonstrating strong robustness and generalization. The study also quantified the influence of 27 meteorological variables during crop growth stages, contributing to agronomic insight. Despite these advances, most stacking-based studies emphasize predictive accuracy (e.g., R$^2$, RMSE, MAE, MAPE) without conducting comprehensive statistical validation across temporal splits or uncertainty analysis. Moreover, explicit explainability assessments using SHAP, LIME, or saliency-based attribution methods are rarely integrated alongside stacking performance evaluation. As a result, while ensemble stacking effectively reduces bias and variance relative to single learners, interpretability and temporal generalization remain insufficiently explored in agricultural yield modeling.
The preceding review highlights two major research gaps. First, although hybrid temporal architectures and stacking strategies have demonstrated improved predictive accuracy, existing studies often lack (i) deep integration between classical machine learning predictions and sequential neural representations, (ii) rigorous statistical validation across temporal splits, and (iii) comprehensive multi-level explainability analyses combining SHAP, LIME, and saliency-based attribution. Second, while blockchain has proven valuable for supply-chain transparency and IoT security, its potential as a certification layer for datasets used in predictive modeling remains largely unexplored.
Addressing these limitations requires a unified framework that simultaneously strengthens predictive performance, statistical robustness, interpretability, and data provenance. To this end, we propose a tabular-only hybrid temporal architecture in which predictions from classical regressors (Random Forest, XGBoost, LightGBM, and CatBoost) are not merely compared or stacked at the meta-output level, but explicitly injected as structured meta-features into a deep MLP-BiLSTM-Transformer pipeline at the harvest-proximate timestep. This design enables the model to leverage complementary inductive biases captured by tree-based ensembles while preserving rich temporal representations.
Beyond architectural integration, the proposed framework incorporates rigorous evaluation through cross-validation, ablation analysis, structured pruning assessment, and statistical significance testing using Student’s $t$-test to ensure that reported improvements are not attributable to random variation. Furthermore, interpretability is treated as a core component rather than an auxiliary analysis: SHAP values quantify global feature contributions, LIME provides local instance-level explanations, and saliency maps capture temporal importance patterns within the sequential architecture.
Finally, to guarantee the trustworthiness of the training data itself, an upstream blockchain-based certification layer ensures dataset integrity through canonicalization, cryptographic hashing (SHA-256), and HMAC signatures recorded per zone and year. In this manner, the proposed framework unifies performance enhancement, statistical validation, explainability, and auditable data provenance into a coherent pipeline for trustworthy AI-driven yield prediction.
3. Input Data
This section describes the data sources, certification mechanism, and modeling strategy underlying the proposed framework, entitled Hybrid Improved Stacking over Tabular Temporal Features with Blockchain-Certified Data. We motivate our methodological choices by practical constraints commonly encountered in developing-country contexts, including limited availability of dense satellite imagery, heterogeneous agroecological records, and the lack of a universally trusted yield ground truth. Our objective is therefore twofold: to design an effective tabular-only temporal prediction model, and to ensure traceability and auditability of the data used for training and evaluation.
Accurate yield prediction critically depends on understanding the nature,
quality, and variability of the input data. In operational agricultural contexts, especially in Sub-Saharan Africa, available observations are often heterogeneous, partially missing, and reported by multiple institutions without a common validation protocol. This subsection introduces the tabular data used in this study and summarizes their key statistical properties as described by Table 1 and Table 2. We rely exclusively on tabular agroecological variables and historical millet yield records aggregated at the zone--year level. The features include climatic indicators, vegetation-index derivatives, and synthetic yield signals designed to regularize learning under sparse observations. Descriptive statistics are computed to assess feature scale, dispersion, interannual variability, and missingness patterns.
Feature | Mean | Std | Min | Max |
|---|---|---|---|---|
Year | 2010 | 6.0560 | 2000 | 2020 |
ndvi_mean_x | 0.6288 | 0.0226 | 0.5636 | 0.6858 |
ndvi_max_x | 0.7241 | 0.0388 | 0.6110 | 0.8300 |
ndvi_min_x | 0.4980 | 0.0459 | 0.3660 | 0.6220 |
ndvi_std_x | 0.0913 | 0.0238 | 0.0217 | 0.1518 |
evi_mean | 0.4390 | 0.0168 | 0.3816 | 0.4896 |
evi_max | 0.5134 | 0.0275 | 0.4430 | 0.5920 |
evi_min | 0.3433 | 0.0343 | 0.2350 | 0.4420 |
evi_std | 0.0685 | 0.0171 | 0.0271 | 0.1237 |
savi_mean | 0.3771 | 0.0179 | 0.3380 | 0.4310 |
savi_max | 0.4421 | 0.0294 | 0.3780 | 0.5260 |
savi_min | 0.2943 | 0.0367 | 0.1900 | 0.3840 |
savi_std | 0.0596 | 0.0191 | 0.0139 | 0.1219 |
ndwi_mean | 0.0497 | 0.0649 | -0.1118 | 0.2132 |
ndwi_max | 0.2143 | 0.0723 | -0.0280 | 0.3000 |
| Year | Mean Yield (kg/ha) | Std (kg/ha) | Regions | Good Years | Droughts |
|---|---|---|---|---|---|
| 2000 | 1049.93 | 271.88 | 14 | 0 | 0 |
| 2001 | 1200.34 | 352.82 | 14 | 0 | 0 |
| 2002 | 807.63 | 228.04 | 14 | 0 | 14 |
| 2003 | 1332.24 | 413.14 | 14 | 14 | 0 |
| 2004 | 1154.70 | 315.35 | 14 | 0 | 0 |
| 2005 | 1132.59 | 399.46 | 14 | 0 | 0 |
| 2006 | 1273.08 | 382.36 | 14 | 0 | 0 |
| 2007 | 735.48 | 212.42 | 14 | 0 | 14 |
| 2008 | 1316.14 | 410.97 | 14 | 14 | 0 |
| 2009 | 1209.97 | 414.77 | 14 | 0 | 0 |
| 2010 | 1528.24 | 481.37 | 14 | 14 | 0 |
| 2011 | 1001.94 | 282.94 | 14 | 0 | 14 |
| 2012 | 1319.00 | 406.44 | 14 | 0 | 0 |
| 2013 | 1291.22 | 397.80 | 14 | 0 | 0 |
| 2014 | 1073.24 | 298.51 | 14 | 0 | 14 |
| 2015 | 1400.97 | 340.76 | 14 | 0 | 0 |
| 2016 | 1602.83 | 510.58 | 14 | 14 | 0 |
| 2017 | 1310.00 | 368.96 | 14 | 0 | 0 |
| 2018 | 1323.75 | 444.59 | 14 | 0 | 0 |
| 2019 | 1720.83 | 489.85 | 14 | 14 | 0 |
| 2020 | 1337.19 | 345.52 | 14 | 0 | 0 |
National millet yields exhibit pronounced interannual variability, with clear yield depressions during drought-affected years (e.g., 2002, 2007, 2011, 2014) and peaks during favorable seasons (e.g., 2010, 2019).
Dispersion tends to increase during high-yield years, reflecting heterogeneous regional responses to agroecological conditions.
These observations motivate the use of temporal models capable of capturing both long-term trends and short-term variability from noisy, incomplete tabular data.
Beyond predictive performance, agricultural decision-support systems must
address trust, transparency, and data provenance. Yield statistics are often contested due to inconsistent reporting protocols, delayed collection, or institutional disagreements. To address these challenges, we introduce a lightweight, blockchain-inspired certification layer that secures data integrity without requiring full on-chain storage of raw observations as shown in Figure 1. The proposed architecture combines tabular data ingestion, cryptographic certification, and temporal learning within a unified pipeline. Rather than storing raw agroecological measurements on-chain, only canonicalized digests and metadata are recorded, preserving privacy while ensuring auditability.

This design deliberately avoids heavy blockchain overhead while providing verifiable provenance guarantees suited to data-scarce environments.
To operationalize provenance, each zone--year data batch is first canonicalized into a deterministic representation, ensuring that identical inputs always produce identical hashes (Table 3). A SHA-256 digest is then computed and recorded alongside minimal metadata.
Certification workflow:
Canonicalize tabular payloads (sorted fields, normalized types).
Compute SHA-256 digest of the canonicalized payload.
Attach metadata (zone, year, data type, timestamp).
Sign the digest using HMAC-SHA256 (prototype) or MSP credentials.
Record digest and metadata in an immutable ledger.
Field | Description |
|---|---|
data_hash | SHA-256 digest of payload |
timestamp | UTC time of certification |
zone | Agricultural region |
data_type | Yield and agroecological data |
tx_id | Transaction identifier |
block_no | Ledger block number |
Any post hoc modification of the data results in a digest mismatch, making tampering immediately detectable.
4. Temporal Deep Learning Model
Crop yield is inherently a temporal process driven by cumulative and lagged agroecological effects. Static regression models operating on single-year snapshots are therefore insufficient to capture seasonality, memory effects, and long-term trends. We address this limitation using a hierarchical temporal architecture designed specifically for tabular sequences.
Let a zone-level temporal sequence be defined as:
where, $F$ denotes the number of tabular features.
Each timestep is embedded via a nonlinear projection:
Temporal dependencies are captured using a BiLSTM:
To model long-range temporal interactions, the hidden states are refined using a Transformer encoder:
where, $\mathbf{P}$ denotes positional encodings.
Temporal dependencies are modeled using a BiLSTM, followed by a Transformer encoder to capture longer-range interactions. Predictions are produced at the final timestep. This combination enables the model to learn both local seasonal patterns and long-term structural dynamics in yield evolution.
The final temporal representation $\mathbf{s}_T$ feeds two output heads:
The training objective combines regression and classification losses:
where, BCE denotes the binary cross-entropy loss, $\ell$ represents the ground-truth label for good-year classification, and $\alpha$, $\beta$, and $\gamma$ control the relative importance of each term
5. Hybrid Improved Stacking Strategy
Although deep temporal models are expressive, classical tabular regressors often provide strong inductive biases when trained on engineered features. Rather than treating these approaches as competing paradigms, we combine them through a hybrid stacking strategy.
In addition to the deep temporal backbone, classical machine learning regressors (Random Forest, XGBoost, LightGBM, and CatBoost) are trained on final-timestep tabular features.
Their predictions form a complementary vector:
The stacked representation is defined as:
which is fed to the final prediction layer. This hybrid stacking strategy injects expert-driven meta-features that improve robustness and generalization.
6. End-to-End Workflow and Implementation
The complete workflow in Figure 2 of our framework begins with the ingestion of tabular data, which are grouped into sequences where each row represents a zone-year-attribute entry, and the last year of each sequence is used for classical machine learning predictions. Each batch of data is then canonicalized, hashed, and stored on the blockchain (or a prototype ledger) to ensure traceability and tamper-evidence. The sequences are passed through a MLP for feature embedding, followed by a BiLSTM to capture short-term temporal dependencies. The LSTM outputs are further refined using a Transformer [21] encoder to model long-range interactions, producing a comprehensive temporal representation. Predictions from classical machine learning models on the final year are concatenated with this representation to form a stacked vector, which is used to predict both crop yield and the probability of a good production year. Model performance is evaluated using MAE, RMSE, and R$^2$, while SHAP values, LIME explanations, and gradient-based saliency maps provide interpretability of feature contributions over time. Finally, outputs are compiled into auditable, zone-level decision-support reports, combining predictions, interpretability insights, and blockchain certification information. This integrated approach ensures accurate temporal yield prediction while maintaining data integrity, traceability, and actionable agronomic insights.
The complete workflow consists of:
Data ingestion and temporal sequence construction
Blockchain-based certification of tabular records
Temporal model training and hybrid stacking
Performance evaluation using MAE, RMSE, and R$^2$
Explainability analysis using SHAP and gradient-based methods
Generation of auditable, zone-level decision-support reports

7. Results and Model Evaluation
This section reports the empirical performance of the proposed models. We first compare classical tabular baselines with the temporal deep model, then analyze architectural choices through ablation and tuning, and finally examine the behavior and interpretability of the hybrid improved stacking variant.
We begin with a global comparison of classical tabular machine learning models (Random Forest, XGBoost, LightGBM, CatBoost) in Figure 3. Among these classical models, Random Forest is the most effective. We then compare these models to two deep learning approaches in Figure 4: the proposed temporal deep model and the hybrid stacking variant. This comparison evaluates whether temporal modeling and stacking offer tangible improvements over strong non-temporal baselines.


Overall, classical models provide competitive baselines, confirming that carefully engineered tabular features already encode strong predictive signals. However, the temporal deep model consistently improves error metrics, highlighting the importance of explicitly modeling interannual dependencies. The hybrid stacking variant further refines performance, particularly in terms of calibration and explained variance.
To verify whether the observed performance improvement is statistically significant, we conducted a paired $t$-test (Table 4) between the baseline model and the proposed Hybrid Improved Stacking model across all test observations.
Statistic | Value |
|---|---|
Sample size ($n$) | 56,574 |
Degrees of freedom ($df$) | 56,573 |
$t$-statistic | -172.99 |
$p$-value | $\lt$ 0.001 |
Mean difference | -0.2595 |
95% CI lower bound | -0.2625 |
95% CI upper bound | -0.2566 |
Significance level ($\alpha$) | 0.05 |
As shown in Table 4, the paired $t$-test reveals a highly statistically significant difference between the two models ($t$ = -172.99, $p$ < 0.001). The negative t-statistic indicates that the prediction errors of the Hybrid Improved Stacking model are significantly lower than those of the baseline model. The mean difference of -0.2595 demonstrates a substantial reduction in absolute prediction error.
The 95% confidence interval of the mean difference ([-0.2625, -0.2566]) does not cross zero, further confirming the robustness of the improvement. Given the large sample size ($n$ = 56,574), the result demonstrates not only statistical significance but also stability across observations. The extremely large absolute t-value reflects the consistency of the improvement across the vast majority of samples. These findings provide strong empirical evidence that the hybrid stacking strategy yields a consistent and reliable performance gain over conventional approaches.
However, when subjected to dynamic cross-validation, a more rigorous evaluation protocol that accounts for temporal dependencies and tests generalizability across different time periods, the difference between the two models becomes statistically non-significant. This discrepancy highlights a critical nuance: while the improvement is highly detectable in the full-sample test due to the extremely large sample size ($n$ = 56,574) and the resulting statistical power, it is not consistently replicable across temporal folds.
These contrasting results demonstrate that the performance gain from the proposed hybrid stacking strategy may be statistically significant in the aggregate but not practically robust under stricter temporal validation. The improvement appears to be concentrated in specific temporal segments rather than uniformly distributed across the entire dataset. The extremely large $t$-statistic in the full test is primarily driven by the sample size rather than the effect size (mean difference of 0.26), which is relatively modest in practical terms.
This finding underscores the importance of employing dynamic cross-validation in time-series forecasting tasks, as standard paired $t$-tests can overestimate the generalizability of model improvements when applied to large, temporally correlated datasets. The static test suggests a highly significant improvement, while the dynamic validation reveals that this improvement may not hold consistently across different temporal contexts.
We therefore interpret the improvement as context-dependent rather than universally applicable—a nuance that would have been completely overlooked without dynamic cross-validation. This result does not invalidate the proposed model but rather provides a more honest and nuanced assessment of its strengths and limitations. The hybrid stacking strategy offers measurable improvements, but these gains may be most pronounced under specific temporal conditions rather than representing a universal advancement. To further investigate these discrepancies, we first perform ablation, pruning, and tuning on the simple hybrid model in order to optimize its performance. We then compare this simple hybrid model with the Hybrid Improved Stacking model through statistical analysis of their prediction differences. Finally, we examine model explainability to gain insights into the temporal dynamics driving their respective performances.
To isolate the contribution of each temporal component, we conduct an ablation study comparing three architectural variants: an LSTM-only model, a Transformer-only model, and the combined BiLSTM-Transformer backbone. Table 5 presents the performance metrics for each configuration, while Figure 5 visualizes the comparative results.
| Model Variant | MAE | RMSE | R$^2$ |
|---|---|---|---|
| LSTM-only | 0.301 | 0.374 | 0.250 |
| LSTM+Transformer | 0.692 | 0.738 | -1.928 |
| Transformer-only | 0.785 | 0.866 | -3.025 |

The ablation results reveal several important findings about the temporal architecture design:
LSTM-only configuration achieves the best performance across all metrics, with MAE of 0.301, RMSE of 0.374, and a positive R$^2$ of 0.250. This indicates that the LSTM effectively captures the sequential dependencies in the yield time series, favoring short-term seasonal patterns and local temporal correlations.
Transformer-only configuration shows the poorest performance, with MAE of 0.785, RMSE of 0.866, and a strongly negative R$^2$ of -3.025. This suggests that the Transformer architecture, while powerful for capturing long-range dependencies in natural language processing, may be less suitable for agricultural yield time series where local temporal patterns dominate and the sequence length is relatively short (4 timesteps).
LSTM+Transformer combination yields intermediate performance (MAE = 0.692, RMSE = 0.738, R$^2$ = -1.928), but still substantially worse than the LSTM-only variant. The negative R$^2$ values for both the combined model and Transformer-only indicate that these architectures fail to capture the underlying data structure, performing worse than a simple mean-based predictor.
It is important to contextualize these results within the broader scope of our study. While the LSTM-only configuration outperforms the other temporal variants in this ablation study, its absolute performance (R$^2$ = 0.250) remains substantially below the predictive accuracy achieved by our final hybrid models. This discrepancy arises because the ablation study evaluates raw temporal architectures without the benefit of:
Hyperparameter optimization: The baseline models use default configurations rather than systematically tuned parameters adapted to the specific characteristics of agricultural time series.
Feature engineering: The ablation study operates on raw tabular features without the enhanced representations developed in our complete pipeline.
Regularization techniques: Techniques such as dropout, batch normalization, and weight decay that stabilize training and improve generalization are not fully optimized in this comparison.
These results should therefore be interpreted as a comparative analysis of temporal architectures rather than an indication of the final model's capabilities. The ablation study serves primarily to validate that recurrent dynamics are essential for capturing temporal dependencies in agricultural time series. However, the modest absolute performance of even the best variant (R$^2$ = 0.250) motivates our subsequent development of more sophisticated approaches, including hyperparameter tuning, model pruning, and most importantly, the hybrid improved stacking strategy presented
The negative R$^2$ values observed for architectures incorporating Transformers suggest that these models, in their default configurations, are poorly suited to the short-sequence, locally-dominated patterns characteristic of agricultural yield data. This finding informs our architectural decisions: rather than simply combining LSTM and Transformer components, we focus on optimizing the hybrid model through careful capacity adjustment and regularization before augmenting it with the stacking strategy that ultimately yields our best-performing model (R$^2$ $\approx$ 0.946).
We analyze the sensitivity of the temporal model to key hyperparameters, including embedding dimension (d_model), LSTM hidden size (lstm_hidden), number of temporal layers (temporal_layers), and learning rate (lr). A random search strategy with three trials was employed to explore the configuration space, and the complete results are presented in Table 6.
|
|
|
| MAE | RMSE | R$^2$ |
|---|---|---|---|---|---|---|
256 | 128 | 2 | 0.0010 | 0.2745 | 0.3378 | 0.3870 |
64 | 64 | 1 | 0.0010 | 0.3454 | 0.4167 | 0.0674 |
64 | 64 | 2 | 0.0003 | 0.3023 | 0.3624 | 0.2946 |
The tuning results reveal several important insights about the model's hyperparameter sensitivity:
Optimal configuration: The best-performing configuration combines a larger embedding dimension (
d_model= 256) with moderate LSTM hidden size (lstm_hidden= 128), two temporal layers, and a learning rate of 0.001. This configuration achieves the lowest errors (MAE = 0.2745, RMSE = 0.3378) and the highest R$^2$ value of 0.3870, indicating that it explains approximately 38.7% of the variance in yield predictions.Impact of model capacity: The results show a clear progression where increased model capacity (
lager d_modelandlstm_hidden) leads to improved performance. The best configuration withd_model= 256 substantially outperforms the variants withd_model= 64, suggesting that the additional representational capacity is necessary to capture the complexity of the yield prediction task.Learning rate sensitivity: The comparison between the two
d_model= 64 configurations reveals the importance of learning rate selection. With the same architecture but different learning rates, the configuration withlr= 0.0003 achieves MAE of 0.3023, while the one withlr= 0.001 achieves MAE of 0.3454. This 14% relative improvement demonstrates that even with fixed architecture, appropriate learning rate selection is crucial.Temporal depth: The best configuration employs two temporal layers, indicating that moderate depth in the temporal processing helps capture the sequential dependencies in the yield data. However, the relatively small improvement over shallower architectures suggests diminishing returns beyond this depth.
Performance landscape: The tuning results reveal a relatively smooth performance landscape with clear differentiation between configurations. The best configuration achieves an R$^2$ of 0.3870, while the worst configuration
d_model= 64,lstm_hidden= 64,temporal_layers= 1,lr= 0.001) achieves only 0.0674, representing a fivefold improvement in explained variance.The tuning results reveal a performance landscape with moderate sensitivity to individual parameters, suggesting that while optimal configuration yields meaningful improvements, the model can be deployed reasonably well across a range of settings. The best configuration (Table 7) achieves an R$^2$ of 0.3870, representing a substantial improvement over the default configuration and demonstrating that hyperparameter optimization is worthwhile for this application.
|
|
|
| MAE | RMSE | R$^2$ |
|---|---|---|---|---|---|---|
256 | 128 | 2 | 0.001 | 0.2745 | 0.3378 | 0.3870 |
Interestingly, the results show that model capacity (d_model and lstm_hidden) has the most pronounced impact on performance, followed by learning rate, while the number of temporal layers has a more modest effect within the tested range. This hierarchy of sensitivity provides practical guidance for future deployments: practitioners should prioritize scaling model capacity and tuning learning rate before exploring deeper architectures.
The relatively small number of tuning trials ($n$ = 3) was sufficient to identify a strong configuration due to the clear performance differentiation, suggesting that random search is an efficient strategy for this model class. The optimal configuration (d_model = 256, lstm_hidden = 128, temporal_layers = 2, lr = 0.001) is adopted for all subsequent experiments and analyses.
To examine the trade-off between model complexity and generalization, in Table 8 we apply structured pruning to the trained temporal model. The objective is to assess whether moderate sparsification acts as an implicit regularizer.
Prune | MAE | RMSE | R$^2$ |
|---|---|---|---|
0.1 | 0.3393 | 0.3910 | 0.179 |
0.2 | 0.2928 | 0.3491 | 0.345 |
0.4 | 0.4025 | 0.4593 | -0.133 |
Moderate pruning preserves, and occasionally improves, generalization performance, whereas aggressive pruning degrades accuracy. This indicates mild overparameterization and supports controlled sparsification.
We now focus on the hybrid improved stacking variant, which augments the deep temporal representation at the final timestep with predictions from classical regressors. Beyond aggregate metrics, we analyze prediction behavior and interpretability.
The parity plot [22], which compares observed and predicted yields, is used to assess the calibration, consistency, and generalization ability of the proposed model across the full range of yield values. The 1:1 reference line represents perfect agreement between predictions and observations. A close alignment of points along this diagonal indicates unbiased predictions and the absence of systematic over-or underestimation. In contrast, deviations from the diagonal reveal potential systematic biases. Unlike aggregate metrics such as MAE or RMSE, the parity plot provides insight into model performance at low, medium, and high yield levels, including extreme drought or high-production years.
For the Hybrid Improved Stacking model, the Figure 6 shows that the predicted yields are tightly concentrated around the 1:1 diagonal, whereas for the baseline hybrid model, we see in Figure 7 that most points lie below the diagonal.This indicates that the stacking strategy substantially improves model calibration, while the baseline hybrid model tends to systematically underestimate observed yields, particularly across moderate to high yield levels.


Residual histograms [23] are employed to analyze the distribution, symmetry, and concentration of prediction errors, providing insights into model calibration and robustness.
We see in Figure 8 that the Hybrid Improved Stacking model exhibits a high concentration of residuals around zero, significantly more than the simple hybrid model in Figure 9. The residual histogram of the Hybrid Improved Stacking model shows a pronounced concentration around zero, indicating improved calibration and reduced prediction variance compared to the baseline hybrid model. This suggests that incorporating ensemble-based predictions as auxiliary inputs effectively enhances local yield estimation accuracy.


The QQ-plot [24] checks whether the model errors follow a normal distribution. On the QQ-plot curve, if the points do not follow the diagonal line, this means that the errors are greater for extreme years, such as drought years or exceptional years, and do not follow a normal distribution.
The Figure 10 and Figure 11 show that: the residuals approximately follow a Gaussian distribution, indicating a well-calibrated model without systematic error pattern for both hybrid models.


The Bland-Altman plot [25] is used to assess the agreement between observed and predicted yields by visualizing the differences (residuals) against the mean of the two measurements. It allows identification of systematic bias, heteroscedasticity, and the spread of prediction errors across the yield range. In this study, our improved hybrid model with stacking shows points closely clustered around the zero-difference line, indicating low bias and consistent predictions (Figure 12). In contrast we see in Figure 13, the simple hybrid model without stacking has points more dispersed around the line, reflecting higher variability and less precise predictions. Overall, the plot confirms that stacking improves the accuracy and stability of yield predictions. The analysis highlights the value of error visualization in model evaluation and comparison.


To better understand the decision-making process of our models, we complement quantitative evaluation with interpretability analyses. We focus on the best-performing Hybrid Improved Stacking model, using SHAP values, LIME, and gradient-based saliency maps to reveal feature contributions over time.
SHAP values [26] provide a unified measure of feature contribution to model predictions based on cooperative game theory. We analyze global feature importance across both the baseline hybrid model and the improved hybrid model with stacking (Figure 14).

The SHAP analyses reveal that the stacking variant emphasizes different features compared to the baseline model. The improved model places higher importance on vegetation indices and agroecological features augmented by classical model predictions, indicating that the stacking mechanism effectively combines expertise from both deep learning and tabular regressors. Key agronomic drivers such as NDVI mean, NDVI max, and EVI are consistently identified as high-impact features, validating the agronomic relevance of the model's decision logic.
LIME [27] provides local, interpretable explanations for individual predictions by approximating the model's behavior in the neighborhood of a specific prediction. We present representative LIME explanations for both models (Figure 15 and Figure 16) to illustrate local decision-making.






The LIME explanations in Figure 15 and Figure 16 demonstrate that both models rely on agronomically interpretable features. Vegetation indices (NDVI, EVI, SAVI) consistently contribute positively to yield predictions during favorable growing conditions, while negative contributions from adverse climate indicators (e.g., high water stress metrics, drought indices) are observed in poor harvest years. The stacking variant exhibits more stable local explanations, with clearer feature attribution, reflecting the improved calibration already observed in the diagnostic plots.
Gradient-based saliency maps reveal which input features have the strongest local gradient influence on model predictions, providing a temporal perspective on feature importance.
Gradient-based saliency maps are applied exclusively to the temporal component of the hybrid model. This is because the BiLSTM--Transformer backbone is fully differentiable, allowing reliable computation of input gradients with respect to the output. In contrast, the classical machine learning models used in the stacking layer (e.g., Random Forest, XGBoost) are non-differentiable, making gradient-based analysis inapplicable.
The saliency maps (Figure 17) show that the model exhibits peak sensitivity to agroecological features near the harvest period (final timesteps), aligning with practical agricultural knowledge whereby late-season conditions are critical determinants of final yield. Early-season features contribute primarily through temporal aggregation by the LSTM and Transformer components, while late-season variables exert direct influence on final predictions.

Together, SHAP, LIME, and gradient-based analyses consistently indicate that both the baseline and improved hybrid models rely on agronomically meaningful variables and exhibit interpretable decision logic. The improved hybrid model with stacking demonstrates:
More transparent feature attribution via SHAP, with clearer separation between important and peripheral features.,Stable local explanations via LIME, with fewer anomalies or reversals in feature sign compared to the baseline.
These properties collectively support transparency and trust in the proposed decision-support pipeline, ensuring that stakeholders can audit, understand, and rely upon model recommendations based on agronomically grounded reasoning.
8. Discussion
The results highlight that integrating AI models with blockchain-inspired provenance and IoT-style data acquisition improves both predictive accuracy and the trustworthiness of agricultural data pipelines. The hybrid temporal architecture combines BiLSTM and Transformer encoders to capture short- and long-term dynamics, while fusing heterogeneous signals such as agroecological variables and historical yields, enabling robust predictions even under sparse or delayed observations. Auxiliary reconstruction objectives act as implicit regularizers, enhancing generalization on limited data, and interpretability analyses (SHAP, LIME, saliency) provide actionable insights aligned with agronomic knowledge, supporting stakeholder adoption.
However, several limitations remain. Temporal alignment of input data is assumed, and misaligned or inconsistent records can degrade performance. Dependence on the availability and quality of key tabular features remains a concern, and computational overhead may limit deployment in resource-constrained environments. Moreover, the prototype blockchain layer relies on off-chain storage and HMAC-based signatures, lacking full decentralized governance, identity management, and privacy-preserving mechanisms. Domain shifts across regions or sensors may also affect generalization, and model training is centralized, leaving federated or privacy-preserving strategies as future extensions.
Future work will address these limitations by implementing MSP-backed endorsements, secure object storage with immutable references, evaluation in multi-organization scenarios, and exploration of federated learning frameworks, aiming to combine high predictive performance with verifiable data integrity and trust in real-world agricultural applications.
9. Conclusion
This work proposes a unified framework that combines AI-based yield prediction with blockchain-inspired provenance and IoT-style data collection. Rather than focusing solely on predictive accuracy, the approach explicitly addresses data integrity, auditability, and trust, which are critical constraints in many agricultural contexts.
Experimental results demonstrate that the proposed stacked hybrid model achieves MAE $\approx$ 0.074, RMSE $\approx$ 0.101, and R$^2$ $\approx$ 0.946, consistently outperforming the base tabular temporal model and individual learners. These results highlight the benefits of integrating stacking, explainability, and data certification within a unified and trustworthy AI pipeline for millet yield prediction.
The framework is particularly relevant for data-constrained and trust-sensitive environments, where reliable ground truth is difficult to obtain. Future work will target large-scale deployment, cross-institutional governance, privacy-preserving learning, and federated training to support real-world agricultural decision-making systems. To further validate the framework, we plan to extend evaluation across multiple agro-climatic regions, conduct stress-tests under distributional shift scenarios, and integrate distributed training strategies for operational deployment.
Conceptualization, P.E.A.G.; methodology, P.E.A.G.; software, P.E.A.G.; validation, P.E.A.G., C.B.D. and A.B.; formal analysis, P.E.A.G.; investigation, P.E.A.G.; resources, P.E.A.G.; data curation, P.E.A.G.; writing—original draft preparation, P.E.A.G.; writing—review and editing, P.E.A.G.; visualization, P.E.A.G.; supervision, A.B., D.N., and C.B.D.; project administration, P.E.A.G.; funding acquisition, C.B.D. and D.N. All authors have read and agreed to the published version of the manuscript.
This study relies exclusively on aggregated, non-identifiable regional and national statistics. No individual-level or personally identifiable data were collected or processed. The blockchain-style provenance layer is designed to enhance data integrity and auditability without exposing raw data; only cryptographic digests and minimal metadata are recorded.
Any real-world deployment of the proposed framework should incorporate appropriate access control, organizational endorsement policies, and privacy-preserving mechanisms in compliance with local regulations and ethical guidelines.
The data used to support the research findings are available from the corresponding author upon request.
The authors thank regional agronomic services and domain experts who contributed to data curation, validation, and contextual interpretation of millet yield statistics in Senegal. We also acknowledge colleagues and reviewers for their constructive feedback on the model architecture, evaluation protocol, and blockchain-inspired provenance design.
The authors declare no conflict of interest.
Additional details related to reproducibility and the compute environment are provided below.
Reproducibility was a primary design objective of this work. All source code, experiment configurations, and implementation details are hosted in a private repository on GitHub. In addition, the datasets used in this study are made available through our startup platform ``http://sig-these.vercel.app'', ensuring controlled access and data traceability.
All results reported in this paper are generated from a single, self-contained pipeline that integrates data preparation, model training, evaluation, and reporting. Tabular inputs are loaded from the file ``processed/complete_dataset.csv'', while synthetic national-level millet yields are provided in the file ``synthetic/senegal_mil_yields_2000_2020.csv''. All quantitative results, tables, and figures are automatically produced and stored under the directory ``outputs/reports_tab'', ensuring that no manual post-processing is required.
The main entry script ``millet_pipelin_only_tab.py'' trains classical tabular models, the temporal deep learning model, and the hybrid stacking variant. To reproduce the reported results, create or activate a Python virtual environment, install all dependencies listed in ``requirements.txt'', and run the pipeline to regenerate reports and figures in ``outputs/reports_tab''.
To reproduce the reported results, follow these steps:
Create or activate a Python virtual environment and install all dependencies listed in ``requirements.txt''.
Run the tabular-only pipeline to regenerate the reports and figures in ``outputs/reports_tab''.
All experiments were conducted on a Windows-based workstation using a Python virtual environment and MiKTeX for LATEX compilation. Classical machine learning models were implemented with ``scikit-learn'', while boosted tree methods (XGBoost, LightGBM, and CatBoost) may require platform-specific binaries or build tools.
The temporal deep learning model was implemented in PyTorch and trained using the AdamW optimizer with cosine learning-rate scheduling and gradient clipping. No specialized hardware accelerators were required, reflecting the practical deployment constraints targeted by this study.
