Bayesian Estimation of Hand Kinematics from Spatially Tracked Landmarks
Abstract:
A Bayesian framework for estimating finger joint kinematics from spatially tracked hand landmarks was introduced in this study. Three-dimensional landmark data were constructed by augmenting image-based two-dimensional hand landmarks with calibrated depth information. A hierarchy of coordinate frames was established, beginning with the palm as the root and extending to child frames assigned to each finger, thereby encoding the natural kinematic dependencies of the hand. This hierarchical representation provides the structural foundation for Bayesian estimation. Finger joint parameters were estimated within a maximum likelihood framework that is robust to tracking noise and signal occlusions, which are common in practical hand-tracking scenarios. Unlike data-driven methods, the proposed approach does not rely on pre-collected training datasets but instead leverages the kinematic model and intrinsic physical constraints of the human hand. The estimation problem was formalized as a Gaussian Bayesian Network (GBN), through which joint parameters were inferred using Maximum Likelihood Estimation (MLE). Robustness of the approach was qualitatively demonstrated through reconstructed graphical configurations that illustrate accurate recovery of finger postures under noisy conditions. This method provides a principled framework for hand motion reconstruction and establishes the foundation for future quantitative evaluations against benchmark datasets. The framework is expected to advance applications in human–computer interaction, prosthetic design, virtual reality (VR), and rehabilitation by enabling more reliable and anatomically consistent hand tracking.1. Introduction
Tracking and kinematic reconstruction of the human hand is an area of study with many potential applications such as robotics, gaming, or human-machine interface [1], [2], [3], [4], [5]. For example, in clinical settings, such as rehabilitation centers, accurate tracking of hand or arm movements and their reconstruction in virtual environments enable detailed data collection and visualization, essential for healthcare providers monitoring recovery in patients like injured athletes or elderly individuals. Advanced tracking and reconstruction algorithms, combined with state-of-the-art hardware, can offer clinicians deeper insights into a patient’s motor abilities, providing a foundation for more accurate recovery assessments and aiding in early diagnosis of neurodegenerative conditions, including stroke [6], Alzheimer’s disease [7] or amyotrophic lateral sclerosis (ALS) [8]. Hand tracking [5], [9], [10], [11] and kinematic reconstruction still face many challenges due to the high-dimensional configuration space of the hand, the motion and appearance variations, the limitations of cameras, and the interference of cluttered environmental backgrounds.
Designing effective hand tracking and reconstruction systems presents unique challenges. The human hand’s complex structure, with its high degrees of freedom (DOF), makes it difficult to accurately track each joint and bone movement, as it involves estimating numerous parameters. Additional challenges arise from self-occlusions, where parts of the hand obscure others, and from the rapid speed of hand movements. These systems must also process substantial data volumes in real time, remaining robust in uncontrolled environments with noisy backgrounds and varying lighting conditions, all of which demand significant computational power and advanced algorithms.
Methods of hand tracking can be divided into two categories: appearance- and model-based approaches [1]. Methods based on visual appearance try to find a mapping from the input data to the hand pose space by utilizing features such as edges. These methods are computationally less demanding since the learned tracking information is encoded for direct association without searching for the whole configuration space of the hand. However, these methods are limited by noise and partial occlusion in the input data and suffer from low accuracy and poor stability. Wang and Payandeh [12] used a Kalman filter to track human hands and leverage scale-invariant feature transform to extract features for hand posture recognition. Ge et al. [13] designed Multi-view Convolutional Neural Networks (MVCNNs) to detect hand pose from a single depth image. Its main idea is projecting the depth image onto three orthogonal planes and using these projections to regress 2D joint heat maps that are further fused to predict 3D hand pose. Zhang et al. [14] presented MediaPipe Hands that can achieve real-time hand tracking using a neural network. It first used a palm detector to predict a bounding box and then used a landmark model to predict the hand skeleton. A feature extractor was trained with extensive hand data, and it can output the 2.5D position of 21 landmarks, the handedness, and the probability of hand presence.
Model-based methods try to fit a hand model to the input data. This process is always done through optimization or nearest neighborhood search. The optimization-based methods try to fine-tune the model parameters of the 3D hand until the model fits the input data well. The methods based on nearest neighborhood search try to find the most similar hand pose in a big pose database. Oikonomidis et al. [15] presented a tracking method that can predict the position, orientation, and full articulation of a hand by using a Kinect sensor. It is a typically model-based method as it formulates this as an optimization problem and seeks for model parameters that minimize the discrepancy between the hand model and hand observation. To solve this problem, Particle Swarm Optimization (PSO) was used. Tagliasacchi et al. [16] presented a robust Articulated Iterative Closest Point (articulated-ICP) to achieve real-time hand tracking using a single depth camera. The articulated registration algorithm can integrate data and regularization priors into a unified solver to enable robust tracking without restricting the freedom of motion. Qian et al. [17] presented a hand tracking system in real time on a desktop. This method follows a local optimization and initialization by a part detection framework. A simple hand model approximated by a set of spheres was used and a cost function measuring the distance between the model and a sparse point cloud was designed. Moreover, some works have tried to use multimodal sensors to capture the motion of human hands. Diaz and Payandeh [18] investigated the integration of a multimodal sensing system for exploring limits of vibrato tactile haptic feedback when interacting with 3D representations of objects.
Several studies have developed kinematic models for robotic hands, each designed to address specific functional needs and structural requirements. For example, Tarmizi et al. [19] presented a multi-fingered robotic hand model, developed using the Denavit-Hartenberg algorithm and Euler-Lagrange equations, to capture joint configurations and movement dynamics accurately. Similarly, Lee et al. [20] introduced an anthropomorphic robotic hand, the Korea Institute of Industrial Technology Hand (KITECH-Hand), with a detailed kinematic analysis. This model uniquely focuses on the metacarpophalangeal (MCP) joints of the fingers, implementing a roll-pitch configuration to replace conventional yaw-pitch structures for improved mechanical functionality. Hand pose estimation methods leverage deep learning models, including diffusion-based approaches [21], [22], [23], [24]. For instance, Wang et al. [21] modeled hand pose forecasting as a reverse diffusion process, introducing a dual-diffusion mechanism that simultaneously captures local and global hand features using specialized global and local diffusion blocks. Cheng et al. [22] proposed a diffusion-based model that de-noises hand pose predictions by integrating image and point cloud data. Joint-wise and local detail conditions were introduced to recover precise key-point positions. Additionally, Ye et al. [24] focused on hand-object interactions, designing a diffusion network that models the conditional distribution of object geometries based on hand configurations captured from video sequences.
As a more recent comparative literature review, Mangalam et al. [25] targeted realistic and immersive hand-object interaction in virtual reality (VR). Current limitations in VR systems were identified, such as unnatural post-collision behavior and ineffective grasping, and a multi-pronged solution across hand modeling, contact dynamics, and grasp release was proposed. The focus was on enhancing realism through physical simulation, neuroscientific insights, and mesh-based modeling, rather than explicit kinematic parameter estimation which is the main focus of this current study. While both works aim to improve hand representation in digital environments, the contributions of this current study emphasize kinematic inference and model estimation from data, whereas the referenced study addresses interaction realism and dynamic behavior in virtual environments. Joseph Isaac et al. [26] presented a deep learning-based approach that uses a convolutional neural network (CNN) trained on large, annotated datasets (HANDS2017 and MSRA) to directly predict 3D hand poses. Its novelty lies in enforcing anatomical correctness within the network architecture itself, avoiding the need for post-processing filters. The model ensures that predicted poses lie within the biomechanical constraints of the human hand, and even corrects errors found in the “ground truth” annotations by creating new datasets (AEF-HANDS2017 and AEF-MSRA). In contrast, this current study offers a data-efficient, model-driven solution that avoids training entirely, while the Semi-Supervised Cross-Correlation Convolutional Neural Network (SSC-CNN) offers a data-intensive, learning-based solution that corrects anatomical inconsistencies through architectural constraints. The former emphasizes hierarchical kinematic reasoning and interpretability, whereas the latter prioritizes learning anatomical validity within a predictive neural framework. Pavlakos et al. [27] presented a fully data-driven, transformer-based method for recovering full 3D hand meshes from monocular images. It emphasizes scale and learning capacity, combining multiple annotated datasets and utilizing a large Vision Transformer (ViT) to achieve state-of-the-art accuracy. It excels in generalizing to in-the-wild conditions, as shown by performance gains on a new benchmark dataset (HInt) with real-world images and 2D keypoint annotations.
While this current study and the study by Pavlakos et al. [27] both aim to reconstruct the 3D structure of the hand, the former emphasizes interpretable, physically grounded kinematic estimation without the need for training data, whereas the latter focuses on high-capacity, large-scale learning to recover dense hand meshes with superior generalization. The former prioritizes model structure and estimation robustness, while the latter advances end-to-end learning performance and dataset scalability. Dong et al. [28] introduced a deep learning framework that bridges graph neural networks and state-space modeling for improved 3D hand pose and shape estimation from single red–green–blue (RGB) images. It addresses limitations in existing transformer-based models by proposing a Graph-guided State Space (GSS) block, which more efficiently captures spatial relationships between joints using significantly fewer tokens. Combined with a global-local fusion module, the proposed model achieves state-of-the-art performance on public benchmarks. In relation to this current study which offers a structured, estimation-driven method rooted in kinematic modeling and physical constraints, the study by Dong et al. [28] delivers a data-intensive, learning-driven solution that leverages graph-guided neural representations for superior accuracy. The former prioritizes robustness and model transparency, whereas the latter emphasizes scalability and predictive performance through architectural innovation.
Rezaei et al. [29] presented a deep learning-based method tailored for depth image inputs. Its key innovation lies in decomposing 3D pose estimation into two stages: predicting 2D joint locations (UV space) and separately estimating depth using dual attention maps. This decomposition isolates the complexity of depth prediction, improving estimation accuracy. Additionally, it introduces a novel appearance-based data augmentation technique for depth images. This current study emphasizes explicit kinematic modeling, interpretability, and estimation robustness without training, while the study by Rezaei et al. [29] focuses on architectural innovations and data augmentation to enhance learning-based prediction accuracy. The former is designed for model-driven reconstruction under minimal data assumptions using measured red–green–blue–depth (RGB-D) data, while the latter advances data-driven depth image analysis for applications.
Despite these advancements, current models are designed for robotic applications and are not readily adaptable to map real hand landmark point clouds into kinematic representations for accurate remote control and graphical reconstruction. In this study, hand poses were estimated using a kinematic model that explicitly enforces motion constraints and ensures natural hand configurations. The proposed approach presents a 3D objective function using a Bayesian tracking framework for enhancing robustness to occlusion and missing data by providing a well-constrained solution space that accurately reflects hand biomechanics. Additionally, deep learning-based hand pose reconstruction methods typically rely on implicit learning of hand structure constraints within the network, which makes them sensitive to unseen hand poses and limits their capacity to generalize across the full range of hand motion. In this study, a hierarchical kinematic model tailored for hand tracking applications was proposed. The proposed model integrates measured data with a multi-layered kinematic structure, enabling precise hand tracking and reconstruction that aligns well with real hand movements in 3D space.
2. Landmark-Based Hand Kinematic Construction
This section presents a short overview of kinematic modeling and construction of the hand based on tracking landmarks.
Kinematic modeling involves creating a mathematical framework to represent the motion of a physical structure, such as the human hand. This representation can be used to determine the relative joint variables that describe the position and orientation of various interconnected segments. Such information can further be utilized to control a physical robotic hand or its virtual counterpart, or as part of a graphical model for visualizing the physical system.
In contrast to physical robots or wearable devices - where each moving joint is equipped with sensors for precise motion tracking - hand movement tracking using ambient sensing modalities such as RGB-D sensors provides only the coordinates of selected landmarks relative to the sensor frame. For example, Figure 1 shows typical hand landmarks defined by MediaPipe, measured using calibrated RGB-D sensing (e.g., an Intel RealSense RGB-D sensor, D345-SDK), with coordinates defined in the sensor’s coordinate system.
The goal of constructing a kinematic model of the hand is to establish a hierarchical coordinate system that can be used to compute finger joint angles based on these spatial landmark measurements through the inverse kinematic solutions. On the other hand, the forward solution refers to the case where the information regarding finger joint angles is available and it is required to estimate for the spatial location of landmarks. In this study, the kinematic model of the hand was constructed from 21 spatial landmark measurements (a point cloud with respect to the sensor frame), consisting of one at the root (the wrist) and 20 distributed across the phalanges of the five fingers. In the following kinematic model construction, each link was assigned a corresponding local frame, allowing the description of the position and rotational angles of each link.
A world coordinate frame {W} is defined to locate the position and orientation of the hand, including its palm and connecting fingers. This coordinate frame is collocated with the RGB-D sensor frame. The rotational angle of each finger joint can be described by defining local (or relative) coordinate frames within the hand. In this context, a wrist frame {0} is defined, which is affixed to the wrist, moves with the hand, and is specified relative to the world frame. It also serves as a local coordinate frame with respect to which all other hand coordinates are defined. As shown in Figure 1, the $\hat{Y}_0$ of the wrist frame is defined to be aligned with the general direction of finger flexion, the unit vector $\hat{X}_0$ is oriented from the wrist towards the root of the middle finger, and $\hat{Z}_0$ is perpendicular to the palm plane, pointing towards the back of the hand, assuming that the front of the hand faces the sensor.

The proposed model employs a hierarchical structure of local frames corresponding to anatomical joints. As shown in Figure 1, the first layer includes local frames $\{1\},\{5\},\{9\},\{13\}$ and $\{17\}$ at the MCP joints, defined with respect to the root frame $\{0\}$. The second layer consists of local frames $\{2\},\{6\},\{10\}$, $\{14\}$ and $\{18\}$ at the proximal interphalangeal (PIP) joints, defined relative to the first layer. Similarly, frames $\{3\},\{7\},\{11\},\{15\}$ and $\{19\}$ at the distal interphalangeal (DIP) joints form the third layer. Finally, the fourth layer comprises frames $\{4\},\{8\},\{12\},\{16\}$ and $\{20\}$ at the fingertips.
Each MCP joint, or finger root joint, possesses two DOF, enabling it to flex and extend (bend up and down) as well as abduct and adduct (move side to side). When the root of a finger is fixed, and the finger is bent up and down, the PIP and DIP joints each exhibit one DOF. Consequently, the workspace (the area the fingertip can reach) of a finger is a 2D plane (Figure 2a). Each of the five fingers can thus be modeled as a kinematic chain comprising three links connected by three revolute joints, with parallel rotation axes. The unit vectors $\hat{Y}$ of the local frames of each finger are aligned with the axes of the revolute joints, as shown in Figure 2b. The unit vectors $\hat{\mathrm{X}}$ for each of the three joints orient towards the origin of the subsequent frame. For example, along the index finger, $\hat{\mathrm{X}}_5$ points to the origin of local frame $\{6\}, \hat{\mathrm{X}}_6$ to the origin of local frame $\{7\}$, and $\hat{\mathrm{X}}_7$ to the origin of local frame $\{8\}$. Note that since $\{8\}$ is located at the fingertip and has no DOF, and hence $\hat{\mathrm{X}}_8$ aligns with $\mathrm{X}_7$. The unit vectors $\hat{\mathrm{Z}}$ of the local frames are assigned using the right-hand rule.


Palm parameters: In this study, the kinematic model of the palm is represented as a planar surface in 3D space. It is assumed that the wrist and the roots of the five finger landmarks lie on a single plane, specifically the $x y$-plane of the local frame $\{0\}$. A set of sixteen parameters are defined, where the initial six parameters are time-varying while the five parameters defined as lengths and the other five angle parameters are assumed to be constant:
- Position parameters: The position vector $\mathbf{d}_0^w=\left(x_0, y_0, z_0\right)^T$ specifies translational distance between the origins of the wrist frame $\{0\}$ and the world frame $\{\mathrm{W}\}$ along $\hat{\mathrm{X}}_{\mathrm{w}}, \hat{\mathrm{Y}}_{\mathrm{w}}$ and $\hat{\mathrm{Z}}_{\mathrm{w}}$, respectively.
- Orientation parameters: The orientation of the local frame $\{0\}$ is defined by the Euler angles $\left\{\gamma_0, \beta_0\right.$, $\left.\alpha_0\right\}$, and these angles describe the sequential rotations around axes $\hat{Z}_0, \hat{Y}_0$ and $\hat{X}_0$.
- Length parameters: The lengths $\left\{l_{0,1}, l_{0,5}, l_{0,9}, l_{0,13}, l_{0,17}\right\}$ represent the fixed distances between the origin of the local frame $\{0\}$ and the origins of the frames $\{1\},\{5\},\{9\},\{13\}$, and $\{17\}$. These measurements, in centimeters, are spatial relationships between the wrist and finger roots.
- Angle parameters: The angles $\left\{\theta_1, \theta_5, \theta_9, \theta_{13}, \theta_{17}\right\}$ describe the rotational offset between the $\hat{\mathrm{X}}_0$ axis and the axes $\hat{\mathrm{X}}_1, \hat{\mathrm{X}}_5, \hat{\mathrm{X}}_9, \hat{\mathrm{X}}_{13}$, and $\hat{\mathrm{X}}_{17}$, respectively. These angles, measured in radians, are pivotal for capturing the hand's natural articulation around the $\hat{\mathrm{Z}}_0$ axis.
Parameters of fingers: As shown in Figure 2, each finger requires three length parameters (as constraints) and four angle parameters (between 0 and 90 degrees) for defining its pose, amounting to a total of 35 parameters for all five fingers:
- Length parameters: A set of 15 parameters determines the fixed lengths of the three links ${{l_{i, i+1}}}, l_{i+{1, i+2}}$ and $l_{i+2, i+3}$ within each finger, where $i=1,5,9,13,17$:
$\left\{\left(l_{1,2}, l_{2,3}, l_{3,4}\right),\left(l_{5,6}, l_{6,7}, l_{7,8}\right),\left(l_{9,10}, l_{10,11}, l_{11,12}\right),\left(l_{13,14}, l_{14,15}, l_{15,16}\right),\left(l_{17,18}, l_{18,19}, l_{19,20}\right)\right\}$
- Angle parameters for finger roots: A set of 10 parameters describes five pairs of rotation angles $\beta_i$ and $\psi_i$ around axes $\hat{\mathrm{Y}}_i$ and $\hat{\mathrm{Z}}_i$ for the root joints of each finger, where $i=1,5,9,13,17$:
$\left\{\left(\beta_1, \gamma_1\right),\left(\beta_5, \gamma_5\right),\left(\beta_9, \gamma_9\right),\left(\beta_{13}, \gamma_{13}\right),\left(\beta_{17}, \gamma_{17}\right)\right\}$
- Angle parameters for other joints: A set of 10 parameters describes five pairs of rotation angles $\beta_{i+1}$ and $\beta_{i+2}$ for the PIP and DIP joints around corresponding $\hat{\mathrm{Y}}_i$ of each finger, where $i=1,5,9,13,17$:
$\left\{\left(\beta_2, \beta_3\right),\left(\beta_6, \beta_7\right),\left(\beta_{10}, \beta_{11}\right),\left(\beta_{14}, \beta_{15}\right),\left(\beta_{19}, \beta_{20}\right)\right\}$
To effectively track human hand movements and reconstruct corresponding poses in a virtual environment, it is essential to represent the position and orientation of each joint's local coordinate frame in 3D space, along with the corresponding joint parameters of each finger. A systematic approach was introduced for resolving these parameters through transformations between adjacent coordinate frames. The input parameters are 21 hand landmarks provided by the MediaPipe model, along with depth data captured by an RGB-D sensor. These represent the position parameters of the origin of each local frame $\{\mathrm{i}\}$ with respect to world frame $\{\mathrm{W}\}$, denoted as the vector $\mathbf{d}_i^w=\left(x_i, y_i, z_i\right)^T$ for $i=0,1, \ldots, 20$. Table 1 shows the definitions of these measured 3D spatial coordinates for each joint of the thumb, index, middle, ring, and little fingers.
Joints | ||||||
Finger | MCP (1st layer) | PIP (2nd layer) | DIP (3rd layer) | Tip (4th layer) | ||
Wrist | ${\text{d}}_0^w$ | |||||
Thumb | ${\text{d}}_1^w$ | ${\text{d}}_2^w$ | ${\text{d}}_3^w$ | ${\text{d}}_4^w$ | ||
Index | ${\text{d}}_5^w$ | ${\text{d}}_6^w$ | ${\text{d}}_7^w$ | ${\text{d}}_8^w$ | ||
Middle | ${\text{d}}_9^w$ | ${\text{d}}_{10}^w$ | ${\text{d}}_{11}^w$ | ${\text{d}}_{12}^w$ | ||
Ring | ${\text{d}}_{13}^w$ | ${\text{d}}_{14}^w$ | ${\text{d}}_{15}^w$ | ${\text{d}}_{16}^w$ | ||
Little | ${\text{d}}_{17}^w$ | ${\text{d}}_{18}^w$ | ${\text{d}}_{19}^w$ | ${\text{d}}_{20}^w$ | ||
Table 2 provides an example of the 3D hand landmark data $\mathbf{d}_i^w$ in meters relative to the world frame, with its axes aligned with the axes of the RGB-D sensor frame, (i.e., the positive $y$-axis direction pointing downward and the positive $z$-direction measuring the distance of the hand to the sensor).
| Joints | |||
Finger | MCP (1st layer) | PIP (2nd layer) | DIP (3rd layer) | Tip (4th layer) |
Wrist | (-0.21, -0.11, 0.54) | |||
Thumb | (-0.19, -0.15, 0.53) | (-0.17, -0.18, 0.54) | (-0.15, -0.20, 0.55) | (-0.14, -0.22, 0.56) |
Index | (-0.13, -0.16, 0.56) | (-0.10, -0.16, 0.56) | (-0.08, -0.16, 0.55) | (-0.06, -0.16, 0.55) |
Middle | (-0.13, -0.14, 0.57) | (-0.09, -0.14, 0.57) | (-0.07, -0.14, 0.56) | (-0.05, -0.14, 0.55) |
Ring | (-0.13, -0.12, 0.57) | (-0.10, -0.12, 0.57) | (-0.07, -0.12, 0.56) | (-0.06, -0.12, 0.56) |
Little | (-0.14, -0.10, 0.57) | (-0.11, -0.10, 0.57) | (-0.09, -0.10, 0.57) | (-0.07, -0.10, 0.57) |
Figure 3 shows the 2D hand tracking data visualization of the hand landmarks and the plot of calibrated 3D coordinates of the landmarks. In this figure, the RGB image shows a hand in a front-facing position, where key hand landmarks are tracked using the MediaPipe framework. These landmarks are accurately projected onto the image plane, ensuring comprehensive coverage of all joint locations. On the right, the 3D point cloud of these landmarks is visualized using depth data from an RGB-D sensor (Intel RealSense D345). The depth values associated with each landmark are mapped to their respective $x$ and $y$ coordinates derived from the MediaPipe output.

3. Estimation with Gaussian Bayesian Network
The stages of grasping, as captured using RGB-D and MediaPipe, are illustrated in Figure 4. This figure provides example case studies showing how the hand approaches, contacts, partially encloses, and ultimately fully grasps the object. These stages are depicted from two different measurement perspectives: a frontal view (top row) and a side view (bottom row). It is important to note that during all these phases, the positions and orientations of certain landmarks are more difficult to track and kinematically reconstruct accurately due to several factors. Occlusions - caused by parts of the hand being hidden behind the object or by overlapping fingers - can result in undetectable measures in depth values. Additionally, sensor data may be noisy or incomplete, particularly for joints involved in fine-grained grasping [30], such as the PIP and DIP joints. Missing or noisy measurements of these joints pose significant challenges for accurately reconstructing the full hand posture, as even small inaccuracies can lead to incorrect reconstruction of the grasp.
In the context of kinematic hand modeling for grasping objects, accurately estimating parameters such as joint angles for graphical reconstruction requires capturing the complex dependencies between adjacent joints. These natural dependencies are due to the structure of the open-kinematic chain representing each finger which is attached to the palm. In this construction, the movement of one joint influences the movement of others in the process of forming a stable grasp. To effectively model these relationships and account for uncertainties in sensor data, a Bayesian network framework was adopted. A Bayesian network is a probabilistic graphical model (PGM) that represents a set of variables and their conditional dependencies through a directed acyclic graph (DAG). In this framework, each node corresponds to a joint or landmark, and the edges capture the dependencies between joints, making it particularly well-suited for tracking a grasping hand where resolving the values of certain joints is more prone to occlusion.
This section extends this concept to a Gaussian Bayesian Network (GBN) [31], which is also well-suited for continuous domains such as joint positions and angles. The GBN framework allows us to model the uncertainty in sensor data through Gaussian noise while capturing the hierarchical dependencies between joints in the kinematic structure and solutions of the hand. In the GBN framework, each joint position is represented as a continuous random variable, and the joint probability density function is modeled using a Gaussian distribution. The core strength of the Bayesian network lies in its ability to model these conditional dependencies between joints. For instance, given the known position of the wrist, the positions of the MCP joints are conditionally dependent on the position and orientation of the wrist. Similarly, the PIP and DIP joints depend on the state of their parent joints. This hierarchical relationship allows the network to propagate information efficiently through the kinematic model and solutions, making it possible to estimate joint angles even when some data points are missing or noisy.

Figure 4 shows the frontal and side measurement stages, respectively. Each frame overlays a detailed hand skeleton, depicting joint landmarks and tracked kinematic motion to demonstrate dynamic hand-object interaction.
As depicted in Figure 5, the root node in the network corresponds to the 3D spatial position of the origin of the wrist coordinate frame $\{0\}$ with respect to the world frame $\{\mathrm{W}\}$. This is obtained by the measured distance $\mathbf{d}_0^w$ of landmark 0, obtained from the RGB-D sensor and through calibrated measure of the MediaPipe's landmark and its corresponding depth measures. Each subsequent node in the network corresponds to the position of origin of a specific hand joint coordinate frame $\{\mathrm{j}\}, j=1, \ldots, 20$, with respect to its parent coordinate frame $\{\mathrm{i}\}$, measured by the corresponding landmark $j$, with edges capturing the conditional dependencies between coordinate frames of the joints. The network structure reflects the hierarchical coordinate frame definitions for the hand kinematic model. For example, each position of the origin of child joint coordinate frames (e.g., PIP joints, $\{2\},\{6\},\{10\},\{14\},\{18\}$ in the second layer), is conditioned on the position of origin of its parent joint coordinate frames (e.g., MCP joints, $\{1\}$, $\{5\},\{9\},\{13\},\{17\}$ at the first layer). This is the dependency structure so that each joint's spatial relationship is computed relative to its parent in the kinematic chain.

When measured landmark information is available, a forward kinematic solution (or direct kinematics) is applied to estimate and perform kinematic transformations for reconstructing hand joint positions and orientations based on the measured landmark parameters. In cases where detected landmarks are missing or occluded by objects or other parts of the hand, forward estimation alone may yield inaccurate joint angle parameters, resulting in failed graphical reconstructions in Unity. To address such scenarios, an inverse kinematic solution was applied to resolve and identify an optimal set of hand kinematic model parameters. This method corrects and updates the joint angles by refining the forward kinematic model to better fit the available, incomplete data. By leveraging the forward kinematic solution-based on detected hand joint positions and previously estimated joint angles-iterative estimation enables the accurate reconstruction of hand movements and object grasping poses under various conditions. As illustrated in Figure 6, significant outliers typically have unusually large deviations from the other landmarks. A straightforward yet effective method for outlier detection is to compute the median of the 3D coordinates across all 21 landmarks and then measure the distance of each landmark from this median. If the distance exceeds a predefined threshold (e.g., 0.2 meters), the landmark is classified as an outlier.


The noise model of the sensed measured information of Intel RealSense D435 has been verified to follow a Gaussian (normal) distribution ${{N}\left(0, \sigma^2\right)}$ [32], [33] to account for the uncertainty in the depth sensor's measurements. This is for a given measured position vector $\mathbf{d}_j^w=\left(x_j, y_j, z_j\right)$ of landmark $j$, obtained from MediaPipe and the depth sensor. This measure can be interpreted to follow a Gaussian distribution, with the mean representing the "kinematic" position of the landmark (the expected or theoretical position of a landmark based on the kinematic chain model of the hand) and the variance:
where, $\mathrm{T}_j^i\left(\mathbf{d}_i^w, \Theta_{i, j}, l_{i, j}\right)$ is the expected 3D "kinematic" position of the landmark $j$, computed based on the hierarchical kinematic transformations (defined in the previous section) from its parent joint $\mathbf{d}_i^w$, the joint angle parameter(s) $\Theta_{i j}$ (e.g., $\Theta_{w, 0}=\left(\alpha_0, \beta_0, \gamma_0\right) \in[ 0,2 \pi]$, for the first layer joints $i=1,5,9,13,17, \Theta_{0, i}=\theta_i$ and $\Theta_{i, i+1}=\left(\beta_i, \gamma_i\right) \in\left[ 0, \frac{\pi}{2}\right]$ and for the second layer joints $j=2,6,10,14,18, \Theta_{j, j+1}=\beta_j \in\left[ 0, \frac{\pi}{2}\right]$ and $\Theta_{j+1, j+2}=\beta_{j+1} \in\left[ 0, \frac{\pi}{2}\right]$ ), and the fixed link (finger) length $l_{i, j, j}$, which serves as a constraint to define the range for the estimation process of angle parameters in the next section.
These parameters in $\mathrm{T}_j^i\left(\mathbf{d}_i^w, \Theta_{i, j}, l_{i, j}\right)$ define the hand's spatial configuration, including the orientation and physical length of each segment between joints. Through a series of hierarchical transformations, which are based on the principles of forward kinematics, the expected 3D spatial position of landmark $j$ can be calculated.
In contrast, the actual measured position of the landmark from the sensor may deviate from this expected "kinematic" position due to various factors, such as the uncertainty and noise in the depth sensor's measurements or occlusions. By modeling the observed position as a Gaussian distribution centered around the expected "kinematic" position, this uncertainty is accounted for, and the joint angles can be estimated more accurately by comparing measured positions to their corresponding kinematic positions. The probability of observing a 3D position $\mathbf{d}_j^w$, conditioned on the hierarchical kinematic transformation based on its parent landmark $\mathbf{d}_i^w$, the joint angle parameter(s) $\Theta_{i, j}$, and the length $l_{i j}$, is then represented by the following probability density function:
Ideally, in the absence of noise, each measured position $\mathbf{d}_j^w$ would be exactly equal to the kinematic position given by the transformation $\mathrm{T}_j^i\left(\mathbf{d}_i^w, \Theta_{i, j}, l_{i j}\right)$, leading to a probability of 1. However, sensor noise is inevitable, meaning the measured data will always deviate from the ideal kinematic positions. For instance, as illustrated in Figure 6, the depth values for the third and the fingertip landmarks of the index finger (joints 7 and 8) are particularly inaccurate. Therefore, the distance computed distance $\left\|\mathbf{d}_7^w-\mathrm{T}_7^6\left(\mathbf{d}_6^w, \beta_6, l_{6,7}\right)\right\|$ where $\mathrm{T}_7{ }^6\left(\mathbf{d}_6^w, l_{6,7}, \Theta_{6,7}\right)$ is the estimated position of landmark 7 derived from the kinematic model using the measured position of its parent landmark $\mathbf{d}_6^w$, the resolved finger joint angle $\Theta_{6,7}=\beta_6$, and length $l_{6,7}$ as a constraint - remains large due to noise in the measured position $\mathbf{d}_7^w$. This leads to the following exponential term approaching zero:
As a result, probability $P\left(\mathbf{d}_7^w \mid \mathrm{T}_7^6\left(\mathbf{d}_6^w, \beta_6, l_{6,7}\right), \sigma\right)$ is close to 0, indicating a poor match between the kinematic model and measured positions. A similar situation occurs for joint 8, where $P\left(\mathbf{d}_8^{\mathrm{w}} \mid \mathrm{T}_8^7\left(\mathbf{d}_7^{\mathrm{w}}, \beta_7, l_{7,8}\right), \sigma\right)$. In the presence of such anomalies described above, this study presents an optimal estimation of kinematic model parameters which enables, in particular, the resolution of the joint angle parameters $\Theta_{i, j}$ that best explain (or fit) the measured data corresponding to hand movements.
To estimate the optimal set of parameters $\Theta_{i, j}$ and $\mathbf{d}_i^w$ that explain the observed data, Maximum Likelihood Estimation (MLE) was employed [34]. The likelihood function is the joint probability of observing all the measured landmark positions, assuming the parameters $\Theta_{i, j}$ and the fixed link lengths $l_{i, j}$ are known. Given the measurement noise model, the likelihood function for a single joint observation $\mathbf{d}_j^w$ can be expressed as:
The goal of MLE is to find the parameter values that maximize the likelihood of the observed data, which, in this case, are the measured 3D positions of the hand landmarks $\mathbf{d}_j^w$ obtained from the sensor. Since there are 21 landmark measurements (including landmark 0) defined by MediaPipe and obtained through the Intel D435 RGB-D depth sensor, this study aims to estimate how well the optimal sets of parameters for the hand kinematic model describe the hand pose during movement. To accomplish this, this study seeks to maximize the overall likelihood, which is the product of the individual probabilities for each observation. Mathematically, this can be represented as the joint likelihood of all 21 landmarks in each frame:
In practice, the log-likelihood function, which converts the product of probabilities into a sum of log-probabilities, is maximized to simplify computation. This transformation allows us to more easily manage the complexity of the estimation process:
This study further assumes that $\sigma$ remains constant and maximizing the second term of the log-likelihood function can only be focused on:
This translates into minimizing the sum of squared errors (SSE) between the observed landmark positions $\mathbf{d}_j^w$ and the predicted kinematic positions $T_j^i\left(\mathbf{d}_i^w, \Theta_{i, j}, l_{i, j}\right)$:
To find the optimal set of angle parameters $\Theta$ and position vectors $d$, the error in the predicted kinematic model is minimized by solving the least squares problem presented in Eq. (8). The objective is to reduce the discrepancy between the measured and model positions, ensuring the estimated parameters fit the observed hand movements as closely as possible. However, as it was mentioned before, in practical applications, hand landmark detection can be prone to failures or erroneous results due to factors such as occlusions, noise effects, or sensor limitations. These issues may produce outliers in depth measurements - such as zero values (in object-occluded or self-occluded cases) or incorrectly assigned background values (e.g., from a background white wall). Depth measurement inaccuracies directly impact the transformation from 2D pixel coordinates $p=\left(u_i, v_i\right)$ to 3D spatial coordinates $d_i=\left(x_i, y_{i,} z_i\right)$, as this conversion relies critically on accurate depth data $d\left(u_i, v_i\right)$ using intrinsic camera parameters, specifically focal lengths $f_x$ and $f_y$ and principal point offsets $c_x$ and $c_y$:
Minimizing the 3D coordinate error in such cases can lead to suboptimal parameter estimates, especially when multiple outliers are present. To address these outliers, a binary confidence score $o_j \in\{0,1\}$ was introduced for each landmark $j$. This score indicates whether the measured landmark is an outlier, where $o_j =1$ denotes an outlier and $o_j=0$ represents a reliable measurement. By incorporating this confidence score into the proposed optimization process, unreliable landmarks can be selectively excluded from influencing the parameter estimation. Then the modified objective function to handle these outliers is then expressed as:
4. Qualitative Experimental Study
This section presents sample qualitative results for two sets of experiments involving tracking and graphical reconstruction sequences of the hand movements toward a cup for grasping while the measured observations were conducted from two different perspectives. These experiments are designed to test the robustness of the proposed estimation approach under varying conditions, including various cases of occlusions. The tracking and reconstruction were initialized using the kinematic model parameters in order to calculate the initial joint angles and position of the virtual hand model (Figure 7). Initially, the parameters ($\mathbf{d}_i^w$ for $i=0, \ldots, 20$) are captured by MediaPipe and the depth sensor, while the link lengths $l_{i j}$ are fixed. The joint angle parameters $\Theta_{i j}$ can either be initialized to zero (if no prior information is available, e.g., the starting point of a hand movement) or set based on prior knowledge (e.g., the estimated hand pose from the previous frame). In the case studies of this research, hand movements were recorded as sequences of RGBdepth images to analyze different motions. As for the computation of coordinates, the current parameters ($\mathbf{d}_i^w, \Theta_{i, j}$, and $\left.l_{i, j}\right)$ were used for each parent landmark $i$, the hierarchical transformations $\mathrm{T}_j^i\left(\mathbf{d}_i^w, \Theta_{i, j}, l_{i, j}\right)$ were performed to compute the estimated $\mathbf{d}_j^w$ for each child landmark of the hand. The accuracy of the current parameter estimation was evaluated by comparing the estimated coordinate $\mathbf{d}_j^w$ with the measured coordinates obtained from the MediaPipe hand model and the RGB-D sensor. Finally, the objective function's value (or "loss") was used to optimize the model parameters using a gradient-based method.
The first grasping task case study, in which the hand approaches a cup from above in a front-facing view, is shown in Figure 8 and Figure 9. This study presents the results of hand tracking and reconstruction during the initial approach phase, followed by two intermediate pre-grasping poses, and finally the full grasping phase. A second grasping task case study, shown in Figure 10 and Figure 11, involves the hand moving along a diagonal trajectory from the upper left to the lower right with respect to the sensor. In this case, tracking data is collected from a side view, culminating in a full grasp of the cup handle. Using a similar analysis to the previous experiment, the performance of the estimation method was evaluated along this alternative trajectory and viewpoint, with a focus on the model’s adaptability to changes in perspective and motion path.

Figure 8a, Figure 8c, Figure 8e, Figure 8g show RGB hand motion tracking as the hand approaches the cup: directly facing the sensor, during contact, partial grasp with self-occlusion, and full grasp with heavy occlusion. Figure 8b, Figure 8d, Figure 8f, Figure 8h show the measured (black) and estimated (purple) 3D spatial landmark points.
Unity graphical reconstructions using forward kinematics are shown in Figure 9a, Figure 9c, Figure 9e, Figure 9g, and refined Unity graphical reconstructions with estimation are shown in Figure 9b, Figure 9d, Figure 9f, Figure 9h.
Figure 10a, Figure 10c, Figure 10e, Figure 10g show RGB hand motion tracking as the hand approaches the cup: directly facing the sensor, during contact, partial grasp with self-occlusion, and full grasp with heavy occlusion. Figure 10b, Figure 10d, Figure 10f, Figure 10h show the measured (black) and estimated (purple) 3D spatial landmark points.
Unity graphical reconstructions using forward kinematics are shown in Figure 11a, Figure 11c, Figure 11e, Figure 11g, and refined Unity graphical reconstructions with estimation are shown in Figure 11b, Figure 11d, Figure 11f, Figure 11h.
The following are observed in regard to the case studies presented in Figure 8 and Figure 9 (Case 1) and Figure 10 and Figure 11 (Case 2). For the case of a hand approaching and grasping a cup from the top while collecting tracking measurements using the RGB-D sensor from a frontal view of the hand, the following are observed for the selected sequences stated in Figure 8. In non-occlusion case 1 ($\alpha$ Instant), at the initial state of motion, three outliers were detected for the fingertip landmarks. In non-occlusion case 1 ($\beta$ Instant), during the pregrasping phases, nine outliers were detected. In occlusion case 1 ($\gamma$ Instant), self-occlusion occurred, resulting in missing ring finger landmark measurements. In occlusion case 1 ($\delta$ Instant), self-occlusion was detected for missing landmark readings of the palm. In the case of the hand approaching and grasping the cup by its handle, while RGB-D tracking measurements are obtained from the side view of the hand (Figure 10), the following instances are selected and analyzed. In non-occlusion case 2 ($\alpha$ Instant) when the hand is at its initial motion configuration, three landmark outliers were detected. In object-occlusion case 2 ($\beta$ Instant), when the fingertips are touching the cup handle, landmark outliers were detected. In occlusion case $2(\gamma$ Instant), when the fingers are grasping the handle, ambiguous fingertip landmark measures were observed. In occlusion case 2 ($\delta$ Instant), when the handle of the cup is fully grasped, ambiguous finger landmark measures were present. The following presents detailed analysis and discussions for two sample instances mentioned above (Figure 8 and Figure 10).
































This section presents some additional details associated with the $\gamma$ instance of the hand tracking and reconstruction in the frontal measure case (Figure 8). This case, as stated above, involves self-occlusion of the hand with a missing measure of the ring finger. The MCP, PIP, and DIP joints of the index, middle, and ring fingers appear visually clustered and ambiguous, with minimal differences in their pixel coordinates (Table 3). Consequently, their computed 3D spatial coordinates converge, appearing closely grouped even for joints not identified as outliers. This clustering of ambiguous data is illustrated in Figure 8b.
Joints | |||||||
Finger | Root/MCP | PIP | DIP | Tip | |||
Wrist | p0 = (301, 242) | ||||||
Thumb | p1 = (276, 250) | p2 = (245, 246) | p3 = (216, 250) | p4 = (196, 254) | |||
Index | p5 = (266, 224) | p6 = (257, 208) | p7 = (250, 206) | p8 = (241, 209) | |||
Middle | p9 = (292, 218) | p10 = (291, 197) | p11 = (287, 192) | p12 = (283, 199) | |||
Ring | p13 = (317, 219) | p14 = (323, 200) | p15 = (328, 200) | p16 = (329, 205) | |||
Little | p17 = (337, 225) | p18 = (356, 210) | p19 = (371, 210) | p20 = (382, 217) | |||
As shown in Table 4, five landmarks associated with the joints are detected as missing, namely the DIP joint of the middle finger and all four joints of the ring finger. Additionally, the index finger’s PIP, DIP, and tip joints are not detected as outliers, due to their near-identical values. This observation can further demonstrate the need for using the kinematic model constraints within the proposed estimation techniques of this study to distinguish individual joints accurately under occlusion conditions. By constraining finger lengths and estimating missing values, a realistic representation of finger poses can be restored despite the occlusion-induced data loss, thereby maintaining consistency with the hand’s natural structure.
Furthermore, it can be seen that in the absence of the root joint for the ring finger, determining the hand palm as a planar reference becomes infeasible. This lack of a reference plane disrupts the kinematic calculations, leading to failures in most parts of the model due to missing values and instances of self-occlusion. In these scenarios, multiple points converge spatially, resulting in nearly indistinguishable joint angles, which hinders accurate angle calculations. As seen in the unity reconstruction (Figure 8c), these limitations manifest as an unnatural hand orientation, with almost all fingers appearing unnaturally compressed together. This distortion occurs because the lack of critical reference points and reliable spatial separation between joints leaves the kinematic model with insufficient information to accurately resolve individual finger positions and orientations. Consequently, the constraints typically applied to enforce anatomical structure are ineffective, causing fingers to “collapse” into each other.
Joints | ||||
Finger | Root/MCP | PIP | DIP | Tip |
Wrist | (-0.02, -0.01, 0.57) | |||
Thumb | (-0.04, 0.00, 0.57) | (-0.07, -0.00, 0.56) | (-0.09, 0.00, 0.54) | (-0.11, 0.01, 0.53) |
Index | (-0.05, -0.02, 0.52) | (-0.05, -0.03, 0.46) | (-0.05, -0.03, 0.45) | (-0.06, -0.03, 0.44) |
Middle | (-0.03, -0.02, 0.51) | (-0.02, -0.04, 0.44) | (0.00, 0.00, 0.00) | (-0.03, -0.03, 0.43) |
Ring | (0.00, 0.00, 0.00) | (0.00, 0.00, 0.00) | (0.00, 0.00, 0.00) | (0.00, 0.00, 0.00) |
Little | (0.00, 0.00, 0.00) | (0.03, -0.03, 0.50) | (0.04, -0.03, 0.48) | (0.05, -0.02, 0.47) |
The inset focuses on iterations 5 to 36, showing gradual loss reduction after the initial sharp decline.
To achieve optimal parameter values of $\Theta$, the numerical optimization methods, highlighted in the previous section, were used to minimize the objective function in Eq. (8), refining the estimated parameters to closely fit the observed hand motions while preserving biomechanical constraints. As illustrated in Figure 7, the iterative optimization loop adjusts the parameters of the kinematic model to reduce estimation errors. Initially, the spatial coordinates $\left(\mathbf{d}_i^w\right.$ for $\left.i=0, \ldots, 20\right)$ are obtained from MediaPipe and the depth sensor, with the link lengths $l_{i, j}$. The optimization proceeds by evaluating the objective function (or "loss") and adjusting the model parameters using gradient-based methods. To analyze the performance of the proposed estimation algorithm, the objective function's decreasing trend is plotted in Figure 12. By examining the error reduction trend in Figure 12, substantial decreases in error across key iterations (from iteration 0 to 1,1 to 5, and 5 to the final iteration at 36 ) can be observed, with reductions of 0.35, 0.07, and 0.002, respectively. These reductions indicate clear convergence toward minimizing error as the algorithm iteratively adjusts parameters, refining the estimated poses to better align with the measured data. Through iterative refinement, the optimization process successfully restores anatomical coherence to the hand model by adjusting joint orientations and positions, effectively compensating for the missing data. This outcome underscores the robustness of the estimation method in dealing with occlusions and highlights the value of iterative optimization for achieving realistic motion representation. The estimated 3D spatial coordinates of the missing landmarks, along with a comparative analysis of Unity reconstructions with and without the application of the estimation process, are presented in Table 5, Figure 8c, and Figure 8d.

| Joints | |||
Finger | Root/MCP
| PIP | DIP | Tip |
Wrist | (-0.02, -0.00, 0.59) | |||
Thumb | (-0.04, -0.01, 0.56) | (-0.07, -0.00, 0.55) | (-0.10, 0.00, 0.54) | (-0.12, 0.01, 0.53) |
Index | (-0.05, -0.02, 0.51) | (-0.05, -0.03, 0.47) | (-0.05, -0.03, 0.45) | (-0.05, -0.03, 0.44) |
Middle | (-0.03, -0.03, 0.51) | (-0.03, -0.04, 0.47) | (-0.03, -0.04, 0.43) | (-0.03, -0.04, 0.43) |
Ring | (-0.01, -0.03, 0.51) | (0.00, -0.03, 0.48) | (0.00, -0.04, 0.46) | (0.01, -0.03, 0.44) |
Little | (0.03, -0.03, 0.52) | (0.04, -0.03, 0.50) | (0.04, -0.03, 0.48) | (0.05, -0.02, 0.47) |
This instant of the hand-grasping phase is shown in Figure 10. This instant would be the last phase of the grasping hand approaching the cup handle while the tracking information is collected from the side using the RGB-D sensor. In this fully grasping case, all four fingers rotate backward in the direct graphical reconstruction, with each joint-MCP, PIP, DIP, and tip-approaching a 90-degree rotation. This scenario presents a challenging case of object occlusion, as the MCP joints of all four fingers are obscured by the cup handle and the other finger segments. Additionally, the measured data from the depth sensor becomes nearly collinear, which deviates from the true finger configuration due to the imposed joint angle constraints limiting rotations between 0 and 90 degrees. It can be seen that despite the dominant presence of ambiguities in the landmark measured which resulted in the infeasible joint angles, the proposed method offers a robust joint angle estimation. By incorporating prior knowledge, kinematic constraints, and initial joint angle measures, it can compensate for the collinearity and occlusion-induced errors (Figure 13). The model's optimization process handles the unnatural alignment suggested by the collinear data and adjusts the estimated joint positions accordingly, preserving the realistic articulation of the fingers. This allows us to maintain a coherent representation of the hand pose, even under the presence of object occlusion, as shown in Figure 10d and Figure 14.






5. Conclusion
The proposed estimation method demonstrates considerable robustness in handling occlusions, noise, and incomplete data during hand pose tracking and grasping tasks. By leveraging a Bayesian network model, a structured framework was established that captures the hierarchical dependencies among hand joints, allowing the system to effectively compensate for missing depth values and occluded joints. This approach, combined with kinematic constraints, enables accurate hand pose estimation even in complex scenarios (e.g., occlusions caused by object grasping) where sensor data may become unreliable or collinear. The integration of prior knowledge and structural constraints enables the model to produce reasonable approximations despite challenging data conditions. A key limitation of the current method lies in its computational cost. While the Bayesian network is effective at handling noise and missing data, it depends on an accurate understanding of the kinematic structure and inter-joint dependencies to function optimally. The reliance on MLE and iterative optimization techniques introduces computational overhead, which poses challenges for real-time applications that demand rapid processing.
In the current setup, object detection and MediaPipe hand tracking operate efficiently, with execution times of approximately 10 ms and 60 ms per frame, respectively. This efficiency is largely attributed to the extensive training data and optimized model architecture underlying MediaPipe, enabling robust and fast performance. In contrast, the MLE-based parameter estimation is significantly more time-consuming, requiring between 1.5 to 6 seconds per frame. This delay arises from the complexity of aligning point cloud data with the kinematic model, as well as the iterative optimization required to estimate joint positions accurately. Each step involves fine-tuning parameters to satisfy hierarchical constraints while fitting the observed data, making the process computationally intensive. Furthermore, the RGB-D sensor used in the proposed setup has a frame rate of approximately 15 frames per second (FPS), underscoring the mismatch between data acquisition speed and the processing speed of the proposed estimation pipeline.
To address these limitations, several strategies can be explored to enhance the computational efficiency of the estimation framework. These include the adoption of faster optimization techniques, such as diffusion models, which manage uncertainty through probabilistic distributions and have shown promising performance in 3D generative tasks, including unseen point cloud generation and missing part completion. Incorporating prior knowledge through deep learning methods, such as selective optimization guided by learned affordances (e.g., Affordance Diffusion), may accelerate estimation by filtering out inaccurate landmark detections. In addition, hardware acceleration strategies (e.g., edge computing) can enable parallel computations across distributed nodes, further improving efficiency. Together, these advancements present a promising pathway for overcoming current computational bottlenecks and achieving real-time hand pose estimation, thereby increasing the system’s practicality and enabling further comparative studies with existing methods.
Conceptualization, S.P.; methodology, S.P. and Y.Y.D.; software, Y.Y.D.; validation, Y.Y.D.; resources, S.P.; original draft preparation, S.P. and Y.Y.D.; writing — review and editing, S.P. and Y.Y.D.; visualization, Y.Y.D.; supervision, S.P.; funding acquisition, S.P. All authors have read and agreed to the published version of the manuscript.
The data used to support the research findings are available from the corresponding author upon request.
The authors declare no conflict of interest.
