Self-learning of Robot Arm using a Modified Soft-Actor-Critic (SAC) Approach: Complex 6D Pose Brick Grasping Test

Mehrotra Abhishek; Hwang Yi

doi:10.55785/JCAR.3.2.1

Preview

Research Article

Journal of Construction Automation and Robotics. 28 June 2024. 1-10
https://doi.org/10.55785/JCAR.3.2.1

Self-learning of Robot Arm using a Modified Soft-Actor-Critic (SAC) Approach: Complex 6D Pose Brick Grasping Test

수정 SAC 알고리즘을 이용한 산업 로봇팔 자율학습: 자유 상태 6차원 벽돌 그리핑 테스트

Mehrotra Abhishek¹

Hwang Yi²^*

아비세크 메로트라¹

이 황²^*

¹Ph. D. Student, Department of Architecture, Korea University

²Corresponding Author, Associate Professor, Department of Architecture, Korea University

¹고려대학교 건축학과 박사과정

²교신저자․고려대학교 건축학과 부교수

^{*Corresponding Author}

License (open-access, http://creativecommons.org/licenses/by-nc/3.0/):

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

ABSTRACT

Previous studies have not addressed robot arm grasping for objects with complex 6D poses using reinforcement learning (RL) when the pose is known beforehand. This is crucial for automation in architecture and related fields. To address this gap, we aimed to achieve precise grasping of objects with complex poses. We employed the Soft Actor-Critic (SAC) baseline and its two variants (using median and mean Q values) to demonstrate the limitations of standard RL algorithms and the effectiveness of modified algorithms for this complex task. The trained models were evaluated using statistical metrics, including average episodic rewards, steps, and success rates. Our results indicate that both SAC variants effectively trained the agent to achieve the target 6D pose, with the mean variant performing slightly better than the median variant. Test rollouts demonstrate that the mean variant of the SAC algorithm exhibits superior performance in attaining positions and poses that are more proximate to the target object compared to the median variant.

Keywords

Robot Arm

Modified SAC

Complex Pose

Grasping

Reinforcement Learning

Architecture

산업용 로봇팔은 다양한 크기와 형태를 갖는 건설 재료를 인식, 운반, 설치하는 과정에서 도움을 줄 수 있다. 본 연구는 인공지능 (AI) 강화 학습을 활용한 로봇 자동화를 위하여 복잡한 6D 포즈를 가진 객체를 그리퍼로 정확히 잡는 과정의 알고리즘 성능에 대하여 테스트한 결과와 분석이다. 연구자는 표준강화학습 알고리즘의 넓은 분산과 훈련비용의 한계를 극복하기 위해, 두 가지 변형(중앙값 및 평균 Q 값 사용)을 적용하여 수정된 Soft Actor Critic (SAC)을 개발하고 3차원 벽돌 잡기에 적용하였다. 훈련된 모델은 평균 에피소드 보상, 단계 및 성공률을 포함한 통계 지표를 사용하여 평가하였으며, 실험 결과는 변형SAC 모두 평균 변형이 중앙값 변형보다 향상된 성능을 발휘하며, 목표 6D 포즈를 달성하도록 에이전트를 효과적으로 훈련시킬 수 있음을 입증하였다.

키워드

로봇 암

수정된 SAC

복합 포즈

그립

강화 학습

아키텍처

MAIN

1. Introduction
1.1 Background
1.2 Literature review
1.3 Research gap and objective
2. Material and method
2.1 Environment description
2.2 RL training
2.3 Metrics for learning
3. Results and Discussion
4. Conclusion

1. Introduction

1.1 Background

Automation in architecture is in its early stages and requires the integration of planning modules for design. Additionally, robot arms must be capable of grasping objects with complex orientations to advance further. Robotic grasping is fundamental to various industrial and other field tasks. As it is based on objects shapes, position and orientation or pose, there are many variants of it: 2D, 3D and 6D pose estimation and grasping to utilize it in stacking, pick and place or any other complex robotic task. This encompasses 2D, 3D, and 6D pose estimation and grasping for tasks like stacking and pick-and-place operations. Although previous studies have addressed grasping objects of various shapes and orientations, there is a lack of research on grasping objects with complex poses that combine azimuthal and altitude angles. Most studies have focused solely on azimuthal angles and often involve grasping objects from a fixed direction. Therefore, it is crucial to address this knowledge gap regarding the grasping of objects in complex poses and to explore RL-based solutions in robotic grasping for architectural applications. In our study, we aim to grasp objects with complex poses using a modified SAC algorithm with a six-joint industrial robot arm. We formulated a problem statement involving object grasping with randomly generated poses around the x and z axes. Two variants of modified SAC, along with a baseline SAC (Li et al., 2022; Shahid et al., 2022), were applied to train the robot to grasp bricks.

1.2 Literature review

Some studies have explored object grasping with various objectives without using RL to achieve the target pose. Choi et al. (2018) employed a 3D Convolutional Neural Network (CNN) to achieve an 87% success rate in grasping unknown objects with a soft hand. Liang-shan et al. (2019) introduced a multi-stage grasping method using undirected graphs for posture matching and robot motion node generation. Similarly, Deng et al. (2020) developed a self-supervised algorithm for 6D pose estimation of objects through continuous robot interaction and image collection. Fu et al. (2020) utilized point-pair features to estimate grasping poses of rigid objects, demonstrating high success rates in real environments. Cheng et al. (2021) estimated 3-Degrees of Freedom (DOF) grasps for various-sized objects using a single RGB image, training their model on the Cornell Grasp dataset with favorable real-time results. De-Oliveira et al. (2021) presented a 6D grasping method with a region generation algorithm, generalizable for various shapes without retraining. Li et al. (2021) used You Only Look Once 4 (YOLO4) and GrabCut algorithms to improve grasping success for a mobile robot arm by identifying and segmenting objects. Junare et al. (2022) introduced a deep learning-based model for grasping objects with a 5DOF robotic arm. Li et al. (2023) enhanced grasp detection in cluttered environments using a novel sampling method for generating point proposals. Liu et al. (2023) proposed a dense pixel-wise prediction model for industrial parts' pose estimation, showing robustness in cluttered environments through key point detection and the uncertainty Perspective-n-Point (PnP) method. Wang et al. (2023) developed a pose estimation scheme using Convolutional Block Attention Module (CBAM) and Pyramid pooling Module (PPM) algorithms for differently shaped objects. Finally, Liu et al. (2024) introduced a Bidirectional Deep Residual Fusion Network (BDR6D) for 6D pose estimation, integrating RGB and depth information with the SuperPoint-FPS keypoint algorithm. A significant limitation of these studies is their reliance on pose estimation and object grasping with planar orientations, excluding the use of reinforcement learning.

Several studies have applied reinforcement learning (RL) to robotic arm grasping applications. For instance, Park et al. (2020) developed an intelligent robot gripping system using CAD files and 3D scanned point clouds, employing Proximal Policy Optimization (PPO) as the RL algorithm, and demonstrated strong performance in simulations. Similarly, Al-Shanoon and Lang (2021) presented a technique for grasping unfamiliar objects of various shapes using DenseNet and Deep Q Network (DQN), with their deep reinforcement grasp policy (DRGP) providing Q values and orientation heatmaps for image pixels. Mohammed et al. (2021) introduced the Multi-View Change Observation-Based Approach (MV-COBA) for grasping occluded or tightly packed objects using Q learning, achieving a promising success rate in simulations with images from two cameras. Shukla et al. (2020) addressed intelligent object manipulation by decomposing it into position and orientation learning, utilizing a genetic algorithm, a regression-based method, and a pseudoinverse model for position, and RL for orientation. Li et al. (2022) proposed a data-driven approach to generate contact points on an object's surface using a multi-finger hand gripper, comparing its efficiency with EfficientGrasp and UniGrasp methods. Liu et al. (2022) developed a digital twin-based solution for industrial object grasping with DQN, showing higher success rates with transfer learning compared to direct real environment training. Shahid et al. (2022) compared PPO and SAC RL algorithms for grasping and lifting objects of three shapes, transferring the learned behavior to a real robot using zero-shot transfer for various object positions and configurations. Wang et al. (2022) introduced a hierarchical framework for collision-free grasping with partial point cloud observations, incorporating plan selection and an option classifier within a Hierarchical Grasping Policy (HGP). Zuo et al. (2023) presented a graph-based Deep RL approach for exploring and manipulating invisible objects, using explorer and coordinator modules with feature extractor inputs, achieving commendable success rates in both simulated and real environments. A notable limitation in these studies is their focus on object grasping with simple poses, without utilizing reinforcement learning to achieve the target grasping pose.

1.3 Research gap and objective

Previous studies on computer vision models for pose estimation and grasping primarily used images as input. However, these studies have several gaps. They considered only planar poses of objects involving azimuthal angles, neglecting complex 3D angles. Additionally, RL was rarely applied, and when used, it was limited to either pose estimation or position identification, followed by grasping with inverse kinematics or other methods. Thus, it is crucial to explore how a robotic agent can reach complex object poses and grasp them using RL in the architectural domain.

To address the gap, we examine the efficacy of the DRL algorithm SAC and its modified versions in enabling a robotic arm to reach complex grasping poses with known poses (via prior vision estimation). We hypothesize that reaching complex object poses, determined by vision, is a challenging task that standard RL algorithms cannot easily accomplish. Our focus is on using self-learning to achieve the target grasping pose. We implemented SAC and two of its variants to evaluate their performance in reaching the target pose. Our main contributions to the existing research are summarized as follows:

•We introduce, for the first time, the self-training of robot arm to achieve a complex pose of brick in an architectural site, showcasing real-world scenarios.

•We introduce modified SAC with two variants, marking the first application of such variations in architectural contexts and comparing their effectiveness.

2. Material and method

This study focused on post-pose estimation grasping using RL. A KUKA robot arm was employed to grasp bricks with complex poses within a large search volume in a simulation space. RoboDK served as the simulation platform, and Python (version 3.11.7) was used for the training of RL agents. This study focused on achieving brick grasping at randomly generated poses within a specified threshold distance and angle.

Object 6D pose was set randomly in simulation platform and robot arm search for this pose under the RL training by adjusting its joint angles. Python was used as a link between RoboDK and agent training in Jupyter notebook. Fig. 1 shows the developed framework of this study and Fig. 2 depicts box and gripper dimensions. To model the environment, we utilized the most widely used ‘OpenAI Gym’ library (version 0.26.2) and linked it with Robodk physics software (version 5.6.1) for the robotic arm simulation. Torch (version 2.0.1) and StableBaselines3 (version 2.0.0) were utilized as the RL libraries used in this proposed scheme and agents were trained with three variants of SAC algorithms. In this setup, the agent models were based on continuous observation and action space. Search space for target grasping position and orientation; and search volume by robotic arm are presented in Fig. 3, which also depicts the experimental setup for this scheme.

https://cdn.apub.kr/journalsite/sites/ksarc/2024-003-02/N0410030201/images/ksarc_03_02_01_F1.jpg

Figure 1.

Developed framework

https://cdn.apub.kr/journalsite/sites/ksarc/2024-003-02/N0410030201/images/ksarc_03_02_01_F2.jpg

Figure 2.

Dimensions of gripper and brick

https://cdn.apub.kr/journalsite/sites/ksarc/2024-003-02/N0410030201/images/ksarc_03_02_01_F3.jpg

Figure 3.

Environment description

Grasping objects in close proximity with a robot arm is essential for most robotic tasks and becomes more complex with increased DOF. Higher DOF increases the likelihood of undesirable postures in Cartesian coordinates, complicating RL in these coordinates. Thus, joint angles are preferable for RL, despite the complexity of managing a larger number of joints. Joint angles are better suited for continuous space problems, where determining precise combinations in discrete space is challenging. In our research, we used a KUKA KR50 R2500 robot arm of 6 DOF and a 2500 mm reach with OnRobot 2FGP20 gripper. We focused on grasping target search within a 10 mm Cartesian position range and a 15-degree orientation range from the target pose. Detailed task descriptions are provided in the following sub-sections.

2.1 Environment description

In this study, we randomly positioned and oriented a brick, fixing one corner as the base position point and considering rotations around the z and x axes as altitude and azimuth angles. The target pose was defined as a point 200 mm from the base position, maintaining the same 3D orientation as the brick. Thus, the target pose included both target position and target orientation. The search volume for the target position was within the ranges: x: 800-2000 mm, y: 400-1400 mm, and z: 200-600 mm, with azimuth and altitude angles ranging from -90 to 0 degrees and 0 to 90 degrees, respectively. Observation space s ∈ ℝ12 was a 12-dimensional array which encompassed 6 joint angles of robot arm, 3 cartesian coordinates of target position and 3 rotational axis values; and action space a ∈ ℝ6 was 6 dimensional which comprised of angular change in each joint angles. Table 1 presents the ranges of observation space and action space variables.

Table 1.

Observation and action space ranges

Observation space ranges
J1		J2		J3		J4		J5		J6
L	H	L	H	L	H	L	H	L	H	L	H
-85.3	-35.2	-97.4	-22.6	86.2	144.55	-170.8	-101.1	26.4	114.85	-157.5	-80.5
Target-x		Target-y		Target-z		Rot-z		Rot-x
L	H	L	H	L	H	L	H	L	H
49	1340	1349	1351	112	1464	-90	0	0	90
Action space ranges
J1		J2		J3		J4		J5		J6
-2	2	-2	2	-2	2	-2	2	-2	2	-2	2

Reward function: The reward function for this task was designed to depend on both distance and orientation. A reward of +0.2 was given if the end effector was within 10 mm of the target position and 15 degrees of the target orientation. The reward decreased exponentially as the end effector moved further from the target pose. Distance was measured using the Euclidean norm, while rotational distance was calculated from the sum of azimuth and altitude angles between the target and the end effector poses, adjusted via six joint angles in RoboDK software. A high positive reward of +1 was applied if both distance and rotational distance reward were positive. Additionally, if the robot arm exceeded predefined joint angle ranges, a penalty of -1 was applied.

(1)

R e w a r d 1 = \{\begin{cases} + 0.2 if d i s t a n c e \leq 10 m m \\ e^{- d i s t a n c e / 450} - 1 e l s e \end{cases}

(2)

R e w a r d 2 = \{\begin{cases} + 0.2 if r o t_d i s t a n c e \leq 15 \deg r e e s \\ e^{- r o t_d i s t a n c e / 450} - 1 e l s e \end{cases}

2.2 RL training

Three RL algorithms were utilized in training for this task of complex grasping. Standard SAC was used as the baseline which utilized two critic and target critic networks and the minimum of the critic’s Q value was selected for the value network and critic network loss calculation. In this method, policy update is done by gradient ascent and critic update by gradient descent of loss functions of value and Q. Policy objective function, and critic loss functions are as follows.

(3)

\begin{aligned} J_{π} (ϕ) = \frac{1}{K} \sum_{i} [Q_{ϕ_{1}} (s_{i}, a) - α \log π_{ϕ} (a ∣ s_{i})] or \\ J_{π} (ϕ) = \frac{1}{K} \sum_{i} [Q_{ϕ_{2}} (s_{i}, a) - α \log π_{ϕ} (a ∣ s_{i})] \end{aligned}

(4)

J_{V} (ϕ) = \frac{1}{K} \sum_{i} ({y_{v}}_{i} - V_{ϕ} (s_{i}))^{2}

(5)

J_{Q} (θ_{j}) = \frac{1}{K} \sum_{i} ({y_{q}}_{i} - {Q_{θ}}_{j} (s_{i}, a_{j}))^{2} for j = 1, 2

(6)

y_{v_{i}} = min (\binom{Q_{ϕ_{1}} (s_{i}, a_{i})}{Q_{ϕ_{2}} (s_{i}, a_{i})}) - α \log π_{ϕ} (a_{i} ∣ s_{i})

(7)

{y_{q}}_{i} = r + γ V_{φ^{'}} (s_{i}^{'})

here, 𝜙 and 𝜙' are value function main and target network parameters, θ₁, θ₂ are main Q network parameters, and 𝜙 is the policy network parameter. K is the minibatch size. ${y_{v}}_{i}$ and ${y_{q}}_{i}$ are the target value of value and Q value for critic loss function evaluation. Action a in equation 3 is a parametrized neural network ‘f’ output with sampled $\in$ noise, $a = f_{ϕ} (\in_{i}, s_{i})$ . Similar to other algorithms, in this also target network parameters will be updated by soft updating mechanism.

(8)

ϕ^{'} = τ ϕ + (1 - τ) ϕ^{'}

Other two variants of SAC utilized 5 critic and target critic networks with one actor network and mean and median of the 5 Q values from the critic network for the critic and value network loss and updating but rest of the functioning is the same as baseline SAC. Eq. 9 and 10 depict the target values for value network and critic network for SAC version with meanwhile eq.10 and 11 depict that of SAC median version.

(9)

y_{v_{i}} = m e a n (\begin{array}{l} Q_{ϕ_{1}} (s_{i}, a_{i}) \\ Q_{ϕ_{2}} (s_{i}, a_{i}) \\ Q_{ϕ_{3}} (s_{i}, a_{i}) \\ Q_{ϕ_{4}} (s_{i}, a_{i}) \\ Q_{ϕ_{5}} (s_{i}, a_{i}) \end{array}) - α \log π_{ϕ} (a ∣ s_{i})

(10)

{y_{q}}_{i} = r + γ V_{φ^{'}} (s_{i}^{'})

(11)

y_{v_{i}} = m e d i a n (\begin{array}{l} Q_{ϕ_{1}} (s_{i}, a_{i}) \\ Q_{ϕ_{2}} (s_{i}, a_{i}) \\ Q_{ϕ_{3}} (s_{i}, a_{i}) \\ Q_{ϕ_{1}} (s_{i}, a_{i}) \\ Q_{ϕ_{5}} (s_{i}, a_{i}) \end{array}) - α \log π_{ϕ} (a ∣ s_{i})

(12)

y_{q_{i}} = r + γ V_{ϕ^{'}} (s_{i}^{'})

In this study, we employed a fixed initial position and a randomly generated target pose for each episode. Episode termination occurred either when the maximum number of steps was reached, or the target was achieved. We used the stable_baseline3 (SB3) library to implement the algorithms and trained the model for 15,000 episodes. Each episode comprised T (=384) discrete time-steps, concluding at T or upon reaching the predefined threshold pose. SB3 was used to preserve model parameters, track model loss, and standardize the environment before training. Post-training, models were tested using saved parameters and normalization statistics to evaluate rewards, steps taken, success rates, and distances with rotational distances from the target.

Table 2 presents the hyperparameter ranges and the optimal values for RL algorithms used. Optimal hyperparameters were selected through manual searches within specified ranges. All algorithms employed a decaying learning rate, determined through iterative trials, ranging from 0.0008 to 0.0003. Additionally, the network structure, consisting of two Multilayer perceptions (MLPs), was established via trial and error.

Table 2.

Hyperparameters range and best values

S.N	Hyperparameters	SAC (all versions)
1	Batch_size (b)	R B	32,64,128 128
2	Gamma (𝛾)	R B	0.98 – 0.995 0.99
3	Entropy_coefficient (∈)	R B	0.01 – 0.02 0.017
4	Buffer_size (B)	R B	1M 1M
5	Learning_starts	R B	1024, 5120 5120
6	Tau (τ)	R B	0.002 0.002
7	Train_freq	R B	64, 128 64
8	Gradient_steps	R B	4, 5, 10 5
9	Network_structure		[180, 360, 360, 360]

2.3 Metrics for learning

The assessment of the agents' training performance tracked three primary metrics: average episodic return, average episodic steps, and average success rate. In contrast, the testing rollouts focused on measuring the algorithm's effectiveness in achieving the target pose. We implemented a 25-episode evaluation period, averaging episodic rewards, steps, and success rates as key performance indicators, and tracked training loss for enhanced evaluation. These three performance measures illustrate an increase in episode returns, a reduction in steps required to achieve the target, and an improvement in the success rate of target achievement through training, thereby inherently reflecting the degree of training success. Using a sliding window averaging approach, we observed these metrics and ran each algorithm five times, presenting mean and 95% confidence interval results across these runs. These indicators revealed the training convergence speed and effectiveness. For testing, we plotted the distance from target positions and rotation distance from the target pose to analyze variation and significance. For this study, a 11th generation i7 PC with 32 GB RAM and GeForce GTX 1650 GPU is used with installed windows 11.

3. Results and Discussion

In this study, performance was evaluated using three metrics: episodic rewards, episodic steps, and success rate, which measure variations across training episodes. Additionally, the final distance and rotational distance from the target pose were used to assess the trained agent's end solution.

Fig. 4(a), (b), and (c) show the fluctuations in average episodic returns, average episodic steps, and average success rate over the training period of 15,000 episodes for a target-reaching task. Each plot displayed solid lines for episodic mean values and transparent regions for 95% confidence intervals at evaluation points (every 25 episodes). The baseline SAC algorithm converged more slowly than the two SAC variants. The mean and median variance SAC variants exhibited similar and less variable performance across all three metrics. In contrast, the baseline SAC showed higher variation in success rate and steps, with only minor differences in rewards compared to the SAC variants. Up to 100 evaluation periods, all algorithms performed similarly, but the SAC variants outperformed the baseline SAC thereafter. Average episodic rewards stabilized within 150 evaluation periods. However, steps and success rates continued to vary, indicating the agent's attempts to achieve the target with minimal differences but not consistently reaching the success threshold. The baseline SAC required more steps to reach the target, while the SAC variants achieved more successful trajectories with fewer steps.

https://cdn.apub.kr/journalsite/sites/ksarc/2024-003-02/N0410030201/images/ksarc_03_02_01_F4.jpg

Figure 4.

Mean and 95% confidence interval of window averaged (a) episodic returns, (b) episodic steps and (c) success rate

Fig. 5 presents the training period's loss for the SAC median and mean variant algorithms, with actor losses in Fig. 5(a) and critic losses in Fig. 5(b). Initially, both losses increased until 1.2 million time-steps before decreasing, reaching their minimum values around 4 million time-steps. The mean variant demonstrates slightly better performance, achieving faster loss reduction by approximately 0.2 million time-steps compared to the median variant.

https://cdn.apub.kr/journalsite/sites/ksarc/2024-003-02/N0410030201/images/ksarc_03_02_01_F5.jpg

Figure 5.

(a) Actor and (b) critic loss for two variants of SAC

In addition to training outcomes, Fig. 6(a) and (b) depicted testing rollouts of two variants, showcasing the distance and rotational deviation from the target grasp pose. During testing rollouts, the agent's proximity to the target pose was assessed based on distance (in mm) and rotational deviation (in degrees). These parameters were automatically evaluated at the end of each episode to determine how closely the agent approached the target pose. It is apparent that both SAC variants achieve similar performance in grasping the object, with slight superiority observed in the mean version. The median variant showed a wider range of grasp poses with values that were more closely clustered. In contrast, the mean variant displayed a right-skewed distribution, with most values falling below 8 mm in distance and 7.5 degrees in target pose angle differences.

https://cdn.apub.kr/journalsite/sites/ksarc/2024-003-02/N0410030201/images/ksarc_03_02_01_F6.jpg

Figure 6.

(a) Distance and (b) rotational difference from random target pose in testing rollouts

Due to the challenge posed by a low threshold grasp pose, none of the agents achieved complete training within 15000 episodes. However, SAC variants exhibited a favorable success rate of 60% and consistent episodic rewards. Towards the end of training, SAC variants exhibited an upward trend in performance, contrasting with the relatively stable trajectory of the SAC baseline. This suggests that with additional episodes, SAC variants are likely to achieve higher success rates compared to the baseline. Only successful rollouts were considered for performance evaluation, revealing that both SAC variants achieved similar end grasp poses with minimal deviation from the target pose. This study serves as a foundational step for future research in learning-based grasping of objects with complex poses using robotic arms. Grasping serves as a fundamental task in various engineering applications, including pick-and-place, stacking, and holding tasks. These applications rely on effective grasping techniques for successful completion. As a fundamental task in various robotic applications across engineering domains, this implementation can provide significant benefits to each field. The SAC algorithm, an offline Actor-critic method, can be improved by incorporating multiple critic networks. This enhancement involves using mean or median values instead of minimum values to evaluate the target value of the value function, particularly in complex pose grasping tasks within a specified architecture.

4. Conclusion

Despite the apparent ease of grasping objects in large spaces after determining their pose through vision methods, challenges arise due to precise matching in 6D space. Our study achieved success in achieving complex target grasp poses with a robot arm within a considerable search volume. We employed two variants of the SAC algorithm with baseline SAC. Results indicated a 35% success rate for baseline SAC, contrasting with a 60% success rate for the modified versions, supporting our hypothesis. The disparity in success rate becomes evident due to task complexity, with minimal distinction between baseline and modified versions in easier tasks, posing visualization challenges. Further training episodes could potentially enhance the success rates of SAC variants as these are heading up at the end of the training while baseline is flat. A notable limitation of this study was the exclusion of pose estimation for complete grasping of objects with complex poses, such as a brick. Future research could extend this work by integrating vision-based pose estimation methods to enable comprehensive object grasping.

Acknowledgements

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT). (No. RS-2024-00353461)

References

Al-Shanoon, A., and Lang, H. (2021). Learn to grasp unknown objects in robotic manipulation. Intelligent Service Robotics, 14(4), pp. 571-582.

10.1007/s11370-021-00380-9

Cheng, H., Wang, Y., and Meng, M. Q. H. (2021). Grasp pose detection from a single RGB image, 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems. IROS, pp. 4686-4691.

10.1109/IROS51168.2021.9636511

Choi, C., Schwarting, W., DelPreto, J., and Rus, D. (2018). Learning object grasping for soft robot hands. IEEE Robotics and Automation Letters, 3(3), pp. 2370-2377.

10.1109/LRA.2018.2810544

de Oliveira, D. M., Viturino, C. C. B., and Conceição, A. G. S. (2021). 6D Grasping based on lateral curvatures and geometric primitives. 2021 Latin American Robotics Symposium, LARS, 2021 Brazilian Symposium on Robotics, SBR, and 2021 Workshop on Robotics in Education, WRE, pp. 138-143.

10.1109/LARS/SBR/WRE54079.2021.9605382

Deng, X., Xiang, Y., Mousavian, A., Eppner, C., Bretl, T., and Fox, D. (2020). Self-supervised 6D object pose estimation for robot manipulation. 2020 IEEE International Conference on Robotics and Automation, ICRA, pp. 3665-3671.

10.1109/ICRA40945.2020.9196714

Fu, H., Mei, X., Zhang, Z., Zhao, W., and Yang, J. (2020). Point pair feature based 6D pose estimation for robotic grasping. 2020 IEEE 4th Information Technology, Networking, Electronic and Automation Control Conference, ITNEC, pp. 1803-1808.

10.1109/ITNEC48623.2020.9084720

Junare, P., Deshmukh, M., Kulkarni, M., and Bartakke, P. (2022). Deep learning based end-to-end grasping pipeline on a lowcost 5-DOF robotic arm. 2022 IEEE 19th India Council International Conference, INDICON, pp. 1-6.

10.1109/INDICON56171.2022.10040180

Li, E., Fu, Y., and Feng, H. (2023). 6D grasp pose detection in cluttered environment from a single-view. 2023 IEEE International Conference on Robotics and Biomimetics, ROBIO, pp. 1-6.

10.1109/ROBIO58561.2023.10354876

Li, K., Baron, N., Zhang, X., and Rojas, N. (2022). EfficientGrasp: A unified data-efficient learning to grasp method for multi-fingered robot hands. IEEE Robotics and Automation Letters, 7(4), pp. 8619-8626.

10.1109/LRA.2022.3187875

Li, Z., Xu, B., Wu, D., Zhao, K., Lu, M., and Cong, J. (2021). A mobile robotic arm grasping system with autonomous navigation and object detection. 2021 International Conference on Control, Automation and Information Sciences, ICCAIS, pp. 543-548.

10.1109/ICCAIS52680.2021.9624636

Liang-shan, Z., Jie, L., Xue-min, S., and Jin-song, B. (2019). Research on multi-stage robotic grasping based on object posture. 2019 WRC Symposium on Advanced Robotics and Automation, WRC SARA, pp. 266-271.

10.1109/WRC-SARA.2019.8931943

Liu, P., Zhang, Q., and Cheng, J. (2024). BDR6D: bidirectional deep residual fusion network for 6d pose estimation. IEEE Transactions on Automation Science and Engineering, 21(2), pp. 1793-1804.

10.1109/TASE.2023.3248843

Liu, X., Yuan, X., Zhu, Q., Wang, Y., Zhang, H., Feng, M., Wu, Z., and Tang, Y. (2023). A robust pixel-wise prediction network with applications to industrial robotic grasping. IEEE Transactions on Industrial Electronics, 70(8), pp. 8203-8214.

10.1109/TIE.2022.3212422

Liu, Y., Xu, H., Liu, D., and Wang, L. (2022). A digital twin-based sim-to-real transfer for deep reinforcement learning-enabled industrial robot grasping. Robotics and Computer-Integrated Manufacturing, 78, 102365.

10.1016/j.rcim.2022.102365

Mohammed, M. Q., Kwek, L. C., Chua, S. C., Aljaloud, A. S., Al‐dhaqm, A., Al‐mekhlafi, Z. G., and Mohammed, B. A. (2021). Deep reinforcement learning‐based robotic grasping in clutter and occlusion. Sustainability, 13(24), 13686.

10.3390/su132413686

Park, J., Lee, S., Lee, J., and Um, J. (2020). Gadgetarm-automatic grasp generation and manipulation of 4-dof robot arm for arbitrary objects through reinforcement learning. Sensors, 20(21), pp. 1-16.

10.3390/s2021618333143047PMC7662704

Shahid, A. A., Piga, D., Braghin, F., and Roveda, L. (2022). Continuous control actions learning and adaptation for robotic manipulation through reinforcement learning. Autonomous Robots, 46(3), pp. 483-498.

10.1007/s10514-022-10034-z

Shukla, P., Kumar, H., and Nandi, G. C. (2020). Robotic grasp manipulation using evolutionary computing and deep reinforcement learning. Intelligent Service Robotics, 14(1), pp. 61-77.

10.1007/s11370-020-00342-7

Wang, L., Meng, X., Xiang, Y., and Fox, D. (2022). Hierarchical policies for cluttered-scene grasping with latent plans. IEEE Robotics and Automation Letters, 7(3), pp. 2883-2890.

10.1109/LRA.2022.3143198

Wang, S., Liu, J., Lu, Q., Liu, Z., Zeng, Y., Zhang, D., and Chen, B. (2023). 6D Pose estimation for vision-guided robot grasping based on monocular camera. 2023 6th International Conference on Robotics, Control and Automation Engineering, RCAE, pp. 13-17.

10.1109/RCAE59706.2023.10398793

Zuo, G., Tong, J., Wang, Z., and Gong, D. (2023). A graph-based deep reinforcement learning approach to grasping fully occluded objects. Cognitive Computation, 15(1), pp. 36-49.

10.1007/s12559-022-10047-x

Journal of Construction Automation and Robotics ISSN:2800-0552(Print) 2951-116X(Online) 건설자동화·로보틱스 논문집

Preview

Self-learning of Robot Arm using a Modified Soft-Actor-Critic (SAC) Approach: Complex 6D Pose Brick Grasping Test

ABSTRACT

MAIN

Figure 1.

Developed framework

Figure 2.

Dimensions of gripper and brick

Figure 3.

Environment description

Table 1.

Observation and action space ranges

(1)

(2)

(3)

(4)

(5)

(6)

(7)

(8)

(9)

(10)

(11)

(12)

Table 2.

Hyperparameters range and best values

Figure 4.

Mean and 95% confidence interval of window averaged (a) episodic returns, (b) episodic steps and (c) success rate

Figure 5.

(a) Actor and (b) critic loss for two variants of SAC

Figure 6.

(a) Distance and (b) rotational difference from random target pose in testing rollouts

Acknowledgements

References