debug_log.txt

====================嗨嗨嗨我又来了啊====================
Log_num: 1
Date: 2023.03.09
Time: 13:41
Contents:
	1). Rename 'ActorNetwork' to 'Actor', 'CriticNetwork' to 'Critic'
	2). Move files in ./simulation/PG_based/ to ./simulation/AC_based/
	3). Add basic classes 'ProbActor' and 'DualCritic' in ./common/common_cls.py
	4). Fix some bugs caused by 1), 2), and 3)
=================奥利给兄弟们干他就完了=================
	

====================嗨嗨嗨我又来了啊====================
Log_num: 2
Date: 2023.03.10
Time: 00: 17
Contents:
	1). Add SAC
	2). Add SAC-4-CartPole.py in /simulation/AC_based/
=================奥利给兄弟们干他就完了=================
	

====================嗨嗨嗨我又来了啊====================
Log_num: 3
Date: 2023.03.11
Time: 19:24
Contents:
	1). Add a new environment: CartPoleAngleOnly. 
		This new environment only takes angular information as the states of RL.
		We don't care the positional state of the cartpole.
	2). Add SAC-4-CartPoleAngleOnly.py in /simulation/AC_based/
	3). Add TD3-4-CartPoleAngleOnly.py in /simulation/AC_based/
	4). Add a well-trained controller for CartPoleAngleOnly using TD3 in /datasave/networks/TD3-CartPoleAngleOnly/parameters
=================奥利给兄弟们干他就完了=================


====================嗨嗨嗨我又来了啊====================
Log_num: 4
Date: 2023.03.11
Time: 19:50
Contents:
	1). Discard the return variables of function 'step_update(action)' in all envs.
	2). Add different directories for different RL algorithms in /simulation/AC_based
=================奥利给兄弟们干他就完了=================

	
====================嗨嗨嗨我又来了啊====================
Log_num: 4
Date: 2023.03.11
Time: 20:57
Contents:
	1). Fix some tiny bugs.
=================奥利给兄弟们干他就完了=================


====================嗨嗨嗨我又来了啊====================
Log_num: 5
Date: 2023.03.13
Time: 20:54
Contents:
	1). Add A2C.
	2). Add A2C-4-CartPoleDiscreteAngleOnly.py in /simulation/AC_based/A2C/
	3). Fix some bugs.
注：
    破玩意不好使，我怀疑是我写的有问题，后来从 github 上找了两个 demo，然后用 ChatGPT 也生成了一个demo，都不好使。
    A2C 用来控制 gym 那个cartpole-v1 有点作用，但是动作选择才两个，我就是盲猜都有 50% 的概率，还用它学。
    我自己写的 CartPoleDiscreteAngleOnly 环境的动作最开始是33个选择，后来13个，最后7个；角度范围从30度，缩小到20度，最后是10度，都不好使。
    相当于我把正确答案都送到它嘴边了，都尼玛学不出来。
    反观TD3，连续动作空间，经验池采20000左右，1分钟不到直接完美......还是得使用经验池啊..>_<
Tips:
    The darn thing doesn't work. I suspect there's something wrong with my code.
    Later, I found two demos on GitHub and a demo generated by ChatGPT. The three demos didn't work either.
    A2C works a bit for controlling gym's cartpole-v1, but it only has two actions.
    I'm just guessing with a 50% chance.
    I created my own CartPoleDiscreteAngleOnly environment with initially 33 action choices, then reduced to 13, and finally 7;
    The angle range went from 30 degrees to 20 degrees to 10 degrees, but none of them worked. God...
    It's as if I've handed it the correct answer on a silver platter, and it still can't learn.
    On the other hand, TD3 with a continuous action space and an experience pool of around 20,000 can learn perfectly in less than a minute...
    I guess I still need to use the replay buffer.
=================奥利给兄弟们干他就完了=================


====================嗨嗨嗨我又来了啊====================
Log_num: 6
Date: 2023.03.14
Time: 20:54
Contents:
	1). Add a controller for CartPole with A2C.
Tips:
    This environment is different the one in gym. There are two main difference.
    First, the action of the CartPole in gym is discrete, while main is continuous.
    Then, the training goal of the CartPole in gym is just stabilizing the Pole,
        while main is stabilizing the Pole and maintaining the Cart at x=0 simultaneously.
    However, the performance is worse than just stabilizing the Pole
        because the Pole and the Cart have conflict of interest in some scenarios.
    I insist that controlling both Cart and Pole is more meaningful and challengeable
        and that is what DRL should do.
    Currently, my controller can stabilize the Pole but leave the Cart keeps fluctuating around x=0.
    目前摆能稳定住，但是小车在搁那嘎达晃晃悠悠的。(不是动力学的问题，我加了阻力，所以理论上有能量耗散，绝对可以稳住不动的)
    If anyone happens to see my code and has any idea to improve the performance, please email me.
        E-mail: yefeng.yang@connect.polyu.hk
        Respect~ 瑞思拜~
=================奥利给兄弟们干他就完了=================


====================嗨嗨嗨我又来了啊====================
Log_num: 7
Date: 2023.03.15
Time: 19:48
Contents:
	1). Add PPO.
	2). Add a controller for CartPoleAngleOnly using PPO.
	3). Add a controller for FlightAttitudeSimulator2StateContinuous using PPO.
Tips:
    The performance is pretty good.
=================奥利给兄弟们干他就完了=================


====================嗨嗨嗨我又来了啊====================
Log_num: 7
Date: 2023.03.16
Time: 11:28
Contents:
	1). Add PPO-4-CartPole.py
	2). PPO-4-UGVBidirectional.py
Tips:
    No well-trained controller for the two envs yet.
=================奥利给兄弟们干他就完了=================


====================嗨嗨嗨我又来了啊====================
Log_num: 8
Date: 2023.03.21
Time: 20:26
Contents:
	1). Add DPPO
	2). Add DPPO-4-CartPoleAngleOnly.py
	3). Add a controller for CartPoleAngleOnly using DPPO.
	4). Add DPPO-4-FlightSimulator2State.py
	5). Add a controller for FlightSimulator2State using DPPO.
=================奥利给兄弟们干他就完了=================


====================嗨嗨嗨我又来了啊====================
Log_num: 9
Date: 2023.03.23
Time: 21:38
Contents:
	1). Add DPPO2 (without debugging)
=================奥利给兄弟们干他就完了=================


====================嗨嗨嗨我又来了啊====================
Log_num: 10
Date: 2023.03.28
Time: 22:43
Contents:
	Nothing, just remind that, DPPO2 doesn't work!!

Tips: 疯了，一周白干，DPPO2 不好使，不知道是哪里的代码出问题了。
    DPPO2 使用了 python 较新的 shared_memory 作为通信机制，这样做快，我对比了一个 github上 用 pipe 的，真的是快的特别多，
    但是这会和 CUDA 的多进程冲突，所以只能用 CPU。但是当我只开一个进程的时候，训练完全没有效果，而同样的环境，用 PPO 秒收敛。
    用 DPPO1，因为我减小了学习率，前期因为 agent 太多，收敛也不容易，所以 DPPO1 收敛地比 PPO 慢很多，但是很稳，最后效果相同。
    所以，肯定不是 DPPO2 网络训练问题，应该是中间某个环节的数据传输出问题了 (多进程通信，包括探索的数据通信，网络参数通信，等等)。
    github 上没有找到用 shared_memory 写的 (虽然我坚持认为这么做最科学，但是 mp.shared_memory 好像是 python3.8 之后的，所以可能暂时没有这么写的)。
    debug 到吐，最后也没查出来是谁的问题。。。
    哪位大佬碰巧看到，可以浏览一下 /algorithm/policy_base/Distributed_PPO2.py 和 /simulation/POlicy_based/DPPO2-4-CartPoleAngleOnly.py 这俩是对应的。
    多谢大佬...... >_< Q_Q T_T
=================奥利给兄弟们干他就完了=================


====================嗨嗨嗨我又来了啊====================
Log_num: 10
Date: 2023.03.29
Time: 22:15
Contents:
    1). 更新了所有算法与环境交互的方式：删除了模型描述文件 (文件本身还在，只是不再使用)
    2). 算法内部加载 env 环境，可以内部交互
    3). 修改了DQN、DuelingDQN 和 DoubleDQN 的部分结构
    4). 修改了部分文件的路径
    5). 重写了 README.md

Tips：
    至此，开始逐步将仓库 ReinforcementLearning 修改成 ReinforcementLearning_V2，这也意味着 ReinforcementLearning 将停止更新。
    ReinforcementLearning_V2 所有的功能与用法均与 ReinforcementLearning 完全相同，之前训练好的控制器也能直接在 V2 上运行 (如果没有bug)。
    当 V2 的调试完全结束之后，ReinforcementLearning_V2 会被重新命名为 ReinforcementLearning 以覆盖目前的版本。
=================奥利给兄弟们干他就完了=================


====================嗨嗨嗨我又来了啊====================
Log_num: 10
Date: 2023.03.30
Time: 22:59
Contents:
    1). Add new env: UGV2.UGVBidirectional
    2). Add new env: SecondOrderIntegrationSystem
=================奥利给兄弟们干他就完了=================


====================嗨嗨嗨我又来了啊====================
Log_num: 10
Date: 2023.04.03
Time: 22:59
Contents:
    1). Add new controller: a positional controller for SecondOrderIntegration using DPPO.

Tips:
    需要网络重训练的方法，在第一次保存的网络中，挑一个比较好的，以此为基础，再训练第二波，以此类推，直到有好的为止。
    奖励函数分为两部分，位置奖励 和 角度奖励。
    位置奖励：负的误差的绝对值，在加一个正的偏置 (跟二次型效果相同)
    角度奖励：误差向量与速度向量夹角的奖励，这个是为了增加轨迹的直线性。奖励数值为 -(夹角 - 45度) * gain，夹角小于45度的时候是就开始给正奖励了。
    确实学出来了，但是用的是DPPO，这就意味着同样的参数，用 PPO 或者 SAC，TD3 可能就不太好学，要么因为数据太少学不完，要么因为走的步子太大震荡。
    拼算力啊......我这电脑太垃圾了，以后有机会弄个 i9 处理器，核再多点能更好 (我的电脑目前也就5个进程，想跑10个也行，但是速度跟5个区别不太大了) 再或者直接上ISSAC Gym......
    奥利给干了
=================奥利给兄弟们干他就完了=================


====================嗨嗨嗨我又来了啊====================
Log_num: 11
Date: 2023.04.12
Time: 22:29
Contents:
    1). Add discrete version of PPO: PPO_Discrete
    2). Add new controller: a positional controller for SecondOrderIntegration-BangBang using PPO_Discrete.

Tips:
    这里需要说明一下，为什么加上 “BangBang”。因为 BangBang 是一类系统的时间最优控制的解的形式，即不论系统动态模型是什么，最后的答案一定是 BangBang 的。
    所谓 BangBang，指的就是控制器的输出只能是最大或者最小，而不存在其他任何中间值，这类最优控制也被称为 “饱和控制”。所以，既然提前已经知道了控制器的形式，
    所以，动作空间就没有必要从 min 到 max 连续了，直接用离散的动作空间就可以的。因此，可以使用 PPO_Discrete 解决。这种操作是完全合理的，RL 问题中，
    奖励函数的设计形式和状态-动作空间的设计要远远比网络结构和网络参数重要，因此合理的简化问题，可以大大提高训练速度和效果。二阶积分系统的控制，之前只用 DPPO
    才训练出来一个差不多的，经过简化，直接用 PPO_Discrete 就完全可以解决。或许在算力足够的情况下不需要考虑这个问题，但是合理将问题简化，并用少的算力得到相同的效果
    是十分有意义的。
=================奥利给兄弟们干他就完了=================


====================嗨嗨嗨我又来了啊====================
Log_num: 12
Date: 2023.04.19
Time: 16:41
Contents:
    1). Adjust networks and some functions for DQN/Double DQN/Dueling DQN to support multi-dimensional actions.
    2). Debugged the networks and corresponding demos mentioned above.

Tips:
    原本的DQN系列网络输出为一维动作的所有价值，不支持高维的动作，本次更新将DQN系列网络输出修改为动作的每个维度的每个价值拉直成的向量，并在用网络选取动作时将其维度进行
    还原，通过取动作每个维度的最大价值作为动作在该维度的选择，使得DQN系列网络和demo支持多维的动作。同时修复了DQN系列算法在Flight-Attitude-Simulator上demo的历史
    遗留bug和调整后的新bug，这些demo现在均可以正常进行训练和测试，但是训练效果尚未考察。
=================奥利给兄弟们干他就完了=================


====================嗨嗨嗨我又来了啊====================
Log_num: 13
Date: 2023.04.19
Time: 22:18
Contents:
    1). Added a new controller (stored in ./datasave/network/DQN-FlightAttitudeSimulator)
    using multi-dimensional-supported DQN (named DQN) for flight attitude simulator.
    2). Fixed some tiny bugs and deleted some redundant functions.

Tips: None
=================奥利给兄弟们干他就完了=================


====================嗨嗨嗨我又来了啊====================
Log_num: 14
Date: 2023.04.20
Time: 0:53
Contents:
    1). Added a new controller (stored in ./datasave/network/DoubleDQN-FlightAttitudeSimulator)
    using multi-dimensional-supported DoubleDQN (named DoubleDQN) for flight attitude simulator.
    2). Added a new controller (stored in ./datasave/network/DuelingDQN-FlightAttitudeSimulator)
    using multi-dimensional-supported DuelingDQN (named DuelingDQN) for flight attitude simulator.
    3). Fixed some bugs.

Tips: None
=================奥利给兄弟们干他就完了=================


====================嗨嗨嗨我又来了啊====================
Log_num: 15
Date: 2023.04.26
Time: 12.20
Contents:
    1). Add a discrete SecondOrderIntegration environment
    2). Add DQN for SecondOrderIntegration_Discrete (untrained)
    3). Add an acceptable-only controller (in ./datasave/network/DQN-SecondOrderIntegration) for SecondOrderIntegration (BangBang)
    4). Fixed some bugs for multi-dimensional-supported DQN, DoubleDQN and DuelingDQN.
    5). Add an option for supporting delay-updating critics for function update_network_parameters() in
    TD3 (Twin-Delayed-DDPG)

Tips:
    The performance of the SecondOrderIntegration controller trained by DQN is acceptable at most.
    I guess this is because I set the boundaries of the environment to prevent the mass point from
    going too far away during the training process. However, DQN had learnt that the boundaries could be used
    to reach the target. Once it reaches the target successfully by using the rebound mechanism, DQN would take that as
    a positive feedback and reinforce such policy. I learnt from this training process that we'd better not change the
    environment while training. I put the corresponding gif in ./datasave/video/gif and named it "bound-bound
    control" for fun~.
=================奥利给兄弟们干他就完了=================


====================嗨嗨嗨我又来了啊====================
Log_num: 16
Date: 2023.05.09
Time: 15.28
Contents:
    1). Add a UGVForward_pid environment using PD controller with parameters tuned by RL (in ./environment/env/UGV_PID)
    2). Add a TD3 controller (in ./datasave/network/TD3-SecondOrderIntegration) for SecondOrderIntegration
    3). Add a TD3 controller (in ./datasave/network/TD3-UGV-Forward_pid) for UGVForward_pid

Tips:
    Considering the continuity of the wheels' velocity, we originally planned to train the acceleration of the wheels
    directly using TD3. However, we failed to get an acceptable controller. Therefore, we decided to use two PID
    controllers for positional and angular control and used TD3 to automatically tune the parameters of the PID
    controllers. The bridge between velocity command and acceleration command is a simple proportional module. The
    training result indicates the effectiveness of our second plan, which can be utilized in future demos.
=================奥利给兄弟们干他就完了=================


====================嗨嗨嗨我又来了啊====================
Log_num: 17
Date: 2023.05.16
Time: 19.23
Contents:
    1). Add a UGVBidirectional_pid environment using PD controller with parameters tuned by RL (in ./environment/env/UGV_PID)
    2). Add a PPO controller (in ./datasave/network/PPO-UGV-Forward_pid) for UGVForward_pid
    3). Add a PPO controller (in ./datasave/network/PPO-UGV-Bidirectional_pid) for UGVBidirectional_pid

Tips:
    Indicated by last log, the idea of RL-PID is utilized in the bidirectional new environment. The difference between
    forward and bidirectional environments lies in the measurement of position error and angular error as PID inputs.
    To enable the ability of going backward, the position error value can be positive or negative depending on the
    orientation of UGV. And the angular error should range in [-pi/2, pi/2] while that is [-pi, pi] in the forward one.
    The trained controller performed better in the bidirectional one due to it's flexibility in direction.
=================奥利给兄弟们干他就完了=================


====================嗨嗨嗨我又来了啊====================
Log_num: 18
Date: 2023.05.22
Time: 13.50
Contents:
    1). Add a DPPO controller (in ./datasave/network/DPPO-UGV-Bidirectional_pid) for UGVBidirectional_pid
    1). Add a TwoLinkManipulator environment (in ./environment/env/RobotManipulators)
    2). Add a PPO controller (in ./datasave/network/PPO-TwoLinkManipulator) for TwoLinkManipulator
    3). Add a DPPO controller (in ./datasave/network/DPPO-4-TwoLinkManipulator) for TwoLinkManipulator

Tips:
    The TwoLinkManipulator is fully driven by torques on each spin axes. A TD3 controller is utilized to control
    the manipulator but failed after several attempts. The reason might be that TD3 is hard to converge when the
    relations between rewards and actions are complex or not obvious. It might work if we well designed the reward
    functions. However, PPO and DPPO can handle that because policy-based algorithms can dig messages in a deeper level,
    but steady-state error still exists.
=================奥利给兄弟们干他就完了=================