Third Place Solution in 2021 Fall ROAR S1 Series
Written by Tianlun Zhang and Star Li describing his experience and solution for their third place solution in Fall 2021 ROAR S1 Series Competition.
Table of Contents
In this post, we will walk through our solution to the ROAR S1 Competition in Fall 2021, which nicely blends algorithms from deep reinforcement learning (DRL) into traditional controllers and helped us get 3rd place in the competition. All of our codes have been open-sourced and available at https://github.com/listar2000/ROAR_gym/tree/tianlun; hopefully, our approach can give inspiration to any individual or team that wishes to incorporate DRL into autonomous racing.
In short, we have leveraged the traditional PID controller to maneuver our vehicle and used a DQN agent (a DRL algorithm, more details below) to dynamically set up the PID coefficients in different regions of the racing track. In the training phase, we reward the agent by its throttle output, locations of passing waylines, and driving faster than a certain speed; and penalize it on collision, high steering on straight lines and a constant surviving penalty. The next two sections will be dedicated to some concepts and design choices related to this core control mechanism.
For throttle control, we used the solution from previous winning solution, which abandon implementing breaking and outputs throttle = exp (-0.07 * |roll|), until the agent reaches a turn.We implemented the turning detection by calculating the tangent values of angles between the agent’s current direction and waylines in the future. In the current solution, we simply reduce the throttle to 0 until the agent finishes turning.
In addition to an RL-supported controller, for the sake of winning the competition, we also implemented some hacks and components aiming at solving the “edge cases” in autonomous racing. We will talk about them in detail in section 4.
Finally, even though we spent a lot of time last semester brainstorming and bettering our solution, we realized that there is still a lot of space for improvements and even redesigns on key aspects in our pipeline. Therefore, the last part of this post is about some next steps for whoever is interested in working on top of this solution.
According to Wikipedia, a proportional–integral–derivative (PID)controller is a closed-loop control algorithm that calculates the difference between the desired setpoint (SP) and a measured process variable (PV) and applies a correction based on proportional, integral, and derivative terms (denoted P, I, and D respectively). For concreteness, our design only requires a lateral (steering) PID controller while the longitudinal (throttle) control is managed by a roll controller, as will be mentioned later. We simply use the “vanilla” PID version that is actually offered in the ROAR codebase because the focus of our solution is not on the controller side. The only change we have done towards PID is to make sure that its coefficients can be dynamically set, which enables smooth integration with our DRL agent. For more information about our PID controller, such as what each term represents or some more in-depth explanation, I would recommend reading the blogpost from previous winners, such as the following post.
Deep Reinforcement Learning
Reinforcement learning (RL) is a branch of modern machine learning that focuses on how intelligent agents ought to take actions in an environment in order to maximize the notion of cumulative reward. We decided to center our solution around RL since its objective perfectly aligns with the goal in autonomous racing: enabling an autonomous agent to make decisions (steering & throttle) within an environment (the racing track), with the aim to maximize some reward (e.g. shortest single-lapse time) given the observations (e.g. the visual inputs from sensors). To be more precise, we have built up a custom Gym environment and utilized the `stablebaseline3` package in Python to train our DQN agent. Some important specifications of our environment are:
Observation space: an array of length 13 containing vehicle’s current speed, location and direction, and next waypoint’s location and direction:
[current speed, vehicle_x_location, vehicle_y_location, vehicle_z_location, vehicle_x_rotation, vehicle_y_rotatoin, vehicle_z_rotation, next_waypoint_x_location, next_waypoint_y_location,next_waypoint_z_location, next_waypoint_x_rotation, next_waypoint_y_rotation, next_waypoint_z_rotation]
where currend_speed ∈ [-220, 220], all variables for location ∈ [-1000, 1000], and all variables for rotation ∈ [-360, 360]
Action space: an array of length 3 representing PID values, each value is range from 0 to 0.6, spaced by 0.1:
[P, I, D] where P,I,D ∈ range(0, 1, 0.1)
2. Reward by throttle output to encourage full throttle:
3. Constant reward if the agent is driving faster than a target speed to encourage high speed:
4. Reward for passing a wayline. The reward ratio is based on the distance from crossing position and target waypoints, as following:
5. Negative reward on steering and change of steering when the agent is on the straight lines:
6. Huge negative reward for collision
Specifically, DRL simply uses a deep neural network (not necessarily very deep) as an approximator to represent important functions in RL (such as Q-functions). In Deep Q-learning (DQN), the DRL agent needs to learn the Q-function of the given environment, which abstracts away the value (or reward) of choosing to perform a certain action under a certain state (i.e. observation). For instance, drawing an analogy from a real-world setting, eating organic salads (an action) should have a higher value than French fries for someone on a diet (a state). Instead of coding up every detail of DQN, we take advantage of the out-of-the-box algorithms and interfaces of `stablebaseline3` to construct our agent.
We trained our agent under the above environment for 10 hours using the Nvidia 3090 GPU in our lab. The training script and configurations can also be found in the given Github link above. The end result is a trained DQN agent that takes in the state of our autonomous vehicle and the track at every frame and returns the predicted PID coefficients that best suit our controller at the current state.
Handling Special Cases
We implemented a turning detection algorithm to slow down the vehicle before entering the four sharp turns on the map. The algorithm finds the immediate next wayline, and the 20th and 40th wayline in the future and calculates the tangent values of angles between them from their slopes. The first tangent value tells if the vehicle is currently turning, and the second tangent value tells if the vehicle will enter a sharp turn soon. In both cases, we set the throttle to 0 if the vehicle is driving faster than a target turning speed, the speed we believe the vehicle can finish the turn without collisions, so that the vehicle can slow down if it is driving too fast to perform turning safely.
In the prediction phase, the DQN model turns out to handle the steering well in most of the cases. However, in some minor parts of straight lines on the track, the agent wabbles and reduces its speed. We boxed these areas and applied a known working PID value when the agent is in the box for racing purposes. This issue might be solved by some future work, which we will talk about in section 6.
Even though our solution has surpassed many traditional, non-RL-based results, we have also noticed much space for further improvements. Due to the page limit, we will only present some of them here.
For the DRL, there might be better alternatives to the vanilla DQN algorithm. For example, the Double Q-learning (DDQN) algorithm has better stability properties and effectively avoids the overestimation of Q-function. There are also other branches under DRL, such as policy-gradient-based algorithms (e.g. DDPG) that we had no time to test out. The action space we used in DQN is discrete, and its size is small, which means this action space might not be accurate enough to contain the optimal values. Even using the same DQN algorithm as us, it might be worthy to spend some time performing more comprehensive hyperparameters and reward ratio search and tuning; it might even be desirable to incorporate automatic hyperparameter tuning tricks such as Bayesian optimization into it. Also, since the model just takes the agent’s current location and direction, current speed, and next waypoint’s location as input, which is independent of the entire map, it is possible to train the agent in different cases separately to get a more accurate prediction.