After having written two articles about machine translation, I thought it might be time to address the flourishing field of reinforcement learning. For this sake, I have designed a Cython-optimized billiard environment that simulates physically-accurate (up to torque) game episodes.
To account for the continuous action space I have decided to implement the Deep Deterministic Policy Gradient [Silver et. al, 2015] algorithm. Having only solved easier tasks before, I quickly realized that especially the initial exploration phase isn’t as straightforward as initially assumed. To rule out any other pitfalls, I’ve further tested some components in isolation which was made possible by artificially generating data.
Subsequently it came to my mind that drafting up an article about this analysis might be compelling as well. This article revolves around a mechanism that learns polar coordinates of one point relative to some other (fixed) point.
The code is, as always, available on GitHub and requires Python 3.7 as well as PyTorch 1.0. Furthermore, the code can easily be run on older systems.
Figure: An overlay of two examples
The network is given an image of some pre-defined resolution that contains two points. The former point (i.e. the origin) is always located in the center while the latter point’s polar coordinate is governed by the random vector where and resp. . The ultimate goal is to predict these polar coordinates with respect to the origin.
The angles are naturally defined on the interval while the radii range from 0 to 1 where 1 corresponds to the maximum radius .
As the network is fed with artificial data we abstain from any attempts of measuring generalization (e.g. by defining a proper data split).
In the remainder of the article will denote a realization of the bivariate random variable .
Side note: It should be pointed out that by defining to be resp. to be and taking we do not get a circular uniform distribution on the unit disk. Applying the transformation would yield, however, exactly said distribution. Yet we abstain from using it for the sake of simplicity as the distribution of makes our analyses easier than those with regard to .
The network’s architecture consists of two kinds of blocks:
- FC: Fully connected (aka dense) layer
- BN: Batch Normalization [Ioffe & Szegedy, 2015]
Side note: I kept the configuration of the network as flexible as possible. In this fashion, it is possible to specify the amount of blocks (per block class) and other parameters (e.g. strides or kernel sizes). Judging by the complexity of the problem, it should however be enough to keep the configuration as it is.
Representation of outputs
To get a meaningful output the network needs to be able to “reach” any value that comes from the ground truth. As the angles are defined over and the radii over , we can leverage some bounded activation function and a linear transformation thereof.
A possible solution using the sigmoid function could read as follows (for two inputs ):
We immediately see that this meets our requirement, as and are always valid. Moreover, this scheme guarantees uniqueness with respect to the angles as one angle cannot be expressed by multiple values. The boundedness further ensures that out-of-range values can never occur.
Nonetheless, it is not clear whether this scheme is optimal. One potential problem could be that both the sigmoid function and the tangent hyperbolicus saturate very quickly (e.g. and also vice-versa ) causing many values to be either 0 or 1 after initialization.
A better solution could be to leverage a function that saturates more slowly than the sigmoid function. The softsign function can be interpreted as a continuous generalization of the sign function. The scaled version shares the same limits as the sigmoid function (i.e. and ) but does not exhibit exponential saturation.
Why is this important? By design our radii (the same arguments holds true in an adapted form for the angles) follow a unit uniform distribution . An optimal network , given some input (in this case our images), necessarily needs to parameterize the output distribution of in a way that maximizes its similarity to the distribution of the radii (i.e. ).
We can always conceive our neural network as being composed of two functions and some activation function (s.t. ), then we can define as being the random variable .
Since the activation functions under question are all odd (after subtracting ) and given our assumption that should be uniform, we can further assume that ’s distribution should be akin to the symmetric uniform distribution where is yet do be determined.
The notion of similarity can be made mathematically precise by using a distance metric on probability distributions. One option to do this is using the Bhattacharyya index. Despite not being a proper metric in a strict sense (it does not satisfy the triangle inequality), it serves our purpose of expressing similarity as a single number. As we are solely interested in evaluating whether some pair of distributions is more similar to each other than another one (i.e. with being the unit uniform distribution) we could also use the Kullback-Leibler divergence. Both are expressed as an integral over the domain of the random variables involved, I decide to choose the former as it makes the derivations somewhat easier in this case.
Let us also define a parametric family over both functions: for some . This enables us to further account for varying levels of “steepness”.
Figure 2: A comparison of the sigmoid- and the scaled-softsign family for . We observe that the respective sigmoid functions saturate considerably faster than their scaled-softsign counterparts.
In the following part we will thus deduce the probability distributions of and , assuming that follows some symmetric uniform distribution for some yet to be determined .
The distribution of can be inferred using the definition of the cumulative function:
Side note: The probability integral transform law tells us that any continuous random variable can be transformed into a random variable that follows by applying its cumulative function on itself (i.e. ). In our case this would be the same as just taking a linear activation which would however no longer meet our requirement of being bounded.
As is strictly monotonic it is also invertible. The inversion is given by which is a scaled version of the so-called logit or log odds function.
The density function of is moreover given by if and 0 otherwise.
Putting all together we get:
Let us now do the same for . The inverse function and its derivative are given by:
Analogously, we get:
Figure 3: Density plots of and for and (i.e. ).
How can we pick a such that the distributions of and become as uniform as possible for some fixed ?
For this sake, let us derive the Bhattacharrya index for both cases:
Interestingly, both functions and solely depend on the product without relying on the variables and individually. This enables us to optimize both as functions of a single variable .
Numerical optimizations yields:
Side note: The Bhattacharrya coefficient satisfies for all density functions , on some domain .
We observe that the sigmoid family is better suited as it can achieve a higher similarity to the uniform distribution defined on the unit interval. Notice that the network can always attain the optimal value by adjusting to be for some given . Intuitively, this makes sense as a large makes the corresponding function steeper, causing the network to settle for a smaller (i.e. the variance of the input random variable thus becomes smaller) to make resp. more uniform.
Probability theory tells us that some random variable can always be standardized by defining a new variable . Not knowing is not much of a problem as and can easily be estimated using batch statistics. The process of computing these estimates and applying them is called Batch Normalization [Ioffe & Szegedy, 2015].
This enables us to derive a value of that is independent of (yielding ) according to the aforementioned numerical optimization for the sigmoid function.
Representation of radii/angles
The radii and angles are accordingly represented as:
where are the outputs of the last batch normalization layer.
Assume that we would like to quantify the difference between two given angles . A naïve way to do this is to leverage the absolute difference between and . This works as long as the absolute difference is not greater than (i.e. if and only if ) and fails otherwise. The latter case can easily be demonstrated by choosing and yielding instead of the actual inscribed angle of .
Of course, this can easily be resolved by formulating a case distinction along these lines:
Continuity: Let us express the above equation in terms of s.t. : We observe that and exist and are identical, from which continuity follows. The function is however not differentiable as is not equal to .
From a theoretical point of view the missing differentiability poses a problem as we need to be able to compute a gradient for each conceivable combination of and to make back-propagation work. In practice, however, this isn’t much of an issue as and are individually differentiable and the cases where ’s gradient does not exist (whenever we have so-called antipodal ) can be resolved by setting these gradients to one of the one-sided limits (i.e. or ). As a side note, this is the same strategy that PyTorch or TensorFlow follow to compute the derivative of the ReLU function.
Figure 4: The left contour plot depicts , a function of . We observe that holds for any angle . The global maxima are attained if and only if and are antipodal (i.e. ). The right plot illustrates as a function of a single variable . The special case shows that despite ’s continuity it is not differentiable everywhere.
How to implement it?
One, admittedly, very exotic variant to construct is by deriving it using trigonometric functions. Despite its intriguing mathematical beauty and its simple implementation in PyTorch/TensorFlow I would not recommend using it. The reason behind this is that trigonometric functions are many orders of magnitude slower than primitive instructions such as additions or subtractions.
Assume that and are some arbitrary vectors in and that is the inscribed angle between and .
Exploiting the definition of the cross product and the dot product we get:
Side note: One could think that either of these two definitions already serves our purpose of computing . Unfortunately, both equations are numerically ill-conditioned which would eventually break our training process.
Furthermore, by the definition of the tangent.
Any angle can be represented using a two-dimensional vector on the unit circle. Let us exploit this by defining and (a trailing 0 was added because the cross product is only defined for three dimensions).
We deduce: where the last result was obtained by applying the sine & cosine addition theorems in reverse.
Finally, we obtain . Special attention has to be paid because is defined over the interval which represents an oriented means of measuring the distance between and . Consequently, our desired angle can easily be computed by harnessing . As a nice side effect, by setting the function can also be used to normalize some arbitrary angle .
The recommended way: Using a Condition
The most efficient way to implement the aforementioned case distinction over is as follows:
def great_circle_distance(alpha, beta): """ Computes the unsigned distance between two angles divided by π Range of values: [0, 1] :param alpha: A PyTorch Tensor that specifies the first angle :param beta: A PyTorch Tensor that specifies the second angle :return: A new PyTorch Tensor of the same shape and dtype """ abs_diff = torch.abs(beta - alpha) return torch.where(abs_diff <= np.pi, abs_diff, 2. * np.pi - abs_diff)/np.pi
Notice, that I further re-scaled the result by dividing it by which ensures that the value will always be between 0 and 1.
The radial loss is defined as : for some small . Notice that the loss is lower bounded by 0 and upper bounded by 1 and is thus defined over the same range as the angular loss.
Side note: Remember that always satisfies .
Due to the presence of two losses we need to find a way to balance both. Remember that we scaled both the angular and the radial loss to be within . This way, the angular loss attains a maximum value of 1 if the angles to be compared are antipodal while the radial loss attains the same value whenever one radius is 0 and the other one is equal to 1. Taking the average of both therefore ensures that the resulting loss is on the same scale and that both terms are weighted equally.
Therefore, we get: where is the output by the network and denotes the ground truth.
Let us now denote the radial loss by , the angular loss by and the composite loss by .
I have performed five runs each being associated to a different with 1.95 being the (theoretically) optimal value. The training was terminated after 2000 steps, the batch size was set to 90 to fully utilize my GPU at the expense of having less stochasticity in the gradients. Moreover, I used the Nesterov-accelerated gradient optimizer [Nesterov, 1983] with an initial learning rate of 0.1 and a momentum value of 0.95. The learning rate was exponentially decayed by a factor of 0.95 every 30 steps.
One batch was set aside on which I evaluated the loss after each step. Notice however, that this is not a proper data split as each test image can also eventually be part of a training batch.
Figure 5: Training loss components over time. We observe that our optimum value both performs better than lower and higher values reaching its minimal loss after roughly 300 steps. The curves were minimally smoothed using cubic B-splines.
Histogram of predictions over time
Figure 6: The left plot shows the histograms for the radii while the right plot corresponds to the angles. We observe that the support of the output distribution for cannot fully cover the range of the actual ground truth whereas higher-than-optimal values put too much weight on the boundaries of the distribution.
Figure 7: Predictions over four examples (randomly selected from the “test” batch). There is a noticeable tendency of the predicted points to drop back to (irrespective of the radius) which has to be analyzed in more detail.