Tensorflow - part 3: Automatic differentiation

Automatic differentiation is very handy for running backpropagation when training neural networks.

Let's import necessary packages

import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf

tf.GradientTape is an API for automatic differentiation. For this API to do the differentiation automatically in the backward phase, the included operations and the order to operate them in the forward pass needs to be known.

tensor_x = tf.Variable(2.0) # x: a Tensorflow variable of scalar

with tf.GradientTape() as tape: # Tensorflow remembers all the executed operations in ```tf.GradientTape``` by storing them into a ```tape```
tensor_y1 = tensor_x**2 # y1
tensor_y2 = tensor_x**3 # y2

Tensorflow remembers all the executed operations in tf.GradientTape by storing them into a tape.

To calculate the gradient of some targets with respect to some sources, use tape.gradient(target, source). For example, to calculate the differentiation of y1 with respect to x.

tensor_dy1_dx = tape.gradient(tensor_y1, tensor_x) # dy = 2x dx
print(tensor_dy1_dx)

Output

tf.Tensor(4.0, shape=(), dtype=float32)

We know that dy1 = 2x dx. If x = 2, then dy1/dx = 4, which is correct.

Now, try again with the differentiation of y2 with respect to x.

tensor_dy2_dx = tape.gradient(tensor_y2, tensor_x) # dy = 3x^2 dx
print(tensor_dy2_dx)

Output

RuntimeError: A non-persistent GradientTape can only be used to compute one set of gradients (or jacobians)

There will be a Runtime error like above. Because the GradientTape.gradient() method can only be called once and then the resources in the GradientTape are released. A persistent gradient tape is the solution for calling gradient() multiple times. Only when the tape object is garbage-collected does Tensorflow release the resources.

Persistent tape

To have a persistent tape, set the argument persistent to True (persistent = True). Run the below code to verify.

tensor_y1 = tensor_x**2
tensor_y2 = tensor_x**3
tensor_y3 = 4*tensor_x + 1

# To calculate the gradient of some target with respect to some source
tensor_dy1_dx = tape.gradient(tensor_y1, tensor_x) # dy = 2x dx
tensor_dy2_dx = tape.gradient(tensor_y2, tensor_x) # dy = 3x^2 dx
tensor_dy3_dx = tape.gradient(tensor_y3, tensor_x) # dy = 4 dx
print(tensor_dy1_dx)
print(tensor_dy2_dx)
print(tensor_dy3_dx)

Output

The tensor_x is assigned with value 2 above, so the results are:

tf.Tensor(4.0, shape=(), dtype=float32)
tf.Tensor(12.0, shape=(), dtype=float32)
tf.Tensor(4.0, shape=(), dtype=float32)

Remember to delete tape after using.

del tape

Suppose we need to calculate a simple linear equation y = x*w + b.

First initialize a tensor_w of shape (4, 3) with random uniform values and a tensor_b of shape (3,).

tf.random.set_seed(1)
tensor_w = tf.Variable(tf.random.uniform((4, 3), minval=-20, maxval=20, dtype=tf.float32), name='w') # w: weight
tensor_b = tf.Variable(tf.ones(3, dtype=tf.float32), name='b') # bias
print(tensor_w)

Output

<tf.Variable 'w:0' shape=(4, 3) dtype=float32, numpy=
array([[-13.394766 ,  16.05925  ,   5.238968 ],
[ -2.6181564,  -8.322439 ,   5.700083 ],
[ 19.031418 ,  -2.5960197,   6.4040756],
[  4.195833 ,   5.4652596,   4.5779514]], dtype=float32)>

Next, we temporarily delay the appearance of tensor_x for the later part as we need to point out some important notes through assigning it with different values on the right side of the equal sign.

# tape.watch(tensor_x) # Let's comment this for now. We will explain later for this tape.watch()
tensor_y = tensor_x @ tensor_w + tensor_b # @ is matrix multiplication
tensor_loss = tf.reduce_mean(tensor_y**2) # dloss = mean(2*y) dy; dloss = (2*y + 2*y + 2*y)/3

print('[+] tensor_y: ', tensor_y)
print('[+] tensor_loss: ', tensor_loss)

All the operations inside the with tf.GradientTape(persistent=True) as tape: are said to be in the GradientTape context.

Next is the gradient calculation part.

tensor_dloss_dy = tape.gradient(tensor_loss, tensor_y) # source tensor_y has 3 elements, so there will be 3 derivative: dloss/dy = 2*y/3, dloss/dy=2*y/3, dloss/dy=2*y/3

print('[+] tensor_dy_dx: ', tensor_dy_dx)
print('[+] tensor_dloss_dy: ', tensor_dloss_dy)
print('[+] tensor_dloss_dx: ', tensor_dloss_dx)

Now is the time to define tensor_x. In the code below, tensor_x is assigned 5 times. They differ in some ways: whether tf.Variable or tf.constant are used?, trainable is set to True or False?, dtype is set to int or float?, is there any operation with a constant following the tf.Variable definition. We will try assigning tensor_x with one of the 5 types one-by-one and see what happens.

tensor_x = tf.Variable([[1, 2, 3, 4]], dtype=tf.float32, name='x') # Case 1
tensor_x = tf.Variable([[1, 2, 3, 4]], dtype=tf.int32, name='x') # Case 2. Note that in this case, tensor_w and tensor_b are also converted to type tf.int32
tensor_x = tf.Variable([[1, 2, 3, 4]], dtype=tf.float32, trainable=False) # Case 3
tensor_x = tf.constant([[1, 2, 3, 4]], dtype=tf.float32) # Case 4
tensor_x = tf.Variable([[1, 2, 3, 4]], dtype=tf.float32, name='x') + 1.0 # Case 5

Next, we show the outputs for each case of tensor_x:

Output

You should find the differences between the cases on your own before we reveal them below. Cues: there are several calculations resulting in None. Why? Many reasons ...

Case 1:
[+] tensor_y:  tf.Tensor([[56.246506 14.48735  55.163166]], shape=(1, 3), dtype=float32)
[+] tensor_loss:  tf.Tensor(2138.8425, shape=(), dtype=float32)
[+] tensor_dy_dx:  tf.Tensor([[ 7.903452 -5.240513 22.839474 14.239044]], shape=(1, 4), dtype=float32)
[+] tensor_dloss_dy:  tf.Tensor([[37.497673  9.658234 36.775444]], shape=(1, 3), dtype=float32)
[+] tensor_dloss_dx:  tf.Tensor([[-154.50319    31.068237  924.07367   378.47495 ]], shape=(1, 4), dtype=float32)

Case 2:
[+] tensor_y:  tf.Tensor([[ -5 103 -64]], shape=(1, 3), dtype=int32)
[+] tensor_loss:  tf.Tensor(4910, shape=(), dtype=int32)
[+] tensor_dy_dx:  None
[+] tensor_dloss_dy:  None
[+] tensor_dloss_dx:  None

Case 3:
[+] tensor_y:  tf.Tensor([[56.246506 14.48735  55.163166]], shape=(1, 3), dtype=float32)
[+] tensor_loss:  tf.Tensor(2138.8425, shape=(), dtype=float32)
[+] tensor_dy_dx:  None
[+] tensor_dloss_dy:  tf.Tensor([[37.497673  9.658234 36.775444]], shape=(1, 3), dtype=float32)
[+] tensor_dloss_dx:  None

Case 4:
[+] tensor_y:  tf.Tensor([[56.246506 14.48735  55.163166]], shape=(1, 3), dtype=float32)
[+] tensor_loss:  tf.Tensor(2138.8425, shape=(), dtype=float32)
[+] tensor_dy_dx:  None
[+] tensor_dloss_dy:  tf.Tensor([[37.497673  9.658234 36.775444]], shape=(1, 3), dtype=float32)
[+] tensor_dloss_dx:  None

Case 5:
[+] tensor_y:  tf.Tensor([[63.46084  25.093401 77.08424 ]], shape=(1, 3), dtype=float32)
[+] tensor_loss:  tf.Tensor(3532.9792, shape=(), dtype=float32)
[+] tensor_dy_dx:  None
[+] tensor_dloss_dy:  tf.Tensor([[42.307224 16.728935 51.389496]], shape=(1, 3), dtype=float32)
[+] tensor_dloss_dx:  None

Do you remember the line tape.watch(tensor_x) which is commented out in the above code of tf.GradientTape context? Now let's uncomment it and try again with the 5 cases of tensor_x:

tape.watch(tensor_x) # Uncomment this line
tensor_y = tensor_x @ tensor_w + tensor_b # @ is matrix multiplication
tensor_loss = tf.reduce_mean(tensor_y**2) # dloss = mean(2*y) dy; dloss = (2*y + 2*y + 2*y)/3

then the outputs are:

Output

You definitely see that there are less None values than above. Some gradient calculations having tensor_x (dx) as source have transformed from None to a value. That is because the tensor_x have been watched by the GradientTape, so the gradients with respect to this variable are now valid.

Case 1:
[+] tensor_y:  tf.Tensor([[56.246506 14.48735  55.163166]], shape=(1, 3), dtype=float32)
[+] tensor_loss:  tf.Tensor(2138.8425, shape=(), dtype=float32)
[+] tensor_dy_dx:  tf.Tensor([[ 7.903452 -5.240513 22.839474 14.239044]], shape=(1, 4), dtype=float32)
[+] tensor_dloss_dy:  tf.Tensor([[37.497673  9.658234 36.775444]], shape=(1, 3), dtype=float32)
[+] tensor_dloss_dx:  tf.Tensor([[-154.50319    31.068237  924.07367   378.47495 ]], shape=(1, 4), dtype=float32)

Case 2:
[+] tensor_y:  tf.Tensor([[ -5 103 -64]], shape=(1, 3), dtype=int32)
[+] tensor_loss:  tf.Tensor(4910, shape=(), dtype=int32)
[+] tensor_dy_dx:  None
[+] tensor_dloss_dy:  None
[+] tensor_dloss_dx:  None

Case 3:
[+] tensor_y:  tf.Tensor([[56.246506 14.48735  55.163166]], shape=(1, 3), dtype=float32)
[+] tensor_loss:  tf.Tensor(2138.8425, shape=(), dtype=float32)
[+] tensor_dy_dx:  tf.Tensor([[ 7.903452 -5.240513 22.839474 14.239044]], shape=(1, 4), dtype=float32)
[+] tensor_dloss_dy:  tf.Tensor([[37.497673  9.658234 36.775444]], shape=(1, 3), dtype=float32)
[+] tensor_dloss_dx:  tf.Tensor([[-154.50319    31.068237  924.07367   378.47495 ]], shape=(1, 4), dtype=float32)

Case 4:
[+] tensor_y:  tf.Tensor([[56.246506 14.48735  55.163166]], shape=(1, 3), dtype=float32)
[+] tensor_loss:  tf.Tensor(2138.8425, shape=(), dtype=float32)
[+] tensor_dy_dx:  tf.Tensor([[ 7.903452 -5.240513 22.839474 14.239044]], shape=(1, 4), dtype=float32)
[+] tensor_dloss_dy:  tf.Tensor([[37.497673  9.658234 36.775444]], shape=(1, 3), dtype=float32)
[+] tensor_dloss_dx:  tf.Tensor([[-154.50319    31.068237  924.07367   378.47495 ]], shape=(1, 4), dtype=float32)

Case 5:
[+] tensor_y:  tf.Tensor([[63.46084  25.093401 77.08424 ]], shape=(1, 3), dtype=float32)
[+] tensor_loss:  tf.Tensor(3532.9792, shape=(), dtype=float32)
[+] tensor_dy_dx:  tf.Tensor([[ 7.903452 -5.240513 22.839474 14.239044]], shape=(1, 4), dtype=float32)
[+] tensor_dloss_dy:  tf.Tensor([[42.307224 16.728935 51.389496]], shape=(1, 3), dtype=float32)
[+] tensor_dloss_dx:  tf.Tensor([[ -28.813324   42.9319   1090.8401    504.20065 ]], shape=(1, 4), dtype=float32)

tape.watch() is needed when tensor_x is a tensor, not Variable. watch is used to trace tensor by tape.

From doing the two experiments above, we can infer that there are 3 cases of tensor_x that we need to use tape.watch:

• When trainable=False, need to use tape.watch to calculate gradients with respect to this Variable.
• When use tensor (not tf.Variable), need to use tape.watch to calculate gradients with respect to this tensor.
• When a tf.Variable is added with a number or a tensor (or other operations), it becomes a tensor. So, to calculate gradients with respect to it we also need to use tape.watch.

Overall, the tensor_x and the related tensors in the process of calculating gradient need to be a tf.Variable, not a tensor (although tf.Variable is also a special type of tensor, except its elements can be changed). Moreover, all the ingredient tensors must be float type (int and string type will not work).

One useful habit is that you should always check if a target or a source is of type tf.Variable before running tape.gradient.

Source as list, dictionary; target as dictionary

Consider the case 1 of tensor_x.

Multiple sources:

The source can also be passes as a list of variables. tape.gradient will calculate differentiation with respect to each of them.

tensor_dloss_dw, tensor_dloss_db = tape.gradient(tensor_loss, [tensor_w, tensor_b])

print('[+] tensor_dloss_dw: ', tensor_dloss_dw)
print('[+] tensor_dloss_db: ', tensor_dloss_db)

Output

[+] tensor_dloss_dw:  tf.Tensor(
[[ 37.497673   9.658234  36.775444]
[ 74.995346  19.316467  73.55089 ]
[112.49302   28.9747   110.32633 ]
[149.99069   38.632935 147.10178 ]], shape=(4, 3), dtype=float32)

[+] tensor_dloss_db:  tf.Tensor([37.497673  9.658234 36.775444], shape=(3,), dtype=float32)

Or a source can even be a dictionary.

dic_vars = {
'tensor_w': tensor_w,
'tensor_b': tensor_b
}

print('[+] tensor_dloss_dw_dic: ', tensor_dloss_dwdb_dic['tensor_w'])
print('[+] tensor_dloss_db_dic: ', tensor_dloss_dwdb_dic['tensor_b'])

Output

[+] tensor_dloss_dw_dic:  tf.Tensor(
[[ 37.497673   9.658234  36.775444]
[ 74.995346  19.316467  73.55089 ]
[112.49302   28.9747   110.32633 ]
[149.99069   38.632935 147.10178 ]], shape=(4, 3), dtype=float32)

[+] tensor_dloss_db_dic:  tf.Tensor([37.497673  9.658234 36.775444], shape=(3,), dtype=float32)

Multiple targets:

Gradients of multiple targets. This is not like the case of multiple sources where tape.gradient calculates gradients separately with respect to each source.

Here, the gradient of multiple targets = the gradient of the sum of the targets = the sum of the gradients of all targets

tensor_dydloss_dx = tape.gradient({'tensor_y': tensor_y, 'tensor_loss': tensor_loss}, tensor_x)

print('[+] tensor_dydloss_dx: ', tensor_dydloss_dx)

Output

[+] tensor_dydloss_dx:  tf.Tensor([[-146.5997    25.82772  946.9131   392.71396]], shape=(1, 4), dtype=float32)
['w:0', 'b:0', 'x:0']

Until now, you may also notice that the gradient result tensor has the same shape as the source tensor (in tape.gradient(target, source)). You can verify it by the code below. We just take some gradients for example.

print(tensor_x.shape) # source
print(tensor_y.shape) # source
print(tensor_w.shape) # source
print(tensor_b.shape) # source

Output

(1, 4)
(1, 4)
(1, 3)
(1, 3)
(4, 3)
(4, 3)
(3,)
(3,)

See which variables are watched by tape

We look in tape.watched_variables()

print('[+] watched variables: ', [watched_var.name for watched_var in tape.watched_variables()])

We also check for the 5 cases of tensor_x when tape.watch(tensor_x) is used.

Output

Case 1:
[+] watched variables:  ['w:0', 'b:0', 'x:0']

Case 2:
[+] watched variables:  ['w:0', 'b:0', 'x:0']

Case 3:
[+] watched variables:  ['w:0', 'b:0', 'Variable:0'] # tensor_x is not named as "x"

Case 4:
[+] watched variables:  ['w:0', 'b:0']

Case 5:
[+] watched variables:  ['w:0', 'b:0']

In case 4 and case 5, there is no variable of tensor_x because as we know, tape.watch just "locally watches" tensor_x in the tf.GradientTape context.

Checking which operations are stored for the backward propagation

Or this process can also be thought as checking which variables are trainable. Let's create a simple neural network to check this.

layer1 = tf.keras.layers.Dense(2, activation='relu', name="dense_1")
# The calculation of Dense is: output = activation(dot(input, kernel) + bias)
layer2 = tf.keras.layers.Dense(4, activation='relu', name="dense_2")
# In ```Dense`` by defaults, kernel is initialized according to glorot uniform and bias is initialized with zeros
tensor_x = tf.constant([[1., 2., 3.]])

# Forward pass
tensor_h = layer1(tensor_x)
tensor_y = layer2(tensor_h)
tensor_loss = tf.reduce_mean(tensor_y**2)

# Calculate gradients with respect to each trainable variables
print('[+] The trainable vairables of layer1: ', layer1.trainable_variables)
print('[+] The trainable vairables of layer2: ', layer2.trainable_variables)

print("Dense 1:")
for var, grad in zip(layer1.trainable_variables, tensor_dloss_dvars_layer1):
print('[+] variable name, shape: {0}, {1}'.format(var.name, grad.shape)) # The shape of gradient is the same as the shape of source

print("Dense 2:")
for var, grad in zip(layer2.trainable_variables, tensor_dloss_dvars_layer2):
print('[+] variable name, shape: {0}, {1}'.format(var.name, grad.shape)) # The shape of gradient is the same as the shape of source

Output

[+] The trainable vairables of layer1:  [<tf.Variable 'dense_1/kernel:0' shape=(3, 2) dtype=float32, numpy=
array([[-0.57740456, -0.572319  ],
[ 0.00795567,  0.5992962 ],
[ 0.24269533,  0.8154019 ]], dtype=float32)>, <tf.Variable 'dense_1/bias:0' shape=(2,) dtype=float32, numpy=array([0., 0.], dtype=float32)>]
[+] The trainable vairables of layer2:  [<tf.Variable 'dense_2/kernel:0' shape=(2, 4) dtype=float32, numpy=
array([[ 0.46956396, -0.71557474, -0.8732331 , -0.62160015],
[ 0.59634185,  0.40849257, -0.18213058,  0.02854967]],
dtype=float32)>, <tf.Variable 'dense_2/bias:0' shape=(4,) dtype=float32, numpy=array([0., 0., 0., 0.], dtype=float32)>]
Dense 1:
[+] variable name, shape: dense_1/kernel:0, (3, 2)
[+] variable name, shape: dense_1/bias:0, (2,)
Dense 2:
[+] variable name, shape: dense_2/kernel:0, (2, 4)
[+] variable name, shape: dense_2/bias:0, (4,)

When using gradient tapes, memory are used to store all the results which are required for the backward propagation. There are some unnecessary operations such as ReLU are removed during the forward pass.

Control flow/Choose scope for assigning variable by if

We can use if to choose which variable is assigned to a final result. Therefore, only calculating gradient according to the chosen one is possible. See the example below for more details.

Have a code like this. Either tensor_x1 or tensor_x2 is used to assign the tensor_res_1, which depends on the if condition.

tensor_x1 = tf.Variable(3.0)
tensor_x2 = tf.Variable(3.0)

tensor_flag = tf.constant(2.0) # a tensor to use in the if condition; must be float

tape.watch(tensor_x1)
tape.watch(tensor_x2)
tape.watch(tensor_flag)
if tensor_flag % 2 == 0:
tensor_res_1 = 4*(tensor_x1**2) + 3
else:
tensor_res_1 = tensor_x2**3

tensor_dres1_dx1, tensor_dres1_dx2 = tape.gradient(tensor_res_1, [tensor_x1, tensor_x2])
print('[+] tensor_dres1_dx1: ', tensor_dres1_dx1)
print('[+] tensor_dres1_dx2', tensor_dres1_dx2)

Output

[+] tensor_dres1_dx1:  tf.Tensor(24.0, shape=(), dtype=float32)
[+] tensor_dres1_dx2 None

Because tensor_flag % 2 == 0, tensor_res_1 is assigned with 4*(tensor_x1**2) + 3. Therefore, we can only calculate the gradient with respect to tensor_x1.

Let's try with one more experiment. Here we have one more variable to assign tensor_res_2. This assignation depends on out last work of tensor_res_1. If tensor_res_1 is assigned with the function 4*(tensor_x1**2) + 3, then tensor_res_2 will be assigned with the other and in reverse.

tape2.watch(tensor_x1)
tape2.watch(tensor_x2)

if tf.math.equal(tensor_res_1, 4*(tensor_x1**2) + 3).numpy().all():
tensor_res_2 = tensor_x2**3
else:
tensor_res_2 = 4*(tensor_x1**2) + 3

tensor_dres2_dx1, tensor_dres2_dx2 = tape2.gradient(tensor_res_2, [tensor_x1, tensor_x2]) # Remember to use ```tape2.gradient()``` here, not ```tape.gradient()```. Otherwise, the gradient results will not be as expected.
print('[+] tensor_dres2_dx1: ', tensor_dres2_dx1)
print('[+] tensor_dres2_dx2', tensor_dres2_dx2)

Output

[+] tensor_dres2_dx1:  None
[+] tensor_dres2_dx2 tf.Tensor(27.0, shape=(), dtype=float32)

In this second experiment, we need to use another name for tape (tape2) so it can be discriminated with the one above and used as a completely different gradient context.

tf.math.equal is used to element-wise check if all the elements of tensor_res and 4*(tensor_x1**2) + 3 are equal. It will return an array of boolean values each of which corresponds to a comparison result of a pair. Then we should use all() to check if all the boolean values are True.

Let's also check what happens if we use tape.gradient (not tape2.gradient) for tensor_res_2.

tensor_dres2_dx1_tmp, tensor_dres2_dx2_tmp = tape.gradient(tensor_res_2, [tensor_x1, tensor_x2])

print('[+] tensor_dres2_dx1_tmp: ', tensor_dres2_dx1_tmp) # None. Is that because ```tensor_x1``` is not assigned to ```tensor_res_2```? Not exactly, it is because in the context of ```tape`` does not exist the ```tensor_res_2```
print('[+] tensor_dres2_dx2_tmp: ', tensor_dres2_dx2_tmp) # None. The same reason as above.

Output

[+] tensor_dres2_dx1_tmp:  None
[+] tensor_dres2_dx2_tmp: None

Both of the printings are None.

• The first None: Is that because tensor_x1 is not assigned to tensor_res_2? Not exactly, it is because in the context of tape does not exist the tensor_res_2.
• The second None: The same reason as the first None.

Plot a function and its gradient

To plot a function, we need many values of x. Get the codes in the previous section as example with one change. Instead of assigning tensor_x1 and tensor_x2 each with a single value, now we assign each of them with a list of values by using tf.linspace.

tensor_x1 = tf.linspace(-15.0, 15.0, 150+1)
tensor_x2 = tf.linspace(-15.0, 15.0, 150+1)

With each value of tensor_x1 (or tensor_x2), there will be a corresponding function value and its gradient value. With these values, now we can plot these functions.

Plot the function tensor_res_1 and its gradient with respect to x1 tensor_dres1_dx1

plt.plot(tensor_x1, tensor_res_1, label='res1')
plt.plot(tensor_x1, tensor_dres1_dx1, label='dres1')
plt.legend() Plot the function tensor_res_2 and its gradient with respect to x2 tensor_dres2_dx2

plt.plot(tensor_x2, tensor_res_2, label='res2')
plt.plot(tensor_x2, tensor_dres2_dx2, label='dres2')
plt.legend() Some more things to notice

According to Tensorflow autodiff guide, There are 2 more essential things to notice:

• There are some tensorflow operations (tf.Operation) that are registered as being non-differentiable or have no gradient registered (differentiable but have not been registered). For the latter case, if you need to make differentiation on this type of operation, there are 2 options: