Normalizing fastai Tabular Inputs
Exploring the effectiveness of normalization on tabular data
Normalizing data is an important aspect of data science that seemed like magic to me for the longest time. I could train a model for 10 epochs on un-normalized data and make little or no progress towards an effective model, then restart and run the same data through my model with normalization applied and have a perfect model. Why was this the case? What was happening that caused such a big difference? Let's explore that question.
import numpy as np
import matplotlib.pyplot as plt
Normalization is the process of converting all of the numbers so the full set of numbers has a mean of 0 and a standard deviation of 1. My guess is that for most people reading this, those numbers and terms don't mean a lot. So let's dig in a bit deeper. Normalization is the process of putting all of our values on a common scale. By ensuring that the overall number range has a mean of 0, each side of zero will have half of the weight of the numbers. That means that all of the values are balanced around 0. Let's get a real example.
Let's assume we have a set of numbers: [1, 5, 10, 25, 100, 1000].
example_1 = np.array([1, 5, 10, 25, 100, 1000]); example_1
How would we transform these numbers to have a mean of 0? Let's start by finding the current mean of these numbers.
example_1.mean()
190.16666 is the mean of this set of numbers. In order to make this 0, we should be able to just subtract each number by that amount.
example_1_mean_0 = example_1-example_1.mean(); example_1_mean_0
f"The mean is now: {example_1_mean_0.mean()} (AKA: 0)"
Instead of being distributed around the number 190.1666, the numbers are now distributed around the number 0. With the numbers now balanced correctly, we want to remove scale and units from our dataset. statista.com has the following example to explain standard deviation:
1,000 people were questioned about their monthly phone bill. The mean value is \$40 and the standard deviation 27. Which means that the average distance of all answers (=values) to the mean is \\$27.
Let's consider what it means then to divide by the standard deviation. For this example, we would divide by 27 so that the standard deviation is 1 and that is going put us into the same scale whether we started with a huge scale like monthly mortgage amount or monthly phone bill. It strips the original scaling away from the dataset and puts onto a scale where 1 is the distance from the center point. This is helpful because it means that a step in any direction will change the loss the same amount. Another way of putting this is that it smooths the loss landscape so all attributes have the same amount of pull.
example_1_mean_0_std_1 = example_1_mean_0/example_1_mean_0.std(); example_1_mean_0_std_1
example_1_mean_0_std_1.mean(), example_1_mean_0_std_1.std()
Notice that in all three graphs, the distribution looks the same. Lots of numbers on the left-hand side and only the one number on the right. But to our neural network, there is a huge difference between taking numbers ranging from -0.52 and 2.23 and taking numbers ranging from 1 to 1000. The weights can make a lot finer adjustments on the normalized version and each incoming data point will be on the same scale. That's why when an image model has been pretrained on Imagenet, you have to keep the same standard deviation and mean. By using those numbers, you are taking all of the pixels in your image and converting them to the normal range for Imagenet.
Now that we have an idea of what is happening, let's see if we can prove that it actually works how we anticipate. To do this, I will be creating a simple tabular learner and a sample dataloader. I will first train the model on an normalized input and then train the same model using the un-normalized input. My assumption is that the normalized inputs will be much better at converting the x value into y than the un-normalized version.
x1 + xN = y
from fastai.tabular.all import *
x = torch.rand((200000,10))
scale_amt = torch.randint(0,100,size=(10,))
#x = x*scale_amt
df = pd.DataFrame(x, columns=[f"x{n}" for n in range(1,11)])
#df['x1']=df['x1']*1000
#df.columns
#df['y'] = df['x1'] + df['x2'] + df['x3'] + df['x4'] + df['x5'] + df['x6'] + df['x7'] + df['x8'] + df['x9'] + df['x10']
df['y'] = (df.values*scale_amt.tolist()).sum(axis=1)
splits = RandomSplitter()(df)
df.head()
monitor_parameters
is a hook that can be used to monitor values inside of a model. It is a little bit of a hack, but it works really well for getting a better idea what is happening inside the model.
def monitor_parameters(m, i, o):
m.weight_track.append(list(m.parameters())[0].tolist()[0])
m.bias_track.append(list(m.parameters())[1].tolist())
to_normalized = TabularPandas(df, cont_names=['x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8', 'x9', 'x10'], y_names='y', procs=[Normalize], splits=splits)
dls_normalized = to_normalized.dataloaders(verbose=True, shuffle=False)
learn_normalized = tabular_learner(dls_normalized, layers=[], config=tabular_config(use_bn=False, bn_cont=False))
learn_normalized.lr_find(start_lr=1e-3, end_lr=1000000)
learn_normalized.model.layers[0][0].bias_track = []
learn_normalized.model.layers[0][0].weight_track = []
learn_normalized.model.layers[0][0].register_full_backward_hook(monitor_parameters)
learn_normalized.fit_one_cycle(10, 10)
learn_normalized.recorder.plot_loss(skip_start=1000)
learn_normalized.show_results(ds_idx=0, shuffle=False)
plt.plot(learn_normalized.model.layers[0][0].bias_track)
plt.show()
plt.plot(learn_normalized.model.layers[0][0].weight_track)
plt.show()
to_not_normalized = TabularPandas(df, cont_names=['x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8', 'x9', 'x10'], y_names=['y'], splits=splits)
dls_not_normalized = to_not_normalized.dataloaders(verbose=True, shuffle=False)
dls_not_normalized.one_batch()
learn_not_normalized = tabular_learner(dls_not_normalized, layers=[], config=tabular_config(use_bn=False, bn_cont=False), train_bn=False)
learn_not_normalized.lr_find(start_lr=1e-3, end_lr=1000000)
learn_not_normalized.model.layers[0][0].bias_track = []
learn_not_normalized.model.layers[0][0].weight_track = []
learn_not_normalized.model.layers[0][0].register_full_backward_hook(monitor_parameters)
learn_not_normalized.fit_one_cycle(10, 10)
learn_not_normalized.recorder.plot_loss(skip_start=500)
learn_not_normalized.show_results(ds_idx=0, shuffle=False)
plt.plot(learn_not_normalized.model.layers[0][0].bias_track)
plt.show()
plt.plot(learn_not_normalized.model.layers[0][0].weight_track)
plt.show()
scale_amt
list(learn_not_normalized.model.layers[0][0].parameters())
list(learn_normalized.model.layers[0][0].parameters())
The results ended up matching our theory but in a slightly different way than I had anticipated. You can see that the non-normalized weights line up a lot closer to the scale_amt and after thinking about the reason for that, I think I understand. When the input values are normalized, all of the numbers are put onto the same scale so the easiest way for the gradients to be adjusted is to just move the bias. But in the non-normalized version when adjusting the bias, all of the individual weights would be moving at different rates so it is a lot less likely that the bias will be moved as the best step for the model. Instead the weights are adjusted more during the training process.