import numpy as np
Positional embedding
Positional embedding is used to encode the position of a token in a sentence using sinusoidals.
\(PE_{[k, 2i]} = sin(\frac{k}{n^\frac{2i}{d}})\)
\(PE_{[k, 2i + 1]} = cos(\frac{k}{n^\frac{2i}{d}})\)
k is the position of the token in the sentence. n is a constant. i is the index in d dimensional vector. It ranges from 0 to \(\frac{d}{2}\)
If d is the dimension for the position embedding then the d- dimensional vector will be represented with pair of sine and cosine. Hence there will be d/2 sine-cosine pair in the vector.
\(PE_k\) = \(\begin{bmatrix}sin(k)\\ cos(k) \\ sin(\frac{k}{n^\frac{2}{d}})\\ cos(\frac{k}{n^\frac{2}{d}}) \\. \\. \\. \\ sin(\frac{k}{n^\frac{2\frac{d}{2}}{d}})\\ cos(\frac{k}{n^\frac{2\frac{d}{2}}{d}}) \end{bmatrix}\)
sine function represents the even positions in the d-dimensional vector while cosine function represents the odd position in the d-dimensional vector.
input = np.arange(100)
def encoding(idx, d):
= []
embedding_values for i in range(d//2):
/ (1000000**(2*i/d))))
embedding_values.append(np.sin(idx / (1000000**(2*i/d))))
embedding_values.append(np.cos(idx return np.array(embedding_values)
def final():
= []
final_embedding for i in range(100):
512))
final_embedding.append(encoding(i, return np.array(final_embedding)
= final() x
import matplotlib.pyplot as plt
= plt.matshow(x)
cax
plt.gcf().colorbar(cax)'Embedding Dimension')
plt.xlabel('Index of tokens') plt.ylabel(
Text(0, 0.5, 'Index of tokens')
import matplotlib.pyplot as plt
= plt.subplots(figsize=(10, 20))
fig, ax = ax.matshow(x.T)
cax 'Index of tokens')
ax.set_xlabel('Embedding Dimension')
ax.set_ylabel(200,0) ax.set_ylim(
\(PE_0\) = \(\begin{bmatrix}0\\ 1 \\ 0\\ 1 \\. \\. \\. \\ 0 \\ 1 \end{bmatrix}\) \(PE_1\) = \(\begin{bmatrix}sin(1)\\ cos(1) \\ sin(\frac{1}{n^\frac{2}{d}})\\ cos(\frac{1}{n^\frac{2}{d}}) \\. \\. \\. \\ sin(\frac{1}{n})\\ cos(\frac{1}{n}) \end{bmatrix}\) \(PE_2\) = \(\begin{bmatrix}sin(2)\\ cos(2) \\ sin(\frac{2}{n^\frac{2}{d}})\\ cos(\frac{2}{n^\frac{2}{d}}) \\. \\. \\. \\ sin(\frac{1}{n})\\ cos(\frac{1}{n}) \end{bmatrix}\) \(.............\) \(PE_{512}\) = \(\begin{bmatrix}sin(512)\\ cos(512) \\ sin(\frac{512}{n^\frac{2}{d}})\\ cos(\frac{512}{n^\frac{2}{d}}) \\. \\. \\. \\ sin(\frac{1}{n})\\ cos(\frac{1}{n}) \end{bmatrix}\)
For all \(PE_{k}\),
index i = 0 - follows sine function with frequency 1 and wavelength \(2\pi\) - follows cosine function with frequency 1 and wavelength \(2\pi\)
index i = 1 - follows sine function with frequency \(\frac{1}{n^\frac{2}{d}}\) and wavelength \(2\pi{n^\frac{2}{d}}\) - follows cosine function with frequency \(\frac{1}{n^\frac{2}{d}}\) and wavelength \(2\pi{n^\frac{2}{d}}\)
index i = d/2 - follows sine function with frequency \(\frac{1}{n}\) and wavelength \(2\pi{n}\) - follows cosine function with frequency \(\frac{1}{n}\) and wavelength \(2\pi{n}\)
This shows that the sine and cosine function frequency decrease as the indices go higher in the d-dimensional vector. We see higher variations in all the tokens for embedding index 0 ( faster changes in colour) and negligible variation for embedding index 512 (no change in colour)
def binary_representation(x):
all = []
for i in x:
= list(np.binary_repr(i, width=10))
tmp = np.array(tmp)
tmp = tmp[::-1]
tmp all.append(tmp)
return np.array(all)
= binary_representation(range(100))
x = x.astype(float)
x = plt.matshow(x.T)
cax
plt.gcf().colorbar(cax)'Index of tokens')
plt.xlabel('Embedding dimension') plt.ylabel(
Text(0, 0.5, 'Embedding dimension')
From both the plots (related to sinusoidal function and the binary function), we cn see that index 0 of embedding dimension has the highest frequency. This shows that sinusoidal functions can represent positional embedding similar to binary representation.
We prefer sinusoidal representation because it exhibits linearity in relationship.
\(W * \begin{bmatrix} sin(t) \\ cos(t) \end{bmatrix} = \begin{bmatrix} sin(t + \theta) \\ cos(t + \theta) \end{bmatrix}\)
\(W * \begin{bmatrix} sin(t) \\ cos(t) \end{bmatrix} = \begin{bmatrix} sin(t)cos(\theta) + cos(t)sin(\theta) \\ cos(t)cos(\theta) - sin(t)sin(\theta) \end{bmatrix}\)
\(\begin{bmatrix} cos(\theta) & sin(\theta) \\ - sin(\theta) & cos(\theta)\end{bmatrix} * \begin{bmatrix} sin(t) \\ cos(t) \end{bmatrix} = \begin{bmatrix} sin(t)cos(\theta) + cos(t)sin(\theta) \\ cos(t)cos(\theta) - sin(t)sin(\theta) \end{bmatrix}\)
\[ W = \begin{bmatrix} cos(\theta) & sin(\theta) \\ - sin(\theta) & cos(\theta)\end{bmatrix}\]
where \(\theta\) is a constant
As \(PE_{t}\) is known then \(PE_{t + \theta}\) can also be determined