LLaMA in R with Keras and TensorFlow



OpenAI’s chatGPT has awakened a collective awareness of what Large
Language Models (LLMs) are capable of. With that awakening comes a daily
march of LLM news: new products, new features, new models, new
capabilities, (and new worries). It seems we’re in the early stages of a
Cambrian explosion of LLMs and LLM powered tools; it’s not yet clear how
LLMs will impact and influence our professional and personal lives, but
it seems clear that they will, in some way.

Since LLMs are here to stay, it’s worthwhile to take some time to
understand how these models work from a first-principles perspective.
Starting with the mechanics can help foster durable intuitions that will
inform our usage of these models now and in the future. (Especially if
the future is one where LLMs are a staple of the data scientist’s
toolbox, as common as an lm() function call).

And what better way is there to learn than by doing. So with that
preamble, in this post we’ll walk through an implementation of an LLM,
LLaMA (Touvron et al. 2023)
specifically, in TensorFlow and Keras, with the goal being to develop
understanding first, capability second.

Why LLaMA? With the sheer volume of LLM related content and news out
there, it can seem daunting to know where to get started. Almost weekly
it seems there is a new model announced. Browsing some hubs of LLM
activity (HuggingFace,
TFHub,
reddit,
HackerNews) muddies the waters even
more. How to pick a specific model?

Of the many LLM-related news items in the past months, one that stands
head-and-shoulders above the crowd is the release of
LLaMA
,
a modern, foundational LLM made available to the public by Meta AI in
February 2023. On common benchmarks, LLaMA outperforms OpenAI’s GPT-3,
while being substantially smaller (though still large).

LLaMA is a great starting place because it is a simple and modern
architecture, has excellent performance on benchmarks, and is open. The
model architecture has had just a few new ideas incorporated into it since
the original Transformer architecture first described in,
Attention Is All You Need
published from Google (Vaswani et al. 2017). Four different sizes of
LLaMA have been released: 7 billion and 13 billion parameter models
trained on 1 Trillion tokens, and 33 billion and 65 billion parameter
models trained on 1.4 trillion tokens. This is an enormous amount of
training data these models have seen–the largest 65B model has been
trained on approximately the “Chinchilla
compute-optimum”
(Hoffmann et al. 2022)
number of tokens, while the smaller LLaMAs are substantially
beyond that optimum. In this blog post we’ll focus on the smallest, 7B
parameter LLaMA model, which you can comfortably load locally and run on
CPU with only 64Gb of RAM.

While not strictly necessary, to follow along locally, you’ll probably
want to acquire the pre-trained LLaMA weights one
way
or
another. Note, the
weights do come with their own license, which you can preview
here.

So, without further ado, let’s get started.

Setup

First, we’ll want to install the required R and Python packages, and
configure a virtual environment:

SentencePiece tokenizer from
Google. SentencePiece is available as a TensorFlow graph operation
through
tf_text.SentencepieceTokenizer,
and also as a Keras layer in
keras_nlp.tokenizers.SentencepieceTokenizer.
By choice of a coin flip, we’ll use the lower-level tf_text interface.

vanishing gradient
problem
. It’s
a skip-connection in the other-wise linear sequence of matrix
transformations. It reinjects information (during the forward pass), and
gradients (during back propagation), back into the trunk. You can think
of these residual connections as freeing the learnable layers in-between
(the ... in the pseudo code) from the burden of having to
“pass-through” or “preserve” information in x, allowing the weights to
instead focus on learning transformations that are, (in corporatese
vernacular), value-adding.

The next composition pattern to note is the repeating usage of a
normalization layer:

Shazeer (2020)
of SwiGLU and other variations on GLU is an exemplar of the types
of explorations and improvements around the Transformer architecture
since its initial publication in
2017; a steady accretion of
enhancements that has brought us to today. The Feedforward$call() is
just a single SwiGLU followed by a linear projection. In its essence,
it’s a clever composition of three (learned) linear projections, an
element-wise multiplication, and a silu()
activation

function.

Perhaps the most surprising observation to make here is the relative
dearth of activation functions, or even non-linearities, not just in
FeedForward, but overall. The silu() in this feedforward, the
reciprocal-root-mean-square in RMSnorm(), and a softmax() in
Attention() are the only non-linear transformations in the whole
sequence of TransformerBlocks. Everything else is a linear
transformation!

Attention

Finally, let’s turn our attention to Attention().

original Transformers
paper
(and available as a keras
builtin under keras$layers$MultiHeadAttention()). The core novelty is
the addition of the apply_rotary_embedding() function, which we’ll
describe shortly. The additional novelty is balanced by the simplicity
from the fact that the layer is performing self-attention—we don’t need
to pass in different query, key, and value tensors (or reason about what
that means), since the same input serves all three roles. Note that the
conventional MultiHeadAttention() layer is covered quite thoroughly in
the 2nd Edition of Deep Learning with R,
including a full implementation of attention in base R.

To develop an understanding of the mechanics in a layer like this, it’s
helpful to temporarily unsee some of the minutia that can act as a fog
obscuring the essence of the operation. In this instance, if we
temporarily strip out the transpose()s and reshape()s (as clever and
vital as they are), this is what’s left:

Su et al. (2022) in the paper titled
“RoFormer: Enhanced Transformer with Rotary Position Embedding”.

Some context:

  • The bare Attention() mechanism doesn’t leave any possibility for a
    token’s position in a sequence to affect the attention scores, since
    only token-pairs are scored. Attention treats its input like a
    bag-of-tokens.

  • The position of a token in a sequence is clearly important, and the
    attention layer should have access to that information.

  • The absolute position of a token in a sequence is less important
    than the relative position between tokens. (Especially so for long
    sequences).

Which leads us into the complex plane. If we imagine the features as
complex numbers, we can rotate them, and we can calculate angles between
them. From the Roformers paper:

Specifically, incorporating the relative position embedding is
straightforward: simply rotate the affine-transformed word embedding
vector by amount of angle multiples of its position index and thus
interprets the intuition behind Rotary Position Embedding

Expanding slightly: the rotation matrix is designed so that
subsequently, after rotating our q and k token sequence embedding
the same way, the angle between token features is a function of the
relative distance between those tokens in the token sequence. The
relative angle between two tokens is invariant to the absolute
position of those tokens in the full sequence.

In short, the rotation injects positional information. The meaning or
interpretability of that positional information, or how it is meant to
be used, or even extracted from the result of q %*% k, is left to the
model to learn.

Here is the code:

Falbel and Keydana 2023),
so time spent understanding them better is time well
spent. For the purposes of this blog post we’ve covered the points
needed and we’ll move on to tying all pieces together. To go deeper and
develop a more mathematically informed understand of RoPE, two excellent
starting points are:

  1. The original paper by Su et al. (2022)

  2. This blog post by
    Biderman et al. (2021)

Tying it all together

With Tokenizer, Embedding, TransformerBlock (RMSNorm,
Attention FeedForward and apply_rotary_embedding) all covered,
it’s time to tie all the pieces together into a Transformer model. We
could do this using %py_class% like with the other layers above, but
it’s just as easy to move over to using the Keras functional API at this
point.

Deep Learning with
R
book), but this blog post is long enough
already. So for now, let’s just take the argmax().

here.

That’s all for now. Thanks for reading and happy travels to all
exploring this exciting LLM terrain!

Photo by Sébastien Goldberg on Unsplash

Biderman, Stella, Sid Black, Charles Foster, Leo Gao, Eric Hallahan, Horace He, Ben Wang, and Phil Wang. 2021. “Rotary Embeddings: A Relative Revolution.” blog.eleuther.ai/rotary-embeddings/.
Falbel, Daniel, and Sigrid Keydana. 2023. “Posit AI Blog: De-Noising Diffusion with Torch.” https://blogs.rstudio.com/tensorflow/posts/2023-04-13-denoising-diffusion/.
Hoffmann, Jordan, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, et al. 2022. “Training Compute-Optimal Large Language Models.” https://arxiv.org/abs/2203.15556.
Shazeer, Noam. 2020. “GLU Variants Improve Transformer.” https://arxiv.org/abs/2002.05202.
Su, Jianlin, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. 2022. “RoFormer: Enhanced Transformer with Rotary Position Embedding.” https://arxiv.org/abs/2104.09864.
Touvron, Hugo, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, et al. 2023. “LLaMA: Open and Efficient Foundation Language Models.” https://doi.org/10.48550/ARXIV.2302.13971.
Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. “Attention Is All You Need.” https://arxiv.org/abs/1706.03762.

Leave a Reply

Your email address will not be published. Required fields are marked *