Introduction

The Meta Llama 3.1 is a collection of pretrained and instruction tuned generative models in 8B, 70B and 405B sizes (text in/text out).

Model Release Date: July 23, 2024.

When it comes to loading and fine-tuning LLMs using various tools and libraries. Hugging Face, with its extensive API, offers multiple methods to load and interact with these LLMs, each tailored to different needs. However, this flexibility can also be confusing.

LLama model architecture: The Llama 2, 3 and 3.1 share the same model architecture but differ in parameters values i.e context length, vocab size, etc. Source:https://github.com/hkproj/pytorch-llama-notes

Using Hugging Face’s Pipelines

This is the simplest and easiest way when you need to get up and running with LLaMA 3.1 as quickly as possible, Hugging Face’s pipeline API is your go-to. This method abstracts away most of the complexity, allowing you to focus on the task at hand, whether it’s text generation, sentiment analysis, or translation.

import transformers
import torch

model_id = "meta-llama/Meta-Llama-3.1-8B"

pipeline = transformers.pipeline(
    "text-generation", model=model_id, model_kwargs={"torch_dtype": torch.bfloat16}, device_map="auto"
)

pipeline("The reason why bigbang happend is")
Loading checkpoint shards: 100%|██████████| 4/4 [00:01<00:00,  2.95it/s],
"[{'generated_text': 'The reason why bigbang happend is because of the big bang theory. The big bang theory is a theory that explains the origin of the universe. The theory was first proposed by the Belgian priest Georges Lemaître in the 1920s. The big bang theory is based on the idea that the universe began as a small, hot, dense point and then expanded and cooled to form the universe we see today.\\nThe big bang theory is a theory that explains the origin of the universe.'}]"

The pipeline API automatically handles model loading, tokenization, and inference, making it ideal for developers who need results fast without diving into the nitty-gritty details of model configuration.

Class Used: transformers.pipeline Ideal For: Quick prototyping, minimal setup, and rapid testing. Limitations: Limited flexibility in customizing model behavior. Not suited for fine-tuning or large-scale training.

HuggingFace will automatically download the model weights and everything necessary, the model weights are usually stored in ~/.cache/huggingface/hub/ directory.

Using AutoModelForCausalLM:

When it comes to fine-tuning a pre-trained model like LLaMA 3.1, the AutoModelForCausalLM class from Hugging Face can be regarded as a good choice. This class is designed to handle causal language models, and it makes the process of loading and configuring the model straightforward. One of the key advantages of using AutoModelForCausalLM is that it automatically reads the model’s JSON configuration file, setting up the architecture based on the parameters defined during pre-training.

Why Use AutoModelForCausalLM for Fine-Tuning?

  • Automatic Configuration: Hugging Face’s AutoModelForCausalLM automatically loads the model’s architecture from the JSON configuration. This means you don’t have to manually specify the model’s layers, attention heads, or other parameters; it’s all taken care of for you.
  • Flexibility for Fine-Tuning: Since the model is loaded with all its pre-trained parameters intact, you can easily fine-tune it on new data with minimal changes. This makes it a great choice when you need to adapt LLaMA 3.1 to a specific task or domain.

Example of Loading with AutoModelForCausalLM

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "meta-llama/Meta-Llama-3.1-8B"
model = AutoModelForCausalLM.from_pretrained(model_id,
 torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",)
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Preparing input text for fine-tuning
input_text = "Fine-tuning LLaMA on new data is straightforward with AutoModelForCausalLM."
input_ids = tokenizer(input_text, return_tensors="pt").input_ids

# Model is now ready for fine-tuning

This model is now ready for fine-tuning on new data. The AutoModelForCausalLM class has loaded the pre-trained LLaMA 3.1 model with the specified configuration, and the tokenizer is set up to process input text for inference or fine-tuning. One can write their own training loop utilizing packages like PEFT and accelerate for efficient model training and fine-tuning.

Quantized Loading with AutoModelForCausalLM and BitsAndBytes: Efficient Inference

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

model_id = "meta-llama/Meta-Llama-3.1-8B"

quant_config = BitsAndBytesConfig(load_in_8bit=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=quant_config,
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)
`low_cpu_mem_usage` was None, now set to True since model is quantized.
Loading checkpoint shards: 100%|██████████| 4/4 [00:03<00:00,  1.11it/s]

We can print the model architecture.

print(model)
LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 4096)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaFlashAttention2(
          (q_proj): Linear8bitLt(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear8bitLt(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear8bitLt(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear8bitLt(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear8bitLt(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear8bitLt(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear8bitLt(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((4096,), eps=1e-05)
    (rotary_emb): LlamaRotaryEmbedding()
  )
  (lm_head): Linear(in_features=4096, out_features=128256, bias=False)
)

For running inference with the quantized model, we can use the similar code as before.:

input_text = "The reason why bigbang happend is "
tokenizer = AutoTokenizer.from_pretrained(model_id)
model.generation_config.pad_token_id = tokenizer.pad_token_id

input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to(model.device)
output = model.generate(input_ids, max_length=100,pad_token_id=tokenizer.eos_token_id, do_sample=True, top_k=50, top_p=0.95)

print(tokenizer.decode(output[0], skip_special_tokens=True))
The reason why bigbang happend is 1. gravity 2. mass 3. energy 4. light 5. space 6. time 7. matter 8. dark matter 9. dark energy 10. space-time 11. mass-energy 12. mass-energy-time 13. mass-energy-time-space 14. mass-energy-time-space-dark matter 15. mass-energy-time-space-dark matter-dark energy 16. mass-energy-time-space-dark matter-dark

you can read the documentations for more details on the generate method and its parameters.

Since we are using the pre-trained LLama 3.1 which was not fine-tuned for anything specific, the output might not be coherent or relevant to the input text.

The quantized model is now ready for efficient inference with reduced memory usage and faster execution times.

Here in these examples we are using Meta-Llama-3.1-8B model and not the Meta-Llama-3.1-8B-intstruct model, the latter one is fine-tuned with RLHF and is more suitable for answering questions and generating coherent responses. The former is mainly for fine-tuning on your own data and tasks since it is not fine-tuned for any specific task.

Loading LLama with local weights:

If you have model weight available locally, for example you downloaded the orignal .pth file from the Meta repo for LLama 3.1 and converted them into Hugging face weights locally and would like to load from there. Then you can use the from_pretrained method with the local_files_only parameter set to True. This will ensure that the model is loaded from the local directory and not downloaded from the Hugging Face model hub.

local_model_path = "~/.cache/huggingface/hub/DirectorytoModelWeights"
local_model = AutoModelForCausalLM.from_pretrained(local_model_path,
                                                   local_files_only=True,
                                                   torch_dtype=torch.bfloat16,)

Using LlamaConfig and LlamaForCausalLM:

LlamaConfig is a configuration class that allows you to define the architecture and parameters of the LLaMA model before loading it. This class gives you the flexibility to customize aspects like the number of layers, attention heads, and more.

Why Use It? Customization: Allows for tailored model architectures and hyperparameters. Research and Experimentation: Ideal for exploring different model configurations. Adding enhancement or modifications to the model architecture.

The famous example of using LlamaConfig is the the repo YARN where authors loaded LLama model with LlamaConfig and LlamaForCausalLM with pretrained weights and changed the RoPe (Rotary Positional Embeddings) with their proposed YARN (Yet Another Rotary Positional Embedding) to improve the context window of the model.

from transformers import LlamaConfig, LlamaForCausalLM,AutoTokenizer
import torch
params= {

  "attention_bias": False,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "eos_token_id": 128001,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 131072,
  "mlp_bias": False,
  "model_type": "llama",
  "num_attention_heads": 16, # 32->16
  "num_hidden_layers": 16, # 32->16
  "num_key_value_heads": 8,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": {
    "factor": 8.0,
    "low_freq_factor": 1.0,
    "high_freq_factor": 4.0,
    "original_max_position_embeddings": 8192,
    "rope_type": "llama3"
  },
  "rope_theta": 500000.0,
  "tie_word_embeddings": False,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.43.0.dev0",
  "use_cache": True,
  "vocab_size": 128256
}
congig_cls = LlamaConfig
model_cls = LlamaForCausalLM
config = congig_cls(**params)

model = model_cls(config)
print(model)
local_model_path = "/home/snawar/.cache/huggingface/hub/models--meta-llama--Meta-Llama-3.1-8B/snapshots/48d6d0fc4e02fb1269b36940650a1b7233035cbb"
model = model.from_pretrained(local_model_path,
                                                   local_files_only=True,
                                                   torch_dtype=torch.bfloat16,).to('cuda')



model_id = "meta-llama/Meta-Llama-3.1-8B"

input_text = "The reason why bigbang happend is "
tokenizer = AutoTokenizer.from_pretrained(model_id)
model.generation_config.pad_token_id = tokenizer.pad_token_id

input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to(model.device)
output = model.generate(input_ids, max_length=100,pad_token_id=tokenizer.eos_token_id, do_sample=True, top_k=50, top_p=0.95)
print(tokenizer.decode(output[0], skip_special_tokens=True))
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 24.87it/s]
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
The reason why bigbang happend is 2nd law of thermodynamics.
2nd law of thermodynamics is entropy increase.
Entropy is probability of disorder.
So, there is always probability of disorder.
But, there is no probability of order.
That is, the probability of order is 0.
So, entropy can be 0.
And, when entropy is 0, temperature is infinite.
Infinite temperature is infinite energy.
So, the universe will have infinite energy.
And,

Even though i changed the num_hidden_layers and num_attention_heads in the param from 32 to 16 , The model weights were still loaded.

Meta’s Original Implementation: The Full LLaMA Experience

Meta’s original implementation offers the most control, particularly for large-scale distributed training. The original LLama 3.1 architecture is purely implemented in PyTorch.

import torch.distributed as dist
from fairscale.nn.model_parallel.initialize import initialize_model_parallel, get_model_parallel_world_size
from models.llama3_1.api.args import ModelArgs
from models.llama3_1.reference_impl.model import Transformer
import os
import torch.distributed as dist

# Set environment variables manually if not running in a distributed launcher
os.environ['RANK'] = '0'
os.environ['WORLD_SIZE'] = '1'
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '9300'



# Continue with the rest of the script...

# Initialize the PyTorch distributed environment
dist.init_process_group(backend='nccl', rank=0, world_size=1)

# Initialize the model parallel group
initialize_model_parallel(model_parallel_size_=1)

# Define your configuration
json = {
    "dim": 4096,
    "ffn_dim_multiplier": 1.3,
    "multiple_of": 1024,
    "n_heads": 32,
    "n_kv_heads": 8,
    "n_layers": 32,
    "norm_eps": 1e-05,
    "rope_theta": 500000.0,
    "use_scaled_rope": True,
    "vocab_size": 128256
}
config = ModelArgs
param = config(**json)

# Initialize the model
print("loading model ....")
llama_model = Transformer(param)
print(llama_model)

# load the .pth model weigths with model.load_state_dict(torch.load("path/to/your/model.pth"))
# Now you can proceed with your training or inference tasks

This method is geared towards advanced users who are comfortable managing distributed systems and need to train or fine-tune LLaMA 3.1 at scale.

The repo for LLama 3.1 itself does not contain the scripts for traning, finetuning and inference but there is a separate repo for that called llama-recipies which containes several useful scripts.

Why Use It?

Full Control: Best for complex setups requiring model parallelism and distributed training.

Large-Scale Training: Ideal for scenarios where you need to fully leverage your hardware resources.