- Published on
HuggingFace Ecosystem Complete Guide: Master Transformers, Datasets, PEFT, and Accelerate
- Authors

- Name
- Youngju Kim
- @fjvbn20031
Introduction
HuggingFace is one of the most important platforms in today's AI/ML ecosystem. It provides hundreds of thousands of pretrained models, tens of thousands of datasets, and practical libraries — all in one unified ecosystem. This guide will take you from beginner to expert across every major component of HuggingFace.
1. HuggingFace Ecosystem Overview
1.1 HuggingFace Hub
The HuggingFace Hub consists of three core elements.
Model Hub: A repository of hundreds of thousands of pretrained models shared by researchers and developers worldwide. You can find nearly every well-known model including BERT, GPT-2, Llama, Mistral, and Stable Diffusion. Models are stored in PyTorch, TensorFlow, and JAX formats, and each model page provides usage instructions and performance benchmarks.
Dataset Hub: Thousands of datasets spanning NLP, Vision, Audio, and Multimodal domains. Perfectly integrated with the datasets library, allowing you to load any dataset with a single line of code.
Spaces: Free hosting for ML demo apps built with Gradio or Streamlit. Runs on CPU or GPU environments and is the easiest way to share your models with the community.
1.2 Key Libraries
| Library | Purpose |
|---|---|
| transformers | Loading and fine-tuning pretrained models |
| datasets | Dataset loading and preprocessing |
| tokenizers | Fast tokenizer implementations |
| peft | Parameter-efficient fine-tuning (LoRA, etc.) |
| accelerate | Multi-GPU/TPU training abstraction |
| trl | RLHF, SFT, DPO fine-tuning |
| diffusers | Image generation models |
| evaluate | Model evaluation metrics |
| optimum | Hardware optimization |
| huggingface_hub | Hub API client |
1.3 Account Setup and API Token
pip install transformers datasets tokenizers peft accelerate trl diffusers evaluate huggingface_hub
from huggingface_hub import login
# Method 1: Direct token
login(token="hf_your_token_here")
# Method 2: Environment variable (recommended)
import os
os.environ["HUGGINGFACE_TOKEN"] = "hf_your_token_here"
# Method 3: CLI
# huggingface-cli login
2. Transformers Library Complete Guide
2.1 Pipeline API
The pipeline is the simplest way to use models in HuggingFace. It handles tokenization, model inference, and post-processing internally.
from transformers import pipeline
# Text classification
classifier = pipeline("text-classification", model="distilbert-base-uncased-finetuned-sst-2-english")
result = classifier("This library is incredibly useful!")
print(result)
# [{'label': 'POSITIVE', 'score': 0.9997}]
# Text generation
generator = pipeline(
"text-generation",
model="meta-llama/Meta-Llama-3-8B-Instruct",
torch_dtype="auto",
device_map="auto"
)
output = generator(
"The most famous landmarks in New York are",
max_new_tokens=100,
do_sample=True,
temperature=0.7
)
print(output[0]["generated_text"])
# Extractive QA
qa = pipeline("question-answering", model="deepset/roberta-base-squad2")
result = qa(
question="Where is HuggingFace headquartered?",
context="HuggingFace is headquartered in New York and also has an office in Paris."
)
print(result)
# {'score': 0.99, 'start': 30, 'end': 38, 'answer': 'New York'}
# Translation
translator = pipeline("translation", model="Helsinki-NLP/opus-mt-en-fr")
result = translator("Hello, I am an AI researcher.")
print(result)
# Summarization
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
long_text = """HuggingFace was founded in 2016 as an AI company..."""
summary = summarizer(long_text, max_length=100, min_length=30)
print(summary)
# Named Entity Recognition
ner = pipeline("ner", model="dbmdz/bert-large-cased-finetuned-conll03-english", aggregation_strategy="simple")
result = ner("Apple is looking at buying a U.K. startup for $1 billion.")
print(result)
2.2 AutoTokenizer and AutoModel
from transformers import AutoTokenizer, AutoModel
import torch
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
# Encode text
text = "HuggingFace is a really useful library."
inputs = tokenizer(text, return_tensors="pt")
print(inputs)
# {'input_ids': tensor([[...]]), 'attention_mask': tensor([[...]])}
# Decode tokens
decoded = tokenizer.decode(inputs["input_ids"][0])
print(decoded)
# Inspect special tokens
print(f"BOS: {tokenizer.bos_token}")
print(f"EOS: {tokenizer.eos_token}")
print(f"PAD: {tokenizer.pad_token}")
print(f"UNK: {tokenizer.unk_token}")
print(f"Vocab size: {tokenizer.vocab_size}")
# Batch encoding
texts = ["First sentence.", "Second sentence, which is a little longer."]
batch_inputs = tokenizer(
texts,
padding=True,
truncation=True,
max_length=128,
return_tensors="pt"
)
print(batch_inputs["input_ids"].shape) # [2, 128]
# Load model
model = AutoModel.from_pretrained("bert-base-uncased")
model.eval()
with torch.no_grad():
outputs = model(**batch_inputs)
print(outputs.last_hidden_state.shape) # [batch, seq_len, hidden_dim]
print(outputs.pooler_output.shape) # [batch, hidden_dim]
2.3 Task-Specific Models
from transformers import (
AutoModelForSequenceClassification,
AutoModelForCausalLM,
AutoModelForSeq2SeqLM,
AutoModelForTokenClassification,
AutoModelForQuestionAnswering,
AutoModelForMaskedLM,
)
import torch
# Sequence classification (sentiment analysis, text classification)
clf_model = AutoModelForSequenceClassification.from_pretrained(
"bert-base-uncased",
num_labels=2
)
# Causal LM (GPT-style text generation)
causal_model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Meta-Llama-3-8B",
torch_dtype=torch.bfloat16,
device_map="auto",
attn_implementation="flash_attention_2"
)
# Seq2Seq (translation, summarization)
seq2seq_model = AutoModelForSeq2SeqLM.from_pretrained("facebook/bart-large-cnn")
# Token classification (NER, POS tagging)
ner_model = AutoModelForTokenClassification.from_pretrained(
"bert-base-uncased",
num_labels=9
)
# Extractive QA
qa_model = AutoModelForQuestionAnswering.from_pretrained("deepset/roberta-large-squad2")
# Masked LM (BERT-style mask prediction)
mlm_model = AutoModelForMaskedLM.from_pretrained("bert-base-uncased")
2.4 Detailed Inference Example
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
def load_model(model_name: str):
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="auto"
)
return tokenizer, model
def generate_text(
model,
tokenizer,
prompt: str,
max_new_tokens: int = 200,
temperature: float = 0.7,
top_p: float = 0.9,
do_sample: bool = True
) -> str:
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
temperature=temperature,
top_p=top_p,
do_sample=do_sample,
repetition_penalty=1.1,
eos_token_id=tokenizer.eos_token_id,
pad_token_id=tokenizer.pad_token_id,
)
# Return only the newly generated tokens (exclude input prompt)
generated_ids = outputs[0][inputs["input_ids"].shape[1]:]
return tokenizer.decode(generated_ids, skip_special_tokens=True)
# Usage example
tokenizer, model = load_model("Qwen/Qwen2.5-7B-Instruct")
# Apply chat template
messages = [
{"role": "system", "content": "You are a helpful AI assistant."},
{"role": "user", "content": "Write a Python function to compute Fibonacci numbers."}
]
prompt = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
response = generate_text(model, tokenizer, prompt)
print(response)
2.5 Tokenizer Deep Dive
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
text = "Hello! Let's learn HuggingFace."
# Basic encoding
ids = tokenizer.encode(text)
print(f"Token IDs: {ids}")
print(f"Token count: {len(ids)}")
# Inspect tokens
tokens = tokenizer.tokenize(text)
print(f"Tokens: {tokens}")
# With/without special tokens
ids_with_special = tokenizer.encode(text, add_special_tokens=True)
ids_without_special = tokenizer.encode(text, add_special_tokens=False)
print(f"With special tokens: {len(ids_with_special)}")
print(f"Without special tokens: {len(ids_without_special)}")
# Batch processing with padding
batch = [
"Short sentence.",
"This is a much longer sentence that will need padding applied to it."
]
encoded = tokenizer(
batch,
padding="max_length",
max_length=64,
truncation=True,
return_tensors="pt",
return_attention_mask=True
)
print("input_ids shape:", encoded["input_ids"].shape)
print("attention_mask shape:", encoded["attention_mask"].shape)
# Adding special tokens
special_tokens = {"additional_special_tokens": ["[DOMAIN]", "[ENTITY]", "[DATE]"]}
num_added = tokenizer.add_special_tokens(special_tokens)
print(f"Added {num_added} special tokens")
3. Datasets Library
3.1 Loading Datasets
from datasets import load_dataset
# Load from Hub
dataset = load_dataset("glue", "sst2")
print(dataset)
# DatasetDict({train: Dataset({features: [...], num_rows: ...})})
# Load specific splits
train_ds = load_dataset("glue", "sst2", split="train")
val_ds = load_dataset("glue", "sst2", split="validation")
# Load from local files
local_ds = load_dataset("json", data_files={"train": "train.jsonl", "test": "test.jsonl"})
csv_ds = load_dataset("csv", data_files="data.csv")
text_ds = load_dataset("text", data_files="corpus.txt")
# Percentage-based splits
split_ds = load_dataset("glue", "sst2", split="train[:80%]")
val_split = load_dataset("glue", "sst2", split="train[80%:]")
# Inspect dataset
print(train_ds.features)
print(train_ds.column_names)
print(train_ds.num_rows)
print(train_ds[0]) # First sample
print(train_ds[:5]) # First 5 samples
3.2 Dataset Preprocessing
from datasets import load_dataset
from transformers import AutoTokenizer
dataset = load_dataset("glue", "sst2")
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
def tokenize_function(examples):
return tokenizer(
examples["sentence"],
padding="max_length",
truncation=True,
max_length=128
)
# Batched processing with multiprocessing
tokenized_ds = dataset.map(
tokenize_function,
batched=True,
num_proc=4,
remove_columns=["sentence", "idx"]
)
# filter: keep only samples meeting a condition
long_samples = dataset["train"].filter(
lambda x: len(x["sentence"].split()) > 10
)
print(f"After filter: {len(long_samples)} samples")
# select: index-based selection
small_ds = dataset["train"].select(range(1000))
# sort
sorted_ds = dataset["train"].sort("label", reverse=True)
# shuffle
shuffled_ds = dataset["train"].shuffle(seed=42)
# rename_column
renamed_ds = dataset["train"].rename_column("label", "sentiment")
# Add a column
def add_text_length(example):
example["text_length"] = len(example["sentence"].split())
return example
ds_with_length = dataset["train"].map(add_text_length)
# Save/load dataset
tokenized_ds.save_to_disk("./tokenized_sst2")
from datasets import load_from_disk
loaded_ds = load_from_disk("./tokenized_sst2")
3.3 Creating and Uploading Custom Datasets
from datasets import Dataset, DatasetDict, Features, Value, ClassLabel
import pandas as pd
# Create Dataset from pandas DataFrame
df = pd.DataFrame({
"text": ["Positive sentence.", "Negative sentence.", "Neutral sentence."],
"label": [1, 0, 2]
})
custom_ds = Dataset.from_pandas(df)
# Create Dataset from dictionary
data_dict = {
"text": ["sample 1", "sample 2"],
"score": [0.8, 0.3]
}
ds_from_dict = Dataset.from_dict(data_dict)
# Define Features schema
features = Features({
"text": Value("string"),
"label": ClassLabel(names=["negative", "positive", "neutral"]),
"score": Value("float32")
})
# Push to Hub
from huggingface_hub import login
login()
custom_ds.push_to_hub("your-username/my-custom-dataset")
# Push with train/test splits
dataset_dict = DatasetDict({
"train": custom_ds,
"test": custom_ds
})
dataset_dict.push_to_hub("your-username/my-custom-dataset")
3.4 Streaming Datasets
from datasets import load_dataset
# Streaming mode (no full dataset loaded into memory)
streaming_ds = load_dataset(
"HuggingFaceFW/fineweb",
"sample-10BT",
split="train",
streaming=True
)
# Iterate as an iterator
for i, example in enumerate(streaming_ds):
if i >= 5:
break
print(example["text"][:100])
# map and filter also work with streaming
filtered_stream = streaming_ds.filter(lambda x: len(x["text"]) > 500)
mapped_stream = filtered_stream.map(lambda x: {"text_len": len(x["text"])})
# Batch processing
def process_batch(batch):
return {"processed": [t.lower() for t in batch["text"]]}
batched_stream = streaming_ds.map(process_batch, batched=True, batch_size=16)
# Convert to DataLoader
from torch.utils.data import DataLoader
dataloader = DataLoader(
list(streaming_ds.take(1000)),
batch_size=8,
shuffle=True
)
3.5 DataCollators
from transformers import (
DataCollatorWithPadding,
DataCollatorForSeq2Seq,
DataCollatorForLanguageModeling,
AutoTokenizer
)
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
# Dynamic padding (pads to max length within each batch)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
# Seq2Seq collator (auto-generates decoder inputs)
t5_tokenizer = AutoTokenizer.from_pretrained("t5-base")
seq2seq_collator = DataCollatorForSeq2Seq(
tokenizer=t5_tokenizer,
padding=True,
return_tensors="pt"
)
# Masked LM collator
mlm_collator = DataCollatorForLanguageModeling(
tokenizer=tokenizer,
mlm=True,
mlm_probability=0.15
)
# Causal LM collator
clm_collator = DataCollatorForLanguageModeling(
tokenizer=tokenizer,
mlm=False
)
4. Tokenizers Library
4.1 Fast Tokenizers
The HuggingFace tokenizers library provides ultra-fast tokenizers implemented in Rust, offering up to 100x speedup over pure Python tokenizers.
from tokenizers import Tokenizer, models, trainers, pre_tokenizers, decoders, processors
from tokenizers.models import BPE, WordPiece, Unigram
# Train a BPE tokenizer
tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=False)
trainer = trainers.BpeTrainer(
vocab_size=30000,
min_frequency=2,
special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"]
)
files = ["corpus.txt"]
tokenizer.train(files, trainer)
# Save and load
tokenizer.save("my_tokenizer.json")
loaded_tokenizer = Tokenizer.load("my_tokenizer.json")
# Encoding
output = tokenizer.encode("Hello, HuggingFace!")
print(output.tokens)
print(output.ids)
# Batch encoding
batch_output = tokenizer.encode_batch(["sentence 1", "sentence 2"])
4.2 WordPiece Tokenizer (BERT-style)
from tokenizers import Tokenizer
from tokenizers.models import WordPiece
from tokenizers.trainers import WordPieceTrainer
from tokenizers.pre_tokenizers import Whitespace
from tokenizers.processors import TemplateProcessing
tokenizer = Tokenizer(WordPiece(unk_token="[UNK]"))
tokenizer.pre_tokenizer = Whitespace()
trainer = WordPieceTrainer(
vocab_size=30522,
special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"]
)
tokenizer.train(["corpus.txt"], trainer)
# Add BERT-style post-processing
tokenizer.post_processor = TemplateProcessing(
single="[CLS] $A [SEP]",
pair="[CLS] $A [SEP] $B:1 [SEP]:1",
special_tokens=[
("[CLS]", tokenizer.token_to_id("[CLS]")),
("[SEP]", tokenizer.token_to_id("[SEP]")),
],
)
# Integrate with HuggingFace transformers
from transformers import PreTrainedTokenizerFast
fast_tokenizer = PreTrainedTokenizerFast(
tokenizer_object=tokenizer,
unk_token="[UNK]",
pad_token="[PAD]",
cls_token="[CLS]",
sep_token="[SEP]",
mask_token="[MASK]"
)
# Upload to Hub
fast_tokenizer.push_to_hub("your-username/my-tokenizer")
5. PEFT: Parameter-Efficient Fine-Tuning
5.1 LoRA Basics
LoRA (Low-Rank Adaptation) freezes the original model weights and trains only two low-rank matrices, dramatically reducing the number of trainable parameters.
from peft import LoraConfig, get_peft_model, TaskType
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
# Load base model
model_name = "meta-llama/Meta-Llama-3-8B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="auto"
)
# LoRA configuration
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=16, # Rank (lower = fewer parameters)
lora_alpha=32, # Scaling factor
target_modules=[ # Layers to apply LoRA
"q_proj", "v_proj",
"k_proj", "o_proj",
"gate_proj", "up_proj",
"down_proj"
],
lora_dropout=0.05,
bias="none",
inference_mode=False
)
# Apply LoRA
peft_model = get_peft_model(model, lora_config)
peft_model.print_trainable_parameters()
# trainable params: 41,943,040 || all params: 8,072,925,184 || trainable%: 0.5196
5.2 QLoRA: 4-bit Quantization + LoRA
from peft import prepare_model_for_kbit_training, LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch
# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
# Load quantized model
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Meta-Llama-3-8B",
quantization_config=bnb_config,
device_map="auto"
)
# Prepare for kbit training (enables gradient checkpointing)
model = prepare_model_for_kbit_training(model)
# Apply LoRA
lora_config = LoraConfig(
r=64,
lora_alpha=128,
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
lora_dropout=0.1,
bias="none",
task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
5.3 Complete Fine-Tuning Example (SFTTrainer)
from trl import SFTTrainer, SFTConfig
from peft import LoraConfig
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from datasets import load_dataset
import torch
# 1. Dataset
dataset = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft[:10%]")
# 2. Model setup
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Meta-Llama-3-8B",
quantization_config=bnb_config,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
tokenizer.pad_token = tokenizer.eos_token
# 3. LoRA config
peft_config = LoraConfig(
r=32,
lora_alpha=64,
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"]
)
# 4. Training config
sft_config = SFTConfig(
output_dir="./qlora-llama3",
num_train_epochs=3,
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
learning_rate=2e-4,
bf16=True,
logging_steps=10,
save_steps=100,
warmup_ratio=0.03,
lr_scheduler_type="cosine",
max_seq_length=2048,
report_to="wandb"
)
# 5. Train
trainer = SFTTrainer(
model=model,
train_dataset=dataset,
peft_config=peft_config,
args=sft_config,
processing_class=tokenizer,
)
trainer.train()
trainer.save_model("./qlora-llama3-final")
5.4 Saving, Loading, and Merging Adapters
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
# Save LoRA adapter only
peft_model.save_pretrained("./lora-adapter")
tokenizer.save_pretrained("./lora-adapter")
# Upload adapter to Hub
peft_model.push_to_hub("your-username/llama3-lora")
# Load adapter (base model + adapter)
base_model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Meta-Llama-3-8B",
torch_dtype=torch.bfloat16,
device_map="auto"
)
peft_model = PeftModel.from_pretrained(base_model, "./lora-adapter")
# Merge adapter into base model (optimizes inference speed)
merged_model = peft_model.merge_and_unload()
merged_model.save_pretrained("./merged-llama3")
tokenizer.save_pretrained("./merged-llama3")
6. Accelerate Library
6.1 Accelerate Configuration
# Interactive setup
accelerate config
# Non-interactive setup
accelerate config --config_file ./accelerate_config.yaml
# accelerate_config.yaml
compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
num_machines: 1
num_processes: 4
gpu_ids: all
mixed_precision: bf16
6.2 Basic Accelerate Training Loop
from accelerate import Accelerator
from transformers import AutoModelForSequenceClassification, AutoTokenizer, AdamW
from datasets import load_dataset
from torch.utils.data import DataLoader
def training_function():
accelerator = Accelerator(
mixed_precision="bf16",
gradient_accumulation_steps=4,
log_with="wandb"
)
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)
dataset = load_dataset("glue", "sst2", split="train")
def tokenize(examples):
return tokenizer(examples["sentence"], truncation=True, max_length=128)
tokenized = dataset.map(tokenize, batched=True)
tokenized.set_format("torch", columns=["input_ids", "attention_mask", "label"])
dataloader = DataLoader(tokenized, batch_size=16, shuffle=True)
optimizer = AdamW(model.parameters(), lr=2e-5)
# Prepare everything with Accelerate
model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)
for epoch in range(3):
model.train()
for batch in dataloader:
with accelerator.accumulate(model):
outputs = model(**batch)
loss = outputs.loss
accelerator.backward(loss)
accelerator.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()
optimizer.zero_grad()
accelerator.print(f"Epoch {epoch} completed")
accelerator.wait_for_everyone()
unwrapped_model = accelerator.unwrap_model(model)
unwrapped_model.save_pretrained("./output", save_function=accelerator.save)
# Run: accelerate launch train.py
training_function()
6.3 DeepSpeed Integration
from accelerate import Accelerator
from accelerate.utils import DeepSpeedPlugin
ds_plugin = DeepSpeedPlugin(
hf_ds_config={
"zero_optimization": {
"stage": 2,
"allgather_partitions": True,
"allgather_bucket_size": 2e8,
"overlap_comm": True,
"reduce_scatter": True,
"reduce_bucket_size": 2e8,
"contiguous_gradients": True
},
"bf16": {"enabled": True},
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"weight_decay": "auto"
}
}
}
)
accelerator = Accelerator(deepspeed_plugin=ds_plugin)
6.4 Gradient Accumulation and Mixed Precision
from accelerate import Accelerator
accelerator = Accelerator(
mixed_precision="bf16",
gradient_accumulation_steps=8
)
for batch in dataloader:
with accelerator.accumulate(model):
output = model(**batch)
loss = output.loss / accelerator.gradient_accumulation_steps
accelerator.backward(loss)
if accelerator.sync_gradients:
accelerator.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()
lr_scheduler.step()
optimizer.zero_grad()
7. TRL: Transformer Reinforcement Learning
7.1 SFTTrainer
from trl import SFTTrainer, SFTConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-7B")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B")
dataset = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft[:5%]")
training_args = SFTConfig(
output_dir="./sft-qwen",
max_seq_length=2048,
num_train_epochs=1,
per_device_train_batch_size=4,
gradient_accumulation_steps=2,
learning_rate=2e-5,
bf16=True,
save_strategy="epoch",
logging_steps=10,
dataset_text_field="messages",
)
trainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=dataset,
processing_class=tokenizer,
)
trainer.train()
7.2 DPOTrainer (Direct Preference Optimization)
from trl import DPOTrainer, DPOConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
model = AutoModelForCausalLM.from_pretrained("your-sft-model")
ref_model = AutoModelForCausalLM.from_pretrained("your-sft-model") # frozen
tokenizer = AutoTokenizer.from_pretrained("your-sft-model")
# DPO dataset needs: prompt, chosen, rejected columns
dpo_dataset = load_dataset("HuggingFaceH4/ultrafeedback_binarized", split="train_prefs")
dpo_config = DPOConfig(
output_dir="./dpo-model",
num_train_epochs=1,
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
learning_rate=5e-7,
beta=0.1, # KL penalty strength
bf16=True,
loss_type="sigmoid",
)
trainer = DPOTrainer(
model=model,
ref_model=ref_model,
args=dpo_config,
train_dataset=dpo_dataset,
processing_class=tokenizer,
)
trainer.train()
7.3 RewardTrainer
from trl import RewardTrainer, RewardConfig
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from datasets import load_dataset
model = AutoModelForSequenceClassification.from_pretrained(
"bert-base-uncased",
num_labels=1 # Scalar reward output
)
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
dataset = load_dataset("HuggingFaceH4/ultrafeedback_binarized", split="train_prefs")
reward_config = RewardConfig(
output_dir="./reward-model",
num_train_epochs=1,
per_device_train_batch_size=8,
learning_rate=1e-5,
max_length=512,
)
trainer = RewardTrainer(
model=model,
args=reward_config,
train_dataset=dataset,
processing_class=tokenizer,
)
trainer.train()
8. Diffusers Library
8.1 Basic Image Generation
from diffusers import StableDiffusionPipeline, DPMSolverMultistepScheduler
import torch
pipe = StableDiffusionPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5",
torch_dtype=torch.float16,
use_safetensors=True
)
# Use a faster scheduler
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
pipe = pipe.to("cuda")
# Memory optimization
pipe.enable_attention_slicing()
pipe.enable_xformers_memory_efficient_attention()
# Generate image
image = pipe(
prompt="A beautiful mountain landscape at sunset, photorealistic, 8k",
negative_prompt="blurry, low quality, cartoon",
num_inference_steps=20,
guidance_scale=7.5,
width=512,
height=512,
generator=torch.Generator("cuda").manual_seed(42)
).images[0]
image.save("landscape.png")
# SDXL
from diffusers import StableDiffusionXLPipeline
xl_pipe = StableDiffusionXLPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0",
torch_dtype=torch.float16,
variant="fp16"
).to("cuda")
image = xl_pipe(
prompt="A majestic mountain landscape, 4k photo",
num_inference_steps=30,
guidance_scale=5.0
).images[0]
8.2 LoRA for Diffusion
from diffusers import StableDiffusionPipeline
import torch
pipe = StableDiffusionPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5",
torch_dtype=torch.float16
).to("cuda")
pipe.load_lora_weights("path/to/lora/weights", weight_name="lora.safetensors")
pipe.fuse_lora(lora_scale=0.8)
image = pipe(
"a photo of sks person in a fantasy world",
num_inference_steps=30
).images[0]
9. Evaluate Library
9.1 Basic Metrics
import evaluate
# BLEU (translation quality)
bleu = evaluate.load("bleu")
result = bleu.compute(
predictions=["the cat is on the mat"],
references=[["the cat is on the mat", "there is a cat on the mat"]]
)
print(f"BLEU: {result['bleu']:.4f}")
# ROUGE (summarization quality)
rouge = evaluate.load("rouge")
result = rouge.compute(
predictions=["This is a very useful library."],
references=["This is an extremely useful library."]
)
print(result) # rouge1, rouge2, rougeL, rougeLsum
# BERTScore (semantic similarity)
bertscore = evaluate.load("bertscore")
result = bertscore.compute(
predictions=["AI is changing the world."],
references=["Artificial intelligence is transforming the globe."],
lang="en"
)
print(f"BERTScore F1: {result['f1'][0]:.4f}")
# Accuracy
accuracy = evaluate.load("accuracy")
result = accuracy.compute(predictions=[0, 1, 1, 0], references=[0, 1, 0, 0])
print(f"Accuracy: {result['accuracy']:.4f}")
9.2 Integration with Trainer
from transformers import Trainer, TrainingArguments
import evaluate
import numpy as np
accuracy = evaluate.load("accuracy")
def compute_metrics(eval_pred):
logits, labels = eval_pred
predictions = np.argmax(logits, axis=-1)
return accuracy.compute(predictions=predictions, references=labels)
training_args = TrainingArguments(
output_dir="./model-output",
evaluation_strategy="epoch",
num_train_epochs=3,
per_device_train_batch_size=16,
per_device_eval_batch_size=32,
load_best_model_at_end=True,
metric_for_best_model="accuracy",
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
compute_metrics=compute_metrics,
)
10. Hub API and Automation
10.1 huggingface_hub Client
from huggingface_hub import HfApi, hf_hub_download, snapshot_download
from huggingface_hub import create_repo, upload_file, upload_folder
api = HfApi()
# Download a single file
local_path = hf_hub_download(
repo_id="meta-llama/Meta-Llama-3-8B",
filename="config.json"
)
# Download entire repo
snapshot_download(
repo_id="meta-llama/Meta-Llama-3-8B",
local_dir="./llama3-8b",
ignore_patterns=["*.bin", "*.pt"] # Only download safetensors
)
# Create repositories
create_repo("your-username/my-model", private=True)
create_repo("your-username/my-dataset", repo_type="dataset")
create_repo("your-username/my-space", repo_type="space", space_sdk="gradio")
# Upload a file
upload_file(
path_or_fileobj="./my_model.bin",
path_in_repo="pytorch_model.bin",
repo_id="your-username/my-model"
)
# Upload a folder
upload_folder(
folder_path="./output-model",
repo_id="your-username/my-model",
ignore_patterns=["*.log", "__pycache__"]
)
# Search models
models = api.list_models(
filter="text-generation",
language="en",
sort="downloads",
limit=10
)
for m in models:
print(m.id, m.downloads)
10.2 Automatic Model Card Generation
from huggingface_hub import ModelCard, ModelCardData
card_data = ModelCardData(
language="en",
license="apache-2.0",
library_name="transformers",
tags=["llama", "instruction-tuned", "fine-tuned"],
datasets=["HuggingFaceH4/ultrachat_200k"],
base_model="meta-llama/Meta-Llama-3-8B",
metrics=[{"type": "accuracy", "value": 0.95}]
)
card = ModelCard.from_template(
card_data,
template_str="""
---
{{ card_data }}
---
# Llama-3-8B Instruct
This model is a fine-tuned version of Llama-3-8B for instruction following.
## Usage
```python
from transformers import pipeline
generator = pipeline("text-generation", model="your-username/llama3-instruct")
Training Details
- Base model: meta-llama/Meta-Llama-3-8B
- Method: QLoRA (4-bit)
- Dataset: UltraChat 200k """ )
card.push_to_hub("your-username/llama3-instruct")
---
## 11. Optimum Library
### 11.1 ONNX Export
```python
from optimum.exporters.onnx import main_export
# Convert model to ONNX format
main_export(
model_name_or_path="bert-base-uncased",
output="./bert-onnx",
task="text-classification"
)
# Load ONNX model for inference
from optimum.onnxruntime import ORTModelForSequenceClassification
from transformers import AutoTokenizer
ort_model = ORTModelForSequenceClassification.from_pretrained("./bert-onnx")
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
inputs = tokenizer("This movie was fantastic!", return_tensors="pt")
outputs = ort_model(**inputs)
print(outputs.logits)
11.2 BetterTransformer
from optimum.bettertransformer import BetterTransformer
from transformers import AutoModelForSequenceClassification
import torch
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")
# Convert to BetterTransformer (leverages PyTorch 2.0+ optimizations)
model = BetterTransformer.transform(model)
model.eval()
with torch.no_grad():
outputs = model(**inputs)
12. Real-World Project: Fine-Tuning a Custom LLM
12.1 Preparing Your Dataset
from datasets import load_dataset, concatenate_datasets
# Load instruction dataset
alpaca = load_dataset("tatsu-lab/alpaca", split="train")
def format_instruction_dataset(example):
"""Convert Alpaca format to ChatML format"""
instruction = example["instruction"]
inp = example.get("input", "").strip()
output = example["output"]
if inp:
user_content = f"{instruction}\n\n{inp}"
else:
user_content = instruction
messages = [
{"role": "system", "content": "You are a helpful AI assistant."},
{"role": "user", "content": user_content},
{"role": "assistant", "content": output}
]
return {"messages": messages}
formatted_dataset = alpaca.map(format_instruction_dataset)
print(formatted_dataset[0])
12.2 Complete QLoRA Fine-Tuning Pipeline
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, prepare_model_for_kbit_training, get_peft_model
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset
MODEL_NAME = "meta-llama/Meta-Llama-3-8B"
OUTPUT_DIR = "./llama3-qlora"
MAX_SEQ_LEN = 2048
NUM_EPOCHS = 2
BATCH_SIZE = 2
GRAD_ACCUM = 8
LR = 2e-4
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
model = AutoModelForCausalLM.from_pretrained(
MODEL_NAME,
quantization_config=bnb_config,
device_map="auto",
attn_implementation="flash_attention_2"
)
model = prepare_model_for_kbit_training(model)
peft_config = LoraConfig(
r=64,
lora_alpha=128,
target_modules=[
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"
],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
dataset = load_dataset("tatsu-lab/alpaca", split="train")
def format_sample(example):
messages = [
{"role": "system", "content": "You are a helpful AI assistant."},
{"role": "user", "content": example["instruction"]},
{"role": "assistant", "content": example["output"]}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)
return {"text": text}
dataset = dataset.map(format_sample)
training_args = SFTConfig(
output_dir=OUTPUT_DIR,
num_train_epochs=NUM_EPOCHS,
per_device_train_batch_size=BATCH_SIZE,
gradient_accumulation_steps=GRAD_ACCUM,
learning_rate=LR,
bf16=True,
gradient_checkpointing=True,
optim="paged_adamw_8bit",
warmup_ratio=0.05,
lr_scheduler_type="cosine",
save_strategy="epoch",
logging_steps=10,
dataset_text_field="text",
max_seq_length=MAX_SEQ_LEN,
packing=True,
)
trainer = SFTTrainer(
model=model,
train_dataset=dataset,
peft_config=peft_config,
args=training_args,
processing_class=tokenizer,
)
trainer.train()
trainer.model.save_pretrained(f"{OUTPUT_DIR}/adapter")
tokenizer.save_pretrained(f"{OUTPUT_DIR}/adapter")
print("Training complete!")
12.3 Evaluation and Inference
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
base_model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Meta-Llama-3-8B",
torch_dtype=torch.bfloat16,
device_map="auto"
)
model = PeftModel.from_pretrained(base_model, "./llama3-qlora/adapter")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
def chat(user_message: str, max_new_tokens: int = 512) -> str:
messages = [
{"role": "system", "content": "You are a helpful AI assistant."},
{"role": "user", "content": user_message}
]
prompt = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
do_sample=True,
temperature=0.7,
top_p=0.9,
repetition_penalty=1.1
)
return tokenizer.decode(
outputs[0][inputs["input_ids"].shape[1]:],
skip_special_tokens=True
)
test_questions = [
"Write a Python function to check if a number is prime.",
"Explain the difference between machine learning and deep learning.",
"What are the main benefits of using HuggingFace?",
]
for q in test_questions:
print(f"Q: {q}")
print(f"A: {chat(q)}")
print("-" * 50)
12.4 Deploy with Gradio Spaces
import gradio as gr
from transformers import pipeline
import torch
pipe = pipeline(
"text-generation",
model="./llama3-qlora/merged",
torch_dtype=torch.bfloat16,
device_map="auto"
)
def respond(message, history, system_prompt, max_tokens, temperature):
messages = [{"role": "system", "content": system_prompt}]
for user, assistant in history:
messages.append({"role": "user", "content": user})
messages.append({"role": "assistant", "content": assistant})
messages.append({"role": "user", "content": message})
output = pipe(
messages,
max_new_tokens=max_tokens,
temperature=temperature,
do_sample=True
)
return output[0]["generated_text"][-1]["content"]
demo = gr.ChatInterface(
respond,
additional_inputs=[
gr.Textbox("You are a helpful AI assistant.", label="System Prompt"),
gr.Slider(50, 1000, 256, label="Max Tokens"),
gr.Slider(0.1, 2.0, 0.7, label="Temperature")
],
title="Llama-3 Chatbot",
description="QLoRA fine-tuned LLM"
)
if __name__ == "__main__":
demo.launch()
Conclusion
In this guide, we covered the core components of the HuggingFace ecosystem:
- transformers: Multiple abstraction levels from pipelines to fine-grained model control
- datasets: Efficient data loading, processing, and sharing
- tokenizers: Ultra-fast Rust-based tokenization
- peft: Fine-tune large models on consumer GPUs with LoRA/QLoRA
- accelerate: Multi-GPU/TPU training without code changes
- trl: Modern fine-tuning techniques — SFT, DPO, RLHF
- diffusers: Standard library for image generation models
- evaluate: Standardized model evaluation
The HuggingFace ecosystem evolves rapidly, so regularly checking the official documentation and blog is highly recommended.
References
- HuggingFace Docs: https://huggingface.co/docs
- Transformers Docs: https://huggingface.co/docs/transformers
- PEFT Docs: https://huggingface.co/docs/peft
- Accelerate Docs: https://huggingface.co/docs/accelerate
- TRL Docs: https://huggingface.co/docs/trl
- Diffusers Docs: https://huggingface.co/docs/diffusers