HuggingFace Usage

Purpose

HuggingFace provides the transformers, datasets, tokenizers, and PEFT libraries that together form the dominant Python ecosystem for working with pretrained language models. This note covers the core usage patterns: loading models from the Hub, inference pipelines, custom training, and dataset manipulation. For fine-tuning strategies see PEFT and LoRA and Finetuning Strategies.

Architecture

HuggingFace Hub (model/dataset registry)
        │
        ▼
transformers (AutoTokenizer, AutoModel, Pipeline, Trainer)
datasets    (Dataset, DatasetDict, streaming)
tokenizers  (fast Rust tokenizers backing transformers)
peft        (LoRA, QLoRA, adapter injection into any model)
trl         (SFTTrainer, DPOTrainer, reward model training)
accelerate  (distributed training wrapper)

Models are identified by model IDs ("meta-llama/Llama-3-8B", "bert-base-uncased") and loaded via from_pretrained. Weights are cached under ~/.cache/huggingface/hub/.

Implementation Notes

Hub and Model Loading

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
 
model_id = "meta-llama/Llama-3-8B-Instruct"
 
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,   # use bf16 on Ampere+
    device_map="auto",             # automatically split across available GPUs
    attn_implementation="flash_attention_2",  # optional speed-up
)

Key from_pretrained kwargs:

Kwarg	Purpose
`torch_dtype=torch.bfloat16`	Half-precision weights (~2× memory reduction)
`device_map="auto"`	Multi-GPU / CPU offload via `accelerate` device map
`load_in_8bit=True`	INT8 quantisation via `bitsandbytes`
`load_in_4bit=True`	NF4 quantisation for QLoRA
`trust_remote_code=True`	Allow custom model code from Hub (use carefully)

Pipeline API — Quick Inference

from transformers import pipeline
 
# Text generation
gen = pipeline("text-generation", model="meta-llama/Llama-3-8B-Instruct",
               device_map="auto", torch_dtype=torch.bfloat16)
out = gen("Summarise this: ...", max_new_tokens=256, do_sample=True, temperature=0.7)
 
# Classification
clf = pipeline("text-classification", model="distilbert-base-uncased-finetuned-sst-2-english")
print(clf("I really loved the movie!"))
# → [{'label': 'POSITIVE', 'score': 0.9998}]
 
# NER
ner = pipeline("ner", model="Jean-Baptiste/roberta-large-ner-english", aggregation_strategy="simple")

Chat / Instruction Models

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user",   "content": "What is the capital of France?"},
]
 
# Apply the model's chat template
input_ids = tokenizer.apply_chat_template(messages, tokenize=True,
                                          add_generation_prompt=True,
                                          return_tensors="pt").to(device)
 
outputs = model.generate(input_ids, max_new_tokens=200, do_sample=False)
reply = tokenizer.decode(outputs[0][input_ids.shape[-1]:], skip_special_tokens=True)

Datasets Library

from datasets import load_dataset, Dataset
 
# Load from Hub
ds = load_dataset("HuggingFaceH4/ultrachat_200k")
train = ds["train_sft"]
 
# Load from local files
ds = Dataset.from_dict({"text": texts, "label": labels})
ds = load_dataset("json", data_files="data.jsonl", split="train")
 
# Preprocessing
def tokenize(batch):
    return tokenizer(batch["text"], truncation=True, max_length=512)
 
tokenized = train.map(tokenize, batched=True, remove_columns=["text"])
tokenized = tokenized.with_format("torch")  # returns torch tensors
 
# Streaming (avoids downloading full dataset)
streaming_ds = load_dataset("c4", "en", split="train", streaming=True)
for sample in streaming_ds.take(10):
    print(sample["text"][:100])

Trainer API

from transformers import TrainingArguments, Trainer
 
args = TrainingArguments(
    output_dir="./output",
    num_train_epochs=3,
    per_device_train_batch_size=8,
    gradient_accumulation_steps=4,
    learning_rate=2e-5,
    fp16=True,                        # or bf16=True on Ampere
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    logging_steps=50,
    report_to="wandb",                # or "mlflow", "tensorboard"
)
 
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_eval,
    tokenizer=tokenizer,
    data_collator=DataCollatorWithPadding(tokenizer),
)
trainer.train()
trainer.save_model("./best_model")

Pushing to the Hub

from huggingface_hub import HfApi
 
# After training
model.push_to_hub("myorg/my-finetuned-model")
tokenizer.push_to_hub("myorg/my-finetuned-model")
 
# Or via Trainer
trainer.push_to_hub()

Trade-offs

Pattern	Pro	Con
`device_map="auto"`	Trivial multi-GPU/CPU offload	Inference only; training needs `accelerate`
Pipeline API	Simplest interface	Less control; no easy batching customisation
Trainer	Batteries included	Opaque; hard to debug custom training logic
Streaming datasets	No disk space required	Slower; no random access
`trust_remote_code=True`	Required for some models	Arbitrary code execution risk

Notes

Explorer

huggingface_usage

HuggingFace Usage

Purpose

Architecture

Implementation Notes

Hub and Model Loading

Pipeline API — Quick Inference

Chat / Instruction Models

Datasets Library

Trainer API

Pushing to the Hub

Trade-offs

References

Links

Graph View

Table of Contents

Backlinks