HuggingFace Usage

Purpose

HuggingFace provides the transformers, datasets, tokenizers, and PEFT libraries that together form the dominant Python ecosystem for working with pretrained language models. This note covers the core usage patterns: loading models from the Hub, inference pipelines, custom training, and dataset manipulation. For fine-tuning strategies see PEFT and LoRA and Finetuning Strategies.

Architecture

HuggingFace Hub (model/dataset registry)
        │
        ▼
transformers (AutoTokenizer, AutoModel, Pipeline, Trainer)
datasets    (Dataset, DatasetDict, streaming)
tokenizers  (fast Rust tokenizers backing transformers)
peft        (LoRA, QLoRA, adapter injection into any model)
trl         (SFTTrainer, DPOTrainer, reward model training)
accelerate  (distributed training wrapper)

Models are identified by model IDs ("meta-llama/Llama-3-8B", "bert-base-uncased") and loaded via from_pretrained. Weights are cached under ~/.cache/huggingface/hub/.

Implementation Notes

Hub and Model Loading

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
 
model_id = "meta-llama/Llama-3-8B-Instruct"
 
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,   # use bf16 on Ampere+
    device_map="auto",             # automatically split across available GPUs
    attn_implementation="flash_attention_2",  # optional speed-up
)

Key from_pretrained kwargs:

KwargPurpose
torch_dtype=torch.bfloat16Half-precision weights (~2× memory reduction)
device_map="auto"Multi-GPU / CPU offload via accelerate device map
load_in_8bit=TrueINT8 quantisation via bitsandbytes
load_in_4bit=TrueNF4 quantisation for QLoRA
trust_remote_code=TrueAllow custom model code from Hub (use carefully)

Pipeline API — Quick Inference

from transformers import pipeline
 
# Text generation
gen = pipeline("text-generation", model="meta-llama/Llama-3-8B-Instruct",
               device_map="auto", torch_dtype=torch.bfloat16)
out = gen("Summarise this: ...", max_new_tokens=256, do_sample=True, temperature=0.7)
 
# Classification
clf = pipeline("text-classification", model="distilbert-base-uncased-finetuned-sst-2-english")
print(clf("I really loved the movie!"))
# → [{'label': 'POSITIVE', 'score': 0.9998}]
 
# NER
ner = pipeline("ner", model="Jean-Baptiste/roberta-large-ner-english", aggregation_strategy="simple")

Chat / Instruction Models

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user",   "content": "What is the capital of France?"},
]
 
# Apply the model's chat template
input_ids = tokenizer.apply_chat_template(messages, tokenize=True,
                                          add_generation_prompt=True,
                                          return_tensors="pt").to(device)
 
outputs = model.generate(input_ids, max_new_tokens=200, do_sample=False)
reply = tokenizer.decode(outputs[0][input_ids.shape[-1]:], skip_special_tokens=True)

Datasets Library

from datasets import load_dataset, Dataset
 
# Load from Hub
ds = load_dataset("HuggingFaceH4/ultrachat_200k")
train = ds["train_sft"]
 
# Load from local files
ds = Dataset.from_dict({"text": texts, "label": labels})
ds = load_dataset("json", data_files="data.jsonl", split="train")
 
# Preprocessing
def tokenize(batch):
    return tokenizer(batch["text"], truncation=True, max_length=512)
 
tokenized = train.map(tokenize, batched=True, remove_columns=["text"])
tokenized = tokenized.with_format("torch")  # returns torch tensors
 
# Streaming (avoids downloading full dataset)
streaming_ds = load_dataset("c4", "en", split="train", streaming=True)
for sample in streaming_ds.take(10):
    print(sample["text"][:100])

Trainer API

from transformers import TrainingArguments, Trainer
 
args = TrainingArguments(
    output_dir="./output",
    num_train_epochs=3,
    per_device_train_batch_size=8,
    gradient_accumulation_steps=4,
    learning_rate=2e-5,
    fp16=True,                        # or bf16=True on Ampere
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    logging_steps=50,
    report_to="wandb",                # or "mlflow", "tensorboard"
)
 
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_eval,
    tokenizer=tokenizer,
    data_collator=DataCollatorWithPadding(tokenizer),
)
trainer.train()
trainer.save_model("./best_model")

Pushing to the Hub

from huggingface_hub import HfApi
 
# After training
model.push_to_hub("myorg/my-finetuned-model")
tokenizer.push_to_hub("myorg/my-finetuned-model")
 
# Or via Trainer
trainer.push_to_hub()

Trade-offs

PatternProCon
device_map="auto"Trivial multi-GPU/CPU offloadInference only; training needs accelerate
Pipeline APISimplest interfaceLess control; no easy batching customisation
TrainerBatteries includedOpaque; hard to debug custom training logic
Streaming datasetsNo disk space requiredSlower; no random access
trust_remote_code=TrueRequired for some modelsArbitrary code execution risk

References