Completetinymodelraven Top [BEST]

import torch from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig quant_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16, bnb_4bit_use_double_quant=True, )

| Metric | TinyLlama (1.1B) | Phi-1.5 (1.3B) | | | :--- | :--- | :--- | :--- | | HellaSwag (0-shot) | 59.2 | 60.1 | 58.4 | | PIQA (0-shot) | 73.5 | 74.0 | 72.1 | | Inference RAM | 2.2 GB | 2.5 GB | 210 MB | | First Token Latency (CPU) | 1.2s | 1.4s | 0.09s | | Tokens per second | 12 | 11 | 45 | completetinymodelraven top

To fine-tune for a specific domain (e.g., medical Q&A or legal text): import torch from transformers import AutoModelForCausalLM

pip install transformers[torch] accelerate bitsandbytes Here is a standard script to get you started: completetinymodelraven top

model = get_peft_model(model, lora_config)

outputs = model.generate( **inputs, max_new_tokens=100, do_sample=True, top_k=50, top_p=0.95, temperature=0.7 )