Completetinymodelraven Top [BEST]

import torch from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig quant_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16, bnb_4bit_use_double_quant=True, )

| Metric | TinyLlama (1.1B) | Phi-1.5 (1.3B) | | | :--- | :--- | :--- | :--- | | HellaSwag (0-shot) | 59.2 | 60.1 | 58.4 | | PIQA (0-shot) | 73.5 | 74.0 | 72.1 | | Inference RAM | 2.2 GB | 2.5 GB | 210 MB | | First Token Latency (CPU) | 1.2s | 1.4s | 0.09s | | Tokens per second | 12 | 11 | 45 | completetinymodelraven top

To fine-tune for a specific domain (e.g., medical Q&A or legal text): import torch from transformers import AutoModelForCausalLM

pip install transformers[torch] accelerate bitsandbytes Here is a standard script to get you started: completetinymodelraven top

model = get_peft_model(model, lora_config)

| Parameter | Value | | :--- | :--- | | | 187 Million | | Layers | 12 (with Top-layer skipping enabled) | | Hidden Size | 768 | | Attention Heads | 12 | | Context Length | 8,192 tokens | | Vocabulary Size | 32,000 (Byte-Pair Encoding) | | Quantization Support | FP32, FP16, INT8, INT4 | | Inference RAM (INT4) | ~210 MB | | Max Generation Speed (CPU) | 45 tokens/sec (Apple M2) | How to Implement the CompleteTinyModelRaven Top Implementing this model is straightforward, but leveraging the "Top" features requires specific flags. Step 1: Installation Ensure you have transformers version 4.36.0 or later, as the Raven architecture is not supported in earlier builds.

outputs = model.generate( **inputs, max_new_tokens=100, do_sample=True, top_k=50, top_p=0.95, temperature=0.7 )