Hunyuan OCR: End-to-End OCR Expert Vision-Language Model

This guide covers installing and setting up Hunyuan OCR for both vLLM and Transformers deployment paths. Choose the method that best fits your use case and hardware configuration.

Quick Start with Transformers

Installation

Install the required Transformers package. Note that Hunyuan OCR support may require a specific version or branch of Transformers.

pip install git+https://github.com/huggingface/transformers@82a06db03535c49aa987719ed0746a76093b1ec4

Note: This will be merged into the Transformers main branch later. Check the official repository for the latest installation instructions.

Model Inference

Here is a complete example of using Hunyuan OCR with Transformers:

from transformers import AutoProcessor
from transformers import HunYuanVLForConditionalGeneration
from PIL import Image
import torch

def clean_repeated_substrings(text):
    """Clean repeated substrings in text"""
    n = len(text)
    if n < 8000:
        return text
    for length in range(2, n // 10 + 1):
        candidate = text[-length:] 
        count = 0
        i = n - length
        
        while i >= 0 and text[i:i + length] == candidate:
            count += 1
            i -= length
        if count >= 10:
            return text[:n - length * (count - 1)]  
    return text

model_name_or_path = "tencent/HunyuanOCR"
processor = AutoProcessor.from_pretrained(model_name_or_path, use_fast=False)
img_path = "path/to/your/image.jpg"
image_inputs = Image.open(img_path)

messages1 = [
    {"role": "system", "content": ""},
    {
        "role": "user",
        "content": [
            {"type": "image", "image": img_path},
            {"type": "text", "text": (
                "检测并识别图片中的文字，将文本坐标格式化输出。"
            )},
        ],
    }
]

messages = [messages1]

texts = [
    processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True)
    for msg in messages
]

inputs = processor(
    text=texts,
    images=image_inputs,
    padding=True,
    return_tensors="pt",
)

model = HunYuanVLForConditionalGeneration.from_pretrained(
    model_name_or_path,
    attn_implementation="eager",
    dtype=torch.bfloat16,
    device_map="auto"
)

with torch.no_grad():
    device = next(model.parameters()).device
    inputs = inputs.to(device)
    generated_ids = model.generate(**inputs, max_new_tokens=16384, do_sample=False)

if "input_ids" in inputs:
    input_ids = inputs.input_ids
else:
    print("inputs: # fallback", inputs)
    input_ids = inputs.inputs

generated_ids_trimmed = [
    out_ids[len(in_ids):] for in_ids, out_ids in zip(input_ids, generated_ids)
]

output_texts = clean_repeated_substrings(processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
))

print(output_texts)

Quick Start with vLLM

Installation

For vLLM deployment, set up a virtual environment and install vLLM from the nightly build:

uv venv hunyuanocr
source hunyuanocr/bin/activate
uv pip install -U vllm --pre --extra-index-url https://wheels.vllm.ai/nightly

Note: It is suggested to install cuda-compat-12-9 for compatibility:

sudo dpkg -i cuda-compat-12-9_575.57.08-0ubuntu1_amd64.deb
echo 'export LD_LIBRARY_PATH=/usr/local/cuda-12.9/compat:$LD_LIBRARY_PATH' >> ~/.bashrc
source ~/.bashrc

# verify cuda-compat-12-9
ls /usr/local/cuda-12.9/compat

Model Deployment

Deploy the model using vLLM with recommended settings for OCR tasks:

vllm serve tencent/HunyuanOCR \
    --no-enable-prefix-caching \
    --mm-processor-cache-gb 0 \
    --gpu-memory-utilization 0.2

Model Inference

Query the deployed model using the OpenAI-compatible API:

from vllm import LLM, SamplingParams
from PIL import Image
from transformers import AutoProcessor

def clean_repeated_substrings(text):
    """Clean repeated substrings in text"""
    n = len(text)
    if n < 8000:
        return text
    for length in range(2, n // 10 + 1):
        candidate = text[-length:] 
        count = 0
        i = n - length
        
        while i >= 0 and text[i:i + length] == candidate:
            count += 1
            i -= length
        if count >= 10:
            return text[:n - length * (count - 1)]  
    return text

model_path = "tencent/HunyuanOCR"
llm = LLM(model=model_path, trust_remote_code=True)
processor = AutoProcessor.from_pretrained(model_path)
sampling_params = SamplingParams(temperature=0, max_tokens=16384)

img_path = "/path/to/image.jpg"
img = Image.open(img_path)

messages = [
    {"role": "system", "content": ""},
    {"role": "user", "content": [
        {"type": "image", "image": img_path},
        {"type": "text", "text": "检测并识别图片中的文字，将文本坐标格式化输出。"}
    ]}
]

prompt = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = {"prompt": prompt, "multi_modal_data": {"image": [img]}}
output = llm.generate([inputs], sampling_params)[0]

print(clean_repeated_substrings(output.outputs[0].text))

Using OpenAI API Client

When using vLLM, you can also query the model through the OpenAI-compatible API:

from openai import OpenAI

client = OpenAI(
    api_key="EMPTY",
    base_url="http://localhost:8000/v1",
    timeout=3600
)

messages = [
    {"role": "system", "content": ""},
    {
        "role": "user",
        "content": [
            {
                "type": "image_url",
                "image_url": {
                    "url": "https://example.com/image.png"
                }
            },
            {
                "type": "text",
                "text": (
                    "Extract all information from the main body of the document image "
                    "and represent it in markdown format, ignoring headers and footers."
                    "Tables should be expressed in HTML format, formulas in the document "
                    "should be represented using LaTeX format, and the parsing should be "
                    "organized according to the reading order."
                )
            }
        ]
    }
]

response = client.chat.completions.create(
    model="tencent/HunyuanOCR",
    messages=messages,
    temperature=0.0,
    extra_body={
        "top_k": 1,
        "repetition_penalty": 1.0
    },
)
print(f"Generated text: {response.choices[0].message.content}")

Configuration Recommendations

Sampling Settings

Use greedy sampling with temperature set to 0.0 for optimal OCR performance. Low-temperature sampling also works well. Higher temperatures may reduce accuracy for OCR tasks.

Caching Configuration

Disable prefix caching and image processor caching for OCR tasks. These features are designed for chat applications with repeated prompts, but OCR tasks typically process unique documents where caching provides little benefit.

Memory Requirements

For vLLM deployment with 16K-token decoding, 80GB GPU memory is recommended. For smaller GPUs, reduce max_tokens, downsample images, or enable tensor parallelism. Adjust gpu-memory-utilization based on your hardware constraints.

Token Limits

Adjust max_num_batched_tokens based on your hardware capabilities. Larger values improve throughput but require more GPU memory. Start with default values and adjust based on your specific hardware and performance requirements.

Post-Processing

The clean_repeated_substrings function helps remove repeated substrings that may appear in long outputs. This is particularly useful for documents with repetitive content or when processing very long documents.

For structured output, define the exact schema in your prompt and validate responses using regex or JSON schema validation. Well-designed prompts with clear output format specifications help ensure structured results.

For the latest installation instructions and updates, please refer to the official Hunyuan OCR documentation and repository.

Installation Guide

Quick Start with Transformers

Installation

Model Inference

Quick Start with vLLM

Installation

Model Deployment

Model Inference

Using OpenAI API Client

Configuration Recommendations

Sampling Settings

Caching Configuration

Memory Requirements

Token Limits

Post-Processing