Installation Guide
This guide covers installing and setting up Hunyuan OCR for both vLLM and Transformers deployment paths. Choose the method that best fits your use case and hardware configuration.
Quick Start with Transformers
Installation
Install the required Transformers package. Note that Hunyuan OCR support may require a specific version or branch of Transformers.
pip install git+https://github.com/huggingface/transformers@82a06db03535c49aa987719ed0746a76093b1ec4Note: This will be merged into the Transformers main branch later. Check the official repository for the latest installation instructions.
Model Inference
Here is a complete example of using Hunyuan OCR with Transformers:
from transformers import AutoProcessor
from transformers import HunYuanVLForConditionalGeneration
from PIL import Image
import torch
def clean_repeated_substrings(text):
"""Clean repeated substrings in text"""
n = len(text)
if n < 8000:
return text
for length in range(2, n // 10 + 1):
candidate = text[-length:]
count = 0
i = n - length
while i >= 0 and text[i:i + length] == candidate:
count += 1
i -= length
if count >= 10:
return text[:n - length * (count - 1)]
return text
model_name_or_path = "tencent/HunyuanOCR"
processor = AutoProcessor.from_pretrained(model_name_or_path, use_fast=False)
img_path = "path/to/your/image.jpg"
image_inputs = Image.open(img_path)
messages1 = [
{"role": "system", "content": ""},
{
"role": "user",
"content": [
{"type": "image", "image": img_path},
{"type": "text", "text": (
"检测并识别图片中的文字,将文本坐标格式化输出。"
)},
],
}
]
messages = [messages1]
texts = [
processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True)
for msg in messages
]
inputs = processor(
text=texts,
images=image_inputs,
padding=True,
return_tensors="pt",
)
model = HunYuanVLForConditionalGeneration.from_pretrained(
model_name_or_path,
attn_implementation="eager",
dtype=torch.bfloat16,
device_map="auto"
)
with torch.no_grad():
device = next(model.parameters()).device
inputs = inputs.to(device)
generated_ids = model.generate(**inputs, max_new_tokens=16384, do_sample=False)
if "input_ids" in inputs:
input_ids = inputs.input_ids
else:
print("inputs: # fallback", inputs)
input_ids = inputs.inputs
generated_ids_trimmed = [
out_ids[len(in_ids):] for in_ids, out_ids in zip(input_ids, generated_ids)
]
output_texts = clean_repeated_substrings(processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
))
print(output_texts)Quick Start with vLLM
Installation
For vLLM deployment, set up a virtual environment and install vLLM from the nightly build:
uv venv hunyuanocr
source hunyuanocr/bin/activate
uv pip install -U vllm --pre --extra-index-url https://wheels.vllm.ai/nightlyNote: It is suggested to install cuda-compat-12-9 for compatibility:
sudo dpkg -i cuda-compat-12-9_575.57.08-0ubuntu1_amd64.deb
echo 'export LD_LIBRARY_PATH=/usr/local/cuda-12.9/compat:$LD_LIBRARY_PATH' >> ~/.bashrc
source ~/.bashrc
# verify cuda-compat-12-9
ls /usr/local/cuda-12.9/compatModel Deployment
Deploy the model using vLLM with recommended settings for OCR tasks:
vllm serve tencent/HunyuanOCR \
--no-enable-prefix-caching \
--mm-processor-cache-gb 0 \
--gpu-memory-utilization 0.2Model Inference
Query the deployed model using the OpenAI-compatible API:
from vllm import LLM, SamplingParams
from PIL import Image
from transformers import AutoProcessor
def clean_repeated_substrings(text):
"""Clean repeated substrings in text"""
n = len(text)
if n < 8000:
return text
for length in range(2, n // 10 + 1):
candidate = text[-length:]
count = 0
i = n - length
while i >= 0 and text[i:i + length] == candidate:
count += 1
i -= length
if count >= 10:
return text[:n - length * (count - 1)]
return text
model_path = "tencent/HunyuanOCR"
llm = LLM(model=model_path, trust_remote_code=True)
processor = AutoProcessor.from_pretrained(model_path)
sampling_params = SamplingParams(temperature=0, max_tokens=16384)
img_path = "/path/to/image.jpg"
img = Image.open(img_path)
messages = [
{"role": "system", "content": ""},
{"role": "user", "content": [
{"type": "image", "image": img_path},
{"type": "text", "text": "检测并识别图片中的文字,将文本坐标格式化输出。"}
]}
]
prompt = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = {"prompt": prompt, "multi_modal_data": {"image": [img]}}
output = llm.generate([inputs], sampling_params)[0]
print(clean_repeated_substrings(output.outputs[0].text))Using OpenAI API Client
When using vLLM, you can also query the model through the OpenAI-compatible API:
from openai import OpenAI
client = OpenAI(
api_key="EMPTY",
base_url="http://localhost:8000/v1",
timeout=3600
)
messages = [
{"role": "system", "content": ""},
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://example.com/image.png"
}
},
{
"type": "text",
"text": (
"Extract all information from the main body of the document image "
"and represent it in markdown format, ignoring headers and footers."
"Tables should be expressed in HTML format, formulas in the document "
"should be represented using LaTeX format, and the parsing should be "
"organized according to the reading order."
)
}
]
}
]
response = client.chat.completions.create(
model="tencent/HunyuanOCR",
messages=messages,
temperature=0.0,
extra_body={
"top_k": 1,
"repetition_penalty": 1.0
},
)
print(f"Generated text: {response.choices[0].message.content}")Configuration Recommendations
Sampling Settings
Use greedy sampling with temperature set to 0.0 for optimal OCR performance. Low-temperature sampling also works well. Higher temperatures may reduce accuracy for OCR tasks.
Caching Configuration
Disable prefix caching and image processor caching for OCR tasks. These features are designed for chat applications with repeated prompts, but OCR tasks typically process unique documents where caching provides little benefit.
Memory Requirements
For vLLM deployment with 16K-token decoding, 80GB GPU memory is recommended. For smaller GPUs, reduce max_tokens, downsample images, or enable tensor parallelism. Adjust gpu-memory-utilization based on your hardware constraints.
Token Limits
Adjust max_num_batched_tokens based on your hardware capabilities. Larger values improve throughput but require more GPU memory. Start with default values and adjust based on your specific hardware and performance requirements.
Post-Processing
The clean_repeated_substrings function helps remove repeated substrings that may appear in long outputs. This is particularly useful for documents with repetitive content or when processing very long documents.
For structured output, define the exact schema in your prompt and validate responses using regex or JSON schema validation. Well-designed prompts with clear output format specifications help ensure structured results.
For the latest installation instructions and updates, please refer to the official Hunyuan OCR documentation and repository.