How to run Qwen 2.5 on AWS AI chips using hugging face library

QWEN 2.5 Multilingual Leading Language Models (LLMS) is a collection of pre-trained and directively tuned generative models of 0.5B, 1.5B, 3B, 7B, 14B, 32B, and 72B (text-in/text-out and code-out). QWEN 2.5’s fine-tuned text-only model is optimized for multilingual interaction usage cases, outperforming both the previous generation of the QWEN model and many published chat models based on popular industry benchmarks.

Qwen 2.5 is an automatic recursive language model that uses an optimized transformer architecture at its core. The QWEN2.5 collection can support over 29 languages, enhancing the role-playing capabilities and conditioning of chatbots.

This post outlines how to use Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Sagemaker to deploy a model family of QWEN 2.5 models using Hugging Face Text Generation Inference (TGI) containers. QWEN2.5 coders and mathematical variations are also supported.

Preparation

Hugging Face offers two tools that are frequently used when using AWS Imedertia and AWS Traneium: Text Generation Inference (TGI) containers. This supports the deployment and delivery of LLMS.

When the Model runs for the first time on Irsentia or Trainium, compile the model to ensure that there is a version that runs best on Imeferntia and Trainium chips. The optimal neuron library that embraces the face with the optimal neuron cache transparently feeds the compiled model when available. If you are using a different model using the QWEN2.5 architecture, you may need to compile the model before deploying. For more information, see Compiling a Recommended or Training Model.

You can deploy TGI as a Docker Container to an Irsentia or Trainium EC2 instance or to an Amazon Sagemaker.

Option 1: Expand TGI to Amazon EC2 INF2

In this example, we deploy QWEN2.5-7B-Instruct to an inf2.xlarge instance. (See this article for detailed instructions on how to deploy an instance using the embracing face Dlami.)

This option uses ssh for the instance to create a .env file (defines constants and specifies where the model is cached) and a file named docker-compose.yaml (a file that defines all the environment parameters that the model needs to be expanded for inference). For this use case, you can copy the following files:

Create a .ENV file with the following content:

MODEL_ID='Qwen/Qwen2.5-7B-Instruct'
#MODEL_ID='/data/exportedmodel' 
HF_AUTO_CAST_TYPE='bf16' # indicates the auto cast type that was used to compile the model
MAX_BATCH_SIZE=4
MAX_INPUT_TOKENS=4000
MAX_TOTAL_TOKENS=4096

Create a file named docker-compose.yaml with the following content:

version: '3.7'

services:
  tgi-1:
    image: ghcr.io/huggingface/neuronx-tgi:latest
    ports:
      - "8081:8081"
    environment:
      - PORT=8081
      - MODEL_ID=${MODEL_ID}
      - HF_AUTO_CAST_TYPE=${HF_AUTO_CAST_TYPE}
      - HF_NUM_CORES=2
      - MAX_BATCH_SIZE=${MAX_BATCH_SIZE}
      - MAX_INPUT_TOKENS=${MAX_INPUT_TOKENS}
      - MAX_TOTAL_TOKENS=${MAX_TOTAL_TOKENS}
      - MAX_CONCURRENT_REQUESTS=512
      #- HF_TOKEN=${HF_TOKEN} #only needed for gated models
    volumes:
      - $PWD:/data #can be removed if you aren't loading locally
    devices:
      - "/dev/neuron0"

Expand the model using Docker Compose.

docker compose -f docker-compose.yaml --env-file .env up

To verify that the model is deployed correctly, send a test prompt to the model.

curl 127.0.0.1:8081/generate \
    -X POST \
    -d '{
  "inputs":"Tell me about AWS.",
  "parameters":{
    "max_new_tokens":60
  }
}' \
    -H 'Content-Type: application/json'

To verify that your model can respond in multiple languages, try sending a prompt in Chinese.

#"Tell me how to open an AWS account"
curl 127.0.0.1:8081/generate \
    -X POST \
    -d '{
  "inputs":"告诉我如何开设 AWS 账户。", 
  "parameters":{
    "max_new_tokens":60
  }
}' \
    -H 'Content-Type: application/json'

Option 2: Expand TGI to Sagemaker

You can also use Hugging Face’s Optimum Neuron Library to quickly deploy models from Sagemaker using the instructions in the Hugging Face Model Hub.

Please select from QWEN 2.5 model card hub Expandafter that Surge Makerand finally AWS IMESENTIA & TRAINIUM.

Copy the sample code to your Sage Maker notebook, select it, then select it run.
The copied notebook looks like this:

import json
import sagemaker
import boto3
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client("iam")
    role = iam.get_role(RoleName="sagemaker_execution_role")("Role")("Arn")

# Hub Model configuration. https://huggingface.co/models
hub = {
    "HF_MODEL_ID": "Qwen/Qwen2.5-7B-Instruct",
    "HF_NUM_CORES": "2",
    "HF_AUTO_CAST_TYPE": "bf16",
    "MAX_BATCH_SIZE": "8",
    "MAX_INPUT_TOKENS": "3686",
    "MAX_TOTAL_TOKENS": "4096",
}


region = boto3.Session().region_name
image_uri = f"763104351884.dkr.ecr.{region}.amazonaws.com/huggingface-pytorch-tgi-inference:2.1.2-optimum0.0.27-neuronx-py310-ubuntu22.04"

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
    image_uri=image_uri,
    env=hub,
    role=role,
)

# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
    initial_instance_count=1,
    instance_type="ml.inf2.xlarge",
    container_startup_health_check_timeout=1800,
    volume_size=512,
)

# send request
predictor.predict(
    {
        "inputs": "What is is the capital of France?",
        "parameters": {
            "do_sample": True,
            "max_new_tokens": 128,
            "temperature": 0.7,
            "top_k": 50,
            "top_p": 0.95,
        }
    }
)

cleaning

Terminate your EC2 instance and remove the Sagemaker endpoint to avoid ongoing costs.

Terminate the EC2 instance through the AWS Management Console.

End the Sage Maker endpoint using the console or the following command:

predictor.delete_model()
predictor.delete_endpoint(delete_endpoint_config=True)

Conclusion

AWS Trainium and AWS Imdentia offer high performance and low cost for deploying QWEN 2.5 models. We look forward to how we can build differentiated AI applications using these powerful models and dedicated AI infrastructure. For more information on how to get started with AWS AI chips, see the AWS Neurons documentation.

About the Author

Jim Burtft He is a senior startup solution architect at AWS and works directly with the startup and Hugpingface team. Gym is a CISSP, part of the AWS AI/ML Technical Field Community, part of the Neuron Data Science community, and works with the open source community to enable the use of recommendations and training. Jim holds a bachelor’s degree in mathematics from Carnegie Mellon University and a master’s degree in economics from the University of Virginia.

Miriam Lebowitz A solution architect focused on empowering early stage startups on AWS. She leverages her experience with AIML to guide companies to select and implement the technology that suits their business goals, setting it up for scalable growth and innovation in a competitive startup world.

Rhia Soni I am AWS Startup Solutions Architect. RHIA specializes in collaborating with early stage startups, helping customers adopt recommendations and training. RHIA is also part of the AWS Analytics Technology Community and is a subject expert in Generating BI. RHIA holds a Bachelor of Arts in Information Science from the University of Maryland.

Paul Eat I’m a senior solution architect manager focusing on AWS startups. Paul created a team for AWS Startup Solution Architects, focusing on recommendations and adoption of Trainium. Paul holds a bachelor’s degree in computer science from Siena College and has multiple cybersecurity certifications.

What's Hot

Psychologists expose mental gymnastics that hide racism

Hurricane Beryl: Hurricane forecasts are improving, but big misses still possible

Automate bulk image editing with Crop.photo and Amazon Rekognition

Optimize hosting DeepSeek-R1 distilled models with Hugging Face TGI on Amazon SageMaker AI

Revolutionizing customer service: MaestroQA’s integration with Amazon Bedrock for actionable insight

How GoDaddy built a category generation system at scale with batch inference for Amazon Bedrock