Part two of the series: Using LLMs in Enterprise AI Pipelines

By Adam Gurary, Senior Associate Product Manager, C3 AI

Vision language models (VLMs) are a major advancement in AI, combining the capabilities of large language models (LLMs) with computer vision to perform tasks such as visual question answering (VQA). VQA models, such as Blip-2 and Llava, answer questions about images, a capability with vast applications, including automated diagram annotation, surveillance, and robotics.

Read Part One of Our Series on Using LLMs in Enterprise AI Pipelines: Efficiently Deploying Fine-Tuned and Open-Source LLMs at Scale

VLMs are powerful new tools for innovation but using them in production presents unique challenges. Many of the standard open-source LLMs, such as Llama-3, Mixtral, and Falcon, use a similar decoder-only architecture and only deal with strings. VLMs, on the other hand, are not as standard. This creates more overhead for data science and model operations teams in tasks such as dependency management, hardware acceleration, and image data handling.

The C3 AI Platform enables the seamless, enterprise-grade deployment and scaling of VLMs. Our platform supports multiple fault tolerant VLM deployments, allowing you to deploy, manage, monitor, and scale your VLMs easily and securely. From “one model to rule them all” to dozens of models fine-tuned for their specific-use cases, the C3 AI Platform provides the tools and flexibility needed to productionize VLMs in enterprise AI pipelines.


The C3 AI VLM Deployment Process

In part one of this series, we discussed the standard C3 AI LLM deployment process. The process for deploying VLMs is almost the same, with one additional step: Define the VLM.

Defining the VLM ensures that all libraries needed to run the model on the required hardware, such as Hugging Face Accelerate in the case of a PyTorch model, are available. This step also determines the profile of model inputs and outputs. The C3 AI Platform offers the flexibility to accept the image in a variety of formats, from passing the image directly to passing only a reference to the image.

Upon deployment, any authorized application can now make requests to the VLM.

The four key features discussed in the previous installment of this series hold for VLMs:

  1. Security: Like LLM deployment, keeping your proprietary VLM and data secure by deploying within your own environment is supported.
    nodepools = ["llava_nodepool"])

    The user deploys a fine-tuned Llava-1.5-13b model to one of their own node pools, ensuring no proprietary model weights or data leaves the environment.

  2. Highly Optimized Inference: Run models fast with the flexibility to use accelerator libraries to manage the VLM execution across multiple GPUs.

  3. Versioning Enforcement: The C3 AI Model Registry enforces versioning for your VLM deployments, ensuring consistency and control over your models.

    The user registers the VLM to the C3 AI Model Registry in one line of code, specifying a URI to maintain versioning and a description of the LLM.

  4. Scalability: VLMs can also be served on a single GPU and scale independently to up to hundreds of GPUs. 
    targetNodeCount = 8,
    hardwareProfile = "Nvidia_8xH200")

    The user configures a node pool to deploy a Llava model across eight Nvidia H200 nodes, effectively deploying the model to 64 Nvidia H200 GPUs.


Unlocking New Possibilities with VLMs

Deploying VLMs on the C3 AI Platform opens a world of possibilities for enterprise AI applications. Whether it’s assisting doctors in diagnosing medical images, or enhancing surveillance systems, VLMs are set to transform industries with their dual understanding of text and images.

In the next installment of our series, we discuss the basics of managing LLM, VLM, and other large model deployments.


About the Author

Adam Gurary is a Senior Associate Product Manager at C3 AI, where he manages the roadmap and execution for the platform’s model inference service and machine learning pipelines. Adam and his team focus on building state-of-the-art tooling for hosting and serving open-source large language models and for creating, training, and executing machine learning pipelines. Adam holds a B.S. in Mathematical and Computational Science from Stanford University.