A blog series on how you can take advantage of the power of LLMs in enterprise AI pipelines

By Adam Gurary, Senior Associate Product Manager, C3 AI


Large language models (LLMs) are revolutionizing enterprise AI pipelines. However, deploying and managing LLMs in production poses significant challenges, such as enforcing access control and configuring scaling functions. These processes require intricate configurations, resource management, and continuous monitoring to ensure reliability and efficiency. Organizations looking to drive value with generative AI risk being bogged down by expensive, time-consuming processes.

The C3 AI Platform can help by simplifying the deployment, management, and scaling of LLMs. Our platform supports multiple fault tolerant LLM deployments, allowing you to deploy, manage, monitor, and scale your models easily and securely. From one model to dozens, the C3 AI Platform provides the tools and flexibility needed to quickly productionize LLMs in enterprise AI pipelines.

 

What Is an Open-Source LLM and When Should You Use It?

Unlike third-party services like GPT-4 or Claude-3, open-source LLMs are freely available to modify and deploy. Open-source LLMs often rival the performance of their proprietary counterparts on various benchmarks, and they can be fine-tuned on proprietary data for unparalleled performance on specific use cases, such as domain-specific chat-based copilots.
To take advantage of performance boosts from state-of-the-art models and fine-tuning while maintaining information security, an enterprise can host its own LLMs. With self-hosted LLMs, proprietary models and data never leave the environment.

The C3 AI Platform fully supports the scalable, fault-tolerant, and performance-optimized deployment of open-source LLMs, including Falcon-40B, Llama-2-70B, Mistral-7B, and Mixtral-8x7B, including fine-tuned versions of these models.

Read more about our secure, LLM agnostic approach for C3 Generative AI.

 

The LLM Deployment Process on the C3 AI Platform

It takes five lines of code to go from choosing an LLM to deploying it on the C3 AI Platform.

 

  1. Choose an LLM: Select an LLM from Hugging Face Hub or point to a proprietary fine-tuned LLM in your filesystem of choice.
    llama_3_70b = c3.VllmPipe(
    modelId = "meta-llama/Meta-Llama-3-70B"
    tensorParallelSize = 8)
    llama_3_70b = c3.VllmPipe(
    modelUrl = 'gcs://finetuned_llama3_url_here'
    tensorParallelSize = 8)

    On the left, the user specifies a Hugging Face model ID and, on the right, the user specifics a URL to model files. In both cases, the user sets the `tensorParallelSize` to define how many GPUs to run the model on. Defining the LLM is one line of code.

  2. Register the LLM: Ensure version control and consistency across deployments by registering the model to your C3 AI Model Registry.
    c3.ModelRegistry.registerMlPipe(llama_3_70b, "llama_3_70b", "Llama-3-70B finetuned")

    The user registers the LLM to the C3 AI Model Registry in one line of code, specifying a URI to maintain versioning and a description of the LLM.

  3. Select hardware: Choose hardware such as Nvidia H100, A100, or L4 depending on your performance requirements.
    hardware_llama_3_70b = c3.HardwareProfile.upsertProfile({
    "name": 'llama3_hardware',
    "cpu": 200,
    "memoryMb": 1800000,
    "gpu": 8,
    "gpuKind": 'nvidia-h100-80gb-8',
    "gpuVendor": 'nvidia',
    "diskGb" : 1000
    });

    The user defines the exact hardware needed the run the model, including everything from the number of Nvidia H100 GPUs to the disk space needed. This gives the user full control of the hardware that will run their LLM.

  4. Configure scaling: Determine the number of model replicas for performance and high availability — the platform supports easy adjustments.
    c3.app().configureNodePool( 
    name = 'llama_3_nodepool',
    targetNodeCount = 2,
    hardwareProfile = hardware_llama_3_70b)

    The user defines how many replicas of the model they would like to deploy. By setting the `targetNodeCount` to 2, they ensure that there will be two replicas of the Llama-3-70B model handling requests, each served on the exact hardware profile defined in step three.

  5. Deploy: Deploy your model with two clicks. Any authorized application can now use the simple completion API for LLM inference requests.
    c3.ModelInference.deploy(
    llama_3_70b_version1,
    node_pools=["llama_3_nodepool"])

    Finally, the user deploys the Llama-3-70B model to the specified hardware with one line of code.

 

Key C3 AI Platform Features That Support LLMs

The C3 AI Platform provides a comprehensive suite of features designed to ensure the secure, efficient, and scalable deployment of LLMs in enterprise environments:

  1. Security: Keep your data and proprietary models secure by deploying models within your own environment, whether it be on-premises or on a cloud environment such as AWS, Google Cloud, or Azure.
  2. Highly optimized inference: Maximize throughput and minimize latency. We achieve this through various optimization techniques, including:
    • PagedAttention: Boosts GPU memory usage efficiency.
    • Custom CUDA kernels: Optimizes computation by tailoring each model’s execution to the given hardware.
    • Continuous batching: Improves throughput by up to 100X by aggregating requests.
    • Support for quantized models: Reduces model size to accelerate inference.
    • Tensor parallelism: Splits the model across multiple GPUs for enhanced performance.
    • Token streaming: Allows your user to see the output of the model as its generated.
    • Decoding algorithms: Improves generation quality with algorithms such as parallel sampling and beam search.
    • Multi-LoRA support: Serves dozens of fine-tuned models efficiently.
  3. Versioning enforcement: The C3 AI Model Registry enforces versioning for your deployments, ensuring consistency and control over your models.
  4. Scalability: Serve your models on a single GPU, or scale up to hundreds, adjusting to the demands of your applications to ensure optimal performance. Scale your LLM deployments independently to ensure high availability of all models with zero resource contention between them.

 

 

Unlocking the Potential of LLMs

The C3 AI Platform simplifies LLM deployment, offering a secure, scalable, and optimized solution for enterprise AI pipelines. Our platform enables you to harness the power of fine-tuned and open-source LLMs with ease, driving innovation and value in your AI initiatives.

Stay tuned for the next installment of our blog series on the benefits of using the C3 AI Platform to deploy LLMs in enterprise AI pipelines — we will discuss the ins and outs of deploying vision language models for advanced visual question-answering use cases.

Learn more about how you can get the most out of your enterprise data with C3 Generative AI and our novel retrieval augmented generation approach.

 


About the Author

Adam Gurary is a Senior Associate Product Manager at C3 AI, where he manages the roadmap and execution for the platform’s model inference service and machine learning pipelines. Adam and his team focus on building state-of-the-art tooling for hosting and serving open-source large language models and for creating, training, and executing machine learning pipelines. Adam holds a B.S. in Mathematical and Computational Science from Stanford University.