The Salesforce AI Platform team is dedicated to developing and managing services that power large language models (LLMs) and other AI workloads within Salesforce. Their main focus is on model onboarding, providing customers with a robust infrastructure to host a variety of ML models. Their mission is to streamline model deployment, enhance inference performance and optimize cost efficiency, ensuring seamless integration into Agentforce and other applications requiring inference. They’re committed to enhancing the model inferencing performance and overall efficiency by integrating state-of-the-art solutions and collaborating with leading technology providers, including open source communities and cloud services such as Amazon Web Services (AWS) and building it into a unified AI platform. This helps ensure Salesforce customers receive the most advanced AI technology available while optimizing the cost-performance of the serving infrastructure. In this post, we share how the Salesforce AI Platform team optimized GPU utilization, improved resource efficiency and achieved cost savings using Amazon SageMaker AI, specifically inference components.
The challenge with hosting models for inference: Optimizing compute and cost-to-serve while maintaining performance
Deploying models efficiently, reliably, and cost-effectively is a critical challenge for organizations of all sizes. The Salesforce AI Platform team is responsible for deploying their proprietary LLMs such as CodeGen and XGen on SageMaker AI and optimizing them for inference. Salesforce has multiple models distributed across single model endpoints (SMEs), supporting a diverse range of model sizes from a few gigabytes (GB) to 30 GB, each with unique performance requirements and infrastructure demands.
The team faced two distinct optimization challenges. Their larger models (20–30 GB) with lower traffic patterns were running on high-performance GPUs, resulting in underutilized multi-GPU instances and inefficient resource allocation. Meanwhile, their medium-sized models (approximately 15 GB) handling high-traffic workloads demanded low-latency, high-throughput processing capabilities. These models often incurred higher costs due to over-provisioning on similar multi-GPU setups. Here’s a sample illustration of Salesforce’s large and medium SageMaker endpoints and where resources are under-utilized:
Leveraging SageMaker AI Inference Components for Optimization
To address these challenges, the Salesforce team integrated Amazon SageMaker AI inference components. These components provide a streamlined approach to optimizing model deployment, specifically focusing on reducing latency and improving throughput while minimizing costs. By leveraging these components, Salesforce was able to significantly reduce GPU utilization and improve resource efficiency.
Specifically, the team utilized features like dynamic batching, which automatically adjusts batch sizes based on incoming traffic patterns, and optimized inference kernels tailored for their specific models. This resulted in a substantial reduction in latency and improved throughput compared to traditional deployment methods. The SageMaker AI Inference Components also allowed them to efficiently manage GPU resources across multiple SMEs, ensuring optimal utilization.
Quantifiable Results: Cost Savings and Performance Gains
The implementation of SageMaker AI inference components yielded impressive results. Salesforce reported achieving cost savings of up to 8x on their inference costs compared to previous deployments. Furthermore, they observed a significant improvement in latency, reducing it by an average of 30%. This combination of reduced costs and improved performance enabled them to scale their AI services more effectively and deliver enhanced user experiences. The key to unlocking this potential lies in utilizing Amazon SageMaker AI inference components, a core element of the Salesforce AI Inference strategy. This approach dramatically improves the overall efficiency and reduces operational overhead.
This case study demonstrates the power of leveraging cloud-based inference solutions like Amazon SageMaker AI for optimizing model deployment. By embracing these components, organizations can unlock significant cost savings, improve performance, and accelerate their AI initiatives. Salesforce’s experience with this Salesforce AI Inference strategy highlights the importance of optimized infrastructure for modern AI workloads.
Source: Read the original article here.
Discover more tech insights on ByteTrending.
Discover more from ByteTrending
Subscribe to get the latest posts sent to your email.












