Scaling AI: Platform best practices


This is a VB Lab Insights article presented by Capital One.


Enterprises are now deeply invested in how they build and continually evolve world-class enterprise platforms that enable AI use cases to be built, deployed, scaled, and evolve over time. Many companies have historically taken a federated approach to platforms as they built capabilities and features to support the bespoke needs of individual areas of their business.

Today, however, advances like generative AI introduce new challenges that require an evolved approach to building and scaling enterprise platforms. This includes factoring in the specialized talent and Graphics Processing Unit (GPU) resource needs for training and hosting large language models, access to huge volumes of high-quality data, close collaboration across many teams to deploy agentic workflows, and a high level of maturity for internal application programming interfaces (APIs) and tooling that multi-agentic workflows require, to name a few. Disparate systems and a lack of standardization hinder companies’ ability to embrace the full potential of AI.

At Capital One, we’ve learned that large enterprises should be guided by a common set of best practices and platform standards to effectively deploy AI at scale. While the details will vary, there are four common principles that help companies to successfully deploy AI at scale to unlock value for their business:

1. Everything starts with the user

The goal for any enterprise platform is to empower users — therefore you must start with those users’ needs. You should seek to understand how your users are engaging with your platforms, what problems they’re trying to solve and any friction they’re coming up against.

At Capital One for instance, a key tenet guiding our AI/ML platform teams is that we obsess over all aspects of the customer experience, even those we don’t directly oversee. For example, we undertook a number of initiatives in recent years to solve the data and access management pain points for our users, even though we rely on other enterprise platforms for these.

As you earn the trust and engagement of your users, you can innovate and reimagine the art of what’s possible with new ideas and by going “further up the stack.” This customer obsession is the foundation for building long-lasting and sustainable platforms.

2. Establishing a multi-tenant platform control plane

Multi-tenancy is essential for any enterprise platform, allowing multiple business lines and distributed teams to use the core platform capabilities such as compute, storage, inference services, workflow orchestration, etc. in a shared but well-managed environment. It allows you to solve core data access pain points, allows abstraction, enables multiple compute patterns, and it simplifies the provisioning and management of compute instances for core services — for example, the large fleet of GPUs and Central Processing Units (CPUs) that AI/ML workloads require.

With the right design of a multi-tenant platform control plane, you can integrate both best-in-class open-source and commercial software components, and scale flexibly as the platform evolves over time. At Capital One, we have developed a robust platform control plane with Kubernetes as the foundation, which scales to our large fleet of compute clusters on AWS, that are used by thousands of active AI/ML users across the company.

We routinely experiment with and adopt best-in-class open-source and commercial software components as plug-ins, and develop our own proprietary capabilities where they give us a competitive edge. For the end-user, this enables access to the latest technologies and greater self-service capabilities, empowering teams to build and deploy on our platforms without having to call on our engineering teams for support. 

3. Embedding automation and governance

As you build a new platform, it’s critical to have the right mechanisms in place to collect logs and insights on models and features along the end-to-end lifecycle, as they are built, tested and deployed. Enterprises can automate core tasks such as lineage tracking, adherence to enterprise controls, observability, monitoring and detection across various layers of their platforms. By standardizing and automating these tasks, it is possible to cut weeks and in some cases, months of time from developing and deploying new mission-critical models and AI use cases.

At Capital One, we’ve taken this a step further by building a marketplace of reusable components and software development kits (SDKs) that have built-in observability and governance standards. These empower our associates to find the reusable libraries, workflows and user-contributed code they need to develop AI models and apps with confidence knowing that the artifacts they are building on enterprise platforms are well-managed under the hood. In fact, at this point in our journey, we consider this level of automation and standardization as a competitive advantage.

4. Investing in talent and effective business routines

Building state-of-the-art AI platforms requires a world-class, cross-functional team. An effective AI platform team must be multidisciplinary and diverse, inclusive of data scientists, engineers, designers,  product managers, cyber and model risk experts and more. Each of these team members brings with them unique skills and experiences and has a key role to play in building and iterating on an AI platform that works for all users and can be extensible over time. 

At Capital One, we have made it our mission to partner cross-functionally across the company as we build and deploy our AI platform capabilities. As we’ve sought to evolve our organization and build up our AI workforce, we established the Machine Learning Engineer role in 2021 and more recently, the AI Engineer role, to recruit and retain the technical talent that will help us continue to stay at the frontier of AI and solve the most challenging problems in financial services.

Along the way, establishing and communicating well-defined roadmaps and change controls for the platform users, and incorporating feedback loops into your planning and software delivery processes is critical to ensuring your users stay informed, can contribute to what’s coming, and understand the benefits of the platform strategy you’re putting in place.

Future-proofing your foundations for AI

Building or transforming enterprise platforms for the AI era is no small task, but it will set your business up for greater agility and scalability. At Capital One, we’ve seen first-hand how these foundations can power AI/ML at scale to continue to drive value for our business and more than 100 million customers.

By laying the right technical foundations, establishing governance practices from the start, and investing in talent, your users could soon be empowered to leverage AI in well-governed ways across the business.

Abhijit Bose is Senior Vice President, Head of Enterprise AI and ML Platforms at Capital One.


VB Lab Insights content is created in collaboration with a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. For more information, contact sales@venturebeat.com.



Source link

About The Author