SecDevOps.comSecDevOps.com
Inside Uber’s Multicloud AI Reality: The Gap Between Data and Compute

Inside Uber’s Multicloud AI Reality: The Gap Between Data and Compute

The New Stack(2 weeks ago)Updated 2 weeks ago

Uber, one of the original ride-hailing services, developed a distributed infrastructure before most anyone even considered it for their enterprise. It ran Mesos before moving to Kubernetes three...

Uber, one of the original ride-hailing services, developed a distributed infrastructure before most anyone even considered it for their enterprise. It ran Mesos before moving to Kubernetes three years ago. The company is now moving from on premises to multicloud, which has its pros and cons as Uber wrestles with how to optimize GPU usage across multiple cloud providers, juggle workloads and create a cohesive converged infrastructure. In a presentation at the co-located AI Day at KubeCon + CloudNativeCon North America in Atlanta this month, Andrew Leung, a senior staff engineer at Uber, offered a glimpse into what it’s like for a company that runs AI workloads to move from on premises to a multicloud approach. The Uber story shows the divide between data and compute in enterprise architectures, with the fungibility of GPUs as a primary challenge. How companies adapt to using AI models will depend on their use cases and which cloud providers they use. For Uber, it now means managing multiple cloud providers to optimize workloads, which creates its own trade-offs. The Challenges of Separating Data and Compute Uber has used predictive models since 2018, Leung said. It now uses AI models across its enterprise for a car’s estimated time of arrival, pricing, fraud detection and the Uber Eats ranking feed. The company now uses large language models (LLMs) for custom-facing and internal tools. It has started to use AI applications for merchant storefronts and agentic systems for internal workflows. Uber uses cloud service providers for different use cases, but separating data and compute has affected its internal infrastructure. How its teams have assembled their Kubernetes stacks reflects how their data and compute are separated, which allows them to optimize for each cloud but makes it challenging to build out converged infrastructure. The engineering teams, Leung said, maintain a data lake running on a single cloud provider. They use a separate cloud provider for inference and other microservices. The next step: Bridge the divide between the clouds they use. That in itself poses a challenge when it makes business sense, particularly for its GPUs and GPU capacity. Even for Uber, GPUs are scarce and expensive. When spread across multiple cloud services, it becomes challenging to leverage GPUs to their full potential. Working with GPUs is not exactly a seamless cloud native experience compared to managing CPUs, Leung added, where portability is more straightforward. “And so we end up having to think about use cases that can either reside entirely within one cloud provider so that I can put training and serving together, or I need to think about the use cases where it makes sense to actually pull the data from one provider to another, in order to facilitate being able to leverage that compute,” Leung said. ”It doesn’t make it quite as seamless as it could be, and you have to be purposeful in how you think about what workloads you’re going to be converging together.” And as for capacity? Silos and over-indexing present their own set of issues. “We ended up with a Kubernetes infrastructure focused on batch and a Kubernetes infrastructure focused on microservices,” Leung said. “The distinction has been that the hardware was segregated at the cluster level, so we would have dedicated GPU clusters that were just serving GPU workloads, and a number of CPU-based clusters serving CPU workloads. “But that’s led to siloing of the actual capacity and ends up sort of over-indexing on a Kubernetes cluster as an abstraction for hardware, rather than leveraging a lot of what we can do internally from Kubernetes itself.” Disaster Recovery Overhead Disaster recovery requires more overhead when running AI workloads, said Leung, in response to Madhuri Yechuri, CEO of Elotl, who interviewed Leung at the colocated AI Day session. Again, the challenge of using GPUs comes into play. The patterns with CPUs don’t apply. “That is increasingly complex for GPU infrastructure, given the cost and scarcity of it,” Leung said. “That’s much harder to stomach when you see how much extra overhead you need to carry for GPU and also, given the fact that GPU workloads aren’t quite as fungible as CPU workloads, where I can’t as easily just dynamically pack eight workloads onto one GPU now, where I could have just squeezed things onto a single CPU.” Failovers are also problematic. You can’t just move GPU workloads around like CPU workloads. “This thing is optimized for this particular hardware, with this particular configuration,” Leung said. “If I do a failover, I can’t necessarily easily just move it to a different hardware configuration on the fly.” The Future of Agentic Workflows Agentic workflows are a different matter for Uber. The company is building and fine-tuning existing models through several initiatives to develop agentic systems for its internal tools and to provide LLM-based support across internal systems. But it still represents the minority of GPU usage. “But as we increase investment there, it has the potential to become larger scale,” Leung said. “The predictive models that we’ve been training aren’t likely to double in size, for example, because they generally grow along with business growth.” There may come a day when Uber unlocks some agentic workflow and puts it everywhere, he said, which would represent a considerable increase in what they need to support with GPUs. But what about utilization? It’s a matter Uber is reviewing. “A lot of our investments in those kinds of agentic workflows are often experimental, but still require a nontrivial footprint,” Leung said. “And as an experimental product, they’re not in production necessarily. So they’re not taking constant, always-on, high volumes of traffic, which makes it suspect as to whether or not, in the end, GPUs being provisioned for this thing is really the best use of them.” Uber uses Ray almost exclusively for training. Ray is a unified framework for scaling AI and Python applications. The infrastructure runs across their regular batch infrastructure, which also runs Spark and other internal batch processing workloads. A typical AI workflow will consist of several steps for data preprocessing and ETL-like tasks. They could be running on Spark if they’re ETL-focused, with most of the training workloads powered via Ray. Nvidia Triton powers predictive models as the adoption of vLLMs increases for LLM use cases. Models get optimized through TensorRT for specific hardware, then run through the ONNX runtime. An Unsolved Problem: Making GPUs More Fungible How Uber makes GPUs more fungible starts with addressing their clusters, Leung said. For Uber, making GPUs more fungible remains an unsolved problem. “Different GPU types with varying memory configurations and architectures can’t seamlessly substitute for each other, ” Leung said. “Today, with how we have clusters set up, the cluster is the boundary. And so we’re moving away from the cluster as kind of its own silo there to help make that more fungible as the first step.” The Uber teams ran authentication and other services on underutilized GPUs. But by moving those services to CPU clusters, they improved performance and freed GPU capacity. The issues with GPUs make observability a challenge. Uber uses Nvidia but is also considering AMD hardware. Each provider has different metrics. Adapting to New Metrics Using AI early has accumulated technical debt for Uber. Its teams have revamped their stack, leveled up and modernized. They recently migrated their old GPU metrics based on Cadvisor, which did not support newer models. Leung said Uber’s engineers have had to adapt to the GPU and the differences in metrics. “What had happened is we advertise these low-level metrics, which then other teams began building their dashboards and metrics and systems around,” Leung said. “And then, when we think about, OK, we want to actually migrate this to a different metric set. Well, the whole company is pinned to this small set of metrics.” They’re exploring building their own API. “You’re going to end up with a mix of a variety of different metrics and with nuances about what each of them means and how teams should understand it when they’re trying to think about high-level utilization or cost efficiency or memory usage,” Leung said. “And so what we’ve been doing there is trying to build metrics almost as an API where we have specific platform metrics that we’re exposing, which we can then potentially source from a variety of vendor-specific metrics. It doesn’t require the user to be as deeply entrenched in any one specific vendor’s model metrics.” The post Inside Uber’s Multicloud AI Reality: The Gap Between Data and Compute appeared first on The New Stack.

Source: This article was originally published on The New Stack

Read full article on source →

Related Articles