How ThirdAI is Democratizing AI Through CPUs
Dr. Anshumali Shrivastava is an Associate Professor of Computer Science at Rice University. He is the CEO of ThirdAI, a company that builds hash-based processing algorithms that accelerate the training and inference of neural networks. Anshu sat down with Neeraj Hablani at the Neotribe office in Menlo Park.
Introducing Anshumali
Neeraj Hablani (NH): Welcome, Dr. Shrivastava! It’s a privilege to chat with you about AI efficiency, generative AI, large language models, and the state of the industry. As someone who has spent the better part of the last decade on these subjects, we’re lucky to hear your perspective.
Dr. Anshumali Shrivastava (AS): Thanks Neeraj for having me. It is always a pleasure. There is so much excitement about the topics and being an educator in the field for a while, I can see how these terms may have an altogether different connotation for different people.
A New Epoch of Technology?
NH: When we look back at the history of computing and the enterprises that dominated the respective market: IBM had a near monopoly on mainframes from the 1950s-70s, Microsoft (OS) and Intel (chip) owned the PC market in the 1980s-90s, the duopoly of Apple and Google consumed the mobile market in the late 2000s-present, and the hyperscalers owned the cloud computing market initiated by AWS in 2006. Do you believe that generative AI represents the beginning of a new epoch of technology, and if so, who might you expect to have a competitive advantage in this industry?
AS: I think a lot of people will agree with that. Even though generative AI still has much to mature, the promise shown is remarkable. A few years ago, people would show simple demos of code completion, question answering, and translation in the lab. Now with generative AI, the pace at which those demonstrations have gone from a curated demo to customers is remarkable. There is no doubt that generative AI represents something that we have not seen in the past, and that is where the hope is.
The foundation of generative AI is a mix of both hardware and smart algorithms. However, in an era where Moore's Law has ended, hardware has limitations to how much you can improve. Software, algorithms, and implementations with the highest level of efficiency are even more important. It is evident to me that the companies that have the capability to train, retrain, deploy, fine-tune, and iterate their foundational AI models will have a competitive advantage.
What are Large Language Models?
NH: Thanks for that overview. I'm excited to go into some of the end use cases that you see with this new wave of technology. As a starting point, let's cover some of the fundamentals – what are large language models, how they are created, and what makes them so transformative?
AS: That's a great question. The term ‘large language model’ has become very popular. To start, let's touch on some history. In the early nineties, Google’s search engines highlighted how powerful information retrieval can be – from documents and webpages we could do impressive keyword-based search. However, language understanding was non-existent. If you searched for ‘water’, you might also want to mean ‘agua’ or ‘ related concepts which go by different keywords. To this end, people initially created rule-based systems, however, it wasn’t until the coupling of natural language processing and deep learning that we made dramatic progress. In the last few years, people realized that if we trained a very large neural model on a very large corpus of data, you could show remarkable improvement in our capabilities to do generation, translation, or natural language process understanding. And this was the birthplace of the term called ‘large language models’ or ‘LLMs’. As in, you have a very large model, something like tens of billions to hundreds of billions of parameters. And these models are trained to solve some very specific language-related self-supervised task. What is most intriguing about these large language models is that problems are being solved not by truly understanding the language, but by memorizing it instead. Literally, this technology is about making a large neural network memorize the internet (or any text data it can get a hold of).
The Importance of Efficiency
NH: These LLMs are dependent and reliant on troves of data and you are no stranger when it comes to large datasets. In your PhD thesis, you wrote: “the unprecedented opportunity provided by learning algorithms in improving our life is bottlenecked by our ability to exploit large amounts of data.” What inspired you to focus on AI efficiency and what developments in this field excite you the most?
AS: When you start a PhD, you know you are committing to a disciple for the next five years or more. So the first thing that you want to do with yourself is to be honest about your intuition. When I was starting the field, deep learning was gaining popularity and AI efficiency was the only thing that felt real to me. As I continued to gain experience in this field by doing a couple of internships and talking with people in the industry, it was evident that model performance improved with larger data sets and larger model size. Thus, it became obvious that efficiency would be a bottleneck to scale. Taking GPT, for example, GPT-1 was ‘good enough’, GPT-2 was ‘great’ and GPT-3 ‘stunned the world’ – and the difference is just the model size and scale. As you bring efficiency to the picture, we will continue to see even more exciting things in the future.
Why did you start ThirdAI?
NH: A few years ago, you decided it was time to start a company. Aside from becoming a tenured professor, congratulations by the way, why was the timing right to being your entrepreneurial journey? Talk us through the founding of ThirdAI.
AS: When I started thinking about training large neural networks I realized that we still use algorithms that were developed in the 80s. And at that time, efficiency and scale were not as relevant – the main concern was validating that the approach could work. So, in 2016 and 2017 we designed a new algorithm centered around a concept called dynamic sparsity. And in 2018 and 2019 we showed that our models on a simple CPU could outperform the very best GPUs. We published results along with collaborators from Intel and we started getting a lot of interest. But when I looked at the existing software stack for AI and how convenient it was to prototype, I realized that it was hard for people to break out of their comfort zone and build something which was fundamentally different. In 2021, my co-founder, Tharun expressed to me that he thought what we had was unique and too valuable not to go all-in. Our first four or five engineers were so excited by ThirdAI’s potential that they turned down generous offers from Silicon Valley companies like Databricks, Facebook, Google and others. And so, ThirdAI was born.
Why is ThirdAI’s breakthrough novel?
NH: That core team demonstrated that hash-based sparsity-inducing algorithms on commodity CPU hardware (be it Intel, AMD, ARM) can outperform state-of-the-art NVIDIA GPUs by more than an order of magnitude for training large neural networks. First, can you define these terms and second, can you explain your approach and the associated impact?
AS: Imagine we have a billion-parameter network and I show it a specific pattern – the model uses 1B operations to consume the pattern and another 1B to update the parameters, if the model has made any mistakes. However, these operations are typically overwhelming to commodity hardware like x86 CPUs. So, it was very natural to advocate for a specialized infrastructure or hardware that could handle these operations, and this is where GPUs come in the picture. Any specialized hardware has its own shortcomings – it cannot have a lot of memory, it can only do very specialized forms of computation, and the other thing that comes with it is friction. Most of the commodity systems that we talk about, the whole ecosystem of software, sits on CPUs. The software stack is optimized over, let's say, ten years on this hardware. However, if I’m a new enterprise with large AI models, I have to integrate one extra hardware into the system and the hardware comes with its proprietary drivers. So, now you have to interact with the driver and it creates engineering friction. So, if an enterprise wants to go down the AI route, you have to invest a lot of money in buying that specialized hardware. You then have to invest a lot of money in finding people who understand the mixed workload and can optimize your workload.
I'm a fan of simplicity. To me this additional requirement of hardware for getting AI working is troublesome. So we worked on a sparsity inducing algorithm, where if we could make CPUs better or as good as GPUs, then we can reduce this friction, right? In order for us to democratize AI, we need to enable AI on a computing hardware that everybody has and that is easily understandable, that is not hard to find and everybody is comfortable coding with. So that is why we thought about the standard x86 or commodity hardware.
An Explanation of Dynamic Sparsity
NH: Your ability to show how commodity hardware can outperform GPUs is standalone impressive. When we consider that 80%+ of server hardware is x86 architecture the ThirdAI approach is even more compelling. The compatibility, availability, and affordability become tough to match. Talk us through how dynamic sparsity works.
AS: Yeah, the idea here is the crux of ThirdAI. So, let’s come back to our example that I have a model that has a billion parameters and let's say I show it an image of a cat. So there are 1 billion operations to process the image of a cat into a prediction. Let's say the model makes a prediction dog. The deep learning system will compute an error, or back propagate an error, so that is another billion operations and then you will update the billion parameters with an update to make sure that this error has been corrected. So, effectively there are 3 billion operations here. If you look at all the updates you will realize that, wait a second, even though I am doing 3 billion operations, only about 1,000 of the updates are significant. 99.9% of the updates are either zero or near zero. So this wasteful nature of how we are doing this operation is the birthplace of dynamic sparsity. In our systems , we only pick the weights that are relevant. We use an information retrieval system inside neural networks to figure out, let's say, the relevant 10,000 parameters (redundancy) for a given pattern and then only update those 10,000 instead of updating the billions.
Let me come up with a better analogy. Let's imagine the world has millions of concepts and let's call these concepts as eyes, nose, ears, wheels, steering wheels, antennas, windows, roof, etc. When you show me as a pattern, I am sparse in those concepts, right? Because I have eyes, nose and ears, I don't have antennas, I don't have wheels or windows. When you show it a pattern of a bus, a bus is also sparse in those concepts because the bus does not have eyes, nose, and ears but has a roof, antennae, and windows. And the same thing happens if you show a house, right? To me it's very clear that there needs to be a large model as in you need all the parameters or concepts. This is where we separate the term dynamic sparsity and static sparsity. It should be noted that sparsity is now an overloaded word.
When you say static sparsity, you are somehow saying that a lot of weights of my parameters are not important and we can throw them away. We are not saying that. We are saying that well, all the weights are important. But in order to train and infer efficiently, every input is sparse in the representation. So what we do is when you show us a pattern, let's say you show me a face, then the face queries the network and you only get parameters relevant to the concept present in the face. So it will probably only get eyes, nose, ears and mouth. But it will not even touch concepts like wheels, antennas, windows, roof and so on and so forth, because we don't need that. Now the only thing you will do is whatever is picked, you will feed forward on that, you will make a prediction based on that and you will back propagate and update those parameters. But because of the sparsity, the number of operations instead of being 3 billion will become only a few thousand.
And at that point of time it just changes the whole equation because then, well, there are not that many operations that require specialized hardware. You can as well do it with a CPU. So that's more or less what it amounts to. Now the question, how do I determine which bits to set to zero? That is a standard information retrieval problem. So when you show me a pattern of a face, how do you know that your nose and ears are present and others are not? I think the puzzle is very similar to how we do web search. When you, let's say, type Neotribe or ThirdAI in Google, somehow Google doesn't take that keyword and match it to all the web pages in the world for that keyword. Google has indexed the web so that the word Neotribe hits the website of Neotribe directly and ThirdAI hits the website of ThirdAI directly rather than touching all the bits. So in exactly the same way, given a pattern, we figure out the relevant parameters – very akin to an information retrieval system.
What is the competitive landscape?
NH: Talk us through the competitive landscape. How does your approach differ from other techniques, are there any notable drawbacks to sparsity, and what advantages does ThirdAI have in terms of training and inference?
AS: All the existing techniques that we know of – PyTorch, TensorFlow, CUDA, oneAPI – they are all based on dense matrix multiplication. So, they do not use sparsity. In fact, sparsity is a problem for these systems because these systems are designed for dense operations. There is a high-performance computing (HPC) term called vectorization which says that you can do a lot of operations together more efficiently than doing all the operations independently. When you talk about sparsity, you lose that advantage because of random access. That is one of the reasons why sparsity is a completely orthogonal field to vectorization and dense operations. Random memory access cannot leverage this vectorization. But having said that, if you look at the difference, in one world I'm doing 3 billion operations and utilizing vectorization, and in the other world I'm getting the same accuracy with only a few thousand operations. I think the difference is drastic enough to argue the case for doing sparse operations.
We are talking about the capability to train and infer with large language models using dynamic sparsity two to three orders of magnitude faster and cheaper. Remember, we are doing this forward propagation and backward propagation both in a sparse way – so we get both inference and training efficiency. Because I'm not utilizing hardware to make my operations faster and instead I am utilizing smart ideas to reduce the arithmetic operation itself… I get an improved energy footprint as I burn less carbon than the alternative dense approaches.
What do customers see when using ThirdAI?
NH: The ThirdAI approach unlocks more accurate models, drives performant solutions, and improves latency in a dramatically more carbon-friendly manner. What benefits do customers see when they implement ThirdAI, and how does ThirdAI improve performance for their users?
AS: Let's talk about the usability perspective first. You won’t notice any difference in how to deploy ThirdAI versus any other solution. We have made sure that our solution is easily integrated – so if you're using any open source version of large language models or if you're using any APIs, you can, in the same way, use our APIs. So from a developer perspective or from an integration perspective, you won't even feel the difference.
But what you will surely feel is a difference in how fast your model can train, how fast you can refresh it, and how fast you can infer it. Every customer we talk to wants a specialized AI for them. In fact, right now, we are talking about large language models, which is one model for a lot of things, but we don't anticipate the future will be like that. The future will be personalized LLMs for an enterprise-specific use case. So, for example, in a groceries catalog, the word Apple and Google should not have any relationship but models trained over the web like GPT will treat the two words as very similar.
I think the future is going to be where customers would want a large language model customized for their needs or personalized for their taste. Every customer wants ownership of the model, the capability to fine-tune, retrain, and refresh the models. With ThirdAI this is very easy because we can train large language models very fast on CPUs. We also offer on-prem solutions and allow customers to own their large language models. You cannot get it done through the existing solutions because either you have to transmit the data themselves or you have to maybe use an open source model on a lot of GPUs. With ThirdAI you can easily validate whether a large language model is making a difference in your business or not without needing expensive GPUs and a large specialized data science team
The Wayfair Case Study
NH: One of your collaborators, Wayfair, published a case study that highlights how Wayfair leveraged ThirdAI’s technology to drive hyper-relevant search results to its customers. Wayfair talked about the evolution of their query classifier model from a set of logistic regression models based on n-grams to a Convolutional Neural Network (CNN) to a model reliant on multiple filters working in parallel. And their data scientist team was trying to figure out how to get the third instantiation of the model to be even more performant. Talk us through that collaboration. How did Wayfair leverage ThirdAI to drive hyper-relevant search results for its customers?
AS: Wayfair published an excellent overview in the blog and this is a classical case of what e-commerce engines are going to face. So let's say you go and type “red shirt for a toddler”. Whenever you type a query like that, search engines want to save you time and drive search relevance. There are going to be a lot of AI models that work on trying to understand the query as to what you really mean when you say toddler. At this point of time you need a lot of heavy lifting from existing natural language processing and other tools that are out there. But the problem in e-commerce is the latency? Most search engine latency requires AI to compute in less than five to ten milliseconds. So, if you go to open source models, you’ll notice that these models are expensive to train and will not meet the latency barrier that you are looking for. So one story is the models are expensive to train. But let's say you figured out a way to train them, then even after training, when you are deploying these models, you are running an inference. Every time a query is being typed on a search engine, the query goes through these language models or these neural models, whatever you have built, and the latency of these models typically take a few hundreds of milliseconds. And that latency is prohibitive. There was a blog published by Amazon that says that for every 100 millisecond delay in your response time you lose 1% of customers. Now, that's huge.
So, what do you do? You have to innovate. You have to figure out ways to make the model itself faster, maybe with a cheaper model which is less accurate, like a smaller model, or you figure out ways to make the model faster in your system. But with ThirdAI sparsity based training and inference, you can get the best AI out there and meet your latency demands. So this was quite a learning experience, even for us, to understand that there is a dire need for efficient AI models in production.
The Cost and Carbon Footprint
NH: This is a clear a win for the enterprise given the increased revenue and reduced cost as well as a win for the customer given more accurate search results. In the present macro environment I am sure customers are appreciative of this value proposition.
Now, the cost to train generative AI models can be prohibitive to innovation and the length of training can range from days to months. Taking two examples: first, according to the CEO of Stability AI, it cost the team roughly $600,000 to train Stable Diffusion with 256 Nvidia A100s; and second, according to ChatGPT itself, “GPT-3 was trained using 3,175 NVIDIA V100 GPUs and 355 TPUs over the course of several months. While OpenAI has not disclosed the exact cost of training GPT-3, it is likely in the millions of dollars.” And these numbers are to train your system one time. Those who aim to re-train or fine-tune their system would need to pay that cost multiple times. Furthermore, complex queries to these systems may require inference on a GPU in the cloud. How does ThirdAI compare to OpenAI in terms of cost both for training and inference?
AS: Yeah, I mean, you are highlighting every customer's dilemma. One customer was looking to search through millions of documents they had on a data center. So, they went to OpenAI and did a pricing calculation. They said it would probably cost them tens of hundreds of million dollars to have this service up and running. We would expect that over the next few months and years that we will see the cost come down because such a pricing model is prohibitive to innovation. Everybody can look at ChatGPT and see that, well, if I have this in my system, then I am going to derive a lot of value. I can automate inefficiency, drive relevance, save call center time, and more. The internet is flooded with use cases.
However, even after you’ve discovered interesting use cases for your business, you won’t be able to take action because of the prohibitive cost. What excites me is that ThirdAI is materializing the promise. We have been working on these challenges for a few years and have the tools to train large neural networks in a very efficient way. As we publish on our website, we are several orders of magnitude cheaper both in terms of cost and carbon footprint compared to traditional ways of training large language models.
[As a fun fact, as per the Microsoft blog, “The supercomputer developed for OpenAI is a single system with more than 285,000 CPU cores, 10,000 GPUs and 400 gigabits per second of network connectivity for each GPU server.” This ranks it in the top 5 of supercomputers in existence.]
ThirdAI’s Opportunity to Democratize AI
NH: To date, the hyperscalers have had the data and compute advantage. With petabytes of data and near infinite compute, they’ve been the gatekeeper of scale. Large SaaS companies are reliant on the hyperscalers and pay on average, an astounding 50% of their revenue in cloud costs. And this trend seems to be true for Generative AI, like ChatGPT, too. On the ‘How I built this’ podcast in September of 2022, Sam Altman said that part of his rationale for the Microsoft partnership was because 90% of the capital he raised was for compute. ThirdAI has the opportunity to break the hyperscalers’ stronghold, democratize AI development, and equal the playing field for upstarts and incumbents.
AS: Absolutely correct. Most of the data infrastructure sits on CPUs and over the years, data developed its own gravity. Now the cloud is on the verge of having another added gravity through AI, leading to two body complications both having their own gravity. As we highlighted earlier, today’s AI requires specific hardware that the hyperscalers provide. But there is now a friction given that the cloud where the database sits is not GPU-ready. And then obviously compute becomes the most dominating cost here. And if you ask the question, who can train something like ChatGPT for enterprises who have terabytes of data, it just boils down to a handful of companies. And to me, AI can only get perfected when a lot of people are trying it, building it, refining it. So if we can give this capability to more people then we not just make progress in the field of making AI efficient, but we make progress towards better AI itself. AI needs to go to several iterations. I think everybody that is out there right now will agree that the current version of the large language models that we have are going to have several iterations and become way more efficient, way more available. But if there are only a few companies that control the iterations, AI progress will be slow and monopolized.
So this opportunity to make AI available in the hands of everyone with technological disruption like sparsity on CPU is a huge opportunity. This not only advances large language models and puts them in the hands of customers with ownership but it also advances AI as a whole and drives further innovation in fields like generative AI and ChatGPT.
Looking Ahead
NH: Thanks for sharing your perspective, Anshu. Incredibly insightful and thought provoking. I can’t wait to see ThirdAI’s continued influence along its mission to democratize access to AI. Any last words you’d like to share?
AS: This is arguably the most exciting time if you are working in efficiency and trying to make LLMs efficient. LLMs have stunned us with their capability. Enterprises want to give large language models a chance to tackle their hardest problems. And efficiency is going to play a key role here because AI is all about scale and the cost. We couldn’t be more excited to be at the center of this field dealing with scale efficiency from a foundational perspective. Thanks, Neeraj. Thanks for the time.