To receive industry-leading AI updates and exclusive content, sign up for our daily and weekly newsletters. Learn more
Many companies are hopeful that AI will revolutionize their business, but the sheer cost of training advanced AI systems can quickly dash those hopes. Elon Musk has noted that engineering issues often cause progress to stall, especially when it comes to optimizing hardware like GPUs to efficiently handle the massive computational requirements of training and fine-tuning large language models.
While large technology companies can afford to spend millions, and sometimes billions, of dollars on training and optimization, smaller companies and startups are often left behind due to lack of funding in the short term. In this article, we discuss some strategies that may allow developers with limited resources to train AI models without incurring large costs.
If you’re going to invest for 10 cents, invest for a dollar.
As we know, the creation and release of AI products, whether foundational models/large language models (LLMs) or fine-tuned downstream applications, rely heavily on specialized AI chips, specifically GPUs. These GPUs are very expensive and hard to obtain, which is why SemiAnalysis coined the terms “GPU rich” and “GPU poor” within the machine learning (ML) community. Training LLMs can be costly, primarily due to expenses associated with the hardware (including both acquisition and maintenance), rather than the ML algorithms or expertise.
Training these models requires extensive computations on powerful clusters, and the larger the model, the longer it takes. For example, training LLaMA 2 70B requires exposing 70 billion parameters to 2 trillion tokens, which requires at least 10^24 floating point operations. If you have a poor GPU, should you give up? No.
Alternative strategies
There are several strategies that technology companies are currently leveraging to find alternative solutions, reduce reliance on expensive hardware, and ultimately save costs.
One approach is to tune and streamline training hardware. This approach is still largely experimental and investment intensive, but it holds promise for future optimization of LLM training. Examples of such hardware-related solutions include custom AI chips from Microsoft and Meta, Nvidia and OpenAI’s new semiconductor initiative, Baidu’s single compute cluster, Vast’s rental GPUs, and Etched’s Sohu chips.
While this is an important step forward, this methodology is best suited for larger companies that can afford to invest heavily now to reduce future expenses, not for new entrants with limited funds who want to develop an AI product now.
What to do: Innovative software
With a low budget in mind, there is another way to optimize your LLM training and reduce costs through innovative software. This approach is more affordable and accessible to most ML engineers, whether they are seasoned professionals, AI enthusiasts, or software developers looking to enter the field. Let’s take a closer look at some of these code-based optimization tools.
Mixed Precision Training
What is it?: Imagine if your company has 20 employees, but you rent office space for 200. Clearly, this is a waste of resources. Similar inefficiencies occur in practice during model training, where ML frameworks often allocate more memory than is actually needed. Mixed precision training fixes this through optimizations, improving both speed and memory usage.
structureTo achieve this, lower precision b/float16 arithmetic is combined with standard float32 arithmetic to perform fewer computational operations at one time. This may sound like a ball of technical arcana to non-engineers, but it essentially means that AI models can process data faster and require less memory without compromising accuracy.
Improvement indicatorsThis technique can improve execution times by up to 6x on GPUs and 2-3x on TPUs (Google’s Tensor Processing Units). Open-source frameworks such as Nvidia’s APEX and Meta AI’s PyTorch support mixed precision training and are easily available for pipeline integration. By implementing this technique, companies can significantly reduce GPU costs while maintaining acceptable model performance.
Activation Checkpoint
What is it?: If you have limited memory but at the same time want more time, checkpointing might be the right technique. In short, by minimizing computations, it significantly reduces memory consumption, making LLM training possible without hardware upgrades.
structure: The main idea of activation checkpointing is to store a subset of important values while training the model, and recalculate the rest only when necessary. That is, instead of keeping all intermediate data in memory, the system only keeps what is important, freeing up memory space in the process. This is similar to the “cross that bridge when the time comes” principle, meaning that we don’t bother with less urgent issues until they require our attention.
Improvement indicators: In most cases, activation checkpointing reduces memory usage by up to 70%, but also extends the training phase by about 15-25%. This fair tradeoff allows companies to train large AI models on existing hardware without investing additional capital in infrastructure. The aforementioned PyTorch library supports checkpointing, making it easier to implement.
Multi-GPU Training
What is it?: Imagine a small bakery that needs to produce a large number of baguettes quickly. With one baker working alone, it will probably take a long time. With a second baker, the process speeds up. Add a third baker, and it speeds up even more. Multi-GPU training works in much the same way.
structure: Instead of using one GPU, we use multiple GPUs at the same time. Thus, the training of the AI model is distributed across these GPUs, allowing them to work in parallel with each other. Logically, this is the opposite of the previous method, checkpointing, which reduced the hardware acquisition cost at the expense of extending the execution time. Here, we use more hardware, but utilize it to the fullest and maximize efficiency, thereby reducing the execution time and reducing the operational cost in return.
Improvement indicators: Below are three robust tools for training LLM on a multi-GPU setup, listed in ascending order of efficiency based on experimental results.
- DeepSpeed: A library specifically designed to train AI models using multiple GPUs, achieving speeds up to 10x faster than traditional training methods.
- FSDP: One of the most popular frameworks for PyTorch, it addresses some of DeepSpeed’s inherent limitations and improves computational efficiency by an additional 15-20%.
- YaFSDP: A recently released enhanced version of FSDP for model training that achieves 10-25% speedup over the original FSDP methodology.
Conclusion
Using techniques such as mixed-precision training, activation checkpoints, and the use of multiple GPUs, even small and medium-sized businesses can make significant advances in AI training, both fine-tuning and creating models. These tools increase computational efficiency, speeding up execution times and reducing overall costs. Additionally, they allow larger models to be trained on existing hardware, reducing the need for expensive upgrades. By democratizing access to advanced AI capabilities, these approaches enable a wider range of technology companies to innovate and compete in this rapidly evolving field.
There is a saying that “AI will never replace you, but someone using AI will replace you.” The time to embrace AI is now, and with the strategies above, you can do so even on a budget.
Ksenia Se Turing Post.
Data Decision Maker
Welcome to the VentureBeat community!
DataDecisionMakers is a place where experts, including technologists working with data, can share data-related insights and innovations.
If you want to hear about cutting edge ideas, updates, best practices, and the future of data and data technology, join DataDecisionMakers.
You might also consider contributing your own article.
Learn more about DataDecisionMakers
Source link


