What Makes Large Language Models Expensive?
IBM Technology
19 min, 20 sec
The video provides an in-depth analysis of the various cost factors associated with implementing generative AI, specifically large language models (LLMs), in an enterprise setting.
Summary
- The video discusses seven cost factors: use case definition, model size and complexity, pre-training costs, inferencing costs, tuning methods, hosting requirements, and deployment options.
- Use cases dictate the type of generative AI needed; a pilot program is recommended to identify enterprise needs.
- Model size affects pricing; larger models with more parameters require more compute power.
- Pre-training an LLM from scratch is cost-prohibitive; leveraging a pre-trained model is an alternative.
- Inferencing involves the AI responding to prompts, with costs based on token usage.
- Tuning adjusts model parameters for specific tasks; fine-tuning and parameter-efficient fine-tuning are two methods with varying costs.
- Hosting is needed when using fine-tuned models or proprietary models; otherwise, an API can be used for inferencing.
- Deployment can be on the cloud (SaaS) or on-premises, with different cost implications for each.
Chapter 1
Introduction to the complex cost factors of implementing generative AI in enterprises.
- Discusses the need for enterprises to consider the full spectrum of costs beyond simply subscribing to a chatbot service.
- Illustrates the point with a story of a best man using ChatGPT to write a speech, demonstrating the consumer use case.
Chapter 2
Understanding the importance of defining use cases for generative AI.
- Highlights the need for specificity in generative AI applications to determine the appropriate compute resources.
- Recommends participating in a pilot to test and evaluate generative AI's efficacy for an enterprise's specific needs.
Chapter 3
Chapter 4
Chapter 5
The costs associated with inferencing and the practice of prompt engineering.
- Inferencing is the process of generating a response from an LLM, with costs based on token usage.
- Prompt engineering is a cost-effective way to tailor results without extensive model modifications.
Chapter 6
Tuning as a method to improve LLM performance and its associated costs.
- Tuning involves adjusting model parameters and is measured in hours with varying rates.
- Different tuning methods, including fine-tuning and parameter-efficient fine-tuning, offer trade-offs between performance and cost.
Chapter 7
The necessity of hosting models for certain interactions and its cost implications.
- Hosting is required when using fine-tuned or proprietary models for interaction.
- API inference is used for prompt engineering and parameter-efficient fine-tuning without hosting.
Chapter 8
The final cost factor considers the deployment environment, whether SaaS or on-premises.
- SaaS offers a predictable cost structure, shared GPU resources, and no maintenance.
- On-premise deployments grant full control over architecture and data but require significant infrastructure investment.
More IBM Technology summaries
What is Retrieval-Augmented Generation (RAG)?
IBM Technology
Marina Danilevsky introduces the concept of Retrieval-Augmented Generation (RAG) to enhance the accuracy and timeliness of responses from large language models.