What Makes Large Language Models Expensive?

IBM Technology

IBM Technology

19 min, 20 sec

The video provides an in-depth analysis of the various cost factors associated with implementing generative AI, specifically large language models (LLMs), in an enterprise setting.

Summary

  • The video discusses seven cost factors: use case definition, model size and complexity, pre-training costs, inferencing costs, tuning methods, hosting requirements, and deployment options.
  • Use cases dictate the type of generative AI needed; a pilot program is recommended to identify enterprise needs.
  • Model size affects pricing; larger models with more parameters require more compute power.
  • Pre-training an LLM from scratch is cost-prohibitive; leveraging a pre-trained model is an alternative.
  • Inferencing involves the AI responding to prompts, with costs based on token usage.
  • Tuning adjusts model parameters for specific tasks; fine-tuning and parameter-efficient fine-tuning are two methods with varying costs.
  • Hosting is needed when using fine-tuned models or proprietary models; otherwise, an API can be used for inferencing.
  • Deployment can be on the cloud (SaaS) or on-premises, with different cost implications for each.

Chapter 1

Introduction to Generative AI Costs

0:00 - 55 sec

Introduction to the complex cost factors of implementing generative AI in enterprises.

Introduction to the complex cost factors of implementing generative AI in enterprises.

  • Discusses the need for enterprises to consider the full spectrum of costs beyond simply subscribing to a chatbot service.
  • Illustrates the point with a story of a best man using ChatGPT to write a speech, demonstrating the consumer use case.

Chapter 2

Evaluating Use Cases

0:55 - 2 min, 25 sec

Understanding the importance of defining use cases for generative AI.

Understanding the importance of defining use cases for generative AI.

  • Highlights the need for specificity in generative AI applications to determine the appropriate compute resources.
  • Recommends participating in a pilot to test and evaluate generative AI's efficacy for an enterprise's specific needs.

Chapter 3

Assessing Model Size

3:20 - 1 min, 39 sec

The impact of model size and complexity on the cost of generative AI.

The impact of model size and complexity on the cost of generative AI.

  • Larger models with more parameters drive up compute and resource costs.
  • Vendors offer pricing tiers based on model size, and different models serve different use cases.

Chapter 4

Understanding Pre-Training Costs

5:00 - 1 min, 8 sec

The prohibitive costs of building and training a large language model from scratch.

The prohibitive costs of building and training a large language model from scratch.

  • Pre-training an LLM requires significant compute resources, time, and effort.
  • Using a pre-trained model is a more feasible alternative for most enterprises.

Chapter 5

Inferencing and Prompt Engineering

6:08 - 2 min, 18 sec

The costs associated with inferencing and the practice of prompt engineering.

The costs associated with inferencing and the practice of prompt engineering.

  • Inferencing is the process of generating a response from an LLM, with costs based on token usage.
  • Prompt engineering is a cost-effective way to tailor results without extensive model modifications.

Chapter 6

The Role of Tuning in Generative AI

8:26 - 2 min, 42 sec

Tuning as a method to improve LLM performance and its associated costs.

Tuning as a method to improve LLM performance and its associated costs.

  • Tuning involves adjusting model parameters and is measured in hours with varying rates.
  • Different tuning methods, including fine-tuning and parameter-efficient fine-tuning, offer trade-offs between performance and cost.

Chapter 7

Hosting and Interaction Options

11:08 - 3 min, 52 sec

The necessity of hosting models for certain interactions and its cost implications.

The necessity of hosting models for certain interactions and its cost implications.

  • Hosting is required when using fine-tuned or proprietary models for interaction.
  • API inference is used for prompt engineering and parameter-efficient fine-tuning without hosting.

Chapter 8

Deployment Choices and Costs

15:00 - 4 min, 18 sec

The final cost factor considers the deployment environment, whether SaaS or on-premises.

The final cost factor considers the deployment environment, whether SaaS or on-premises.

  • SaaS offers a predictable cost structure, shared GPU resources, and no maintenance.
  • On-premise deployments grant full control over architecture and data but require significant infrastructure investment.

More IBM Technology summaries

What is Retrieval-Augmented Generation (RAG)?

What is Retrieval-Augmented Generation (RAG)?

IBM Technology

IBM Technology

Marina Danilevsky introduces the concept of Retrieval-Augmented Generation (RAG) to enhance the accuracy and timeliness of responses from large language models.