Why Self-Hosting Small LLMs Are Cheaper Than GPT-4: A Breakdown

Introduction

The world of AI and natural language processing is advancing at an astonishing rate. One of the most exciting developments is the emergence of large language models (LLMs) like GPT-4, which can generate human-like text, answer questions, and perform a wide range of language-related tasks. However, the cost associated with running such models can be prohibitive for many individuals and businesses. In this blog post, we'll explore why self-hosting small LLMs can be a cost-effective alternative to GPT-4, breaking down the economics of both approaches.

The Future of LLMs

With the release of LLaMa-2, the monopoly of big tech companies in the LLM space is diminishing. As the open-source community continues to grow and innovate, it is poised to match or even surpass the quality of API-based models. The speed and cost-effectiveness of self-hosted models have the potential to revolutionize industries that rely on LLMs for production.

ChatGPT and Open-Source LLMs

The journey of Large Language Models started with the introduction of the Google Transformer Architecture in 2017, which paved the way for models like BERT, GPT, and BART. OpenAI, in particular, played a pivotal role in the development of these models:

GPT-1 (2018): With 117 million parameters, GPT-1 marked a significant step in generating human-like sentences and paragraphs.
GPT-2 (2019): Boasting 1.5 billion parameters, it generated more meaningful and extended texts.
GPT-3 (2020): A milestone with 175 billion parameters, GPT-3 could handle a wide range of tasks, from text generation to chatbot capabilities.
GPT-3.5 (2022): An upgrade with conversational data training, GPT-3.5 further improved the model's ability to understand, engage, and respond to natural language queries.
GPT-4 (2023): The latest in the series, GPT-4 is a multimodal large language model, with both paid and commercial API access.

However, the success of ChatGPT paved the way for several open source LLMs, like Meta's LLaMa, which offers a 65 billion parameter model, outperforming GPT-3. Microsoft and Meta's collaborative effort resulted in LLaMa-2, which comes with enhanced performance and features. It offers models with 7B, 13B, and 70B pre-trained and fine-tuned parameters.

These open source models have paved the way for further developments:

Stanford's Alpaca Model: Fine-tuned the LLaMa-7B model with 52k instructions.
Vicuna: A 13B parameters LLaMa model achieving 90% of ChatGPT's quality.
Stable Beluga 2: An open-access LLM based on LLaMa-2 70B model, surpassing GPT-3.5 in some tasks.
Luna AI LLaMa-2 Uncensored: Fine-tuned using 40,000 chat discussions, this advanced chatbot outperforms ChatGPT.

Several other models and chatbots fine-tuned on open source LLMs have demonstrated exceptional performance, sometimes exceeding ChatGPT's capabilities.

Pros and Cons of OpenAI LLMs

Pros:

High-quality responses.
Cost-effective for low usage (around 1,000 requests a day).
Faster time to market.
Easy infrastructure setup and deployment.
Minimal staff specialization on LLMs.

Cons:

Risk of exposing data.
Becomes expensive over time with increased request volume.
Vendor lock-in.
Unclear white label reselling license.

Pros and Cons of Open-Source LLMs

Pros:

Transparent and customizable.
Ideal and cost-effective for high usage.
Fine-tuned models perform better than the GPT series for domain-specific tasks.
Flexible to host infrastructure on any hardware device.

Cons:

Large infrastructure setup cost.
Model quality may be lower than OpenAI models.
Requires specialized staff to train and maintain LLM models.
Requires correct Open Source License.

Significance of LLMs like LLaMa-2

The history of LLMs is a relatively recent but rapidly evolving one. The transformer architecture, introduced in 2017, marked a significant milestone in natural language processing. However, big tech companies dominated the AI landscape, with models like GPT-3 costing millions of dollars to train, leaving independent AI labs and the open-source community trailing behind. This led to a scenario where self-hosted LLMs were not on par with their API-based counterparts, forcing many businesses to rely on models like GPT-3.5 or GPT-4.

Enter LLaMa-2, a collaborative effort between Meta and Microsoft, which has recently disrupted the status quo. LLaMa-2 is designed for both experimental and commercial use, with versions ranging from 7 billion to a staggering 70 billion parameters. The 70B version of LLaMa-2, in particular, matches GPT-4 in terms of speed and efficiency, ushering in a new era of possibilities.

LLaMa-2 vs GPT Comparison

To understand why self-hosted LLMs are cheaper than GPT-4, we need to examine several key factors:

Quality: Historically, there was a substantial difference in the quality of content generated by self-hosted and API-based models. However, LLaMa-2 has significantly narrowed this gap, making it comparable to GPT-3.5 in main benchmarks. Furthermore, experiments have shown that larger LLaMa-2 models, such as the 70B version, are on par with GPT-4 in terms of understanding prompts and generating results.
Customization/Fine-Tuning: Smaller LLMs fine-tuned on domain-specific datasets can compete effectively with larger models. Fine-tuning open-source models can be done with a relatively small budget and data, making it possible to find pre-trained models for a wide range of use cases, such as Vicuna, which achieved more than 90% of Chat-GPT quality.
Cost: One of the major advantages of API-based models is their initial cost-effectiveness, especially for low-scale usage. However, as your usage scales up, so do the costs. API-based models like GPT-3 and GPT-4 are priced based on the number of tokens used. As an example, if you have 10,000 queries per day with a total of 450 words per query, costs can quickly add up, especially for high-volume applications.
Development and Maintenance: Historically, API-based models had the upper hand in terms of ease of integration, requiring minimal development and maintenance effort. However, the open-source community has made significant progress in rolling out developments, making self-hosted LLMs just as accessible as their API-based counterparts.
Transparency and Control: Self-hosted models offer transparency and control that API-based models can't match. You have the autonomy to maintain and update your model and endpoint, avoiding unexpected issues resulting from updates or changes imposed by API providers.
Data Privacy and Safety: For many businesses, data autonomy and privacy are paramount. Self-hosted LLMs allow you to control where data is stored and sent, mitigating concerns about proprietary data leakage. While private Azure hosting is an option, it comes at a high cost.

Cost of Initial Setup and Inference

API Access Solution: When using an API access solution like GPT-4, costs vary depending on the provider, the specific use case, and the model chosen. You are billed based on the total number of input tokens, output tokens, and sometimes a per-request fee. For example, Cohere charges $0.2 for every 1,000 classifications.
On-Premises Solution: Hosting open-source models, especially large ones, can be expensive due to the infrastructure needed. Costs are mainly determined by the hardware, with hourly rates ranging from $0.6 for a NVIDIA T4 (14GB) to $45.0 for 8 NVIDIA A100 GPUs (640GB).

Cost of Maintenance

API Access Solution: Some providers offer fine-tuning services, covering data upload, model training, and deployment. The pricing for fine-tuning varies among providers and models.
On-Premises Solution: Maintaining open-source models involves running an IT infrastructure for retraining the model, which is directly proportional to the time it takes. More complex tasks require more resources, resulting in higher costs.

Other Considerations

CO2 Emissions: The environmental cost of training LLMs is significant due to their increasing size. It's essential to choose an adaptive IT infrastructure to reduce CO2 emissions. Factors that impact CO2 emissions include compute, data center location and efficiency, and hardware.
Expertise: Maintaining open-source solutions often requires a specialized team, increasing costs. API access solutions provide more support from the provider.

The Cost of GPT-4

To begin, let's understand the cost of using GPT-4. It's important to note that the pricing structure for these models can vary, but for the sake of this breakdown, we'll consider a full context window. GPT-4's cost can be roughly estimated as follows:

$0.03 per 1,000 prompt tokens for an 8192 context window.
$0.06 per 1,000 tokens for completion tokens.

This brings the total cost to approximately $0.30 per 1,000 tokens. While GPT-4 undoubtedly delivers impressive results, this price point may not be feasible for all use cases.

Cost of Self-Hosting

Now, let's shift our focus to self-hosting small LLMs. The primary expense here is the cost of a GPU server. To simplify our calculations, we'll assume the use of a LambdaAPI H100 server priced at $2 per hour.

For the purpose of this comparison, consider the performance of a small LLM like Falcon-7B running at a rate of roughly 44.1 tokens per second with a full context window on a 4090 GPU. While the H100 server is more powerful, we'll use this number as a conservative estimate.

With 44.1 tokens per second, the server generates approximately 158,760 tokens in an hour. Therefore, the cost per 1,000 tokens for self-hosting is approximately ($2 per hour) / (158,760 tokens per hour) = ~$0.013 per 1,000 tokens.

This calculation is a rough estimate and may vary depending on factors like GPU efficiency and LLM performance. However, even with these conservative numbers, self-hosting small LLMs appears significantly more cost-effective.

Efficiency Matters

It's essential to highlight that the size and efficiency of the LLM you self-host can significantly impact the cost-effectiveness of this approach. The initial calculation is based on a relatively less optimized setup, using a smaller LLM and a slower GPU with limited VRAM. If you can fine-tune a model like Mistral-7B for a specific task, the performance can be even more efficient.

Furthermore, self-hosting does require consistent GPU usage to maximize cost savings. However, even with a relatively conservative estimate of 10% efficiency using the setup mentioned, you'd only incur around 30% of the cost of GPT-4.

Conclusion

In conclusion, the choice between self-hosting small LLMs and using a model like GPT-4 ultimately depends on your specific needs and budget constraints. While GPT-4 offers remarkable capabilities, the cost may be a barrier for many users. Self-hosting, even with a smaller model and less-than-optimal hardware, can provide a cost-effective alternative. If you have a narrow and well-defined task that can be addressed with a self-hosted LLM, the cost savings can be substantial.

Remember that the numbers provided in this breakdown are rough estimates and should be adjusted based on your specific hardware, model, and usage. As technology continues to evolve, self-hosting small LLMs might become an even more attractive option for those looking to leverage the power of AI on a budget.

Why Self-Hosting Small LLMs Are Cheaper Than GPT-4: A Breakdown

Introduction

The Future of LLMs

ChatGPT and Open-Source LLMs

Pros and Cons of OpenAI LLMs

Pros and Cons of Open-Source LLMs

Significance of LLMs like LLaMa-2

LLaMa-2 vs GPT Comparison

Cost of Initial Setup and Inference

Cost of Maintenance

Other Considerations

The Cost of GPT-4

Cost of Self-Hosting

Efficiency Matters

Related Articles

Making AI Deployment Affordable and Scalable: Cost Efficiency of Quantization

Interpretable vs. Black-Box Models: A Comprehensive Exploration on Early Prediction under Uncertainty

Generative AI in Healthcare: Applications, Benefits, and Its Future

GPU Cloud

Company

Legal & Policies

Investor Relations

Resources