Large Language Models have become the talk of the town. Every day brings new progress and innovation. New versions of GPT are being released. Meta is in the limelight with Llama-2. Parallelly we have a large number of counterparts and variations getting launched, including in multiple languages. Foundational models can now handle text and images pretty easily.
Open source LLMs are democratizing AI and come with a lot of promise. We can count on them for their data privacy, transparency, customization capabilities, and their low cost. But the fundamental issue we are facing when it comes to LLMs is their huge size and thirst for computational power. Firms and individuals other than large enterprises may not be able to afford such large infrastructure for their AI models – but we may have some workarounds. With this constraint in mind, let’s explore some frameworks and methods for LLM serving and inference which can help us use them seamlessly.
To conduct the following experiments, first sign up on E2E Cloud. Once registered, you can go to ‘Compute’ tab on the left, and spin up a GPU compute resource.
Once you have launched the GPU node, you would be able to add your SSH keys, and get going. Follow the steps outlined below to test out the various approaches outlined.
Hugging Face Endpoints
This may be the simplest way to make use of LLMs. It becomes particularly useful when we are building simple systems or testing models. Hugging Face has made it straight-forward by providing necessary documentation for each model in its page. Short code snippets are enough to get the model running. Here is an example inference code snippet for Llama-2 chat model.
Install transformers and login to Hugging Face:
Import libraries, load and prompt the model.
As simple as that!!
- Best option for beginners and research purposes.
- Detailed and good documentation
- With its robust community support and widespread popularity, Hugging Face has become the go-to platform for machine learning developers and researchers.
- Big companies support and use this platform to provide their models as open source.
- It can be easily used with almost all other frameworks.
- It does have paid services for using API endpoints by cloud. But usually, it involves downloading the model to the machine and using it. So, efficiency is compromised and it requires pretty decent machines to use.
vLLM is a fast and simple framework for LLM inference and serving. It provides high throughput serving and support for distributed inference. Offering seamless integration with Hugging Face models and OpenAI compatible API server. Outstanding features include Continuous Batching and Paged Attention. As their tagline says, it indeed is easy, fast and cheap LLM serving for everyone.
Install CUDA on the system if not installed already.
To start the server:
Query the hosted model using:
We can use it for offline batched inference also. Here is an example for text completion:
Define parameters, load model and prompt.
For utilizing multiple GPUs we can deploy models to any cloud using SkyPilot.
A serving.yaml file is to be created as shown. Here we have requested A100 GPU for the Llama 13B model we want to deploy.
- If inference speed is your priority vLLM is the best option. Algorithms like paged attention accelerate inference like no other frameworks.
- High Throughput serving
- It has integrations with Hugging Face Transformers and OpenAI API
- For better deployment and scaling using any cloud platform, it has native integration with SkyPilot framework.
- Currently, it does not support quantization. So, it is not efficient in terms of memory usage.
- Currently, it does not support LoRA, QLoRA or other adapters for LLMs.
OpenLLM is a platform for packaging LLMs into production. It allows seamless integration with leading services like BentoML, Hugging Face, LangChain. It allows deployment to cloud or on machines and supports docker containerization when used with platforms like BentoML. It works with state-of-the-art models like Falcon, Flan-T5, StarCoder and much more.
For quick inference through server install openllm library,
To start the server:
Alternatively, to specify the model type, enter model-id and parameters:
Query the hosted model using curl or inbuilt python client.
To get more control over the server, we can code one using BentoML service.
To start the service run:
If required we can containerize and deploy the LLM application also using BentoML.
- It offers quantization techniques for LLMs with bitsandbytes and GPTQ.
- There are options to modify the models. Experimental fine-tuning functionalities are available. It also supports plugins like adapters for LLMs.
- Integration with BentoML and BentoCloud offers wonderful options for deployment and scaling. BentoML enables us to dockerize the application.
- It has integrations with Hugging Face Agents framework and LangChain.
- It does not support built-in distributed inference.
Ray is a complete toolkit that includes libraries to make end-to-end ML pipelines. One of the libraries Ray Serve is a good scalable model serving tool. It can be used to build inference APIs. It supports complex deep learning models with TensorFlow or PyTorch and has special optimizations like response streaming, dynamic request batching and batched inference for LLMs.
Here is a sample code snippet for serving a Llama 13B model.
For inference, query the /llamas endpoint using HTTP requests:
- Dynamically scalable to many machines for adjusting required resources for the model.
- It is optimized for LLMs with features like response streaming, dynamic request batching and multi-GPU serving for computation intensive LLMs.
- It has got native integration for the LangChain.
- It does not seem to be the best option for beginners as it comprises of all kinds of complicated functions for all types of ML models.
More About E2E Cloud Platform
E2E Networks is a leading provider of advanced cloud GPUs in India, offering a high-value infrastructure cloud for building and launching machine learning applications. Our cloud platform supports a variety of workloads across various industry verticals, providing high-performance computing in the cloud. We offer a flagship machine learning platform called Tir, which is built on top of Jupyter Notebook. E2E Networks also provides reliable and cost-effective storage solutions, allowing businesses to store and access their data from anywhere. Our cloud platform is trusted by over 10,000 clients and is designed for numerous real-world use cases in domains ranging from Data Science, NLP, Computer Vision/Image Processing, HealthTech, ConsumerTech, and more.
Tir: E2E Cloud’s Flagship Machine Learning Platform Tir is built on top of Jupyter Notebook, an advanced web-based interactive development environment offered by E2E Cloud. It provides users with the latest features and functionalities, including JupyterLab, a cutting-edge interface for working with notebooks, code, and data. Tir allows easy integration with other E2E services to build and deploy a complete pipeline. There are interfaces to test and decide which model to use for your task.
E2E Cloud Models: E2E Networks offers a wide range of options for machine learning applications. There are deployment-ready models for tasks ranging from Data Science, NLP, Computer Vision/Image Processing, HealthTech, ConsumerTech, and more. Other models include embedding models that convert text to embeddings which are widely used for search engines, personalization and recommendation tasks.
- Highly scalable infrastructure
- Reliable for production-level purposes
- We are now growing to an end-to-end platform for machine learning models.
The selection of the framework is indeed contingent on several factors, including the specific task you’re tackling, the Large Language Model you’re utilizing, and the expenses you’re prepared to bear. This is not an exhaustive list but these are some of the currently available frameworks you are expected to try first for LLM inference and serving. To sum up the whole thing: