mkmohangb/llms-deployment.md

## llms-deployment.md

      
    Raw
  

              llms-deployment.md
            
          
    Truefoundry
  

Adding the queue in the middle adds an overall latency of around 10-20 ms which is fine for LLM inference usecase since the overall inference latency is in the order of seconds.
The LLM gateway layer helps us calculate the detailed analytics about the incoming requests and can also calculate the token distribution among the requests. We can also start logging the requests to enable finetuning later.
No dependency on one cloud provider
Cost reduction by using spot and other cloud providers
No dependency on one cloud provider
high reliability by using multiple cloud providers distributed across regions
ideally recommend Kubernetes because you get all the health checks, routing and failover for free.
Put the model in a shared volume like EFS (AWS Elastic File System) so that models are not downloaded repeatedly and mount the volume to all the pods.

K8S setup using LLM Engine