![](https://private-user-images.githubusercontent.com/2610866/317999426-9ca21b1b-8cc9-4521-a615-aa17109f28cf.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjIxMDg4NjksIm5iZiI6MTcyMjEwODU2OSwicGF0aCI6Ii8yNjEwODY2LzMxNzk5OTQyNi05Y2EyMWIxYi04Y2M5LTQ1MjEtYTYxNS1hYTE3MTA5ZjI4Y2YucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI0MDcyNyUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNDA3MjdUMTkyOTI5WiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9ZjQ0N2RhZDYzYjA4OTNlZjk2ODdiMGNlNzUyOTMwZWRiNzg2N2NmODIwYjhlMzU0NDQyNDc2ODI2NjJkNmIwZCZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QmYWN0b3JfaWQ9MCZrZXlfaWQ9MCZyZXBvX2lkPTAifQ.Wya-KzauHuSFtrMOdfiRI6NaAjfGO2iFDdu-17OF3U8)
- Adding the queue in the middle adds an overall latency of around 10-20 ms which is fine for LLM inference usecase since the overall inference latency is in the order of seconds.
- The LLM gateway layer helps us calculate the detailed analytics about the incoming requests and can also calculate the token distribution among the requests. We can also start logging the requests to enable finetuning later.
- No dependency on one cloud provider
- Cost reduction by using spot and other cloud providers
- No dependency on one cloud provider
- high reliability by using multiple cloud providers distributed across regions
- ideally recommend Kubernetes because you get all the health checks, routing and failover for free.
- Put the model in a shared volume like EFS (AWS Elastic File System) so that models are not downloaded repeatedly and mount the volume to all the pods.