- Adding the queue in the middle adds an overall latency of around 10-20 ms which is fine for LLM inference usecase since the overall inference latency is in the order of seconds.
- The LLM gateway layer helps us calculate the detailed analytics about the incoming requests and can also calculate the token distribution among the requests. We can also start logging the requests to enable finetuning later.
- No dependency on one cloud provider
- Cost reduction by using spot and other cloud providers
- No dependency on one cloud provider
- high reliability by using multiple cloud providers distributed across regions
- ideally recommend Kubernetes because you get all the health checks, routing and failover for free.
- Put the model in a shared volume like EFS (AWS Elastic File System) so that models are not downloaded repeatedly and mount the volume to all the pods.
Last active
March 29, 2024 10:41
-
-
Save mkmohangb/4194c035c354e96fe6dec17f5d9359b4 to your computer and use it in GitHub Desktop.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment