Skip to content

Instantly share code, notes, and snippets.

@mkmohangb
Last active March 29, 2024 10:41
Show Gist options
  • Save mkmohangb/4194c035c354e96fe6dec17f5d9359b4 to your computer and use it in GitHub Desktop.
Save mkmohangb/4194c035c354e96fe6dec17f5d9359b4 to your computer and use it in GitHub Desktop.

Truefoundry

  • Adding the queue in the middle adds an overall latency of around 10-20 ms which is fine for LLM inference usecase since the overall inference latency is in the order of seconds.
  • The LLM gateway layer helps us calculate the detailed analytics about the incoming requests and can also calculate the token distribution among the requests. We can also start logging the requests to enable finetuning later.
  • No dependency on one cloud provider
  • Cost reduction by using spot and other cloud providers
  • No dependency on one cloud provider
  • high reliability by using multiple cloud providers distributed across regions
  • ideally recommend Kubernetes because you get all the health checks, routing and failover for free.
  • Put the model in a shared volume like EFS (AWS Elastic File System) so that models are not downloaded repeatedly and mount the volume to all the pods.

K8S setup using LLM Engine

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment