Skip to content

Instantly share code, notes, and snippets.

@enricorotundo
Last active August 17, 2023 09:56
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save enricorotundo/abda94dfe2924408311be004f2f3293b to your computer and use it in GitHub Desktop.
Save enricorotundo/abda94dfe2924408311be004f2f3293b to your computer and use it in GitHub Desktop.
Tritonserver tritonserver:23.02-py3 flags
since this is nowhere to be found...
=============================
== Triton Inference Server ==
=============================
NVIDIA Release 23.02 (build 53616260)
Triton Server Version 2.31.0
Copyright (c) 2018-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
WARNING: The NVIDIA Driver was not detected. GPU functionality will not be available.
Use the NVIDIA Container Toolkit to start this container with GPU support; see
https://docs.nvidia.com/datacenter/cloud-native/ .
root@611a66e9c322:/opt/tritonserver# tritonserver --help
Usage: tritonserver [options]
--help
Print usage
--log-verbose <integer>
Set verbose logging level. Zero (0) disables verbose logging
and values >= 1 enable verbose logging.
--log-info <boolean>
Enable/disable info-level logging.
--log-warning <boolean>
Enable/disable warning-level logging.
--log-error <boolean>
Enable/disable error-level logging.
--log-format <string>
Set the logging format. Options are "default" and "ISO8601".
The default is "default". For "default", the log severity (L) and
timestamp will be logged as "LMMDD hh:mm:ss.ssssss". For "ISO8601",
the log format will be "YYYY-MM-DDThh:mm:ssZ L".
--log-file <string>
Set the name of the log output file. If specified, log
outputs will be saved to this file. If not specified, log outputs will
stream to the console.
--id <string>
Identifier for this server.
--model-store <string>
Equivalent to --model-repository.
--model-repository <string>
Path to model repository directory. It may be specified
multiple times to add multiple model repositories. Note that if a model
is not unique across all model repositories at any time, the model
will not be available.
--exit-on-error <boolean>
Exit the inference server if an error occurs during
initialization.
--disable-auto-complete-config
If set, disables the triton and backends from auto
completing model configuration files. Model configuration files must be
provided and all required configuration settings must be specified.
--strict-model-config <boolean>
DEPRECATED: If true model configuration files must be
provided and all required configuration settings must be specified. If
false the model configuration may be absent or only partially specified
and the server will attempt to derive the missing required
configuration.
--strict-readiness <boolean>
If true /v2/health/ready endpoint indicates ready if the
server is responsive and all models are available. If false
/v2/health/ready endpoint indicates ready if server is responsive even if
some/all models are unavailable.
--allow-http <boolean>
Allow the server to listen for HTTP requests.
--http-port <integer>
The port for the server to listen on for HTTP requests.
--reuse-http-port <boolean>
Allow multiple servers to listen on the same HTTP port when
every server has this option set. If you plan to use this option as
a way to load balance between different Triton servers, the same
model repository or set of models must be used for every server.
--http-address <string>
The address for the http server to binds to.
--http-thread-count <integer>
Number of threads handling HTTP requests.
--allow-grpc <boolean>
Allow the server to listen for GRPC requests.
--grpc-port <integer>
The port for the server to listen on for GRPC requests.
--reuse-grpc-port <boolean>
Allow multiple servers to listen on the same GRPC port when
every server has this option set. If you plan to use this option as
a way to load balance between different Triton servers, the same
model repository or set of models must be used for every server.
--grpc-address <string>
The address for the grpc server to binds to.
--grpc-infer-allocation-pool-size <integer>
The maximum number of inference request/response objects
that remain allocated for reuse. As long as the number of in-flight
requests doesn't exceed this value there will be no
allocation/deallocation of request/response objects.
--grpc-use-ssl <boolean>
Use SSL authentication for GRPC requests. Default is false.
--grpc-use-ssl-mutual <boolean>
Use mututal SSL authentication for GRPC requests. Default is
false.
--grpc-server-cert <string>
File holding PEM-encoded server certificate. Ignored unless
--grpc-use-ssl is true.
--grpc-server-key <string>
File holding PEM-encoded server key. Ignored unless
--grpc-use-ssl is true.
--grpc-root-cert <string>
File holding PEM-encoded root certificate. Ignore unless
--grpc-use-ssl is false.
--grpc-infer-response-compression-level <string>
The compression level to be used while returning the infer
response to the peer. Allowed values are none, low, medium and high.
By default, compression level is selected as none.
--grpc-keepalive-time <integer>
The period (in milliseconds) after which a keepalive ping is
sent on the transport. Default is 7200000 (2 hours).
--grpc-keepalive-timeout <integer>
The period (in milliseconds) the sender of the keepalive
ping waits for an acknowledgement. If it does not receive an
acknowledgment within this time, it will close the connection. Default is
20000 (20 seconds).
--grpc-keepalive-permit-without-calls <boolean>
Allows keepalive pings to be sent even if there are no calls
in flight (0 : false; 1 : true). Default is 0 (false).
--grpc-http2-max-pings-without-data <integer>
The maximum number of pings that can be sent when there is
no data/header frame to be sent. gRPC Core will not continue sending
pings if we run over the limit. Setting it to 0 allows sending pings
without such a restriction. Default is 2.
--grpc-http2-min-recv-ping-interval-without-data <integer>
If there are no data/header frames being sent on the
transport, this channel argument on the server side controls the minimum
time (in milliseconds) that gRPC Core would expect between receiving
successive pings. If the time between successive pings is less than
this time, then the ping will be considered a bad ping from the peer.
Such a ping counts as a ‘ping strike’. Default is 300000 (5
minutes).
--grpc-http2-max-ping-strikes <integer>
Maximum number of bad pings that the server will tolerate
before sending an HTTP2 GOAWAY frame and closing the transport.
Setting it to 0 allows the server to accept any number of bad pings.
Default is 2.
--allow-sagemaker <boolean>
Allow the server to listen for Sagemaker requests. Default
is false.
--sagemaker-port <integer>
The port for the server to listen on for Sagemaker requests.
Default is 8080.
--sagemaker-safe-port-range <<integer>-<integer>>
Set the allowed port range for endpoints other than the
SageMaker endpoints.
--sagemaker-thread-count <integer>
Number of threads handling Sagemaker requests. Default is 8.
--allow-vertex-ai <boolean>
Allow the server to listen for Vertex AI requests. Default
is true if AIP_MODE=PREDICTION, false otherwise.
--vertex-ai-port <integer>
The port for the server to listen on for Vertex AI requests.
Default is AIP_HTTP_PORT if set, 8080 otherwise.
--vertex-ai-thread-count <integer>
Number of threads handling Vertex AI requests. Default is 8.
--vertex-ai-default-model <string>
The name of the model to use for single-model inference
requests.
--allow-metrics <boolean>
Allow the server to provide prometheus metrics.
--allow-gpu-metrics <boolean>
Allow the server to provide GPU metrics. Ignored unless
--allow-metrics is true.
--allow-cpu-metrics <boolean>
Allow the server to provide CPU metrics. Ignored unless
--allow-metrics is true.
--metrics-port <integer>
The port reporting prometheus metrics.
--metrics-interval-ms <float>
Metrics will be collected once every <metrics-interval-ms>
milliseconds. Default is 2000 milliseconds.
--trace-file <string>
Set the file where trace output will be saved. If
--trace-log-frequency is also specified, this argument value will be the
prefix of the files to save the trace output. See --trace-log-frequency
for detail.
--trace-level <string>
Specify a trace level. OFF to disable tracing, TIMESTAMPS to
trace timestamps, TENSORS to trace tensors. It may be specified
multiple times to trace multiple informations. Default is OFF.
--trace-rate <integer>
Set the trace sampling rate. Default is 1000.
--trace-count <integer>
Set the number of traces to be sampled. If the value is -1,
the number of traces to be sampled will not be limited. Default is
-1.
--trace-log-frequency <integer>
Set the trace log frequency. If the value is 0, Triton will
only log the trace output to <trace-file> when shutting down.
Otherwise, Triton will log the trace output to <trace-file>.<idx> when it
collects the specified number of traces. For example, if the log
frequency is 100, when Triton collects the 100-th trace, it logs the
traces to file <trace-file>.0, and when it collects the 200-th trace,
it logs the 101-th to the 200-th traces to file <trace-file>.1.
Default is 0.
--model-control-mode <string>
Specify the mode for model management. Options are "none",
"poll" and "explicit". The default is "none". For "none", the server
will load all models in the model repository(s) at startup and will
not make any changes to the load models after that. For "poll", the
server will poll the model repository(s) to detect changes and will
load/unload models based on those changes. The poll rate is
controlled by 'repository-poll-secs'. For "explicit", model load and unload
is initiated by using the model control APIs, and only models
specified with --load-model will be loaded at startup.
--repository-poll-secs <integer>
Interval in seconds between each poll of the model
repository to check for changes. Valid only when --model-control-mode=poll is
specified.
--load-model <string>
Name of the model to be loaded on server startup. It may be
specified multiple times to add multiple models. To load ALL models
at startup, specify '*' as the model name with --load-model=* as the
ONLY --load-model argument, this does not imply any pattern
matching. Specifying --load-model=* in conjunction with another
--load-model argument will result in error. Note that this option will only
take effect if --model-control-mode=explicit is true.
--rate-limit <string>
Specify the mode for rate limiting. Options are
"execution_count" and "off". The default is "off". For "execution_count", the
server will determine the instance using configured priority and the
number of time the instance has been used to run inference. The
inference will finally be executed once the required resources are
available. For "off", the server will ignore any rate limiter config and
run inference as soon as an instance is ready.
--rate-limit-resource <<string>:<integer>:<integer>>
The number of resources available to the server. The format
of this flag is
--rate-limit-resource=<resource_name>:<count>:<device>. The <device> is optional and if not listed will be applied to
every device. If the resource is specified as "GLOBAL" in the model
configuration the resource is considered shared among all the devices
in the system. The <device> property is ignored for such resources.
This flag can be specified multiple times to specify each resources
and their availability. By default, the max across all instances
that list the resource is selected as its availability. The values for
this flag is case-insensitive.
--pinned-memory-pool-byte-size <integer>
The total byte size that can be allocated as pinned system
memory. If GPU support is enabled, the server will allocate pinned
system memory to accelerate data transfer between host and devices
until it exceeds the specified byte size. If 'numa-node' is configured
via --host-policy, the pinned system memory of the pool size will be
allocated on each numa node. This option will not affect the
allocation conducted by the backend frameworks. Default is 256 MB.
--cuda-memory-pool-byte-size <<integer>:<integer>>
The total byte size that can be allocated as CUDA memory for
the GPU device. If GPU support is enabled, the server will allocate
CUDA memory to minimize data transfer between host and devices
until it exceeds the specified byte size. This option will not affect
the allocation conducted by the backend frameworks. The argument
should be 2 integers separated by colons in the format <GPU device
ID>:<pool byte size>. This option can be used multiple times, but only
once per GPU device. Subsequent uses will overwrite previous uses for
the same GPU device. Default is 64 MB.
--response-cache-byte-size <integer>
The size in bytes to allocate for a request/response cache.
When non-zero, Triton allocates the requested size in CPU memory and
shares the cache across all inference requests and across all
models. For a given model to use request caching, the model must enable
request caching in the model configuration. By default, no model uses
request caching even if the request cache is enabled with the
--response-cache-byte-size flag. Default is 0.
--min-supported-compute-capability <float>
The minimum supported CUDA compute capability. GPUs that
don't support this compute capability will not be used by the server.
--exit-timeout-secs <integer>
Timeout (in seconds) when exiting to wait for in-flight
inferences to finish. After the timeout expires the server exits even if
inferences are still in flight.
--backend-directory <string>
The global directory searched for backend shared libraries.
Default is '/opt/tritonserver/backends'.
--repoagent-directory <string>
The global directory searched for repository agent shared
libraries. Default is '/opt/tritonserver/repoagents'.
--buffer-manager-thread-count <integer>
The number of threads used to accelerate copies and other
operations required to manage input and output tensor contents.
Default is 0.
--model-load-thread-count <integer>
The number of threads used to concurrently load models in
model repositories. Default is 2*<num_cpu_cores>.
--backend-config <<string>,<string>=<string>>
Specify a backend-specific configuration setting. The format
of this flag is --backend-config=<backend_name>,<setting>=<value>.
Where <backend_name> is the name of the backend, such as 'tensorrt'.
--host-policy <<string>,<string>=<string>>
Specify a host policy setting associated with a policy name.
The format of this flag is
--host-policy=<policy_name>,<setting>=<value>. Currently supported settings are 'numa-node', 'cpu-cores'.
Note that 'numa-node' setting will affect pinned memory pool behavior,
see --pinned-memory-pool for more detail.
--model-load-gpu-limit <<device_id>:<fraction>>
Specify the limit on GPU memory usage as a fraction. If
model loading on the device is requested and the current memory usage
exceeds the limit, the load will be rejected. If not specified, the
limit will not be set.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment