Skip to content

Instantly share code, notes, and snippets.

@apivovarov
Created June 12, 2024 16:44
Show Gist options
  • Save apivovarov/fd031028deaa2ca8b8168afc90b61021 to your computer and use it in GitHub Desktop.
Save apivovarov/fd031028deaa2ca8b8168afc90b61021 to your computer and use it in GitHub Desktop.
./bazel-bin/xla/tools/hlo-opt --help
This tool lets you run a given HloModule from a file (or stdin) and convert it
to expanded HLO, fully optimized HLO, or a binary depending on options.
HLO passes are always run, unless the HLO module is already scheduled (has
is_scheduled=True).
You can also pass in debug option flags for the HloModule.
Usage:
bazel run opt -- --platform=[gpu|cpu|...] path/to/hlo_module
usage: ./bazel-bin/xla/tools/hlo-opt
Flags:
--o="-" string Output filename, or '-' for stdout (default).
--platform="gpu" string The platform for which we perform the translation
--format="" string The format of the input file. By default inferred from the filename. Valid values:
hlo : HLO textual format
pb : xla::HloProto in binary proto format
pbtxt : xla::HloProto in text proto format
--stage="hlo" string Output stage to dump. Valid values depend on the platform, for GPUs:
* hlo : HLO after all optimizations
* llvm : LLVM IR
* ptx : PTX dump
* buffer-assignment: Buffer Assignment
* hlo-backend: HLO after backend passes
--list-stages=false bool Print all supported stages for a given platform and exit
--split-input-file=false bool Splits the input file in pieces based on '// -----' substring, and processes each chunk independently
--xla_cpu_enable_fast_math=false bool Enable unsafe fast-math optimizations in the CPU compiler; this may produce faster code at the expense of some accuracy.
--xla_cpu_fast_math_honor_nans=true bool When xla_cpu_enable_fast_math is true then this controls whether we allow operations to produce NaNs. Ignored when xla_cpu_enable_fast_math is false.
--xla_cpu_fast_math_honor_infs=true bool When xla_cpu_enable_fast_math is true then this controls whether we allow operations to produce infinites. Ignored when xla_cpu_enable_fast_math is false.
--xla_cpu_fast_math_honor_division=true bool When xla_cpu_enable_fast_math is true then this controls whether we forbid to use multiplication by the reciprocal instead of division. Ignored when xla_cpu_enable_fast_math is false.
--xla_cpu_fast_math_honor_functions=true bool When xla_cpu_enable_fast_math is true then this controls whether we forbid to approximate calculations for functions. Ignored when xla_cpu_enable_fast_math is false.
--xla_cpu_enable_fast_min_max=true bool Enable fast floating point min/max lowering that always propagates NaNs.
--xla_gpu_enable_fast_min_max=false bool Enable fast floating point min/max lowering that does not propagate NaNs.
--xla_llvm_enable_alias_scope_metadata=true bool In LLVM-based backends, enable the emission of !alias.scope metadata in the generated IR.
--xla_llvm_enable_noalias_metadata=true bool In LLVM-based backends, enable the emission of !noalias metadata in the generated IR.
--xla_llvm_enable_invariant_load_metadata=true bool In LLVM-based backends, enable the emission of !invariant.load metadata in the generated IR.
--xla_llvm_disable_expensive_passes=false bool In LLVM-based backends, disable a custom set of expensive optimization passes.
--xla_backend_optimization_level=3 int32 Numerical optimization level for the XLA compiler backend.
--xla_disable_hlo_passes="" string Comma-separated list of hlo passes to be disabled. These names must exactly match the passes' names; no whitespace around commas.
--xla_enable_hlo_passes_only="" string Comma-separated list of hlo passes to be enabled. These names must exactly match the passes' names; no whitespace around commas. The unspecified passes are all disabled.
--xla_disable_all_hlo_passes=false bool Disables all HLO passes. Notes that some passes are necessary for correctness and the invariants that must be satisfied by 'fully optimized' HLO are different for different devices and may change over time. The only 'guarantee', such as it is, is that if you compile XLA and dump the optimized HLO for some graph, you should be able to run it again on the same device with the same build of XLA.
--xla_embed_ir_in_executable=false bool Embed the compiler IR as a string in the executable.
--xla_eliminate_hlo_implicit_broadcast=true bool Eliminate implicit broadcasts when lowering user computations to HLO instructions; use explicit broadcast instead.
--xla_cpu_multi_thread_eigen=true bool When generating calls to Eigen in the CPU backend, use multi-threaded Eigen mode.
--xla_gpu_cuda_data_dir="./cuda_sdk_lib" string If non-empty, specifies a local directory containing ptxas and nvvm libdevice files; otherwise we use those from runfile directories.
--xla_gpu_ftz=false bool If true, flush-to-zero semantics are enabled in the code generated for GPUs.
--xla_gpu_ptx_file="" string If non-empty, specifies a file containing ptx to use. The filename prefix must have the same pattern as PTX dumped by XLA. This allows to match one specific module. General workflow. Get the generated module ptx from XLA, modify it, then pass it back via this option.
--xla_gpu_llvm_ir_file="" string If non-empty, specifies a file containing textual LLVM IR to use. The filename prefix must have the same pattern as LLVM dumped by XLA (i.e. module_0001.ir-no-opt.ll -> module_0001.MY_NEW_FILE.ll). This allows to match one specific module. General workflow. Get the not optimized LLVM IR from XLA, modify it, then pass it back via this option.
--xla_test_all_output_layouts=false bool Let ClientLibraryTestBase::ComputeAndCompare* test all permutations of output layouts. For example, with a 3D shape, all permutations of the set {0, 1, 2} are tried.
--xla_test_all_input_layouts=false bool Let ClientLibraryTestBase::ComputeAndCompare* test all permutations of *input* layouts. For example, for 2 input arguments with 2D shape and 4D shape, the computation will run 2! * 4! times for every possible layouts
--xla_hlo_profile=false bool Instrument the computation to collect per-HLO cycle counts
--xla_backend_extra_options="" string Extra options to pass to a backend; comma-separated list of 'key=val' strings (=val may be omitted); no whitespace around commas.
--xla_cpu_use_mkl_dnn=false bool Generate calls to MKL-DNN in the CPU backend.
--xla_cpu_use_acl=false bool Generate calls to ACL (Arm Compute Library) in the CPU backend.
--xla_cpu_use_xla_runtime=false bool Enable XLA Runtime in the CPU backend.
--xla_cpu_sparse_cuda_threads=0 int32 Sets number fo CUDA threads for sparse GPU acceleration in the CPU backend (0 = off).
--xla_gpu_crash_on_verification_failures=false bool Crashes the program on extra verification failures, e.g. cuDNN cross checking failures
--xla_gpu_strict_conv_algorithm_picker=true bool Upgrades warnings to failures when all algorithms fail conv autotuning.
--xla_gpu_autotune_level=4 int32 Set GEMM and Convolution auto-tuning level. 0 = off; 1 = on; 2 = on+init; 3 = on+init+reinit; 4 = on+init+reinit+check.
--xla_force_host_platform_device_count=1 int32 Force the host platform to pretend that there are these many host "devices". All of these host devices are backed by the same threadpool. Setting this to anything other than 1 can increase overhead from context switching but we let the user override this behavior to help run tests on the host that run models in parallel across multiple devices.
--xla_gpu_disable_gpuasm_optimizations=false bool In XLA:GPU run ptxas in -O0 (default is -O3).
--xla_gpu_asm_extra_flags="" string Pass extra parameters to the GPU assembler tool (i.e., ptxas for CUDA). If multiple parameters, separate them by comma.
--xla_fuel="" string Sets compiler fuel, useful for bisecting bugs in passes. Format --xla_fuel=PASS1=NUM1,PASS2=NUM2,...
--xla_dump_to="" string Directory into which debugging data is written. If not specified but another dumping flag is passed, data will be written to stdout. To explicitly write to stdout, set this to "-". The values "sponge" and "test_undeclared_outputs_dir" have a special meaning: They cause us to dump into the directory specified by the environment variable TEST_UNDECLARED_OUTPUTS_DIR.
--xla_dump_hlo_as_text=false bool Dumps HLO modules as text before and after optimizations. debug_options are written to the --xla_dump_to dir, or, if no dir is specified, to stdout.
--xla_dump_hlo_as_long_text=false bool Dumps HLO modules as long text before and after optimizations. debug_options are written to the --xla_dump_to dir, or, if no dir is specified, to stdout. Ignored unless xla_dump_hlo_as_text is true.
--xla_dump_hlo_as_proto=false bool Dumps HLO modules as HloProtos to the directory specified by --xla_dump_to.
--xla_dump_hlo_as_dot=false bool Dumps HLO modules rendered as dot files to the directory specified by --xla_dump_to.
--xla_dump_hlo_as_html=false bool Dumps HLO modules rendered as HTML files to the directory specified by --xla_dump_to.
--xla_dump_hlo_as_url=false bool Tries to dump HLO modules rendered as URLs to stdout (and also to the directory specified by --xla_dump_to). This is not implemented by default; you need to add a plugin which calls RegisterGraphToURLRenderer().
--xla_dump_fusion_visualization=false bool Tries to generate HLO fusion visualization as an HTML page to the directory specified by --xla_dump_to). This is not implemented by default; you need to add a plugin which calls RegisterGraphToURLRenderer(). Generates a file per computation. Currently only implemented for the GPU backend.
--xla_dump_hlo_snapshots=false bool Every time an HLO module is run, dumps an HloSnapshot to the directory specified by --xla_dump_to.
--xla_dump_hlo_module_re="" string Limits dumping only to modules which match this regular expression. Default is to dump all modules.
--xla_dump_hlo_pass_re="" string If specified, dumps HLO before and after optimization passes which match this regular expression, in addition to dumping at the very beginning and end of compilation.
--xla_dump_include_timestamp=false bool If specified, includes a timestamp in the dumped filenames.
--xla_dump_max_hlo_modules=-1 int32 Max number of hlo module dumps in a directory. Set to < 0 for unbounded.
--xla_dump_module_metadata=false bool Dumps HloModuleMetadata as text protos to the directory specified by --xla_dump_to.
--xla_dump_compress_protos=false bool Gzip-compress protos dumped by --xla_dump_hlo_as_proto.
--xla_hlo_graph_addresses=false bool When rendering graphs (--xla_dump_hlo_as_{dot,html,url}), displays the address in memory of each HloInstruction object.
--xla_hlo_graph_sharding_color=false bool Assign colors based on sharding assignments when generating the HLO graphs.
--xla_allow_excess_precision=true bool Allow xla to increase the output precision of an instruction.
--xla_gpu_force_conv_nchw=false bool For cuDNN convolutions, always use NCHW layouts.
--xla_gpu_force_conv_nhwc=false bool For cuDNN convolutions, always use NHWC layouts.
--xla_gpu_algorithm_denylist_path="" string An AlgorithmDenylist text proto file as a denylist of convolutions to avoid to use.
--xla_gpu_use_runtime_fusion=false bool For using cuDNN runtime compiled fusion kernels.
--xla_tpu_detect_nan=false bool Trigger error on execution on TPU if a NAN value is detected
--xla_tpu_detect_inf=false bool Trigger error on execution on TPU if a INF value is detected
--xla_cpu_enable_xprof_traceme=false bool If true, XLA CPU generates code to call TraceMe::Activity{Start|End} around HLO operations.
--xla_gpu_unsafe_fallback_to_driver_on_ptxas_not_found=false bool If true, XLA GPU falls back to the driver if ptxas is not found. Note that falling back to the driver can have drawbacks like using more memory and/or other bugs during compilation, so we recommend setting this flag to false.
--xla_multiheap_size_constraint_per_heap=-1 int32 Generates multiple heaps (i.e., temp buffers) with a size constraint on each heap to avoid Out-of-Memory due to memory fragmentation. The constraint is soft, so it works with tensors larger than the given constraint size. -1 corresponds to no constraints.
--xla_gpu_force_compilation_parallelism=0 int32 Overrides normal multi-threaded compilation setting to use this many threads. Setting to 0 (the default value) means no enforcement.
--xla_gpu_enable_llvm_module_compilation_parallelism=false bool Decides whether we can do LLVM module compilation in a parallelised way. If set to false, then it will be single threaded, otherwise the number of threads depends on the --xla_gpu_force_compilation_parallelism flag and the thread pool supplied to GpuCompiler.
--xla_gpu_deterministic_ops=false bool Guarantees run-to-run determinism on GPU.
--xla_gpu_enable_async_collectives=true bool Converts synchronous collective ops into asynchronous.
--xla_gpu_enable_async_all_reduce=true bool Converts synchronous all-reduce ops into asynchronous.
--xla_gpu_enable_async_collective_permute=false bool Converts synchronous collective-permute ops into asynchronous.
--xla_gpu_enable_async_all_gather=false bool Converts synchronous all-gather ops into asynchronous.
--xla_gpu_enable_async_reduce_scatter=false bool Converts synchronous reduce-scatter ops into asynchronous.
--xla_gpu_enable_async_all_to_all=false bool Converts synchronous all-to-all ops into asynchronous.
--xla_gpu_all_reduce_combine_threshold_bytes=31457280 int64 Size threshold (in bytes) for the GPU all-reduce combiner.
--xla_gpu_all_gather_combine_threshold_bytes=31457280 int64 Size threshold (in bytes) for the GPU all-gather combiner.
--xla_gpu_reduce_scatter_combine_threshold_bytes=31457280 int64 Size threshold (in bytes) for the GPU reduce-scatter combiner.
--xla_gpu_enable_all_gather_combine_by_dim=true bool Combine all-gather ops with the same gather dimension or irrespective of their dimension.
--xla_gpu_enable_reduce_scatter_combine_by_dim=true bool Combine reduce-scatter ops with the same dimension or irrespective of their dimension.
--xla_gpu_all_reduce_contiguous=false bool Combine all-reduces into a single operation over a contiguous buffer.
--xla_gpu_all_reduce_blueconnect_num_devices_per_host=0 int32 Number of devices per host for first stage of BlueConnect decomposition pass. The pass will attempt to decompose all-reduces ops into a ReduceScatter-AllReduce-AllGather sequence, with the initial ReduceScatter being performed over all of the devices in the same host. Set to < 1 to disable all-reduce decomposition.
--xla_gpu_enable_while_loop_reduce_scatter_code_motion=false bool Enable hoisting of reduce-scatter outside while loops.
--xla_gpu_collective_inflation_factor=1 int32 Inflation factor for collectives. If set to > 1, each XLA/GPU collective will execute multiple times (will yield incorrect results)
--xla_gpu_enable_reassociation_for_converted_a
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment