Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save renxida/e00f8db18ac1c34e0a2d8a0f687451d2 to your computer and use it in GitHub Desktop.
Save renxida/e00f8db18ac1c34e0a2d8a0f687451d2 to your computer and use it in GitHub Desktop.
iree help output
OVERVIEW: IREE compilation driver
USAGE: iree-compile [options] <input file or '-' for stdin>
OPTIONS:
CUDA HAL Target:
--iree-hal-cuda-dump-ptx - Dump ptx to the debug stream.
--iree-hal-cuda-llvm-target-arch=<string> - LLVM target chip.
--iree-hal-cuda-llvm-target-feature=<string> - Use to set PTX version.
--iree-hal-cuda-use-ptxas - It uses the ptxas compiler that is on the environment, compiles the generated PTX code with it, puts the cubin binary generated by ptxas into the executable. '--iree-hal-cuda-llvm-target-arch' is used as the target GPU. If passing additional parameters to ptxas is desired, the parameters flag can be used (e.g.'--iree-hal-cuda-use-ptxas-params=-v').
--iree-hal-cuda-use-ptxas-from=<string> - It uses the provided ptxas compiler, compiles the generated PTX code with it, puts the cubin binary generated by ptxas into the executable. '--iree-hal-cuda-llvm-target-arch' is used as the target GPU. If passing additional parameters to ptxas is desired, the parameters flag can be used (e.g.'--iree-hal-cuda-use-ptxas-params=-v').
--iree-hal-cuda-use-ptxas-params=<string> - Passes the given additional parameters to ptxas.
Color Options:
--color - Use colors in output (default=autodetect)
General options:
--aarch64-neon-syntax=<value> - Choose style of NEON code to emit from AArch64 backend:
=generic - Emit generic NEON assembly
=apple - Emit Apple-style NEON assembly
--aarch64-use-aa - Enable the use of AA during codegen.
--abort-on-max-devirt-iterations-reached - Abort when the max iterations for devirtualization CGSCC repeat pass is reached
--allow-ginsert-as-artifact - Allow G_INSERT to be considered an artifact. Hack around AMDGPU test infinite loops.
--amdgpu-atomic-optimizer-strategy=<value> - Select DPP or Iterative strategy for scan
=DPP - Use DPP operations for scan
=Iterative - Use Iterative approach for scan
=None - Disable atomic optimizer
--amdgpu-bypass-slow-div - Skip 64-bit divide for dynamic 32-bit values
--amdgpu-disable-loop-alignment - Do not align and prefetch loops
--amdgpu-dpp-combine - Enable DPP combiner
--amdgpu-dump-hsa-metadata - Dump AMDGPU HSA Metadata
--amdgpu-enable-merge-m0 - Merge and hoist M0 initializations
--amdgpu-enable-power-sched - Enable scheduling to minimize mAI power bursts
--amdgpu-kernarg-preload-count=<uint> - How many kernel arguments to preload onto SGPRs
--amdgpu-promote-alloca-to-vector-limit=<uint> - Maximum byte size to consider promote alloca to vector
--amdgpu-sdwa-peephole - Enable SDWA peepholer
--amdgpu-use-aa-in-codegen - Enable the use of AA during codegen.
--amdgpu-verify-hsa-metadata - Verify AMDGPU HSA Metadata
--amdgpu-vgpr-index-mode - Use GPR indexing mode instead of movrel for vector indexing
--arm-add-build-attributes -
--arm-implicit-it=<value> - Allow conditional instructions outdside of an IT block
=always - Accept in both ISAs, emit implicit ITs in Thumb
=never - Warn in ARM, reject in Thumb
=arm - Accept in ARM, reject in Thumb
=thumb - Warn in ARM, emit implicit ITs in Thumb
--atomic-counter-update-promoted - Do counter update using atomic fetch add for promoted counters only
--atomic-first-counter - Use atomic fetch add for first counter in a function (usually the entry counter)
--bounds-checking-single-trap - Use one trap block per function
--bounds-checking-unique-traps - Always use one trap per check
--cfg-hide-cold-paths=<number> - Hide blocks with relative frequency below the given value
--cfg-hide-deoptimize-paths -
--cfg-hide-unreachable-paths -
--compile-from=<value> - Compilation phase to resume from, starting with the following phase.
=start - Entry point to the compilation pipeline.
=input - Performs input processing and lowering into core IREE input dialects (linalg/etc).
=abi - Adjusts program ABI for the specified execution environment.
=preprocessing - Compiles up to the `preprocessing` specified
=global-optimization - Compiles up to global optimization.
=flow - Compiles up to the `flow` dialect.
=stream - Compiles up to the `stream` dialect.
=executable-sources - Compiles up to just before `hal.executable`s are configured, excluding codegen.
=executable-configurations - Compiles up to just before `hal.executable`s are translated, including selection of translation strategies for codegen.
=executable-targets - Compiles up to translated `hal.executable`s, including codegen.
=hal - Compiles up to the `hal` dialect, including codegen.
=vm - Compiles up to the `vm` dialect.
=end - Complete the full compilation pipeline.
--compile-to=<value> - Compilation phase to run up until before emitting output.
=start - Entry point to the compilation pipeline.
=input - Performs input processing and lowering into core IREE input dialects (linalg/etc).
=abi - Adjusts program ABI for the specified execution environment.
=preprocessing - Compiles up to the `preprocessing` specified
=global-optimization - Compiles up to global optimization.
=flow - Compiles up to the `flow` dialect.
=stream - Compiles up to the `stream` dialect.
=executable-sources - Compiles up to just before `hal.executable`s are configured, excluding codegen.
=executable-configurations - Compiles up to just before `hal.executable`s are translated, including selection of translation strategies for codegen.
=executable-targets - Compiles up to translated `hal.executable`s, including codegen.
=hal - Compiles up to the `hal` dialect, including codegen.
=vm - Compiles up to the `vm` dialect.
=end - Complete the full compilation pipeline.
--cost-kind=<value> - Target cost kind
=throughput - Reciprocal throughput
=latency - Instruction latency
=code-size - Code size
=size-latency - Code size and latency
--debug-info-correlate - Use debug info to correlate profiles. (Deprecated, use -profile-correlate=debug-info)
--debugify-func-limit=<ulong> - Set max number of processed functions per pass.
--debugify-level=<value> - Kind of debug info to add
=locations - Locations only
=location+variables - Locations and Variables
--debugify-quiet - Suppress verbose debugify output
--disable-auto-upgrade-debug-info - Disable autoupgrade of debug info
--disable-i2p-p2i-opt - Disables inttoptr/ptrtoint roundtrip optimization
--disable-promote-alloca-to-lds - Disable promote alloca to LDS
--disable-promote-alloca-to-vector - Disable promote alloca to vector
--do-counter-promotion - Do counter register promotion
--dot-cfg-mssa=<file name for generated dot file> - file name for generated dot file
--emit-mlir-bytecode - Emit bytecode when generating compile-to or VM MLIR output.
--emit-mlir-bytecode-version=<value> - Use specified bytecode version when generating compile-to or VM MLIR output.
--emscripten-cxx-exceptions-allowed=<string> - The list of function names in which Emscripten-style exception handling is enabled (see emscripten EMSCRIPTEN_CATCHING_ALLOWED options)
--enable-cse-in-irtranslator - Should enable CSE in irtranslator
--enable-cse-in-legalizer - Should enable CSE in Legalizer
--enable-emscripten-cxx-exceptions - WebAssembly Emscripten-style exception handling
--enable-emscripten-sjlj - WebAssembly Emscripten-style setjmp/longjmp handling
--enable-gvn-hoist - Enable the GVN hoisting pass (default = off)
--enable-gvn-memdep -
--enable-gvn-sink - Enable the GVN sinking pass (default = off)
--enable-jump-table-to-switch - Enable JumpTableToSwitch pass (default = off)
--enable-load-in-loop-pre -
--enable-load-pre -
--enable-loop-simplifycfg-term-folding -
--enable-name-compression - Enable name/filename string compression
--enable-pgo-force-function-attrs - Enable pass to set function attributes based on PGO profiles
--enable-split-backedge-in-load-pre -
--enable-split-loopiv-heuristic - Enable loop iv regalloc heuristic
--experimental-debug-variable-locations - Use experimental new value-tracking variable locations
--experimental-debuginfo-iterators - Enable communicating debuginfo positions through iterators, eliminating intrinsics
--force-tail-folding-style=<value> - Force the tail folding style
=none - Disable tail folding
=data - Create lane mask for data only, using active.lane.mask intrinsic
=data-without-lane-mask - Create lane mask with compare/stepvector
=data-and-control - Create lane mask using active.lane.mask intrinsic, and use it for both data and control flow
=data-and-control-without-rt-check - Similar to data-and-control, but remove the runtime check
--fs-profile-debug-bw-threshold=<uint> - Only show debug message if the source branch weight is greater than this value.
--fs-profile-debug-prob-diff-threshold=<uint> - Only show debug message if the branch probility is greater than this value (in percentage).
--generate-merged-base-profiles - When generating nested context-sensitive profiles, always generate extra base profile for function with all its context profiles merged into it.
--hash-based-counter-split - Rename counter variable of a comdat function based on cfg hash
--hot-cold-split - Enable hot-cold splitting pass
--hwasan-percentile-cutoff-hot=<int> - Alternative hot percentile cuttoff.By default `-profile-summary-cutoff-hot` is used.
--hwasan-random-skip-rate=<number> - Probability value in the range [0.0, 1.0] to skip instrumentation of a function.
--import-all-index - Import all external functions in index.
--instcombine-code-sinking - Enable code sinking
--instcombine-guard-widening-window=<uint> - How wide an instruction window to bypass looking for another guard
--instcombine-max-num-phis=<uint> - Maximum number phis to handle in intptr/ptrint folding
--instcombine-max-sink-users=<uint> - Maximum number of undroppable users for instruction sinking
--instcombine-maxarray-size=<uint> - Maximum array size considered when doing a combine
--instcombine-negator-enabled - Should we attempt to sink negations?
--instcombine-negator-max-depth=<uint> - What is the maximal lookup depth when trying to check for viability of negation sinking.
--instrprof-atomic-counter-update-all - Make all profile counter updates atomic (for testing only)
--internalize-public-api-file=<filename> - A file containing list of symbol names to preserve
--internalize-public-api-list=<list> - A list of symbol names to preserve
--iree-codegen-enable-workgroup-specialization - Enable workgroup specialization.
--iree-codegen-gpu-native-math-precision - Skip polynomial lowering for math op natively available on GPU
--iree-codegen-linalg-max-constant-fold-elements=<long> - Maximum number of elements to try to constant fold.
--iree-codegen-llvm-verbose-debug-info - Emit verbose debug information in LLVM IR.
--iree-codegen-llvmgpu-enable-transform-dialect-aligned-matmul - activate the matmul tensorcore strategy for tile aligned shapes
--iree-codegen-llvmgpu-enable-transform-dialect-batch-matmul-strategy - activate the batch matmul strategy, additional configuration flags are shared with matmul
--iree-codegen-llvmgpu-enable-transform-dialect-implicit-gemm-strategy - activate the convolution implicit gemm strategy
--iree-codegen-llvmgpu-enable-transform-dialect-jit - enable the usage of the transform dialect JIT
--iree-codegen-llvmgpu-enable-transform-dialect-matmul-tensorcore-strategy - activate the matmul tensorcore strategy
--iree-codegen-llvmgpu-enable-transform-dialect-pad-strategy - activate the pad strategy
--iree-codegen-llvmgpu-enable-transform-dialect-small-matmul - activate the matmul tensorcore strategy for small shapes (< 16) in at least a dimension
--iree-codegen-llvmgpu-matmul-c-matrix-threshold=<int> - matmul c matrix element count threshold to be considered as small vs. large when deciding MMA schedule
--iree-codegen-llvmgpu-use-mma-sync - force use mma sync instead of wmma ops
--iree-codegen-llvmgpu-use-vector-distribution - enable the usage of the vector distribution pipeline
--iree-codegen-llvmgpu-use-wmma - force use of wmma operations for tensorcore
--iree-codegen-log-swizzle-tile=<uint> - log swizzle tile value
--iree-codegen-mmt4d-use-intrinsics - Whether to use instrinsics when lowering vector contracts generated from mmt4d matmuls (as opposed to inline asm). Not for production use.
--iree-codegen-transform-dialect-library=<string> - File path to a module containing a library of transform dialectstrategies. Can be suffixed with the name of a transform sequencewithin the library to run as preprocessing per executable variant.This is specified as <file-path>@<sequence-name>. If not specified,this will default to `__kernel_config`.
--iree-consteval-jit-debug - Prints debugging information to stderr (useful since when consteval has issues, it is often in production on the largest models where we don't want to run a debug compiler).
--iree-consteval-jit-target-device=<string> - Overrides the target device used for JIT'ing.
--iree-experimental-llvmcpu-arm-force-ssve - Controls whether to disable SME tiling when SME+SSVE are enabled with +sme. As a result, IREE will effectively target SSVE instead of SME. This flag is experimental and should only be used for testing.
--iree-flow-break-dispatch=<string> - Enables inserting a break after a specified dispatch. Supports two modes; breaking on the dispatch ordinal before deduplication (@function_name:<index>) and breaking on the dispatch symbol.
--iree-flow-collapse-reduction-dims - Enable collapsing of reduction dims
--iree-flow-dispatch-generate-workload-region - Generate the workload region.
--iree-flow-dispatch-use-transform-dialect=<string> - MLIR file containing a top-level module that specifies the transformations to apply to form dispatch regions.
--iree-flow-dump-dispatch-graph - Dump a dot graph for dispatches.
--iree-flow-dump-dispatch-graph-output-file=<string> - Output file name for a dispatch graph dump.
--iree-flow-element-wise-fuse-multi-reduction - Enable element-wise fusion of multi-reduction loop ops.
--iree-flow-enable-detensoring - Enable changing of tensor operations into scalar operations.
--iree-flow-enable-fuse-padding-into-linalg-consumer-ops - Enable fusing tensor.pad ops into Linalg consumer ops.
--iree-flow-enable-fuse-padding-into-linalg-producer-ops - Enable fusing tensor.pad ops into Linalg consumer ops.
--iree-flow-enable-pad-handling - Enable native handling of tensor.pad operations.
--iree-flow-export-benchmark-funcs - Exports one function per original module entry point and unique flow.executable that dispatches with dummy arguments.
--iree-flow-fuse-multi-use - Fuse multi-use ops.
--iree-flow-inline-constants-max-byte-length=<int> - Maximum byte-length of tensor constant that can be inlined into a dispatch region or 0 to disable inlining.
--iree-flow-normalize-input-indexing-map - Enable normalizing input indexing map to identity.
--iree-flow-split-matmul-reduction=<long> - split ratio
--iree-flow-topk-split-reduction=<long> - comma separated list of split ratios
--iree-flow-trace-dispatch=<string> - Enables tracing tensors at specified dispatches. Supports two modes; tracing the dispatch by ordinal before deduplication (@function_name:<index>) and tracing all occurrences of the dispatch symbol.
--iree-flow-trace-dispatch-tensors - Trace runtime input/output tensors for each dispatch function.
--iree-flow-zero-fill-empty-tensors - Zero fill empty tensors instead of leaving them uninitialized.
--iree-global-opt-enable-demote-contraction-inputs-to-bf16 - Demote inputs (LHS, RHS) of linalg matmul-like ops from f32 to bf16.
--iree-global-opt-enable-fuse-silu-horizontal-matmul - Enables fusing specifically structured matmuls (experimental).
--iree-global-opt-enable-quantized-matmul-reassociation - Enables reassociation of quantized matmul ops (experimental).
--iree-global-opt-propagate-transposes - Enables propagation of transpose ops to improve fusion chances.
--iree-hal-benchmark-dispatch-repeat-count=<uint> - The number of times to repeat each hal.command_buffer.dispatch op. This simply duplicates the dispatch op and inserts barriers. It's meant for command buffers having linear dispatch structures.
--iree-hal-executable-object-search-path=<string> - Additional search paths for resolving #hal.executable.object file references.
--iree-hal-instrument-dispatches=<power of two byte size> - Enables dispatch instrumentation with a power-of-two byte size used for storage (16mib, 64mib, 2gib, etc).
--iree-hal-list-target-backends - Lists all registered target backends for executable compilation.
--iree-hal-preprocess-executables-with=<string> - Passes each hal.executable to the given command. Multiple commands may be specified and they will be executed in order. A command may either be a pass pipeline available within the IREE compiler specified as `builtin.module(...)` or a shell tool that consumes a hal.executable MLIR file on stdin and produces a modified hal.executable on stdout. Non-zero exit codes will fail compilation.
--iree-hal-substitute-executable-configuration=<string> - A `executable_name=object_file.xxx` pair specifying a hal.executable symbol name that will be substituted with the configured executable file at the given path. Configured execuable paths are relative to those specified on `--iree-hal-executable-object-search-path=`. If a `.mlir` or `.mlirbc` file is specified the entire executable will be replaced with an equivalently named hal.executable in the referenced file and otherwise the executable will be externalized and link the referenced file (`.ptx`/`.spv`/etc).
--iree-hal-substitute-executable-configurations-from=<string> - Substitutes any hal.executable with a file in the given path with the same name ala `--iree-hal-substitute-executable-configuration=`.
--iree-hal-substitute-executable-object=<string> - A `executable_name=object_file.xxx` pair specifying a hal.executable symbol name that will be substituted with the object file at the given path. Object paths are relative to those specified on `--iree-hal-executable-object-search-path=`. If a `.mlir` or `.mlirbc` file is specified the entire executable will be replaced with an equivalently named hal.executable in the referenced file and otherwise the executable will be externalized and link the referenced file (`.ptx`/`.spv`/etc).
--iree-hal-substitute-executable-objects-from=<string> - Substitutes any hal.executable with a file in the given path with the same name ala `--iree-hal-substitute-executable-object=`.
--iree-hal-substitute-executable-source=<string> - A `executable_name=object_file.xxx` pair specifying a hal.executable symbol name that will be substituted with the source object file at the given path. Source object paths are relative to those specified on `--iree-hal-executable-object-search-path=`. If a `.mlir` or `.mlirbc` file is specified the entire executable will be replaced with an equivalently named hal.executable in the referenced file and otherwise the executable will be externalized and link the referenced file (`.ptx`/`.spv`/etc).
--iree-hal-substitute-executable-sources-from=<string> - Substitutes any hal.executable with a file in the given path with the same name ala `--iree-hal-substitute-executable-source=`.
--iree-link-bitcode=<string> - Paths of additional bitcode files to load and link. Comma-separated. Any list entry that contains an equals (=) is parsed as `arch=path` and is only linked if `arch` matches the target triple.
--iree-list-plugins - Lists all loaded plugins.
--iree-llvmcpu-check-linalg-vectorization - Runs the pass to check if all the Linalg ops are vectorized
--iree-llvmcpu-disable-distribution - disable thread distribution in codegen
--iree-llvmcpu-disable-vector-peeling - Disable peeling as a pre-processing step for vectorization (only relevant when using compiler heuristics to select the strategy).
--iree-llvmcpu-distribution-size=<int> - default distribution tile size
--iree-llvmcpu-enable-pad-consumer-fusion - Flag to enable the fusion for pad + consumer
--iree-llvmcpu-enable-scalable-vectorization - Enable scalable vectorization if it is supported by the target (e.g., +sve, +sve2 and/or +sme feature flags)
--iree-llvmcpu-enable-transform-dialect-jit - enable the usage of the transform dialect JIT
--iree-llvmcpu-fail-on-out-of-bounds-stack-allocation - fail if the upper bound of dynamic stack allocation cannot be solved
--iree-llvmcpu-general-matmul-tile-bytes=<int> - target distribution tile size for matrix operands of general matmuls, expressed in bytes. Currently only used in data-tiled matmuls (mmt4d).
--iree-llvmcpu-instrument-memory-accesses - Instruments memory accesses in dispatches when dispatch instrumentation is enabled.
--iree-llvmcpu-narrow-matmul-tile-bytes=<int> - target distribution tile size for wide matrix operand of narrow matmuls, expressed in bytes. Currently only used in data-tiled matmuls (mmt4d). Since this is only used for narrow matmuls, which traverse their wide matrix operand once, there is no reuse here and this doesn't have to be sized to fit in some CPU cache. This is more about distributing work to threads.
--iree-llvmcpu-number-of-threads=<int> - number of threads that are used at runtime if codegen thread distribution is enabled
--iree-llvmcpu-reassociate-fp-reductions - Enables reassociation for FP reductions
--iree-llvmcpu-skip-intermediate-roundings - Allow skipping intermediate roundings. For example, in f16 matmul kernels on targets with only f32 arithmetic, we have to perform each multiply-accumulate in f32, and if this flag is false, then we have to round those f32 accumulators to the nearest f16 every time, which is slow.
--iree-llvmcpu-stack-allocation-limit=<int> - maximum allowed stack allocation size in bytes
--iree-llvmcpu-use-decompose-softmax-fuse - Enables inter-pass fusion for the DecomposeSoftmax pass.
--iree-llvmcpu-use-fast-min-max-ops - Use `arith.minf/maxf` instead of `arith.minimumf/maximumf` ops
--iree-llvmcpu-vector-pproc-strategy=<value> - Set the strategy for pre-processing Linalg operation before vectorization:
=peel - Peel iterations from the vector dimensions so that they become multiple of the vector length
=mask - Compute vector dimensions assuming vector masking support. Vector sizes may be rounded up to the nearest power of two and out-of-bounds elements would be masked-out.
=none - Do not apply any vectorization pre-processing transformation.
--iree-llvmgpu-enable-prefetch - Enable prefetch in the vector distribute pipeline
--iree-llvmgpu-shared-memory-limit=<long> - specify the maximum amount of shared memory allowed to be allocated for the given target
--iree-rocm-index-bits=<int> - Set the bit width of indices in ROCm.
--iree-spirv-enable-transform-dialect-jit - Enable the usage of the transform dialect JIT
--iree-spirv-index-bits=<int> - Set the bit width of indices in SPIR-V.
--iree-stream-emulate-memset - Emulate all memset types with dispatches.
--iree-stream-external-resources-mappable - Allocates external resources as host-visible and mappable. This can degrade performance and introduce allocation overhead and staging buffers for readback on the host should be managed by the calling application instead.
--iree-stream-partitioning-favor=<value> - Default stream partitioning favor configuration.
=debug - Force debug partitioning (no concurrency or pipelining).
=min-peak-memory - Favor minimizing memory consumption at the cost of additional concurrency.
=max-concurrency - Favor maximizing concurrency at the cost of additional memory consumption.
--iree-stream-resource-alias-mutable-bindings - Fuses bindings that are mutable instead of leaving them split.
--iree-stream-resource-index-bits=<uint> - Bit width of indices used to reference resource offsets.
--iree-stream-resource-max-allocation-size=<ulong> - Maximum size of an individual memory allocation.
--iree-stream-resource-max-range=<ulong> - Maximum range of a resource binding; may be less than the max allocation size.
--iree-stream-resource-memory-model=<value> - Memory model used for host-device resource memory access.
=unified - Host and device memory are unified and there's (practically) no performance cost for cross-access.
=discrete - Host and device memory are discrete and cross-access is expensive.
--iree-stream-resource-min-offset-alignment=<ulong> - Minimum required alignment in bytes for resource offsets.
--iree-util-hoist-into-globals-print-constexpr-dotgraph-to=<filename> - Prints a dot graph representing the const-expr analysis. The red nodes represent roots and the green nodes represent hoisted values.
--iree-util-zero-fill-elided-attrs - Fills elided attributes with zeros when serializing.
--iree-vm-c-module-optimize - Optimizes the VM module with CSE/inlining/etc prior to serialization
--iree-vm-c-module-output-format=<value> - Output format used to write the C module
=code - C Code file
=mlir-text - MLIR module file in the VM and EmitC dialects
--iree-vm-c-module-strip-debug-ops - Strips debug-only ops from the module
--iree-vmvx-enable-ukernels-decompose-linalg-generic - Enables decomposition of linalg.generic ops when ukernels are enabled (experimental)
--iree-vmvx-skip-intermediate-roundings - Allow skipping intermediate roundings. For example, in f16 matmul kernels on targets with only f32 arithmetic, we have to perform each multiply-accumulate in f32, and if this flag is false, then we have to round those f32 accumulators to the nearest f16 every time, which is slow.
--iree-vulkan-experimental-indirect-bindings - Force indirect bindings for all generated dispatches.
--iree-vulkan-target-env=<string> - Vulkan target environment as #vk.target_env attribute assembly.
--iree-vulkan-target-triple=<string> - Vulkan target triple controlling the SPIR-V environment.
--iterative-counter-promotion - Allow counter promotion across the whole loop nest.
--lint-abort-on-error - In the Lint pass, abort on errors.
--load=<pluginfilename> - Load the specified plugin
--lto-embed-bitcode=<value> - Embed LLVM bitcode in object files produced by LTO
=none - Do not embed
=optimized - Embed after all optimization passes
=post-merge-pre-opt - Embed post merge, but before optimizations
--matrix-default-layout=<value> - Sets the default matrix layout
=column-major - Use column-major layout
=row-major - Use row-major layout
--matrix-print-after-transpose-opt -
--max-counter-promotions=<int> - Max number of allowed counter promotions
--max-counter-promotions-per-loop=<uint> - Max number counter promotions per loop to avoid increasing register pressure too much
--mir-strip-debugify-only - Should mir-strip-debug only strip debug info from debugified modules by default
--misexpect-tolerance=<uint> - Prevents emiting diagnostics when profile counts are within N% of the threshold..
--mlir-disable-threading - Disable multi-threading within MLIR, overrides any further call to MLIRContext::enableMultiThreading()
--mlir-elide-elementsattrs-if-larger=<uint> - Elide ElementsAttrs with "..." that have more elements than the given upper limit
--mlir-elide-resource-strings-if-larger=<uint> - Elide printing value of resources if string is too long in chars.
--mlir-pass-pipeline-crash-reproducer=<string> - Generate a .mlir reproducer file at the given output path if the pass manager crashes or fails
--mlir-pass-pipeline-local-reproducer - When generating a crash reproducer, attempt to generated a reproducer with the smallest pipeline.
--mlir-pass-statistics - Display the statistics of each pass
--mlir-pass-statistics-display=<value> - Display method for pass statistics
=list - display the results in a merged list sorted by pass name
=pipeline - display the results with a nested pipeline view
--mlir-pretty-debuginfo - Print pretty debug info in MLIR output
--mlir-print-debuginfo - Print debug info in MLIR output
--mlir-print-elementsattrs-with-hex-if-larger=<long> - Print DenseElementsAttrs with a hex string that have more elements than the given upper limit (use -1 to disable)
--mlir-print-ir-after=<pass-arg> - Print IR after specified passes
--mlir-print-ir-after-all - Print IR after each pass
--mlir-print-ir-after-change - When printing the IR after a pass, only print if the IR changed
--mlir-print-ir-after-failure - When printing the IR after a pass, only print if the pass failed
--mlir-print-ir-before=<pass-arg> - Print IR before specified passes
--mlir-print-ir-before-all - Print IR before each pass
--mlir-print-ir-module-scope - When printing IR for print-ir-[before|after]{-all} always print the top-level operation
--mlir-print-local-scope - Print with local scope and inline information (eliding aliases for attributes, types, and locations
--mlir-print-op-on-diagnostic - When a diagnostic is emitted on an operation, also print the operation as an attached note
--mlir-print-skip-regions - Skip regions when printing ops.
--mlir-print-stacktrace-on-diagnostic - When a diagnostic is emitted, also print the stack trace as an attached note
--mlir-print-value-users - Print users of operation results and block arguments as a comment
--mlir-timing - Display execution times
--mlir-timing-display=<value> - Display method for timing data
=list - display the results in a list sorted by total time
=tree - display the results ina with a nested tree view
--no-discriminators - Disable generation of discriminator information.
--nvptx-sched4reg - NVPTX Specific: schedule for register pressue
--object-size-offset-visitor-max-visit-instructions=<uint> - Maximum number of instructions for ObjectSizeOffsetVisitor to look at
--pgo-block-coverage - Use this option to enable basic block coverage instrumentation
--pgo-temporal-instrumentation - Use this option to enable temporal instrumentation
--pgo-view-block-coverage-graph - Create a dot file of CFGs with block coverage inference information
--poison-checking-function-local - Check that returns are non-poison (for testing)
--print-pipeline-passes - Print a '-passes' compatible string describing the pipeline (best-effort only).
--profile-correlate=<value> - Use debug info or binary file to correlate profiles.
=<empty> - No profile correlation
=debug-info - Use debug info to correlate
=binary - Use binary to correlate
--promote-alloca-vector-loop-user-weight=<uint> - The bonus weight of users of allocas within loop when sorting profitable allocas
--r600-ir-structurize - Use StructurizeCFG IR pass
--remove-traps-percentile-cutoff-hot=<int> - Alternative hot percentile cuttoff. By default `-profile-summary-cutoff-hot` is used.
--remove-traps-random-rate=<number> - Probability value in the range [0.0, 1.0] of unconditional pseudo-random checks removal.
--riscv-add-build-attributes -
--riscv-use-aa - Enable the use of AA during codegen.
--runtime-counter-relocation - Enable relocating counters at runtime.
--safepoint-ir-verifier-print-only -
--sample-profile-check-record-coverage=<N> - Emit a warning if less than N% of records in the input profile are matched to the IR.
--sample-profile-check-sample-coverage=<N> - Emit a warning if less than N% of samples in the input profile are matched to the IR.
--sample-profile-max-propagate-iterations=<uint> - Maximum number of iterations to go through when propagating sample block/edge weights through the CFG.
--skip-ret-exit-block - Suppress counter promotion if exit blocks contain ret.
--speculative-counter-promotion-max-exiting=<uint> - The max number of exiting blocks of a loop to allow speculative counter promotion
--speculative-counter-promotion-to-loop - When the option is false, if the target block is in a loop, the promotion will be disallowed unless the promoted counter update can be further/iteratively promoted into an acyclic region.
--split-input-file - Split the input file into pieces and process each chunk independently.
--summary-file=<string> - The summary file to use for function importing.
--sve-tail-folding=<string> - Control the use of vectorisation using tail-folding for SVE where the option is specified in the form (Initial)[+(Flag1|Flag2|...)]:
disabled (Initial) No loop types will vectorize using tail-folding
default (Initial) Uses the default tail-folding settings for the target CPU
all (Initial) All legal loop types will vectorize using tail-folding
simple (Initial) Use tail-folding for simple loops (not reductions or recurrences)
reductions Use tail-folding for loops containing reductions
noreductions Inverse of above
recurrences Use tail-folding for loops containing fixed order recurrences
norecurrences Inverse of above
reverse Use tail-folding for loops requiring reversed predicates
noreverse Inverse of above
--tail-predication=<value> - MVE tail-predication pass options
=disabled - Don't tail-predicate loops
=enabled-no-reductions - Enable tail-predication, but not for reduction loops
=enabled - Enable tail-predication, including reduction loops
=force-enabled-no-reductions - Enable tail-predication, but not for reduction loops, and force this which might be unsafe
=force-enabled - Enable tail-predication, including reduction loops, and force this which might be unsafe
--td-matmul-strategy-blk-sizes=<long> - block tile size for dims (x,y,z) for the transform dialect matmul strategy
--td-matmul-strategy-num-threads=<long> - number of threads for dims (x,y,z) for the transform dialect matmul strategy
--td-matmul-strategy-num-warps=<long> - number of warps for dims (x,y,z) for the transform dialect matmul strategy
--td-matmul-strategy-peel-pipeline-epilogue - whether to peel the pipeline epilogue for the transform dialect matmul strategy
--td-matmul-strategy-pipeline-depth=<long> - pipeline depth for the transform dialect matmul strategy
--td-matmul-strategy-reduc-size=<long> - reduction tile sized for the transform dialect matmul strategy
--td-matmul-strategy-use-async-copies - use asynchronous copies for the transform dialect matmul strategy
--td-matmul-strategy-use-fma - use fma for the transform dialect matmul strategy
--td-matmul-strategy-use-mma-sync - use mma sync for the transform dialect matmul strategy
--td-matmul-strategy-use-wmma - use wmma for the transform dialect matmul strategy
--td-pad-strategy-blk-sizes=<long> - block tile sizes for dims (x,y,z) for the transform dialect pad strategy
--td-pad-strategy-num-threads=<long> - number of threads for dims (x,y,z) for the transform dialect pad strategy
--td-pad-strategy-use-async-copies - use async copies through shared memory for the pad strategy
--td-pad-strategy-vector-size=<long> - vector size for the transform dialect pad strategy
--thinlto-assume-merged - Assume the input has already undergone ThinLTO function importing and the other pre-optimization pipeline changes.
--type-based-intrinsic-cost - Calculate intrinsics cost based only on argument types
--verify - Verifies the IR for correctness throughout compilation.
--verify-legalizer-debug-locs=<value> - Verify that debug locations are handled
=none - No verification
=legalizations - Verify legalizations
=legalizations+artifactcombiners - Verify legalizations and artifact combines
--verify-region-info - Verify region info (time consuming)
--vp-counters-per-site=<number> - The average number of profile counters allocated per value profiling site.
--vp-static-alloc - Do static counter allocation for value profiler
--wasm-enable-eh - WebAssembly exception handling
--wasm-enable-sjlj - WebAssembly setjmp/longjmp handling
--write-experimental-debuginfo - Write debug info in the new non-intrinsic format
--x86-align-branch=<string> - Specify types of branches to align (plus separated list of types):
jcc indicates conditional jumps
fused indicates fused conditional jumps
jmp indicates direct unconditional jumps
call indicates direct and indirect calls
ret indicates rets
indirect indicates indirect unconditional jumps
--x86-align-branch-boundary=<uint> - Control how the assembler should align branches with NOP. If the boundary's size is not 0, it should be a power of 2 and no less than 32. Branches will be aligned to prevent from being across or against the boundary of specified size. The default value 0 does not align branches.
--x86-branches-within-32B-boundaries - Align selected instructions to mitigate negative performance impact of Intel's micro code update for errata skx102. May break assumptions about labels corresponding to particular instructions, and should be used with caution.
--x86-pad-max-prefix-size=<uint> - Maximum number of prefixes to use for padding
Generic Options:
--help - Display available options (--help-hidden for more)
--help-list - Display list of available options (--help-list-hidden for more)
--version - Display the version of this program
IREE Example Plugin:
--iree-example-flag - Dummy flag for the example plugin
IREE HAL executable target options:
--iree-hal-dump-executable-benchmarks-to=<string> - Path to write standalone hal.executable benchmarks into (- for stdout).
--iree-hal-dump-executable-binaries-to=<string> - Path to write translated and serialized executable binaries into.
--iree-hal-dump-executable-configurations-to=<string> - Path to write individual hal.executable input source listings into, after translation strategy selection and before starting translation (- for stdout).
--iree-hal-dump-executable-files-to=<string> - Meta flag for all iree-hal-dump-executable-* options. Path to write executable files (sources, benchmarks, intermediates, binaries) to.
--iree-hal-dump-executable-intermediates-to=<string> - Path to write translated executable intermediates (.bc, .o, etc) into.
--iree-hal-dump-executable-sources-to=<string> - Path to write individual hal.executable input source listings into (- for stdout).
--iree-hal-executable-debug-level=<int> - Debug level for executable translation (0-3).
--iree-hal-target-backends=<string> - Target backends for executable compilation.
IREE Main Options:
--compile-mode=<value> - IREE compilation mode
=std - Standard compilation
=vm - Compile from VM IR
=hal-executable - Compile an MLIR module containing a single hal.executable into a target-specific binary form (such as an ELF file or a flatbuffer containing a SPIR-V blob)
=precompile - Precompilation pipeline which does input conversion and global optimizations.
-o <filename> - Output filename
--output-format=<value> - Format of compiled output
=vm-bytecode - IREE VM Bytecode (default)
=vm-c - C source module
=vm-asm - IREE VM MLIR Assembly
IREE VM bytecode options:
--iree-vm-bytecode-module-optimize - Optimizes the VM module with CSE/inlining/etc prior to serialization
--iree-vm-bytecode-module-output-format=<value> - Output format the bytecode module is written in
=flatbuffer-binary - Binary FlatBuffer file
=flatbuffer-text - Text FlatBuffer file, debug-only
=mlir-text - MLIR module file in the VM dialect
=annotated-mlir-text - MLIR module file in the VM dialect with annotations
--iree-vm-bytecode-module-strip-debug-ops - Strips debug-only ops from the module
--iree-vm-bytecode-module-strip-source-map - Strips the source map from the module
--iree-vm-bytecode-source-listing=<string> - Dump a VM MLIR file and annotate source locations with it
--iree-vm-emit-polyglot-zip - Enables output files to be viewed as zip files for debugging (only applies to binary targets)
IREE VM target options:
--iree-vm-target-extension-f32 - Support f32 target opcode extensions.
--iree-vm-target-extension-f64 - Support f64 target opcode extensions.
--iree-vm-target-index-bits=<int> - Bit width of index types.
--iree-vm-target-optimize-for-stack-size - Prefer optimizations that reduce VM stack usage over performance.
--iree-vm-target-truncate-unsupported-floats - Truncate f64 to f32 when unsupported.
IREE compiler plugin options:
--iree-plugin=<string> - Plugins to activate
--iree-print-plugin-info - Prints available and activated plugin info to stderr
IREE options for apply custom preprocessing before normal IREE compilation flow:
--iree-preprocessing-pass-pipeline=<string> - Textual description of the pass pipeline to run before running normal IREE compilation pipelines
--iree-preprocessing-transform-spec-filename=<string> - File name of a transform dialect spec to use for preprocessing
IREE options for controlling global optimizations.:
--iree-opt-aggressively-propagate-transposes - Propagates transposes to named ops even when the resulting op will be a linalg.generic
--iree-opt-const-eval - Enables eager evaluation of constants using the full compiler and runtime (on by default).
--iree-opt-const-expr-hoisting - Hoists the results of latent constant expressions into immutable global initializers for evaluation at program load.
--iree-opt-data-tiling - Enables data tiling path.
--iree-opt-demote-f32-to-f16 - Converts all f32 ops and values into f16 counterparts unconditionally before main global optimizations.
--iree-opt-demote-f64-to-f32 - Converts all f64 ops and values into f32 counterparts unconditionally before main global optimizations.
--iree-opt-demote-i64-to-i32 - Converts all i64 ops and values into i32 counterparts unconditionally before main global optimizations.
--iree-opt-minimum-parameter-export-size=<long> - Minimum size of constants to export to the archive created in `iree-opt-export-parameter-archive-export-file`.
--iree-opt-numeric-precision-reduction - Reduces numeric precision to lower bit depths where possible.
--iree-opt-outer-dim-concat - Transposes all concatenations to happenalong the outer most dimension.
--iree-opt-parameter-archive-export-file=<string> - File path to create a parameter archive using any inline global constants.
--iree-opt-parameter-archive-export-scope=<string> - Scope for parameters in the archive created in `iree-opt-export-parameter-archive-export-file`.
--iree-opt-promote-bf16-to-f32 - Converts all bf16 ops and values into f32 counterparts unconditionally before main global optimizations.
--iree-opt-promote-f16-to-f32 - Converts all f16 ops and values into f32 counterparts unconditionally before main global optimizations.
--iree-opt-splat-parameter-archive-export-file=<string> - File path to create a parameter archive of splat values out of all parameter backed globals.
--iree-opt-strip-assertions - Strips debug assertions after any useful information has been extracted.
IREE options for controlling host/device scheduling.:
--iree-execution-model=<value> - Specifies the execution model used for scheduling tensor compute operations.
=host-only - Host-local code only that does not need execution scheduling.
=async-internal - Full HAL using asynchronous host/device execution internally but exporting functions as if synchronous.
=async-external - Full HAL using asynchronous host/device execution both internally and externally.
=inline-static - Inline host-local in-process execution with executable code statically linked into the host program.
=inline-dynamic - Inline host-local in-process execution using dynamic executables.
--iree-scheduling-dump-statistics-file=<string> - File path to write statistics to; or `` for stderr or `-` for stdout.
--iree-scheduling-dump-statistics-format=<value> - Dumps statistics in the specified output format.
=pretty - Human-readable pretty printed output.
=verbose - Pretty printed output with additional IR.
=csv - Comma separated values.
=json - JSON output with structures for data exchange
--iree-scheduling-optimize-bindings - Enables binding fusion and dispatch site specialization.
IREE options for controlling the input transformations to apply.:
--iree-input-type=<string> - Specifies the input program representation:
=none - No input dialect transformation.
=auto - Analyze the input program to choose conversion.
=* - An extensible input type defined in a plugin.
IREE translation binding support options.:
--iree-native-bindings-support - Include runtime support for native IREE ABI-compatible bindings.
--iree-tflite-bindings-support - Include runtime support for the IREE TFLite compatibility bindings.
LLVMCPU HAL Target:
--iree-llvmcpu-debug-symbols - Generate and embed debug information (DWARF, PDB, etc)
--iree-llvmcpu-embedded-linker-path=<string> - Tool used to link embedded ELFs produced by IREE (for --iree-llvmcpu-link-embedded=true).
--iree-llvmcpu-enable-ukernels=<string> - Enables ukernels in the llvmcpu backend. May be `default`, `none`, `all`, or a comma-separated list of specific unprefixed ukernels to enable, e.g. `mmt4d`.
--iree-llvmcpu-keep-linker-artifacts - Keep LLVM linker target artifacts (.so/.dll/etc)
--iree-llvmcpu-link-embedded - Links binaries into a platform-agnostic ELF to be loaded by the embedded IREE ELF loader.
--iree-llvmcpu-link-static - Links system libraries into binaries statically to isolate them from platform dependencies needed at runtime
--iree-llvmcpu-link-ukernel-bitcode - Link ukernel bitcode libraries into generated executables
--iree-llvmcpu-list-targets - Lists all registered targets that the LLVM backend can generate code for.
--iree-llvmcpu-loop-interleaving - Enable LLVM loop interleaving opt
--iree-llvmcpu-loop-unrolling - Enable LLVM loop unrolling opt
--iree-llvmcpu-loop-vectorization - Enable LLVM loop vectorization opt
--iree-llvmcpu-sanitize=<value> - Apply LLVM sanitize feature
=address - Address sanitizer support
=thread - Thread sanitizer support
--iree-llvmcpu-slp-vectorization - Enable LLVM SLP Vectorization opt
--iree-llvmcpu-static-library-output-path=<string> - Path to output static object (EX: '/path/to/static-library.o'). This will produce the static library at the specified path along with a similarly named '.h' file for static linking.
--iree-llvmcpu-system-linker-path=<string> - Tool used to link system shared libraries produced by IREE (for --iree-llvmcpu-link-embedded=false).
--iree-llvmcpu-target-abi=<string> - LLVM target machine ABI; specify for -mabi
--iree-llvmcpu-target-cpu=<string> - LLVM target machine CPU; use 'host' for your host native CPU.
--iree-llvmcpu-target-cpu-features=<string> - LLVM target machine CPU features; use 'host' for your host native CPU.
--iree-llvmcpu-target-data-layout=<string> - LLVM target machine data layout override.
--iree-llvmcpu-target-float-abi=<value> - LLVM target codegen enables soft float abi e.g -mfloat-abi=softfp
=default - Default (softfp)
=soft - Software floating-point emulation
=hard - Hardware floating-point instructions
--iree-llvmcpu-target-triple=<string> - LLVM target machine triple.
--iree-llvmcpu-target-vector-width-in-bytes=<uint> - Overrides the native vector register width (in bytes) of the target.
--iree-llvmcpu-wasm-linker-path=<string> - Tool used to link WebAssembly modules produced by IREE (for --iree-llvmcpu-target-triple=wasm32-*).
MetalSPIRV HAL Target:
--iree-metal-compile-to-metallib - Compile to .metallib and embed in IREE deployable flatbuffer if true; otherwise stop at and embed MSL source code
--iree-metal-target-platform=<value> - Apple platform to target
=macos - macOS platform
=ios - iOS platform
=ios-simulator - iOS simulator platform
ROCM HAL Target:
--iree-rocm-bc-dir=<string> - Directory of ROCM Bitcode.
--iree-rocm-enable-ukernels=<string> - Enables microkernels in the llvmcpu backend. May be `default`, `none`, `all`, or a comma-separated list of specific unprefixed microkernels to enable, e.g. `mmt4d`.
--iree-rocm-target-chip=<string> - ROCm target chip.
--iree-rocm-waves-per-eu=<int> - Optimization hint specifying minimum number of waves per execution unit.
StableHLO Input:
--iree-input-demote-f64-to-f32 - Converts all f64 ops and values into f32 counterparts.
--iree-input-demote-i64-to-i32 - Converts all i64 ops and values into i32 counterparts.
--iree-input-promote-bf16-to-f32 - Converts all bf16 ops and values into f32 counterparts.
Torch Input:
--iree-torch-use-strict-symbolic-shapes - Forces dynamic shapes to be treated as strict
VMVX HAL Target:
--iree-vmvx-enable-microkernels - Enables microkernel lowering for vmvx (experimental)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment