jmpnz/mike-pall-lua-list-2009

## mike-pall-lua-list-2009
This was copied verbatim from the Lua mailing list see ref:1 and is kept here for posterity reasons.

- ref:1 http://lua-users.org/lists/lua-l/2009-11/msg00089.html

## On LuaJIT 2.0 IP and Research Opportunities

It has been brought to my attention that it might be advantageous
for some parts of the research community and the open source
community, that I make a public statement about the intellectual
property (IP) contained in LuaJIT 2.0 and earlier versions:

  I hereby declare any and all of my own inventions contained in
  LuaJIT to be in the public domain and up for free use by anyone
  without payment of any royalties whatsoever.

  [Note that the source code itself is licensed under a permissive
  license and is not placed in the public domain. But this is an
  orthogonal issue.]

I cannot guarantee it to be free of third-party IP however. In
fact nobody can. Writing software has become a minefield and any
moderately complex piece of software is probably (unknowingly to
the author) encumbered by hundreds of dubious patents. This
especially applies to compilers. The curent IP system is broken
and software patents must be abolished. Ceterum censeo.

The usual form of disclosure is to write papers and publish them.
I'm sorry, but I don't have the time for this right now. But I
would consider publishing open source software as a form of
disclosure.

In the interest of anyone doing research on virtual machines,
compilers and interpreters, I've compiled  a list of some of the
new aspects to be found in LuaJIT 2.0. I do not claim all of them
are original (I cannot possibly know all of the literature), but
my research indicates that many of them are quite innovative.

This also presents some research opportunities for 3rd parties.
I have little use for academic merits myself -- I'm more interested
in coding than writing papers. Anyone is welcome to dig out any
aspects, explore them in detail and publish them (giving due credit).

Design aspects of the VM:

- NaN-tagging: 64 bit tagged values are used for stack slots and
  table slots. Unboxed floating-point numbers (doubles) are
  overlayed with tagged object references. The latter can be
  distinguished from numbers via the use of special NaNs as tags.
  It's a remote descendant of pointer-tagging.

  [The idea dates back to 2006, but I haven't disclosed it before
  2008. Special NaNs have been used to overlay pointers before.
  Others have used it for tagging later on. The specific layout is
  of my own devising.]

- Low-overhead call frames: The linear, growable stack implicitly
  holds the frame structure. The tags for the base function of
  each call frame hold a linked structure of frames, using no
  extra space. Calls/returns are faster due to lower memory
  traffic. This also allows installing exception handlers at zero
  cost (it's a special bit pattern in the frame link).

Design of the IR (intermediate representation) used by the compiler:

- Linear, pointer-free IR: The typed IR is SSA-based and highly
  orthogonal. An instruction takes up only 64 bits. It has up to
  two operands which are 16 bit references. It's implemented with
  a bidirectionally growable array. No trees, no pointers, no cry.
  Heavily optimized for minimal D-cache impact, too.

- Skip-list chains: The IR is threaded with segregated, per-opcode
  skip-list chains. The links are stored in a multi-purpose 16 bit
  field in the instruction. This facilitates low-overhead lookup
  for CSE, DSE and alias analysis. Back-linking enables short-cut
  searches (average overhead is less than 1 lookup). Incremental
  build-up is trivial. No hashes, no sets, no complex updates.

- IR references: Specially crafted IR references allow fast const
  vs. non-const decisions. The trace recorder uses type-tagged
  references (a form of caching) internally for low-overhead
  type-based dispatch.

- High-level IR: A single, uniform high-level IR is used across
  all stages of the compiler. This reduces overall complexity.
  Careful instruction design avoids any impact on low-level CSE
  opportunities. It also allows cheap and effective high-level
  semantic disambiguation for memory references.

Design of the compiler pipeline:

- Rule-based FOLD engine: The FOLD engine is primarily used for
  constant folding, algebraic simplifications and reassociation.
  Most traditional compilers have an evolutionary grown set of
  implicit rules, spread over thousands of hand-coded tiny
  conditionals.

  The rule-based FOLD engine uses a declarative approach to
  combine the first and second level of lookup. It allows wildcard
  lookup with masked keys, too. A pre-processor generates a
  semi-perfect hash table for constant-time rule lookup. It's able
  to deal with thousands of rules in a uniform manner without
  performance degradation. A declarative approach is also much
  easier to maintain.

- Unified stage dispatch: The FOLD engine is the first stage in
  the compiler pipeline. Wildcard rules are used to dispatch
  specific instructions or instruction types (loads, stores,
  allocations etc.) to later optimization stages (load forwarding,
  DSE etc.). Unmatched instructions are passed on to CSE.

  Unified stage dispatch facilitates modular and pluggable
  optimizations with only local knowledge. It's also faster than
  doing multiple dispatches in every stage.

Trace compiler:

- NLF region-selection: The trace heuristics use a natural-loop
  first (NLF) region-selection mechanism to come up with a
  close-to optimal set of (looping) root traces. Only special
  bytecode instructions trigger new root traces -- regular
  conditionals never do this. Root traces that leave the loop are
  aborted and retried later. This also gives outer loops a chance
  to inline inner loops with a low trip count.

  NLF usually generates a superior set of root traces than the
  MRET/NET (next-executing tail) and LEI (last-executed iteration)
  region-selection mechanisms known from the literature.

- Hashed profile counters: Bytecode instructions to trigger the
  start of a hot trace use low-overhead hashed profiling counters.
  The profile is imprecise because collisions are ignored. The
  hash table is kept very small to reduce D-cache impact (only two
  hot cache lines). Since NLF weeds out most false positives, this
  doesn't deteriorate hot trace detection.

  [Neither using hashed profile counters, nor imprecise profiling,
  nor using profiling to detect hot loops is new. But the specific
  combination may be original.]

- Code sinking via snapshots: The VM must be in a consistent state
  when a trace exits. This means that all updates (stores) to the
  state (stack or objects) must track the original language
  semantics.

  Naive trace compilers achieve this by forcing a full update of
  the state to memory before every exit. This causes many on-trace
  stores and seriously diminishes code quality.

  A better approach is to sink these stores to compensation code,
  which is only executed if the trace exits are actually taken.
  A common solution is to emit actual code for these stores. But
  this causes code cache bloat and the information often needs to
  be stored redundantly, for linking of side traces.

  Code sinking via snapshots allows sinking of arbitrary code
  without the overhead of the other approaches. A snapshot stores
  a consistent view of all updates to the state before an exit. If
  an exit is taken the on-trace machine state (registers and spill
  slots) and the snapshot can be used to restore the VM state.

  State restoration using this data-driven approach is slow of
  course. But repeatedly taken side exits quickly trigger the
  generation of side traces. The snapshot is used to initialize
  the IR of the side trace with the necessary state using
  pseudo-loads. These can be optimized together with the remainder
  of the side trace. The pseudo-loads are unified with the machine
  state of the parent trace by the backend to enable zero-cost
  linking to side traces.

  [Currently snapshots only allow store sinking of scalars. It's
  planned to extend this to allow arbitrary store and allocation
  sinking, which together with store forwarding would be a unique
  way to achieve scalar-replacement of aggregates.]

- Sparse snapshots: Taking a full snapshot of all state updates
  before every exit would need a considerable amount of storage.
  Since all scalar stores are sunk, it's feasible to reduce the
  snapshot density. The basic idea is that it doesn't matter which
  state is restored on a taken exit, as long as it's consistent.

  This is a form of transactional state management. Every snapshot
  is a commit; a taken exit causes a rollback to the last commit.
  The on-trace state may advance beyond the last commit as long as
  this doesn't affect the possibility of a rollback. In practice
  this means that all on-trace updates to the state (non-scalar
  stores that are not sunk) need to force a new snapshot for the
  next exit.

  Otherwise the trace recorder only generates a snapshot after
  control-flow constructs that are present in the source, too.
  Guards that have a low probability of being wrongly predicted do
  not cause snapshots (e.g. function dispatch). This further
  reduces the snapshot density. Sparse snapshots also improve
  on-trace code quality, because they reduce the live range of the
  results of intermediate computations. Scheduling decisions can
  be made over a longer stream of instructions, too.

  [It's planned to switch to compressed snapshots. 2D-compression
  across snapshots may be able to remove even more redundancy.]

Optimizations:

- Hash slot specialization: Hash table lookup for constant keys is
  specialized to the predicted hash slot. This avoids a loop to
  follow the hash chain. Pseudocode:

    HREFK:  if (hash[17].key != key) goto exit
    HLOAD:  x = hash[17].value
    -or-
    HSTORE: hash[17].value = x

  HREFK is shared by multiple HLOADs/HSTOREs and may be hoisted
  independently. The verification of the prediction (HREFK) is
  moved out of the dependency chain by a super-scalar CPU. This
  makes hash lookup as cheap as array lookup with minimal complexity.

  It also avoids all the complications (cache invalidation,
  ordering constraints, shape mismatches) associated with hidden
  classes (V8) or shape inference/property caching (TraceMonkey).

- Code hoisting via unrolling and copy-substitution (LOOP):
  Traditional loop-invariant code motion (LICM) is mostly useless
  for the IR resulting from dynamic languages. The IR has many
  guards and most subsequent instructions are control-dependent on
  them. The first non-hoistable guard would effectively prevent
  hoisting of all subsequent instructions.

  The LOOP pass does synthetic unrolling of the recorded IR,
  combining copy-substitution with redundancy elimination to
  achieve code hoisting. The unrolled and copy-substituted
  instructions are simply fed back into the compiler pipeline,
  which allows reuse of all optimizations for redundancy
  elimination. Loop recurrences are detected on-the-fly and a
  minimized set of PHIs is generated.

- Narrowing of numbers to integers: Predictive narrowing is used
  for induction variables. Demand-driven narrowing is used for
  index expressions using a backpropagation algorithm.

  This avoids the complexity associated with speculative, eager
  narrowing, which also causes excessive control-flow dependencies
  due to the many overflow checks. Selective narrowing is better
  at exploiting the combined bandwidth of the FP and integer units
  of the CPU and avoids clogging up the branch unit.

Register allocation:

- Blended cost-model for R-LSRA: The reverse-linear-scan register
  allocator uses a blended cost model for its spill decisions.
  This takes into account multiple factors (e.g. PHI weight) and
  benefits from the special layout of IR references (constants
  before invariant instructions, before variant instructions).

- Register hints: The register allocation heuristics take into
  account register hints, e.g. for loop recurrences or calling
  conventions. This is very cheap to implement, but improves the
  allocation decisions considerably. It reduces register shuffling
  and prevents unnecessary spills.

- x86-specific improvements: Special heuristics for move vs.
  rename produce close to optimal code for two-operand machine
  code instructions.

  Fusion of memory operands into instructions is required to
  generate high-quality x86 code. Late fusion in the backend
  allows better, local decisions, based on actual register
  pressure, rather than estimates of prior stages.

Ok, that's it! Sorry for the length of this posting, but I hope it
was at least informative to someone out there.

--Mike
	This was copied verbatim from the Lua mailing list see ref:1 and is kept here for posterity reasons.

	- ref:1 http://lua-users.org/lists/lua-l/2009-11/msg00089.html

	## On LuaJIT 2.0 IP and Research Opportunities

	It has been brought to my attention that it might be advantageous
	for some parts of the research community and the open source
	community, that I make a public statement about the intellectual
	property (IP) contained in LuaJIT 2.0 and earlier versions:

	I hereby declare any and all of my own inventions contained in
	LuaJIT to be in the public domain and up for free use by anyone
	without payment of any royalties whatsoever.

	[Note that the source code itself is licensed under a permissive
	license and is not placed in the public domain. But this is an
	orthogonal issue.]

	I cannot guarantee it to be free of third-party IP however. In
	fact nobody can. Writing software has become a minefield and any
	moderately complex piece of software is probably (unknowingly to
	the author) encumbered by hundreds of dubious patents. This
	especially applies to compilers. The curent IP system is broken
	and software patents must be abolished. Ceterum censeo.

	The usual form of disclosure is to write papers and publish them.
	I'm sorry, but I don't have the time for this right now. But I
	would consider publishing open source software as a form of
	disclosure.

	In the interest of anyone doing research on virtual machines,
	compilers and interpreters, I've compiled a list of some of the
	new aspects to be found in LuaJIT 2.0. I do not claim all of them
	are original (I cannot possibly know all of the literature), but
	my research indicates that many of them are quite innovative.

	This also presents some research opportunities for 3rd parties.
	I have little use for academic merits myself -- I'm more interested
	in coding than writing papers. Anyone is welcome to dig out any
	aspects, explore them in detail and publish them (giving due credit).

	Design aspects of the VM:

	- NaN-tagging: 64 bit tagged values are used for stack slots and
	table slots. Unboxed floating-point numbers (doubles) are
	overlayed with tagged object references. The latter can be
	distinguished from numbers via the use of special NaNs as tags.
	It's a remote descendant of pointer-tagging.

	[The idea dates back to 2006, but I haven't disclosed it before
	2008. Special NaNs have been used to overlay pointers before.
	Others have used it for tagging later on. The specific layout is
	of my own devising.]

	- Low-overhead call frames: The linear, growable stack implicitly
	holds the frame structure. The tags for the base function of
	each call frame hold a linked structure of frames, using no
	extra space. Calls/returns are faster due to lower memory
	traffic. This also allows installing exception handlers at zero
	cost (it's a special bit pattern in the frame link).

	Design of the IR (intermediate representation) used by the compiler:

	- Linear, pointer-free IR: The typed IR is SSA-based and highly
	orthogonal. An instruction takes up only 64 bits. It has up to
	two operands which are 16 bit references. It's implemented with
	a bidirectionally growable array. No trees, no pointers, no cry.
	Heavily optimized for minimal D-cache impact, too.

	- Skip-list chains: The IR is threaded with segregated, per-opcode
	skip-list chains. The links are stored in a multi-purpose 16 bit
	field in the instruction. This facilitates low-overhead lookup
	for CSE, DSE and alias analysis. Back-linking enables short-cut
	searches (average overhead is less than 1 lookup). Incremental
	build-up is trivial. No hashes, no sets, no complex updates.

	- IR references: Specially crafted IR references allow fast const
	vs. non-const decisions. The trace recorder uses type-tagged
	references (a form of caching) internally for low-overhead
	type-based dispatch.

	- High-level IR: A single, uniform high-level IR is used across
	all stages of the compiler. This reduces overall complexity.
	Careful instruction design avoids any impact on low-level CSE
	opportunities. It also allows cheap and effective high-level
	semantic disambiguation for memory references.

	Design of the compiler pipeline:

	- Rule-based FOLD engine: The FOLD engine is primarily used for
	constant folding, algebraic simplifications and reassociation.
	Most traditional compilers have an evolutionary grown set of
	implicit rules, spread over thousands of hand-coded tiny
	conditionals.

	The rule-based FOLD engine uses a declarative approach to
	combine the first and second level of lookup. It allows wildcard
	lookup with masked keys, too. A pre-processor generates a
	semi-perfect hash table for constant-time rule lookup. It's able
	to deal with thousands of rules in a uniform manner without
	performance degradation. A declarative approach is also much
	easier to maintain.

	- Unified stage dispatch: The FOLD engine is the first stage in
	the compiler pipeline. Wildcard rules are used to dispatch
	specific instructions or instruction types (loads, stores,
	allocations etc.) to later optimization stages (load forwarding,
	DSE etc.). Unmatched instructions are passed on to CSE.

	Unified stage dispatch facilitates modular and pluggable
	optimizations with only local knowledge. It's also faster than
	doing multiple dispatches in every stage.

	Trace compiler:

	- NLF region-selection: The trace heuristics use a natural-loop
	first (NLF) region-selection mechanism to come up with a
	close-to optimal set of (looping) root traces. Only special
	bytecode instructions trigger new root traces -- regular
	conditionals never do this. Root traces that leave the loop are
	aborted and retried later. This also gives outer loops a chance
	to inline inner loops with a low trip count.

	NLF usually generates a superior set of root traces than the
	MRET/NET (next-executing tail) and LEI (last-executed iteration)
	region-selection mechanisms known from the literature.

	- Hashed profile counters: Bytecode instructions to trigger the
	start of a hot trace use low-overhead hashed profiling counters.
	The profile is imprecise because collisions are ignored. The
	hash table is kept very small to reduce D-cache impact (only two
	hot cache lines). Since NLF weeds out most false positives, this
	doesn't deteriorate hot trace detection.

	[Neither using hashed profile counters, nor imprecise profiling,
	nor using profiling to detect hot loops is new. But the specific
	combination may be original.]

	- Code sinking via snapshots: The VM must be in a consistent state
	when a trace exits. This means that all updates (stores) to the
	state (stack or objects) must track the original language
	semantics.

	Naive trace compilers achieve this by forcing a full update of
	the state to memory before every exit. This causes many on-trace
	stores and seriously diminishes code quality.

	A better approach is to sink these stores to compensation code,
	which is only executed if the trace exits are actually taken.
	A common solution is to emit actual code for these stores. But
	this causes code cache bloat and the information often needs to
	be stored redundantly, for linking of side traces.

	Code sinking via snapshots allows sinking of arbitrary code
	without the overhead of the other approaches. A snapshot stores
	a consistent view of all updates to the state before an exit. If
	an exit is taken the on-trace machine state (registers and spill
	slots) and the snapshot can be used to restore the VM state.

	State restoration using this data-driven approach is slow of
	course. But repeatedly taken side exits quickly trigger the
	generation of side traces. The snapshot is used to initialize
	the IR of the side trace with the necessary state using
	pseudo-loads. These can be optimized together with the remainder
	of the side trace. The pseudo-loads are unified with the machine
	state of the parent trace by the backend to enable zero-cost
	linking to side traces.

	[Currently snapshots only allow store sinking of scalars. It's
	planned to extend this to allow arbitrary store and allocation
	sinking, which together with store forwarding would be a unique
	way to achieve scalar-replacement of aggregates.]

	- Sparse snapshots: Taking a full snapshot of all state updates
	before every exit would need a considerable amount of storage.
	Since all scalar stores are sunk, it's feasible to reduce the
	snapshot density. The basic idea is that it doesn't matter which
	state is restored on a taken exit, as long as it's consistent.

	This is a form of transactional state management. Every snapshot
	is a commit; a taken exit causes a rollback to the last commit.
	The on-trace state may advance beyond the last commit as long as
	this doesn't affect the possibility of a rollback. In practice
	this means that all on-trace updates to the state (non-scalar
	stores that are not sunk) need to force a new snapshot for the
	next exit.

	Otherwise the trace recorder only generates a snapshot after
	control-flow constructs that are present in the source, too.
	Guards that have a low probability of being wrongly predicted do
	not cause snapshots (e.g. function dispatch). This further
	reduces the snapshot density. Sparse snapshots also improve
	on-trace code quality, because they reduce the live range of the
	results of intermediate computations. Scheduling decisions can
	be made over a longer stream of instructions, too.

	[It's planned to switch to compressed snapshots. 2D-compression
	across snapshots may be able to remove even more redundancy.]

	Optimizations:

	- Hash slot specialization: Hash table lookup for constant keys is
	specialized to the predicted hash slot. This avoids a loop to
	follow the hash chain. Pseudocode:

	HREFK: if (hash[17].key != key) goto exit
	HLOAD: x = hash[17].value
	-or-
	HSTORE: hash[17].value = x

	HREFK is shared by multiple HLOADs/HSTOREs and may be hoisted
	independently. The verification of the prediction (HREFK) is
	moved out of the dependency chain by a super-scalar CPU. This
	makes hash lookup as cheap as array lookup with minimal complexity.

	It also avoids all the complications (cache invalidation,
	ordering constraints, shape mismatches) associated with hidden
	classes (V8) or shape inference/property caching (TraceMonkey).

	- Code hoisting via unrolling and copy-substitution (LOOP):
	Traditional loop-invariant code motion (LICM) is mostly useless
	for the IR resulting from dynamic languages. The IR has many
	guards and most subsequent instructions are control-dependent on
	them. The first non-hoistable guard would effectively prevent
	hoisting of all subsequent instructions.

	The LOOP pass does synthetic unrolling of the recorded IR,
	combining copy-substitution with redundancy elimination to
	achieve code hoisting. The unrolled and copy-substituted
	instructions are simply fed back into the compiler pipeline,
	which allows reuse of all optimizations for redundancy
	elimination. Loop recurrences are detected on-the-fly and a
	minimized set of PHIs is generated.

	- Narrowing of numbers to integers: Predictive narrowing is used
	for induction variables. Demand-driven narrowing is used for
	index expressions using a backpropagation algorithm.

	This avoids the complexity associated with speculative, eager
	narrowing, which also causes excessive control-flow dependencies
	due to the many overflow checks. Selective narrowing is better
	at exploiting the combined bandwidth of the FP and integer units
	of the CPU and avoids clogging up the branch unit.

	Register allocation:

	- Blended cost-model for R-LSRA: The reverse-linear-scan register
	allocator uses a blended cost model for its spill decisions.
	This takes into account multiple factors (e.g. PHI weight) and
	benefits from the special layout of IR references (constants
	before invariant instructions, before variant instructions).

	- Register hints: The register allocation heuristics take into
	account register hints, e.g. for loop recurrences or calling
	conventions. This is very cheap to implement, but improves the
	allocation decisions considerably. It reduces register shuffling
	and prevents unnecessary spills.

	- x86-specific improvements: Special heuristics for move vs.
	rename produce close to optimal code for two-operand machine
	code instructions.

	Fusion of memory operands into instructions is required to
	generate high-quality x86 code. Late fusion in the backend
	allows better, local decisions, based on actual register
	pressure, rather than estimates of prior stages.

	Ok, that's it! Sorry for the length of this posting, but I hope it
	was at least informative to someone out there.

	--Mike