Skip to content

Instantly share code, notes, and snippets.

@faultyserver
Last active February 5, 2018 19:44
Show Gist options
  • Save faultyserver/5939d791ddd4d20a58a91da69fcda304 to your computer and use it in GitHub Desktop.
Save faultyserver/5939d791ddd4d20a58a91da69fcda304 to your computer and use it in GitHub Desktop.
An in-depth introduction to the Interpreter class for the Myst programming language: http://myst-lang.org.

Myst Interpreter Walkthrough

This guide attempts to walk through the structure of the Interpreter class of the Myst language.

The source code for the Interpreter is split into multiple files. The main file is src/myst/interpreter.cr, but most of the actual implementation lives under the src/myst/interpreter directory.

This guide is split into various sections to help gradually build a full understanding, and to simplify explanations further on in the guide. In order, we'll cover:

  1. Primitives - basic value types and the differences between them.
  2. Scopes - value containers, closures, and nesting.
  3. Instance properties of the interpreter - purpose and general use cases.
  4. Pattern Matching - the recursive pattern matching implementation.
  5. Invocations - persistant representations of a function call.
  6. Utilities - useful functions for simplifying complex behavior.
  7. Visitors - walking a program's syntax tree.
  8. Kernel + Native library - defining types, modules, and methods that are bundled with the myst executable.

This guide will not go through the full details of all of these components, but will hopefully cover enough of the basics for you to go and discover those details on your own.

Primitives

Primitives are the basis of any object that exists in Myst. The primitive types in Myst are:

  • Boolean
  • Integer
  • Float
  • String
  • Symbol
  • Nil
  • List
  • Map
  • Functor
  • FunctorDef
  • NativeDef
  • Module
  • Type
  • Instance

The definitions for all of these types are given in src/myst/interpreter/value.cr. The class names for all concrete types (types that can be instantiated) are prefixed with T to avoid colliding with Crystal's own type names (e.g., the type name for integer values is TInteger).

All value types inherit from the the abstract base class Value, which defines three things:

  • a truthy? method that returns the truthiness of the Value. In every case except nil and the boolean value false, this will return true.
  • a type_name method that returns the name of the type as it would be referred to in Myst. For example, TFloat#type_name returns "Float".
  • an ivars property that contains the instance variables for the value.

Value also provides a generic method for creating a new value object from a Literal AST node (integers, strings, etc.).

Container Types

Types and Modules both inherit from an intermediate value class called ContainerType. ContainerTypes have an included_modules containing the list of modules that have been included (using include in the language) into the type/module.

This list of included modules is used in the ancestors method to recursively determine the list of ancestors for a given type/module. For example:

defmodule Foo
end

defmodule Bar
  include Foo
end

deftype Baz
  include Bar
end

In this case, the Bar type would only have the Bar module in its included_modules list, but ancestors will return [Bar, Foo], because Bar also includes the Foo module. We'll see how ancestors is used within the interpreter later on in the "Visitors" section.

Both modules and types define a scope property to contain the values defined within them. Types also define an instance_scope property to distinguish between static and instance properties. We'll see more about scopes in a bit.

Lastly, types also define an extended_modules property - the list of modules that the type extends, meaning they apply to the type itself, rather than instances of the type.

Instances

Instances of a Myst Type are represented by the TInstance value type. Instances only keep a reference to their Type and a scope property to ensure that each instance has its own version of variables and such.

By keeping a reference to the Type, rather than copying the instance_scope of the type into each instance, TInstance types remain small and incredibly fast to duplicate.

Functors

Functors are the value objects that represent functions in Myst. A Functor can be created with def/defstatic, a block argument to a Call, or an anonymous function definition (using fn...end).

Functors have a few special properties beyond what most value types include:

  • clauses is the list of clauses that are defined for the functor.
  • lexical_scope is (obviously) the lexical scope at the point where the function was defined. For functions with multiple clauses, this will be the lexical scope of the first clause of the function.
  • closure? is a boolean representing whether the functor is a closure.
  • closed_self is the value of self at the point where the functor was defined. This property is only used if the functor is a closure.

To expand on the clauses property: functors in Myst are defined by a set of clauses which provide the parameter definitions and bodies of the function. For more information about what this looks like in the language itself, check out the "Multi-clause functions" section of the Get Started guide.

Clauses can be defined in two ways. Most commonly, a clause will be defined in Myst code, which would become a TFunctorDef. In some cases (such as the Native Library, which we'll cover later on), the clause can be defined as a Crystal method with a specific interface. This interface is defined by the TNativeDef alias as a method taking a receiver value, arguments to the function, and a block argument, and returning a Value. Any method that follows this structure can be used as a TNativeDef.

Scopes

We talked a little bit about Scopes in the previous section, but here we'll cover what exactly a scope is, how they implement closures, and how they can be nested to allow access to variables defined outside of the scope itself.

The normal Scope class is defined in src/myst/interpreter/scope.cr, and defines two properties:

  • a values hash that maps names to values.
  • an optional parent scope.

Scopes provide two different access methods: recursive and direct. The access operators ([], []?, and []=) are all recursive, meaning they will also check their parent scope (and the parent's parent, etc.) if the requested name does not exist in the current scope.

Direct access will only check the current Scope object for the name, and is done using has_key? and assign. Getting the value for a name is not currently implemented with a direct access method. Instead, use has_key? to check if the Scope directly contains the name, then use [] to get the value.

ClosureScopes

ClosureScope is a subclass of Scope that implements closures. It is defined in src/myst/interpreter/closure_scope.cr.

The only new property of ClosureScope is a closed_scope value that is the scope to be closed over.

ClosureScopes re-implement both the recursive and direct access methods of Scope to work in the same way: first checking the closed scope for a name, then falling back to the scope itself.

ClosureScopes will never access the parent scope. Instead, it uses recursive access on the scope that it closes over.

Instance properties

The Interpreter defines a bunch of properties that track the current state of a program and provide a simple interface for performing operations. All of these properties are defined in src/myst/interpreter.cr.

The Myst interpreter is a basic Stack Machine, meaning that all operations the interpreter can perform are implemented on top of a single array of Value objects. However, there are many other properties used to store state information that can affect the behavior of the implementations in various ways.

The main properties of the Interpreter are:

  • stack - The stack of values that the interpreter performs operations on.
  • self_stack - A stack of values, where at any point in the program, the top value represents the current value of self.
  • callstack - A stack of "frames", representing all of the calls that happened to get to the current point in the program.
  • scope_stack - A stack of Scopes representing the current lexical scope of the program.
  • kernel - The top-level module, exposing all of the primitive types to the language (more on this later).

All other properties of the interpreter (input, output, errput, and warnings) are infrastructural properties that don't affect the interpreter itself, but how the program that invokes the interpreter can interact with it. These are not important for understanding how the interpreter works.

Stack

There's not too much more to explain about the stack property. I would recommend reading up on how Stack Machines work to get a better idea of how this will be used by the rest of the Interpreter.

Self stack

The self_stack is modified whenever the program does something that changes the value of self. This can happen in a few ways:

  • When entering a deftype or defmodule, the type/module will be pushed to the top of this stack.
  • When invoking a Call with a receiver, the receiver will be pushed as the value of self.
  • When invoking a Call on a function that is a closure, the closed_self property of the function will be pushed on top of the receiver as the value of self.

At any point in the Interpreter, the current value of self is available via the current_self method.

Callstack

Callstack is a basic wrapper around an Array of AST nodes, tracking the execution path of the interpreter to show how a program reached its current point. It is defined in src/myst/interpreter/callstack.cr.

At this point, the callstack is only used when printing out backtrace information when an error is raised and not captured by the program.

Scope stack

The scope stack is the most complex stack in the Interpreter. New scopes are pushed to this stack in a few cases:

  • Every function clause will push a copy of the functions scope when attempting to match and invoke it.
  • Exception handlers push completely new scopes to avoid leaking variables outside of the handler.

push_scope_override automatically sets the parent of new scopes to the previous current_scope, unless the new scope already had a parent (e.g., functions from within a module have that module as a parent).

This may seem simple, but the current scope is actually not determined by these operations in most cases. The two cases described above are how scope overrides are pushed to the scope stack. In most cases, there is no scope override (e.g., within a deftype or defmodule, or at the top level). Instead, the current scope is determined by the scope of the current value of self, as determined by the __scopeof utility function.

This behavior is all abstracted by the Interpreter#current_scope function, so any method in the interpreter can just call current_scope and not worry about these semantics.

Kernel

The kernel is just the top level module of the program. It is inserted as the base value in the self_stack to allow variable and method definitions at the top level of the program, and to expose the primitive types and native library to the program.

Pattern Matching

Recursive pattern matching is implemented as part of the Interpreter, and is defined in the src/myst/interpreter/matcher.cr file. For more information on what pattern matching in Myst looks like, check out the "Pattern Matching" chapter of the Get Started guide.

The interface to the matcher is the match method, that takes a pattern and a value to try to match. This method looks at the type of the pattern (an AST node) and dispatches to the appropriate matching implementation for that type. These methods are then able to recursively call back to the match method to perform deep pattern matching on values.

Whenever any part of a match fails, the matcher will raise a MatchError - a RuntimeError that can be rescued from within the language.

match_value

match_value is the primitive matching operation, used to match the value or type of a value directly against a value pattern. This method is called whenever the pattern is a simple Literal node (not a List or a Map literal), a value interpolation, or a Constant.

This method interprets the pattern to get the value it represents, then simply checks its equality with the originally-given value.

The special case here is when the pattern resolves to a Type or Module, where the match is then performed against the type of the value, rather than the value itself. This match will succeed as long as the type of the given value or any of its ancestors match the type specified by the pattern.

match_list

match_list is called whenever the pattern for the match is a ListLiteral. Right off the bat, this method asserts that the given value is also a List. If it is not, a MatchError is raised.

If that check succeeds, the method will then "chunk" the list pattern into three parts by calling chunk_list_pattern. The result of this function is three components: an array of elements that appear before the Splat argument in the list, the Splat element itself, and then an array of elements appearing after the Splat. If the list pattern does not contain a Splat, only the first component will have any values.

With the chunked pattern, match_list then recursively calls match for each pattern component on the left side of the splat, pairing it with the element at the same index from the value being matched. The same is done backwards from the right for elements after the splat.

Finally, the splat is matched with any remaining elements in the value.

Importantly, match_list is an exhaustive match, meaning that all elements from the value list must be matched. If any elements of the value are not matched, and there is no splat to collect them, the match will fail.

match_map

Similar to match_list, match_map also checks that the value being matched is a Map, but the actual matching implementation is much simpler.

The method iterates the keys of the MapLiteral pattern, checking the entry for that value in the Map value given to the match. If the key does not exist in the Map value, the match fails.

If the value does exist, the method recursively calls match with the corresponding value from the pattern.

Unlike match_list, this match is not exhaustive. The value may define other keys beyond what the pattern requires and the match will still succeed.

Variable matching

For Vars, Underscores, and IVars, matching does not require any work. These matches simply assign whatever value is provided into the variable represented by the pattern.

Invocations

An Invocation is a binding of an interpreter instance, a functor, and arguments for the functor together to represent a Call. The definition of Invocation lives in src/myst/interpreter/invocation.cr.

In essence, an Invocation is the step between the Call node in the AST and the result of calling the function. Functor resolution happens before creating an Invocation, meaning they can be passed around without concern for lexical scoping or the like. However, the entire functor is carried with an Invocation; matching to a specific clause does not happen until the Invocation is actually invoked.

After creating an Invocation, the Call is executed by calling Invocation#invoke. Because Invocations bind all of the necessary components needed to invoke the Call, this can happen immediately after the Invocation is created, or at any point in the future if the Invocation is stored. However, if the state of the interpreter changes between these two times, the results of invoking the Invocation may vary.

Invoking the Invocation

When Invocation#invoke is called, a few things happen before the actual Call execution:

  • The Invocation stores the size of the self_stack before the Call is invoked so that it can be restored afterward.
  • If the Call has a receiver, that value is pushed to the self_stack.
  • If the Call is a closure, the closed_self property of the functor is pushed to the self_stack. This is intentionally after pushing the receiver, as the closure takes precedence.
  • The clauses of the functor are matched, in order, with the arguments provided to the Invocation. This uses the bound interpreter's matcher as described above, but with some extra wrapping behavior, the details of which can be found in Invocation#clause_matches?.
  • Once a clause that matches all of the Invocation's arguments is found, iteration of the clauses stops.

To ensure that each attempt at matching clause parameters does not have any effects on outside code, each attempt pushes a new scope override as described in the Scopes section. This scope is then popped if the match fails, or after the Call executes if the match succeeds.

When a match succeeds, do_call is called with the matching clause. do_call has three definitions based on the type of the clause:

  1. For a TFunctorDef (a clause defined in Myst code), the Invocation tells the interpreter to visit the clause's body and the resulting value is returned.
  2. For a TNativeDef (a native Crystal clause definition), the clause called directly with the receiver, arguments, and block argument passed in (see the Primitives section for how TNativeDef is defined).
  3. For any other value, a RuntimeError is raised, as the clause is not a callable object. This should never happen, and would actually be an Interpreter bug if it were to be encountered.

After do_call returns, the Call execution has finished, and so the Invocation resets the value of self to what it was before the Invocation (using the index it stored earlier), then returns the value that resulted from performing the Call.

Alternatively, if the Invocation failed to find a matching clause for the provided arguments, it will raise a RuntimeError saying that no clause matched.

Flow Control

Invocations are also responsible for implementing the flow control operations (break, next and return).

The Interpreter implements these using Crystal's native exception handling, and each of the operations has a corresponding Exception subclass to represent it.

For BreakExceptions, the semantics involve immediately exiting from both the function being called and the parent function that called said function. To implement this, BreakExceptions have a caught property to indicate whether this is the first or second function it is breaking from.

For next and return, the implementation simply pops the last value from the stack and returns it immediately.

Utilities

The Myst Interpreter defines a few useful utility functions for common operations used across the Interpreter codebase. Most of these methods are prefixed with dunderscores (two underscores, __) to explicitly indicate that they are utilities coming from src/myst/interpreter/util.cr.

__typeof

__typeof is easily the most commonly used utility. It provides an easy way of resolving the type of any object. For Types and Modules, this returns the value itself. For instances, it returns the type of the instance, and for any primitive value, it will return the Myst Type from the Kernel that is associated with that primitive type.

Some example use cases of __typeof include the type matching described in the Matcher section above and performing recursive lookup of variables/functions. However, the most common use case is providing type information for Native Library functions, beyond just primitive types. This will be covered a little bit in the Native Library section.

__scopeof

__scopeof is similar to __typeof, but returns the primary scope of the given value, rather than the entire type. For object instances, this is the cloned scope they created when they were instantiated. For containers, this is the scope property (the static scope for TType, the only scope on TModule). For any other value type (e.g., primitives), this method gets the TType representing the primitive type of the value, then returns it's instance scope.

__scopeof is most commonly used for recursive lookup (alongside __typeof) of names on an object. It allows the implementation to abstract away all of the special cases of scoping and just ask for a value directly through a single interface.

lookup and recursive_lookup

As their names imply, lookup and recursive_lookup are responsible for searching through scopes and finding a value for the given name.

lookup only checks the current scope (using the recursive access notation described in the Scopes section), and will raise an error if no value is found for that name.

recursive_lookup instead takes a receiver value as well and will recursively search through the scopes of the receiver and it's appropriate ancestors to look for the requested name (for TTypes, this is the extended_ancestors, ancestors for everything else).

__raise_not_found

If a value is expected to be found, but cannot be for whatever reason, __raise_not_found is the common interface for raising the appropriate error. This method is responsible for everyone's favorite message: "No variable or method foo for bar:Bar".

__raise_runtime_error

If the Interpreter needs to raise a RuntimeError for whatever reason, it should do so using __raise_runtime_error. Not only does this method abstract away the boilerplate of creating a RuntimeError object, it also allows for hooking and other debugging methods while developing on the interpreter (i.e., checking callstacks to see how the error was reached).

This is most commonly used in the Native Library to raise exceptions when users provide bad arguments or an operation can otherwise not be performed.

Visitors

This is the part I'm sure you've all been waiting for. Visitors are what make the interpreter actually run and perform the operations that a program defines.

Rather than going into detail here about the Visitor pattern itself, I would recommend reading the Wikipedia article on the pattern.

The gist of the implementation here is that there is a visit method that accepts a node argument, which is an AST node object from the program tree. Each node subclass (e.g., Def, ValueInterpolation, or Expressions) then has its own overload of the visit method to define how that node should be handled. These methods are responsible for calling visit with any of their children nodes (as appropriate).

With all of this in place, executing a program is as simple as calling Interpreter#visit with the root node, and the program will just execute. Note that #visit is actually wrapped by #run to handle some other behavior.

The rest of this section will be covering some of the more complicated visit overloads to explain them a bit better, with references to the rest of this guide. These will be covered alphabetically for simplicity.

All visit overloads for the Intepreter are defined in the src/myst/interpreter/nodes/ folder, and each node type's overload generally lives in a file with the same name.

Call

Most of the interesting work around Calls was covered in the Invocation section, but Call still does a fair amount of work before delegating to Invocation.

First, Call calls lookup_call. As of Myst 0.4.0, Calls can either refer to a function by a name String, or they can define an expression that resolves to a functor. In the first case, lookup_call will delegate to the recursive_lookup we covered earlier. In the latter, it simply interprets the expression and returns the result.

Also, Myst's syntax means that sometimes Call nodes won't refer to a function at all, instead referring to a non-local variable. In this case the visitor just pushes the resolved value onto the stack to transparently avoid an error.

Def

The only really interesting thing here is how existing functor objects are found vs creating new ones.

The visitor first checks the current value of self, and whether the Def is static or not to determine which scope the functor lookup/insertion should be done in. Then, if that scope directly has a functor with the same name, it is picked up. If not, a new functor is created. Only then is the new clause added to the functor object.

Also, if the Def is actually a Block node (which is a subclass of Def), the functor will not be added to the current scope, as blocks are anonymous and always unique.

ExceptionHandler

Any code block that has a rescue or ensure clause gets wrapped in an ExceptionHandler. Similar to an Invocation the visitor for this node will remember the self_stack and scope_stack sizes at entry to restore them after the exception handler finishes.

Most of this implementation is copied from Invocation, just looking at the rescues property of the node rather than functor clauses.

InterpolatedStringLiteral

Dealing with interpolations within strings is a little more complicated than it may seem. The syntax means that the representation of an InterpolatedStringLiteral is made up of various StringPieces.

Each piece can either be a real StringLiteral or an arbitrary Node. In the latter case, the visitor recurses through the Node, then tries to call whatever to_s method the resulting value defines to get the string representation.

The end result is created by joining all of the resulting Strings together and wrapping that in a TString value object.

MagicConst

Magic constants are special values that introspect the program source code. __FILE__, __LINE__, and __DIR__ all look at the location information of the node to create a value dynamically.

OpAssign

OpAssigns are syntactically rewrites to their expanded form (a op= b -> a = a op b).

At this point, this is the only visitor that dynamically generates new nodes to recurse into. Something to note here is that this will probably be moved into the SemanticAnalysis phase to improve runtime performance.

Require

Require is easily the most complicated visitor. The premise is to resolve the given file name into an absolute file path, load that file through a new Parser, then recurse into the resulting node tree.

Most of the code here relates to resolving file paths across the file system, as the given path may already be absolute, or it could project relative, or relative to any directory on the $MYST_PATH environment variable.

Kernel + Native Library

The Native Library is the set of types, modules, and methods that are compiled into the myst executable and made available to any program. These methods are all implemented in Crystal as TNativeDefs, and added to the Interpreter via the create_kernel method, defined in src/myst/interpreter/kernel.cr. The init_* methods that it calls out to are all defined in the files under the src/myst/interpreter/native_lib directory.

Rather than cover what is defined in the native library, this section will cover how these methods are defined and the tools used to do so.

The NativeLib module defined in src/myst/interpreter/native_lib.cr defines a bunch of additional utilities to simplify writing native function implementations and abstract out most of the boilerplate for common operations.

call_func and call_func_by_name

The call_func utility provides a clean interface to creating and invoking an Invocation object, given a function and some arguments.

call_func_by_name makes this even easier by allowing the code to simplify provide the name of a function to call, and it will take care of resolving the appropriate functor object from that name.

The most common usecase for these methods are to call to_s or similar methods on Myst values for easier manipulation within the native method.

instantiate

instantiate is a simple wrapper around creating a new instance of a Type and calling the initialize method on it. This does not currently have many uses, but is useful for cleanly creating new objects dynamically inside of a native function.

method

The method macro is a convenience macro for defining Crystal methods that can be used as TNativeDefs and added as clauses to a functor. Every method in the native library is created using this method.

The arguments to this macro are the name of the generated method, the expected type of the self object when inside of this method, the parameters for the method, and the method definition itself, given as a block.g

This method takes care of casting the given arguments into the parameter types and casting the return value into a Value type as expected by the TNativeDef alias.

def_method and def_instance_method

To actually add Crystal methods created with NativeLib#method to an object in the interpreter, use def_method or def_instance_method.

def_method is used for defining methods on modules, or static methods on types, while def_instance_method is only used to define instance methods on a type.

Both of these methods take three arguments: the TType object to add the method to, the name to use for the method in Myst, and the name of the Crystal method that defines the implementation (the name used for NativeLib#method).

These methods are used in all of the init_* methods described earlier to populate the kernel types with their functions.

Conclusion

Hopefully this guide has been helpful and provided enough information for you to form a full understanding of the structure of the Myst Interpreter. This guide has not covered the Lexer, Parser, or Semantic Analyzer, so I will likely end up expanding this guide with other guides for those components as well. They are all much simpler, though, so I don't expect the guides will be as long.

I also imagine I've probably missed something or explained things poorly somehwere in this guide, so please feel free to ask me questions on Discord. You can join the Myst Community Discord and ask in the #learning channel, or PM me directly (faulty#7958).

@Jens0512
Copy link

Jens0512 commented Feb 4, 2018

Awesome! Thank you very much for this great guide :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment