Skip to content

Instantly share code, notes, and snippets.

@soh-cah-toa
Created August 9, 2011 00:56
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save soh-cah-toa/1133182 to your computer and use it in GitHub Desktop.
Save soh-cah-toa/1133182 to your computer and use it in GitHub Desktop.
PODDS: Parrot Opcode Debug Data Serialization format
As of 3.6.0, the debug segment in Parrot bytecode is severely lacking to say the least,
containing merely a line number to opcode mapping that is very unreliable. What we need is a
standardized data debugging format. One of the challenges in this is how to describe the
relationship between the generated bytecode and the original source with enough detail so that a
debugger can provide the user with detailed information. Additionally, this description must be
short and sweet so that it doesn't eat up a lot of space or take too much processing time to
interpret.
There already exists several debugging formats: COFF, stabs, DWARF, etc. However, these are
generally used in object files and executables and not virtual machine bytecode (with the
exception of LLVM since it is a more low-level virtual machine). Despite this, I think using
these as a model for designing our own debugging format would prove to be invaluable. Considering
the popularity of the DWARF format, I will be using it as a general guideline.
A majority of modern high-level programming languages are block structured. That is, each entity
- whether it be a function or class - is contained within another entity. This creates a lexical
scope where symbols are only known within the scope in which they were defined. Therefore,
the best approach for a debug format is to use a model that is also block structured. This makes
it much easier to describe the static structure of a source file, show the members of a
structure or class, etc. Each description entity (excluding the root which describes the source
file) is contained within a parent entity. It may also contain child entries and siblings. This
design creates a tree-like structure similar to the abstract syntax tree created by the
compiler. In the interest of keeping this concise, only the minimum amount of information that
is needed to describe a program object is provided. Additionally, this makes the format
extensible enough to describe nearly any procedural or object-oriented language. A debugger like
HBDB could recognize or ignore certain extensions created by various HLL's.
For now, I'd like to call the format PODDS: Parrot Opcode Debug Data Serialization format. This
specification is still an early draft so the name (among other things) are certainly subject to
change.
The full PODDS specification is quite exhaustive. If you aren't interested in all the gory
details, please see "Introduction to PODDS" at https://gist.github.com/1180094.
NOTE:
I cannot take full credit for the design of this debug data format. This specification is
essentially just a watered-down version of DWARF 1.1 with a few name changes here and there that I
think are better. Since full DWARF compliance would definitely not be possible for Parrot, I've
pretty much just picked out the bits and pieces that I think are Parrot-able. Each section listed
below roughly corresponds with a section in the DWARF 1.1 specification so I will list the page
numbers that it is taken from. The DWARF 1.1 specification can be found at
<http://dwarfstd.org/doc/dwarf_1_1_0.pdf>.
== DATA DESCRIPTION ENTITY (p. 5) ==
The most basic entity in PODDS is called a "Data Description Entity" or DDE. A DDE consists of a
"class" that indicates what it describes and a list of "properties" that further describe the
specific characteristics of the entity. Excluding the topmost DDE, a DDE will always be owned by
a parent DDE and may or may not have any child or sibling DDE's.
Examples of class names:
CLASS_array_type
CLASS_class_type
CLASS_compile_unit
CLASS_enum_type
CLASS_global_sub
CLASS_global_var
CLASS_inline_sub
CLASS_label
CLASS_lex_block
CLASS_local_var
CLASS_member
CLASS_module
CLASS_padding
CLASS_param
CLASS_ptr_type
CLASS_ref_type
CLASS_src_file
CLASS_str_type
CLASS_sub
== PROPERTIES (p. 5) ==
Properties always form a name/value pair. A value will always have one of the following forms:
* address - points to some location in the program's address space
* reference - refers to another DDE in the debug segment
* constant - uninterpreted numerical data
* block - uninterpreted data
* string - a null-terminated series of zero or more bytes
Examples of property names:
PT_end_pc
PT_fund_type
PT_inline
PT_is_optional
PT_lang
PT_location
PT_program
PT_sibling
PT_start_pc
PT_start_scope
PT_str_len
PT_user_def_type
There is no restriction on the order in which properties appear. To eliminate ambiguity, each
property is unique and no more than one property of a given name may appear in a DDE.
== DDE OWNERSHIP (p. 6) ==
The ownership of DDE entries is represented by their physical ordering and use of the
`PT_sibling` property. The value of this property is a reference to another DDE. If the DDE
referred to is null, it represents the end of the sibling chain. Except for `CLASS_padding`, all
DDE's are required to have the `PT_sibling` property. A DDE is owned by its physical predecessor
(called the "parent") unless it is referenced by that physical predecessor with the `PT_sibling`
property. You can think of this DDE as the first child of the predecessor. Children derived from
a DDE form a chain of siblings.
== LOCATION DESCRIPTORS (p. 7) ==
DDE's are required to provide a description of how to determine run-time values of program
objects such as variables. These "location descriptors" are provided by forming variable length
descriptions using a few simple building blocks called "location iotas." They are:
* OP_reg(register) - the program object is in the register given by "register"
* OP_addr(address) - the program object is in the address given by "address"
* OP_const(number) - the program object is constant
{{{ TODO Location iota for describing continuations }}}
A location descriptor can consist of a single location iota or a series of them. If the location
description is a series of iotas, they should be ordered as if they were operators in a postfix
expression. If a location descriptor doesn't contain any iotas, it represents a program object
that exists in the source code but doesn't exist in the bytecode (possibly due to optimizations
such as dead code elimination). When evaluated, the expression represented by a location
description evaluates to the run-time address of the value of that program object except in the
case of `OP_reg` where it evaluates to the register containing the value of the program object.
== TYPE PROPERTIES (p. 9) ==
Certain properties describe the data type of the entity. There are four basic type properties:
fundamental types, user-defined types, modified fundamental types, and modified user-defined
types.
A fundamental type is any data type defined by the HLL, a.k.a. "built-in types." It is
represented by the `PT_fund_type` property. This property contains one of the following constant
values:
FT_bool
FT_char
FT_float
FT_int
FT_label
FT_long
FT_ptr
FT_short
FT_sign_*
FT_usign_*
Additionally, the programmer may define new types such as structures, classes, enums, etc. They
are described in their own separate DDE containing the `PT_user_def_type` property which
references the DDE for the new type.
== MODIFIED TYPES (p. 9) ==
Some user defined types are created by applying certain modifiers to other types; both
fundamental and user-defined. The "pointer to" modifier means that value should be interpreted as
an address, the "reference to" modifier indicates that the value is a C++ reference, the "const"
modifier means that the value is immutable, and so on. Note that most of these modifiers are
mostly seen in the C-world but other HLL's have similar constructs that use different semantics
but have the same meaning.
Examples of type modifiers:
MOD_const
MOD_ptr_to
MOD_ref_to
The type modifiers appear in order as if it were part of a right-associative expression. When
applied to fundamental types, the `PT_mod_fund_type` property contains the value of the
contiguous block of bytes where the type is stored. When applied to a user-defined type, the
`PT_mod_u_d_type` property also contains the value of the contiguous block of bytes where the
type is stored.
Consider the following examples in C:
const char *ch;
MOD_ptr_to MOD_const FT_char
int * const i;
MOD_const MOD_ptr_to FT_int
== ACCESS MODIFIERS (p. 10) ==
In most object-oriented languages, access level modifiers are used to control access to certain
members of a class. Although the semantics vary from language to language, they generally
include 'public', 'private', and 'protected'. PODDS uses the `PT_public`, `PT_private`, and
`PT_protected` properties for each of these respectively. The value of these properties contain
a zero-length string consisting of only the null byte. The mere presence of these properties is
enough to describe the accessibility of the DDE.
== COMPILPTION UNITS (p. 11) ==
It's possible for Parrot bytecode to be derived from one or more sources or "compilation units".
This may be the case in C when all the #include directives are expanded or in Perl when all the
'use'd modules have been imported. Each compilation unit DDE owns the DDE's that describe the
declarations in the corresponding compilation unit. Each compilation unit will be described using
the `CLASS_compile_unit` class.
A `CLASS_compile_unit` class may have any of the following properties:
* PT_sibling
A reference to the DDE that appears right after the last DDE for that compilation unit. Note
that a DDE may not actually exist at the specified offset. This is the case when that offset
is greater than or equal to the size of the debug segment.
* PT_start_pc
The address of the first opcode generated for that compilation unit. This address may be
beyond the last valid opcode.
* PT_end_pc
The address of the first location past the last opcode generated for that compilation unit.
This address may be beyond the last valid opcode.
* PT_name
A null-terminated string containing either the full or relative path of the source file from
which the compilation unit was derived.
* PT_lang
A constant value (preferably an unsigned integer) representing the source language of the
compilation unit.
* PT_stmt_list
A reference to a record in the line number table (mentioned later).
* PT_comp_dir
A null-terminated string containing the current working directory of the command that
generated the compilation unit.
* PT_compiler
A null-terminated string containing information about the compiler that generated the
compilation unit. The actual contents of the string is left up to the compiler though it's
recommended that it begin with the name of the vendor or at least some other identifiable
string that will avoid confusion with other compilers.
== SUBROUTINES (p. 12) ==
DDE's describing subroutines must include any one of the following three classes:
CLASS_global_sub
CLASS_sub
CLASS_inline_sub
A subroutine DDE uses the `PT_name` property that contains a null-terminated string representing
its name as it appears in the original source file.
At this point it is important to mention that all names used to describe program objects (in this
case, subroutines) should represent the object's name AS IT APPEARS IN THE ORIGINAL SOURCE FILE.
This is because some compilers use name mangling to encode extra information in the name like in
the case of function or operator overloading. HLL's that do use name mangling techniques should
always use the unmangled name in the `PT_named` property and other program objects.
In object-oriented HLL's, the subroutine DDE for a method a.k.a. "member function" should use
the `PT_member` property. The value of this property is a reference to the type definition of
the class. The presence of this property makes identifier resolution through methods possible.
In HLL's that distinguish between ordinary subroutines and "main subroutines," the DDE for such a
subroutine should contain the `PT_main` property. The value of which is a zero-length string
consisting of only the null byte. The mere presence of this property is enough to identify the
subroutine as the "entry point" of the program.
If a subroutine returns a value, then its DDE should have one of the four type properties:
fundamental type, modified fundamental type, user-defined type, and modified user-defined type.
Subroutines (formally known as procedures) that do not return a value should not have a type
property.
In HLL's where all subroutines must be pre-declared with a prototype (only if its name is used
before it's defined), the subroutine's DDE should contain the `PT_proto` property. The value of
which is a zero-length string consisting of only the null byte. The mere presence of this
property is enough to indicate that the subroutine has a prototype.
The `PT_start_pc` and `PT_end_pc` properties of a subroutine represent the address of the first
and last opcode respectively generated for the subroutine.
The declarations in a subroutine are described by the entries owned by the subroutine's DDE.
DDE's that describe the parameters of a subroutine will appear in the same order as they do in
the original source file.
The are no limitations on the order of declaration DDE's (that don't represent its parameters)
that are children of a subroutine DDE.
For HLL's that support subroutines with indefinite arity (variadic functions in C), the
unspecified parameters in a variadic parameter list are described with the `CLASS_unspec_param`
class.
A subroutine may include the `PT_ret_cont` property. The value of this property is a location
descriptor representing the continuation where the subroutine will return to.
Inline subroutines are described in two portions: the declaration instance and each inline
instance. The declaration instance, if any, is described with a regular subroutine DDE. These
entries must include the `PT_inline` property whose value is a zero-length string consisting of
only the null byte. The mere presence of this property is enough to indicate that the subroutine
is inlined. If an HLL does not require an "out-of-line" declaration, then the subroutine DDE
will not have any `PT_start_pc` and `PT_end_pc` properties. Furthermore, if such a DDE exists
but also has children describing its parameters (such as pointy blocks in Perl 6), its children
will not have location descriptors.
Each DDE for inline subroutines must use the `CLASS_inline_sub` class. It will also contain the
`PT_spec` property which is a reference to the DDE describing the declaration or "specification."
Each DDE for an inline subroutine owns the DDE entries describing its parameters (if any) and its
local/lexical variables.
== LEXICAL BLOCKS (p. 14) ==
A lexical block is described with a DDE using the `CLASS_lex_block` class. This entity has
both a `PT_start_pc` and `PT_end_pc` property that represent the address of the first and last
opcode respectively generated for the lexical block.
The name of the lexical block (if any) is described using the `PT_name` property.
The declarations in a lexical block are described by the entries owned by the block's DDE. There
exists one DDE for each declaration within the lexical block.
== LABELS (p. 15) ==
A label is described with a DDE using the `CLASS_label` class. The entity for a label is
owned by the DDE describing the scope in which the label can be referenced.
A label DDE has a `PT_start_pc` property representing the address of the opcode generated as the
first statement that immediately follows the label.
The name of the label is described using the `PT_name` property.
== VARIABLES (p. 15) ==
All variables whether they're global, local, or parameters are described with a DDE containing
the `CLASS_global_var`, `CLASS_local_var`, and `CLASS_param` classes respectively. The entity may
contain the following properties:
* PT_name
A null-terminated string representing the variable name as it appears in the original source
file.
* PT_location
The location descriptor of the variable. If this property has a null value or is not used at
all, it is assumed that the variable exists only in the source code but doesn't exist in the
bytecode.
* Any one of the four type properties.
* PT_member
A reference to the structure or class type if the variable is a member of which.
* PT_opt_param
If the variable is an optional parameter, this property is a zero-length string consisting
of only the null byte. The mere presence of this property is enough to indicate that the
variable is an optional parameter.
* PT_def_val
If the variable is a parameter that has a default value, this property may be any constant
value (including strings) that appropriately represents the actual default value of the
parameter.
* PT_const_val
If the variable is a constant value, this property may be any constant value (including
strings) that appropriately represents the variable's actual value.
* PT_start_scope
If the variable's scope begins after the value of `PT_start_pc` for the closest enclosing
scope of the variable, this property represents the offset of the beginning of the scope for
the variable from the `PT_start_pc` value of the DDE that defines its scope. This is used by
HLL's that allow the scope of a variable to begin in the middle of a lexical block or allow
one declaration to change the scope of a subsequent declaration.
== TYPEDEFS (p. 17) ==
Any type defined via a typedef is described with a DDE using the `CLASS_typedef` class. This
entity has a `PT_name` property representing its name. The entity also contains one of the four
type properties.
== POINTERS (p. 17) ==
Pointers and references are described with a DDE containing the `CLASS_ptr_type` and
`CLASS_ref_type` classes respectively. If the pointer or reference is named, then it will contain
the `PT_name` property representing its name. The entity also contains one of the four type
properties which describes the type pointed to or referenced.
== ARRAYS (p. 17) ==
Arrays are described with a DDE using the `CLASS_array_type` class. If the array is named,
then it will contain the `PT_name` property representing its name. An array DDE describing a
multidimensional array may include the `PT_ordering` property whose value is a constant that
describes the ordering (row-major or column-major) of the array's elements. If the `PT_ordering`
property exists, then the DDE is required to use the `ORD_col_major` or `ORD_row_major`
properties. If it doesn't exist, the default ordering of the language (given in the `PT_lang`
property) is assumed.
The subscripts and element data type of the array are described with the `PT_subscr_data`
property. The value of this property is stored in the contiguous block of memory containing the
array. A "data item" describes each dimension and element type of the array. The data items that
describe the dimensions are ordered by the appearance of the dimensions in the original source
file. The last data item in the `PT_subscr_data` property describes the element type.
A data item that describes a dimension is split into four parts in the following order:
1. A format specifier that describes the information following it.
2. The subscript index type which may be either a fundamental or user-defined type.
3. Information that describes the lower bound of the dimension. This may take the form of either
a constant value or location descriptor. If it's a location descriptor, its value is the
address of the lowest element of the dimension. If the lower bound is not specified, it is
described with a zero-length block.
4. Information that describes the upper bound of the dimension. Similar to the lower bound, its
value maybe be either a constant value or location descriptor. If the upper bound is not
specified, it is described with a zero-length block.
The first data item for a dimension consists of a format specifier preceded by one of the four
type properties. This determines how the data items following it should be interpreted. This is
much more efficient than using specific properties to describe the type of the subscript index
and upper/lower bounds. There are nine possible format specifiers:
* FMT_ft_c_c
A fundamental type followed by a constant followed by a constant.
* FMT_ft_c_d
A fundamental type followed by a constant followed by a location descriptor.
* FMT_ft_d_c
A fundamental type followed by a location descriptor followed by a constant.
* FMT_ft_d_d
A fundamental type followed by a location descriptor followed by a location descriptor.
* FMT_ut_c_c
A reference to a user-defined type followed by a constant followed by a constant.
* FMT_ut_c_d
A reference to a user-defined type followed by a constant followed by a location descriptor.
* FMT_ut_d_c
A reference to a user-defined type followed by a location descriptor followed by a constant.
* FMT_ut_d_d
A reference to a user-defined type followed by a location descriptor followed by a location
descriptor.
* FMT_et
A type property describing the element type.
If it is possible to determine the size of the array at compile time, the array DDE may use the
`PT_static_size` property. The value of this property is a constant representing the total size
in bytes.
== CLASSES AND STRUCTURES (p. 19) ==
Classes and structures are described with DDE's using the `CLASS_class_type` and
`CLASS_struct_type` classes respectively. If the class or structure is named, then it will
contain the `PT_name` property representing its name. If it is possible to determine the size of
the class or structure at compile time, its DDE may use the `PT_static_size` property. The value
of this property is a constant representing the total size in bytes.
The members of a class or structure are described by the DDE's owned by the corresponding
entities for the class/structure and appear in the same order as they do in the original source
file.
If the definition of a member of a class or structure appears outside the class/structure
definition, it will have a DDE containing the `PT_member` property which is a reference to the
class declaration containing that member. If the definition of a member appears inside the class
structure definition, it will contain the `PT_location` property describing the location of that
member relative to the base address of the class/structure that encloses it the closest.
A class that inherits from another class owns the DDE that describes the class it inherits from.
This is indicated by a DDE using the `CLASS_inherit` class.
A DDE for an inherited class has the `PT_user_def_type` property which is a reference to the DDE
describing the class from which the parent is derived. It also has a location property describing
the location of the members inherited by the class relative to the beginning of the members of
the entire class.
As described earlier, a DDE may contain one of the three accessibility properties: `PT_public`,
`PT_private`, and `PT_protected`.
== ENUMERATIONS (p. 23) ==
An enumeration is described with DDE's using the `CLASS_enum_type` class. If the enumeration is
named, then it will contain the `PT_name` property representing its name. An enumeration entity
also has a `PT_byte_size` property which is a constant value representing the number of bytes
needed for an instance of this enumeration.
== DECREASING ACCESS TIME (p. 26) ==
A symbolic debugger has to access PODDS data very frequently. Therefore, it is very important to
consider how to decrease the amount of time needed to read and interpret debug data. This
becomes quite difficult when a program object is defined outside the compilation unit where the
debugee is stopped. To find the DDE associated with a program object, a debugger would have to
run a very aggressive search through every DDE at the highest scope in each compilation unit.
This can severally cripple the performance of the debugger.
To combat this problem, a compiler has the option of providing two separate types of tables that
provide information about the DDE's owned by a particular compilation unit: the public name table
and the public address table.
== PUBLIC NAME TABLE (p. 26) ==
The "public name table" is a subsection of the debug segment consisting of records that contain
variable-length entries. Each record describes the names of program objects described by the
DDE's that are owned by a single compilation unit. Each record starts with a header that contains
three important values: 1) the (non-inclusive) length of the entries for that record, 2) the
offset of the compilation unit's DDE from the start of the debug segment, and 3) the size in
bytes of the DDE describing that particular compilation unit. Following the header is a variable
number of offset/name pairs. Each pair contains the offset from the start of the compilation unit
entry that corresponds with the current record for the DDE for the given program object, followed
by a string representing the object's name as found in its `PT_name` property. Each record is
terminated by a null pair. In this way, a debugger can rapidly determine which compilation unit
to search in order to find the DDE for a program object with a given name.
== PUBLIC ADDRESS TABLE (p. 27) ==
The "public address table" is a subsection of the debug segment consisting of records that
contain variable-length entries. Each record describes the section of the program's address
space that contains the compilation unit. Each record starts with a header that contains two
important values: the (non-inclusive) length of the entries for that record and the offset of
the compilation unit's DDE from the start of the debug segment. Following the header is a
variable number of pairs of "address range descriptors." Each one contains the starting address
of the range followed by its length. Each record is terminated by a null pair. In this way, a
debugger can rapidly determine which compilation unit to search in order to find the DDE for a
program object with a given address.
== LINE NUMBER TABLE (p. 27) ==
Associating source-level lines numbers with their respective generated opcodes makes it possible
for a debugger user to specify addresses in relation to source statements. This makes single
stepping much more easier.
Each compilation unit DDE in the debug segment references a corresponding record in the line
number table that describes its respective source statement. The first record in the table
includes the length of the table in bytes and is followed by the address of the first opcode
generated for the compilation unit. The rest of the table consists of a list of source statement
records. A source statement record consists of three parts: 1) a line number, 2) a position
within the source line, and 3) an opcode address. The line numbers are ordered starting with 1
from the beginning of the compilation unit.
The compiler has two ways to represent the position within the source line. It can either use the
number of characters from the beginning of the line to the beginning of the source statement or
use the special value `SRC_NO_POS` to indicate that the record refers to the entire line. This
feature is necessary for HLL's that allow multiple statements in a single line.
The address in each record describes the address of the first opcode generated for that source
statement minus the address of the first opcode generated for the compilation unit. That is, it
represents the offset into the compilation unit.
Some HLL's allow statements to extend over multiple lines. The record in such a case will refer
to the line containing the start of that particular statement.
There is no limitation on the order in which the records appear. They do not necessarily represent
the exact order in which the statements appear in the original source file. Additionally, it is
not required to have a record in the line number table for every single source statement in the
original source file.
To terminate the line number table, PODDS uses a record whose line number is 0 and whose address
describes the first opcode of the next compilation unit. This allows the debugger to understand
which opcodes are associated with the last statement in a compilation unit; a useful feature for
stepping out of functions.
== EXTENSIONS (p. 29) ==
Special labels are reserved for compiler-specific extensions. To denote the start and end of a
range used for such extensions, the labels will use the normal prefix (ELEM, PT, FT, OP, MOD,
LANG, etc.) followed by the `_start_user` or `_end_user` suffix. This prevents extensions from
polluting the PODDS namespace.
Furthermore, compiler-specific extensions should take the form `prefix_compiler_version` where
`compiler` is the name of the compiler and `version` is the extension version (not the compiler
version).
== ERROR VALUES (p. 29) ==
When encoded (described in the next section), the value 0 is reserved to represent some unknown
value or error in the property names or forms, fundamental types, type modifiers, location iotas,
etc.
== ENCODING (p. 29) ==
{{{ XXX The values in this section are the same ones used by DWARF. I really don't see a reason
to make them anything different considering this system is proven to work. }}}
Each DDE consists of a 4-byte (inclusive) length, a 2-byte class, and a series of properties.
The 4-byte length is an unsigned integer that represents the total number of bytes used by the
DDE. The 2-byte class value determines the DDE's "classification" and is encoded as follows:
CLASS_padding = 0x0000
CLASS_array_type = 0x0001
CLASS_class_type = 0x0002
CLASS_enum = 0x0003
CLASS_param = 0x0004
CLASS_global_sub = 0x0005
CLASS_global_var = 0x0006
CLASS_label = 0x0007
CLASS_lex_block = 0x0008
CLASS_local_var = 0x0009
CLASS_member = 0x000a
CLASS_ptr_type = 0x000b
CLASS_ref_type = 0x000c
CLASS_compile_unit = 0x000d
CLASS_src_file = 0x000e
CLASS_str_type = 0x000f
CLASS_struct_type = 0x0010
CLASS_sub = 0x0011
CLASS_sub_type = 0x0012
CLASS_typedef = 0x0013
CLASS_unspec_params = 0x0014
CLASS_inherit = 0x0015
CLASS_inline_sub = 0x0016
CLASS_start_user = 0x4080
CLASS_end_user = 0xffff
`CLASS_padding` DDE's are distinct from null DDE's in that `CLASS_padding` entities have a 4-byte
size (greater than or equal to 8) and a 2-byte class that's followed by the appropriate number
of padding bytes. On the other hand, null entities contain between 1 - 7 zero bytes.
== PROPERTY TYPES (p. 30) ==
Properties are encoded using a 2-byte field for its name followed by the appropriate value. The
value's form is encoded into the property's name using a bitmask. Possible forms include:
* address
Represents an address as `FORM_ADDR`.
* reference
A 4-byte value represented as `FORM_REF`. Its value is the offset (in bytes) relative to the
start of the debug segment.
* constant
Constants make take any of three forms: a 2-byte value `FORM_DATA2`, a 4-byte value
`FORM_DATA4`, and an 8-byte value `FORM_DATA8`.
* block
Blocks are represented in two ways. The first contains a 2-byte length that's followed by 0 -
65,535 contiguous bytes `FORM_BLOCK2`. The second contains a 4-byte length that's followed by
0 - 4,294,967,295 contiguous bytes `FORM_BLOCK4`. The bytes may contain any combination of
addresses, references, or data types.
* string
A null-terminated string `FORM_STRING`.
The forms encoded into the property name have the following values:
FORM_addr = 0x1
FORM_ref = 0x2
FORM_block2 = 0x3
FORM_block4 = 0x4
FORM_data2 = 0x5
FORM_data4 = 0x6
FORM_data8 = 0x7
FORM_string = 0x8
Properties are encoded as follows:
PT_sibling = 0x0010 | FORM_REF
PT_location = 0x0020 | FORM_BLOCK2
PT_name = 0x0030 | FORM_STRING
PT_fund_type = 0x0050 | FORM_DATA2
PT_mod_fund_type = 0x0060 | FORM_BLOCK2
PT_user_def_type = 0x0070 | FORM_REF
PT_mod_u_d_type = 0x0080 | FORM_BLOCK2
PT_subscr_data = 0x00a0 | FORM_BLOCK2
PT_byte_size = 0x00b0 | FORM_DATA4
PT_stmt_list = 0x0100 | FORM_DATA4
PT_start_pc = 0x0110 | FORM_ADDR
PT_end_pc = 0x0120 | FORM_ADDR
PT_lang = 0x0130 | FORM_DATA4
PT_member = 0x0140 | FORM_REF
PT_str_len = 0x0190 | FORM_BLOCK2
PT_comp_dir = 0x01b0 | FORM_STRING
PT_const_val = 0x01c0 | FORM_STRING
PT_const_val = 0x01c0 | FORM_DATA2
PT_const_val = 0x01c0 | FORM_DATA4
PT_const_val = 0x01c0 | FORM_DATA8
PT_const_val = 0x01c0 | FORM_BLOCK2
PT_const_val = 0x01c0 | FORM_BLOCK4
PT_def_val = 0x01e0 | FORM_ADDR
PT_def_val = 0x01e0 | FORM_DATA2
PT_def_val = 0x01e0 | FORM_DATA8
PT_def_val = 0x01e0 | FORM_STRING
PT_inline = 0x0200 | FORM_STRING
PT_is_opt = 0x0210 | FORM_STRING
PT_low_bound = 0x0220 | FORM_REF
PT_low_bound = 0x0220 | FORM_DATA2
PT_low_bound = 0x0220 | FORM_DATA4
PT_low_bound = 0x0220 | FORM_DATA8
PT_program = 0x0230 | FORM_STRING
PT_private = 0x0240 | FORM_STRING
PT_compiler = 0x0250 | FORM_STRING
PT_protected = 0x0260 | FORM_STRING
PT_proto = 0x0270 | FORM_STRING
PT_public = 0x0280 | FORM_STRING
PT_ret_cont = 0x02a0 | FORM_BLOCK2
PT_spec = 0x02b0 | FORM_REF
PT_start_scope = 0x02c0 | FORM_DATA4
PT_up_bound = 0x02f0 | FORM_REF
PT_up_bound = 0x02f0 | FORM_DATA2
PT_up_bound = 0x02f0 | FORM_DATA4
PT_up_bound = 0x02f0 | FORM_DATA8
PT_start_user = 0x2000
PT_end_user = 0x3ff0
{{{ XXX These values are intentionally out of order since the bitmask would otherwise conflict
with the form "flag" being set in each }}}}
== LOCATION IOTAS (p. 32) ==
Each location iota has a 1-byte identification code which is interpreted to mean `reg(register)`,
`addr(address)`, or `const(number)`. For an iota that takes a number, the identifying byte is
followed by 4-byte value. For an iota that takes an address, the value is of a size that can
appropriately represent any address.
A location descriptor is the value of a location property and is stored in a 2-byte block.
Location iotas are encoded as follows:
OP_reg = 0x01
OP_addr = 0x02
OP_const = 0x03
OP_start_user = 0xe0
OP_end_user = 0xff
== FUNDAMENTAL TYPES (p. 33) ==
For values falling in the range from `FT_start_user` through `FT_end_user`, the low order byte
of the type's code contains the byte-size of program objects that are the specified type only if
the size is constant, otherwise the low order byte is 0.
Fundamental types are encoded as follows:
FT_char = 0x0001
FT_sign_char = 0x0002
FT_usign_char = 0x0003
FT_short = 0x0004
FT_sign_short = 0x0005
FT_usign_short = 0x0006
FT_int = 0x0007
FT_sign_int = 0x0008
FT_usign_int = 0x0009
FT_long = 0x000a
FT_sign_long = 0x000b
FT_usign_long = 0x000c
FT_ptr = 0x000d
FT_float = 0x000e
FT_dbl_float = 0x000f
FT_ext_float = 0x0010
FT_complex = 0x0011
FT_dbl_complex = 0x0012
FT_void = 0x0014
FT_bool = 0x0015
FT_ext_complex = 0x0016
FT_label = 0x0017
FT_start_user = 0x8000
FT_end_user = 0xffff
== TYPE MODIFIERS (p. 34) ==
Type modifiers are encoded as 1-byte values as follows:
MOD_ptr_to = 0x01
MOD_ref_to = 0x02
MOD_const = 0x03
MOD_start_user = 0x80
MOD_end_user = 0xff
== SOURCE LANGUAGES (p. 34) ==
Source languages are encoded as 4-byte constant values. To include a type for every known
dynamic language would be exhaustive and pointless. For now, only a short list of the most
commonly used and developed HLL's is necessary. Languages can be added in the future as need
arises.
LANG_perl6 = 0x00000001
LANG_nqp = 0x00000002
LANG_winxed = 0x00000003
LANG_partcl = 0x00000004
LANG_lua = 0x00000005
LANG_cardinal = 0x00000006
LANG_start_user = 0x00008000
LANG_end_user = 0x0000ffff
== ARRAY ORDERING (p. 35) ==
The order properties of arrays are encoded as follows:
ORD_row_major = 0x0
ORD_col_major = 0x1
== ARRAY SUBSCRIPTS (p. 35) ==
The entire array subscript entry must be less than 65,536 bytes. This may seem overly large but
it allows for future implementations of C on Parrot. Such an implementation would only allow
5,957 dimensions in an array and would require 11 bytes per dimension plus the 5 bytes (at
least) for the element type description.
Array subscript data contains six components that are encoded as follows:
* Format Specifier
1-byte constant.
* Fundamental Type
2-byte constant.
* User-Defined Type
4-byte reference.
* Subscript Bound Index
4-byte constant.
* Subscript Bound Location
2-byte data block.
* Element Type
Any of the four type properties preceded by the corresponding 2-byte CLASS_* class.
The format specifiers in the array subscript entry are encoded as follows:
FMT_ft_c_c = 0x1
FMT_ft_c_d = 0x2
FMT_ft_d_c = 0x3
FMT_ft_d_d = 0x4
FMT_ut_c_c = 0x5
FMT_ut_c_d = 0x6
FMT_ut_d_c = 0x7
FMT_ut_d_d = 0x8
== PUBLIC NAME TABLE (p. 36) ==
Each record in the public name table starts with a header containing three values: a 4-byte (non
inclusive) value representing the length of the set of entries for the compilation unit, a 4
byte offset of the compilation unit's DDE from the start of the debug segment, and a 4-byte
value containing the byte-size of the DDE describing that particular compilation unit. The
header is followed by a series of pairs. Each pair contains a 4-byte offset followed by a null
terminated string. Each set is terminated by a 4-byte value of 0. 4 bytes might seem overly
excessive just to represent 0 but the length is consistent with the alignment.
== ADDRESS TABLE (p. 36) ==
Each record in the address table starts with a header containing two values: a 4-byte (non
inclusive) value representing the length of the set of entries for the compilation unit and a 4
byte offset into the debug segment. The header is followed by a series of pairs. Each pair
contains an address and a 4-byte constant length. Each set is terminated by a 4-byte value of 0.
== LINE NUMBER TABLE (p. 36) ==
The source statement information for a compilation unit consists of a 4-byte (inclusive) length
and is followed by an address. This is followed by a series of source statement records. The 4
byte (inclusive) length represents the number of bytes used by the statement information for the
compilation unit. The address represents the address of the first opcode generated for that
compilation unit.
Each record contains an unsigned 4-byte integer representing the source line number, an unsigned
2-byte integer representing the statement's position within the corresponding line, and an
unsigned 4-byte integer representing the address. The special position `SRC_NO_POS` has the
value 0xffff which indicates that the record refers to the entire line.
== APPLICABLE PROPERTIES (p. 41) ==
This list describes all the properties that a given class can have in its DDE. It is important
to note that these are merely the applicable properties and a DDE is not required to specify
every single one of them.
To save space, the `PT_fund_type`, `PT_mod_fund_type`, `PT_user_def_type`, and `PT_mod_u_d_type`
properties will be abbreviated as FT, MTF, UDT, and MUDT respectively.
ELEMENT NAME PROPERTY NAME
-----------------------------------------
CLASS_array_type PT_byte_size
PT_name
PT_ordering
PT_private
PT_protected
PT_public
PT_sibling
PT_start_scope
PT_subscr_data
-----------------------------------------
CLASS_class_type PT_byte_size
PT_name
PT_private
PT_protected
PT_public
PT_sibling
PT_start_scope
-----------------------------------------
CLASS_compile_unit PT_comp_dir
PT_compiler
PT_end_pc
PT_lang
PT_name
PT_sibling
PT_start_pc
PT_stmt_list
-----------------------------------------
CLASS_enum_type PT_byte_size
PT_name
PT_private
PT_protected
PT_public
PT_sibling
PT_start_scope
-----------------------------------------
CLASS_global_sub FT/MFT/UDT/MUDT
PT_end_pc
PT_inline
PT_location
PT_member
PT_name
PT_private
PT_program
PT_protected
PT_proto
PT_public
PT_ret_cont
PT_sibling
PT_start_pc
PT_start_scope
-----------------------------------------
CLASS_global_var FT/MFT/UDT/MUDT
PT_const_val
PT_location
PT_member
PT_name
PT_private
PT_protected
PT_public
PT_sibling
PT_start_scope
-----------------------------------------
CLASS_inherit PT_location
PT_private
PT_protected
PT_public
PT_sibling
PT_user_def_type
-----------------------------------------
CLASS_inline_sub PT_end_pc
PT_sibling
PT_spec
PT_start_pc
-----------------------------------------
CLASS_label PT_name
PT_start_pc
PT_start_scope
PT_sibling
-----------------------------------------
CLASS_lex_block PT_end_pc
PT_name
PT_sibling
PT_start_pc
-----------------------------------------
CLASS_local_var FT/MFT/UDT/MUD
PT_const_val
PT_location
PT_name
PT_private
PT_protected
PT_public
PT_sibling
PT_start_scope
-----------------------------------------
CLASS_member FT/MFT/UDT/MUDT
PT_byte_size
PT_location
PT_name
PT_private
PT_protected
PT_public
PT_sibling
-----------------------------------------
CLASS_padding
-----------------------------------------
CLASS_param FT/MFT/UDT/MUDT
PT_def_val
PT_is_opt
PT_location
PT_name
PT_sibling
-----------------------------------------
CLASS_ptr_type FT/MFT/UDT/MUDT
PT_name
PT_private
PT_protected
PT_public
PT_start_scope
PT_sibling
-----------------------------------------
CLASS_ref_type FT/MFT/UDT/MUDT
PT_name
PT_private
PT_protected
PT_public
PT_start_scope
PT_sibling
-----------------------------------------
CLASS_str_type PT_byte_size
PT_name
PT_private
PT_protected
PT_public
PT_sibling
PT_start_scope
PT_str_length
-----------------------------------------
CLASS_struct_type PT_byte_size
PT_name
PT_private
PT_protected
PT_public
PT_sibling
PT_start_scope
-----------------------------------------
CLASS_sub FT/MFT/UDT/MUDT
PT_end_pc
PT_inline
PT_member
PT_name
PT_private
PT_protected
PT_proto
PT_public
PT_ret_cont
PT_start_pc
PT_start_scope
PT_sibling
-----------------------------------------
CLASS_sub_type FT/MFT/UDT/MUDT
PT_name
PT_private
PT_protected
PT_proto
PT_public
PT_sibling
PT_start_scope
-----------------------------------------
CLASS_typedef FT/MFT/UDT/MUDT
PT_name
PT_private
PT_protected
PT_public
PT_sibling
PT_start_scope
-----------------------------------------
CLASS_unspec_param PT_sibling
-----------------------------------------
@particle
Copy link

the definition of a string, as found on line 70, is dangerous. that is:

  • string - a null-terminated series of zero or more bytes

while it has long been standard practice due to C, this design decision is now considered to be a mistake. anyone who's ever experienced damage due to buffer overruns will tell you so.

please consider changing the definition of a string to include a length field and drop the terminator.

@soh-cah-toa
Copy link
Author

@particle

I'm not averse to changing the string representation but is it really necessary? I don't think it's possible to cause a buffer overflow since the debug segment isn't actual executable code. Considering this, does using a length field really provide any additional security?

@cotto
Copy link

cotto commented Aug 20, 2011

How much thought did you put into adapting this to Parrot and pbc? There's a lot in there that makes perfect sense for a C-like language but doesn't really have an analog in Parrot. There are many examples, but a subset of what jumps out is typedefs (which don't exist in PIR or pbc), arrays (which are just a PMC) and fundamental types (which only have meaning in NCI). My criticism isn't about those three aspects of the proposal, but primarily that it seems like you haven't put enough thought in adapting only the relevant bits of the DWARF spec. I greatly appreciate that you waded through the spec in the first place and look forward to helping you further refine this proposal. Modeling debug segments on a widely-used design is a good plan, but only if we make sure to adapt it carefully.

The general strategy I'd use would be to try to figure out the motivation for the various parts of DWARF and see how those could be applied to Parrot. This will obviously be more work, but it'll also mean that we'll have a better understanding of why DWARF does what it does and how we can translate its solutions to parrot.

Additionally, my brain started to get sore before I got 1/3 through the proposal. I'd much prefer that you present a very general outline that's easy to understand and comment on, then once people are happy to expand that out to something more exhaustive. Laziness as a virtue applies to hackers too.

@soh-cah-toa
Copy link
Author

@cotto

That's exactly where I'm at right now. I need to figure out how to Parrot-ize DWARF. I'm going to need a bit of help from some of the Parrot veterans who know those kind of things without having to even think about it. Of course, any suggestions you may have are always welcome.

I also agree that writing a smaller outline would be beneficial. I can get started on that soon.

All are very valid points and thanks. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment