-
-
Save soh-cah-toa/1133182 to your computer and use it in GitHub Desktop.
As of 3.6.0, the debug segment in Parrot bytecode is severely lacking to say the least, | |
containing merely a line number to opcode mapping that is very unreliable. What we need is a | |
standardized data debugging format. One of the challenges in this is how to describe the | |
relationship between the generated bytecode and the original source with enough detail so that a | |
debugger can provide the user with detailed information. Additionally, this description must be | |
short and sweet so that it doesn't eat up a lot of space or take too much processing time to | |
interpret. | |
There already exists several debugging formats: COFF, stabs, DWARF, etc. However, these are | |
generally used in object files and executables and not virtual machine bytecode (with the | |
exception of LLVM since it is a more low-level virtual machine). Despite this, I think using | |
these as a model for designing our own debugging format would prove to be invaluable. Considering | |
the popularity of the DWARF format, I will be using it as a general guideline. | |
A majority of modern high-level programming languages are block structured. That is, each entity | |
- whether it be a function or class - is contained within another entity. This creates a lexical | |
scope where symbols are only known within the scope in which they were defined. Therefore, | |
the best approach for a debug format is to use a model that is also block structured. This makes | |
it much easier to describe the static structure of a source file, show the members of a | |
structure or class, etc. Each description entity (excluding the root which describes the source | |
file) is contained within a parent entity. It may also contain child entries and siblings. This | |
design creates a tree-like structure similar to the abstract syntax tree created by the | |
compiler. In the interest of keeping this concise, only the minimum amount of information that | |
is needed to describe a program object is provided. Additionally, this makes the format | |
extensible enough to describe nearly any procedural or object-oriented language. A debugger like | |
HBDB could recognize or ignore certain extensions created by various HLL's. | |
For now, I'd like to call the format PODDS: Parrot Opcode Debug Data Serialization format. This | |
specification is still an early draft so the name (among other things) are certainly subject to | |
change. | |
The full PODDS specification is quite exhaustive. If you aren't interested in all the gory | |
details, please see "Introduction to PODDS" at https://gist.github.com/1180094. | |
NOTE: | |
I cannot take full credit for the design of this debug data format. This specification is | |
essentially just a watered-down version of DWARF 1.1 with a few name changes here and there that I | |
think are better. Since full DWARF compliance would definitely not be possible for Parrot, I've | |
pretty much just picked out the bits and pieces that I think are Parrot-able. Each section listed | |
below roughly corresponds with a section in the DWARF 1.1 specification so I will list the page | |
numbers that it is taken from. The DWARF 1.1 specification can be found at | |
<http://dwarfstd.org/doc/dwarf_1_1_0.pdf>. | |
== DATA DESCRIPTION ENTITY (p. 5) == | |
The most basic entity in PODDS is called a "Data Description Entity" or DDE. A DDE consists of a | |
"class" that indicates what it describes and a list of "properties" that further describe the | |
specific characteristics of the entity. Excluding the topmost DDE, a DDE will always be owned by | |
a parent DDE and may or may not have any child or sibling DDE's. | |
Examples of class names: | |
CLASS_array_type | |
CLASS_class_type | |
CLASS_compile_unit | |
CLASS_enum_type | |
CLASS_global_sub | |
CLASS_global_var | |
CLASS_inline_sub | |
CLASS_label | |
CLASS_lex_block | |
CLASS_local_var | |
CLASS_member | |
CLASS_module | |
CLASS_padding | |
CLASS_param | |
CLASS_ptr_type | |
CLASS_ref_type | |
CLASS_src_file | |
CLASS_str_type | |
CLASS_sub | |
== PROPERTIES (p. 5) == | |
Properties always form a name/value pair. A value will always have one of the following forms: | |
* address - points to some location in the program's address space | |
* reference - refers to another DDE in the debug segment | |
* constant - uninterpreted numerical data | |
* block - uninterpreted data | |
* string - a null-terminated series of zero or more bytes | |
Examples of property names: | |
PT_end_pc | |
PT_fund_type | |
PT_inline | |
PT_is_optional | |
PT_lang | |
PT_location | |
PT_program | |
PT_sibling | |
PT_start_pc | |
PT_start_scope | |
PT_str_len | |
PT_user_def_type | |
There is no restriction on the order in which properties appear. To eliminate ambiguity, each | |
property is unique and no more than one property of a given name may appear in a DDE. | |
== DDE OWNERSHIP (p. 6) == | |
The ownership of DDE entries is represented by their physical ordering and use of the | |
`PT_sibling` property. The value of this property is a reference to another DDE. If the DDE | |
referred to is null, it represents the end of the sibling chain. Except for `CLASS_padding`, all | |
DDE's are required to have the `PT_sibling` property. A DDE is owned by its physical predecessor | |
(called the "parent") unless it is referenced by that physical predecessor with the `PT_sibling` | |
property. You can think of this DDE as the first child of the predecessor. Children derived from | |
a DDE form a chain of siblings. | |
== LOCATION DESCRIPTORS (p. 7) == | |
DDE's are required to provide a description of how to determine run-time values of program | |
objects such as variables. These "location descriptors" are provided by forming variable length | |
descriptions using a few simple building blocks called "location iotas." They are: | |
* OP_reg(register) - the program object is in the register given by "register" | |
* OP_addr(address) - the program object is in the address given by "address" | |
* OP_const(number) - the program object is constant | |
{{{ TODO Location iota for describing continuations }}} | |
A location descriptor can consist of a single location iota or a series of them. If the location | |
description is a series of iotas, they should be ordered as if they were operators in a postfix | |
expression. If a location descriptor doesn't contain any iotas, it represents a program object | |
that exists in the source code but doesn't exist in the bytecode (possibly due to optimizations | |
such as dead code elimination). When evaluated, the expression represented by a location | |
description evaluates to the run-time address of the value of that program object except in the | |
case of `OP_reg` where it evaluates to the register containing the value of the program object. | |
== TYPE PROPERTIES (p. 9) == | |
Certain properties describe the data type of the entity. There are four basic type properties: | |
fundamental types, user-defined types, modified fundamental types, and modified user-defined | |
types. | |
A fundamental type is any data type defined by the HLL, a.k.a. "built-in types." It is | |
represented by the `PT_fund_type` property. This property contains one of the following constant | |
values: | |
FT_bool | |
FT_char | |
FT_float | |
FT_int | |
FT_label | |
FT_long | |
FT_ptr | |
FT_short | |
FT_sign_* | |
FT_usign_* | |
Additionally, the programmer may define new types such as structures, classes, enums, etc. They | |
are described in their own separate DDE containing the `PT_user_def_type` property which | |
references the DDE for the new type. | |
== MODIFIED TYPES (p. 9) == | |
Some user defined types are created by applying certain modifiers to other types; both | |
fundamental and user-defined. The "pointer to" modifier means that value should be interpreted as | |
an address, the "reference to" modifier indicates that the value is a C++ reference, the "const" | |
modifier means that the value is immutable, and so on. Note that most of these modifiers are | |
mostly seen in the C-world but other HLL's have similar constructs that use different semantics | |
but have the same meaning. | |
Examples of type modifiers: | |
MOD_const | |
MOD_ptr_to | |
MOD_ref_to | |
The type modifiers appear in order as if it were part of a right-associative expression. When | |
applied to fundamental types, the `PT_mod_fund_type` property contains the value of the | |
contiguous block of bytes where the type is stored. When applied to a user-defined type, the | |
`PT_mod_u_d_type` property also contains the value of the contiguous block of bytes where the | |
type is stored. | |
Consider the following examples in C: | |
const char *ch; | |
MOD_ptr_to MOD_const FT_char | |
int * const i; | |
MOD_const MOD_ptr_to FT_int | |
== ACCESS MODIFIERS (p. 10) == | |
In most object-oriented languages, access level modifiers are used to control access to certain | |
members of a class. Although the semantics vary from language to language, they generally | |
include 'public', 'private', and 'protected'. PODDS uses the `PT_public`, `PT_private`, and | |
`PT_protected` properties for each of these respectively. The value of these properties contain | |
a zero-length string consisting of only the null byte. The mere presence of these properties is | |
enough to describe the accessibility of the DDE. | |
== COMPILPTION UNITS (p. 11) == | |
It's possible for Parrot bytecode to be derived from one or more sources or "compilation units". | |
This may be the case in C when all the #include directives are expanded or in Perl when all the | |
'use'd modules have been imported. Each compilation unit DDE owns the DDE's that describe the | |
declarations in the corresponding compilation unit. Each compilation unit will be described using | |
the `CLASS_compile_unit` class. | |
A `CLASS_compile_unit` class may have any of the following properties: | |
* PT_sibling | |
A reference to the DDE that appears right after the last DDE for that compilation unit. Note | |
that a DDE may not actually exist at the specified offset. This is the case when that offset | |
is greater than or equal to the size of the debug segment. | |
* PT_start_pc | |
The address of the first opcode generated for that compilation unit. This address may be | |
beyond the last valid opcode. | |
* PT_end_pc | |
The address of the first location past the last opcode generated for that compilation unit. | |
This address may be beyond the last valid opcode. | |
* PT_name | |
A null-terminated string containing either the full or relative path of the source file from | |
which the compilation unit was derived. | |
* PT_lang | |
A constant value (preferably an unsigned integer) representing the source language of the | |
compilation unit. | |
* PT_stmt_list | |
A reference to a record in the line number table (mentioned later). | |
* PT_comp_dir | |
A null-terminated string containing the current working directory of the command that | |
generated the compilation unit. | |
* PT_compiler | |
A null-terminated string containing information about the compiler that generated the | |
compilation unit. The actual contents of the string is left up to the compiler though it's | |
recommended that it begin with the name of the vendor or at least some other identifiable | |
string that will avoid confusion with other compilers. | |
== SUBROUTINES (p. 12) == | |
DDE's describing subroutines must include any one of the following three classes: | |
CLASS_global_sub | |
CLASS_sub | |
CLASS_inline_sub | |
A subroutine DDE uses the `PT_name` property that contains a null-terminated string representing | |
its name as it appears in the original source file. | |
At this point it is important to mention that all names used to describe program objects (in this | |
case, subroutines) should represent the object's name AS IT APPEARS IN THE ORIGINAL SOURCE FILE. | |
This is because some compilers use name mangling to encode extra information in the name like in | |
the case of function or operator overloading. HLL's that do use name mangling techniques should | |
always use the unmangled name in the `PT_named` property and other program objects. | |
In object-oriented HLL's, the subroutine DDE for a method a.k.a. "member function" should use | |
the `PT_member` property. The value of this property is a reference to the type definition of | |
the class. The presence of this property makes identifier resolution through methods possible. | |
In HLL's that distinguish between ordinary subroutines and "main subroutines," the DDE for such a | |
subroutine should contain the `PT_main` property. The value of which is a zero-length string | |
consisting of only the null byte. The mere presence of this property is enough to identify the | |
subroutine as the "entry point" of the program. | |
If a subroutine returns a value, then its DDE should have one of the four type properties: | |
fundamental type, modified fundamental type, user-defined type, and modified user-defined type. | |
Subroutines (formally known as procedures) that do not return a value should not have a type | |
property. | |
In HLL's where all subroutines must be pre-declared with a prototype (only if its name is used | |
before it's defined), the subroutine's DDE should contain the `PT_proto` property. The value of | |
which is a zero-length string consisting of only the null byte. The mere presence of this | |
property is enough to indicate that the subroutine has a prototype. | |
The `PT_start_pc` and `PT_end_pc` properties of a subroutine represent the address of the first | |
and last opcode respectively generated for the subroutine. | |
The declarations in a subroutine are described by the entries owned by the subroutine's DDE. | |
DDE's that describe the parameters of a subroutine will appear in the same order as they do in | |
the original source file. | |
The are no limitations on the order of declaration DDE's (that don't represent its parameters) | |
that are children of a subroutine DDE. | |
For HLL's that support subroutines with indefinite arity (variadic functions in C), the | |
unspecified parameters in a variadic parameter list are described with the `CLASS_unspec_param` | |
class. | |
A subroutine may include the `PT_ret_cont` property. The value of this property is a location | |
descriptor representing the continuation where the subroutine will return to. | |
Inline subroutines are described in two portions: the declaration instance and each inline | |
instance. The declaration instance, if any, is described with a regular subroutine DDE. These | |
entries must include the `PT_inline` property whose value is a zero-length string consisting of | |
only the null byte. The mere presence of this property is enough to indicate that the subroutine | |
is inlined. If an HLL does not require an "out-of-line" declaration, then the subroutine DDE | |
will not have any `PT_start_pc` and `PT_end_pc` properties. Furthermore, if such a DDE exists | |
but also has children describing its parameters (such as pointy blocks in Perl 6), its children | |
will not have location descriptors. | |
Each DDE for inline subroutines must use the `CLASS_inline_sub` class. It will also contain the | |
`PT_spec` property which is a reference to the DDE describing the declaration or "specification." | |
Each DDE for an inline subroutine owns the DDE entries describing its parameters (if any) and its | |
local/lexical variables. | |
== LEXICAL BLOCKS (p. 14) == | |
A lexical block is described with a DDE using the `CLASS_lex_block` class. This entity has | |
both a `PT_start_pc` and `PT_end_pc` property that represent the address of the first and last | |
opcode respectively generated for the lexical block. | |
The name of the lexical block (if any) is described using the `PT_name` property. | |
The declarations in a lexical block are described by the entries owned by the block's DDE. There | |
exists one DDE for each declaration within the lexical block. | |
== LABELS (p. 15) == | |
A label is described with a DDE using the `CLASS_label` class. The entity for a label is | |
owned by the DDE describing the scope in which the label can be referenced. | |
A label DDE has a `PT_start_pc` property representing the address of the opcode generated as the | |
first statement that immediately follows the label. | |
The name of the label is described using the `PT_name` property. | |
== VARIABLES (p. 15) == | |
All variables whether they're global, local, or parameters are described with a DDE containing | |
the `CLASS_global_var`, `CLASS_local_var`, and `CLASS_param` classes respectively. The entity may | |
contain the following properties: | |
* PT_name | |
A null-terminated string representing the variable name as it appears in the original source | |
file. | |
* PT_location | |
The location descriptor of the variable. If this property has a null value or is not used at | |
all, it is assumed that the variable exists only in the source code but doesn't exist in the | |
bytecode. | |
* Any one of the four type properties. | |
* PT_member | |
A reference to the structure or class type if the variable is a member of which. | |
* PT_opt_param | |
If the variable is an optional parameter, this property is a zero-length string consisting | |
of only the null byte. The mere presence of this property is enough to indicate that the | |
variable is an optional parameter. | |
* PT_def_val | |
If the variable is a parameter that has a default value, this property may be any constant | |
value (including strings) that appropriately represents the actual default value of the | |
parameter. | |
* PT_const_val | |
If the variable is a constant value, this property may be any constant value (including | |
strings) that appropriately represents the variable's actual value. | |
* PT_start_scope | |
If the variable's scope begins after the value of `PT_start_pc` for the closest enclosing | |
scope of the variable, this property represents the offset of the beginning of the scope for | |
the variable from the `PT_start_pc` value of the DDE that defines its scope. This is used by | |
HLL's that allow the scope of a variable to begin in the middle of a lexical block or allow | |
one declaration to change the scope of a subsequent declaration. | |
== TYPEDEFS (p. 17) == | |
Any type defined via a typedef is described with a DDE using the `CLASS_typedef` class. This | |
entity has a `PT_name` property representing its name. The entity also contains one of the four | |
type properties. | |
== POINTERS (p. 17) == | |
Pointers and references are described with a DDE containing the `CLASS_ptr_type` and | |
`CLASS_ref_type` classes respectively. If the pointer or reference is named, then it will contain | |
the `PT_name` property representing its name. The entity also contains one of the four type | |
properties which describes the type pointed to or referenced. | |
== ARRAYS (p. 17) == | |
Arrays are described with a DDE using the `CLASS_array_type` class. If the array is named, | |
then it will contain the `PT_name` property representing its name. An array DDE describing a | |
multidimensional array may include the `PT_ordering` property whose value is a constant that | |
describes the ordering (row-major or column-major) of the array's elements. If the `PT_ordering` | |
property exists, then the DDE is required to use the `ORD_col_major` or `ORD_row_major` | |
properties. If it doesn't exist, the default ordering of the language (given in the `PT_lang` | |
property) is assumed. | |
The subscripts and element data type of the array are described with the `PT_subscr_data` | |
property. The value of this property is stored in the contiguous block of memory containing the | |
array. A "data item" describes each dimension and element type of the array. The data items that | |
describe the dimensions are ordered by the appearance of the dimensions in the original source | |
file. The last data item in the `PT_subscr_data` property describes the element type. | |
A data item that describes a dimension is split into four parts in the following order: | |
1. A format specifier that describes the information following it. | |
2. The subscript index type which may be either a fundamental or user-defined type. | |
3. Information that describes the lower bound of the dimension. This may take the form of either | |
a constant value or location descriptor. If it's a location descriptor, its value is the | |
address of the lowest element of the dimension. If the lower bound is not specified, it is | |
described with a zero-length block. | |
4. Information that describes the upper bound of the dimension. Similar to the lower bound, its | |
value maybe be either a constant value or location descriptor. If the upper bound is not | |
specified, it is described with a zero-length block. | |
The first data item for a dimension consists of a format specifier preceded by one of the four | |
type properties. This determines how the data items following it should be interpreted. This is | |
much more efficient than using specific properties to describe the type of the subscript index | |
and upper/lower bounds. There are nine possible format specifiers: | |
* FMT_ft_c_c | |
A fundamental type followed by a constant followed by a constant. | |
* FMT_ft_c_d | |
A fundamental type followed by a constant followed by a location descriptor. | |
* FMT_ft_d_c | |
A fundamental type followed by a location descriptor followed by a constant. | |
* FMT_ft_d_d | |
A fundamental type followed by a location descriptor followed by a location descriptor. | |
* FMT_ut_c_c | |
A reference to a user-defined type followed by a constant followed by a constant. | |
* FMT_ut_c_d | |
A reference to a user-defined type followed by a constant followed by a location descriptor. | |
* FMT_ut_d_c | |
A reference to a user-defined type followed by a location descriptor followed by a constant. | |
* FMT_ut_d_d | |
A reference to a user-defined type followed by a location descriptor followed by a location | |
descriptor. | |
* FMT_et | |
A type property describing the element type. | |
If it is possible to determine the size of the array at compile time, the array DDE may use the | |
`PT_static_size` property. The value of this property is a constant representing the total size | |
in bytes. | |
== CLASSES AND STRUCTURES (p. 19) == | |
Classes and structures are described with DDE's using the `CLASS_class_type` and | |
`CLASS_struct_type` classes respectively. If the class or structure is named, then it will | |
contain the `PT_name` property representing its name. If it is possible to determine the size of | |
the class or structure at compile time, its DDE may use the `PT_static_size` property. The value | |
of this property is a constant representing the total size in bytes. | |
The members of a class or structure are described by the DDE's owned by the corresponding | |
entities for the class/structure and appear in the same order as they do in the original source | |
file. | |
If the definition of a member of a class or structure appears outside the class/structure | |
definition, it will have a DDE containing the `PT_member` property which is a reference to the | |
class declaration containing that member. If the definition of a member appears inside the class | |
structure definition, it will contain the `PT_location` property describing the location of that | |
member relative to the base address of the class/structure that encloses it the closest. | |
A class that inherits from another class owns the DDE that describes the class it inherits from. | |
This is indicated by a DDE using the `CLASS_inherit` class. | |
A DDE for an inherited class has the `PT_user_def_type` property which is a reference to the DDE | |
describing the class from which the parent is derived. It also has a location property describing | |
the location of the members inherited by the class relative to the beginning of the members of | |
the entire class. | |
As described earlier, a DDE may contain one of the three accessibility properties: `PT_public`, | |
`PT_private`, and `PT_protected`. | |
== ENUMERATIONS (p. 23) == | |
An enumeration is described with DDE's using the `CLASS_enum_type` class. If the enumeration is | |
named, then it will contain the `PT_name` property representing its name. An enumeration entity | |
also has a `PT_byte_size` property which is a constant value representing the number of bytes | |
needed for an instance of this enumeration. | |
== DECREASING ACCESS TIME (p. 26) == | |
A symbolic debugger has to access PODDS data very frequently. Therefore, it is very important to | |
consider how to decrease the amount of time needed to read and interpret debug data. This | |
becomes quite difficult when a program object is defined outside the compilation unit where the | |
debugee is stopped. To find the DDE associated with a program object, a debugger would have to | |
run a very aggressive search through every DDE at the highest scope in each compilation unit. | |
This can severally cripple the performance of the debugger. | |
To combat this problem, a compiler has the option of providing two separate types of tables that | |
provide information about the DDE's owned by a particular compilation unit: the public name table | |
and the public address table. | |
== PUBLIC NAME TABLE (p. 26) == | |
The "public name table" is a subsection of the debug segment consisting of records that contain | |
variable-length entries. Each record describes the names of program objects described by the | |
DDE's that are owned by a single compilation unit. Each record starts with a header that contains | |
three important values: 1) the (non-inclusive) length of the entries for that record, 2) the | |
offset of the compilation unit's DDE from the start of the debug segment, and 3) the size in | |
bytes of the DDE describing that particular compilation unit. Following the header is a variable | |
number of offset/name pairs. Each pair contains the offset from the start of the compilation unit | |
entry that corresponds with the current record for the DDE for the given program object, followed | |
by a string representing the object's name as found in its `PT_name` property. Each record is | |
terminated by a null pair. In this way, a debugger can rapidly determine which compilation unit | |
to search in order to find the DDE for a program object with a given name. | |
== PUBLIC ADDRESS TABLE (p. 27) == | |
The "public address table" is a subsection of the debug segment consisting of records that | |
contain variable-length entries. Each record describes the section of the program's address | |
space that contains the compilation unit. Each record starts with a header that contains two | |
important values: the (non-inclusive) length of the entries for that record and the offset of | |
the compilation unit's DDE from the start of the debug segment. Following the header is a | |
variable number of pairs of "address range descriptors." Each one contains the starting address | |
of the range followed by its length. Each record is terminated by a null pair. In this way, a | |
debugger can rapidly determine which compilation unit to search in order to find the DDE for a | |
program object with a given address. | |
== LINE NUMBER TABLE (p. 27) == | |
Associating source-level lines numbers with their respective generated opcodes makes it possible | |
for a debugger user to specify addresses in relation to source statements. This makes single | |
stepping much more easier. | |
Each compilation unit DDE in the debug segment references a corresponding record in the line | |
number table that describes its respective source statement. The first record in the table | |
includes the length of the table in bytes and is followed by the address of the first opcode | |
generated for the compilation unit. The rest of the table consists of a list of source statement | |
records. A source statement record consists of three parts: 1) a line number, 2) a position | |
within the source line, and 3) an opcode address. The line numbers are ordered starting with 1 | |
from the beginning of the compilation unit. | |
The compiler has two ways to represent the position within the source line. It can either use the | |
number of characters from the beginning of the line to the beginning of the source statement or | |
use the special value `SRC_NO_POS` to indicate that the record refers to the entire line. This | |
feature is necessary for HLL's that allow multiple statements in a single line. | |
The address in each record describes the address of the first opcode generated for that source | |
statement minus the address of the first opcode generated for the compilation unit. That is, it | |
represents the offset into the compilation unit. | |
Some HLL's allow statements to extend over multiple lines. The record in such a case will refer | |
to the line containing the start of that particular statement. | |
There is no limitation on the order in which the records appear. They do not necessarily represent | |
the exact order in which the statements appear in the original source file. Additionally, it is | |
not required to have a record in the line number table for every single source statement in the | |
original source file. | |
To terminate the line number table, PODDS uses a record whose line number is 0 and whose address | |
describes the first opcode of the next compilation unit. This allows the debugger to understand | |
which opcodes are associated with the last statement in a compilation unit; a useful feature for | |
stepping out of functions. | |
== EXTENSIONS (p. 29) == | |
Special labels are reserved for compiler-specific extensions. To denote the start and end of a | |
range used for such extensions, the labels will use the normal prefix (ELEM, PT, FT, OP, MOD, | |
LANG, etc.) followed by the `_start_user` or `_end_user` suffix. This prevents extensions from | |
polluting the PODDS namespace. | |
Furthermore, compiler-specific extensions should take the form `prefix_compiler_version` where | |
`compiler` is the name of the compiler and `version` is the extension version (not the compiler | |
version). | |
== ERROR VALUES (p. 29) == | |
When encoded (described in the next section), the value 0 is reserved to represent some unknown | |
value or error in the property names or forms, fundamental types, type modifiers, location iotas, | |
etc. | |
== ENCODING (p. 29) == | |
{{{ XXX The values in this section are the same ones used by DWARF. I really don't see a reason | |
to make them anything different considering this system is proven to work. }}} | |
Each DDE consists of a 4-byte (inclusive) length, a 2-byte class, and a series of properties. | |
The 4-byte length is an unsigned integer that represents the total number of bytes used by the | |
DDE. The 2-byte class value determines the DDE's "classification" and is encoded as follows: | |
CLASS_padding = 0x0000 | |
CLASS_array_type = 0x0001 | |
CLASS_class_type = 0x0002 | |
CLASS_enum = 0x0003 | |
CLASS_param = 0x0004 | |
CLASS_global_sub = 0x0005 | |
CLASS_global_var = 0x0006 | |
CLASS_label = 0x0007 | |
CLASS_lex_block = 0x0008 | |
CLASS_local_var = 0x0009 | |
CLASS_member = 0x000a | |
CLASS_ptr_type = 0x000b | |
CLASS_ref_type = 0x000c | |
CLASS_compile_unit = 0x000d | |
CLASS_src_file = 0x000e | |
CLASS_str_type = 0x000f | |
CLASS_struct_type = 0x0010 | |
CLASS_sub = 0x0011 | |
CLASS_sub_type = 0x0012 | |
CLASS_typedef = 0x0013 | |
CLASS_unspec_params = 0x0014 | |
CLASS_inherit = 0x0015 | |
CLASS_inline_sub = 0x0016 | |
CLASS_start_user = 0x4080 | |
CLASS_end_user = 0xffff | |
`CLASS_padding` DDE's are distinct from null DDE's in that `CLASS_padding` entities have a 4-byte | |
size (greater than or equal to 8) and a 2-byte class that's followed by the appropriate number | |
of padding bytes. On the other hand, null entities contain between 1 - 7 zero bytes. | |
== PROPERTY TYPES (p. 30) == | |
Properties are encoded using a 2-byte field for its name followed by the appropriate value. The | |
value's form is encoded into the property's name using a bitmask. Possible forms include: | |
* address | |
Represents an address as `FORM_ADDR`. | |
* reference | |
A 4-byte value represented as `FORM_REF`. Its value is the offset (in bytes) relative to the | |
start of the debug segment. | |
* constant | |
Constants make take any of three forms: a 2-byte value `FORM_DATA2`, a 4-byte value | |
`FORM_DATA4`, and an 8-byte value `FORM_DATA8`. | |
* block | |
Blocks are represented in two ways. The first contains a 2-byte length that's followed by 0 - | |
65,535 contiguous bytes `FORM_BLOCK2`. The second contains a 4-byte length that's followed by | |
0 - 4,294,967,295 contiguous bytes `FORM_BLOCK4`. The bytes may contain any combination of | |
addresses, references, or data types. | |
* string | |
A null-terminated string `FORM_STRING`. | |
The forms encoded into the property name have the following values: | |
FORM_addr = 0x1 | |
FORM_ref = 0x2 | |
FORM_block2 = 0x3 | |
FORM_block4 = 0x4 | |
FORM_data2 = 0x5 | |
FORM_data4 = 0x6 | |
FORM_data8 = 0x7 | |
FORM_string = 0x8 | |
Properties are encoded as follows: | |
PT_sibling = 0x0010 | FORM_REF | |
PT_location = 0x0020 | FORM_BLOCK2 | |
PT_name = 0x0030 | FORM_STRING | |
PT_fund_type = 0x0050 | FORM_DATA2 | |
PT_mod_fund_type = 0x0060 | FORM_BLOCK2 | |
PT_user_def_type = 0x0070 | FORM_REF | |
PT_mod_u_d_type = 0x0080 | FORM_BLOCK2 | |
PT_subscr_data = 0x00a0 | FORM_BLOCK2 | |
PT_byte_size = 0x00b0 | FORM_DATA4 | |
PT_stmt_list = 0x0100 | FORM_DATA4 | |
PT_start_pc = 0x0110 | FORM_ADDR | |
PT_end_pc = 0x0120 | FORM_ADDR | |
PT_lang = 0x0130 | FORM_DATA4 | |
PT_member = 0x0140 | FORM_REF | |
PT_str_len = 0x0190 | FORM_BLOCK2 | |
PT_comp_dir = 0x01b0 | FORM_STRING | |
PT_const_val = 0x01c0 | FORM_STRING | |
PT_const_val = 0x01c0 | FORM_DATA2 | |
PT_const_val = 0x01c0 | FORM_DATA4 | |
PT_const_val = 0x01c0 | FORM_DATA8 | |
PT_const_val = 0x01c0 | FORM_BLOCK2 | |
PT_const_val = 0x01c0 | FORM_BLOCK4 | |
PT_def_val = 0x01e0 | FORM_ADDR | |
PT_def_val = 0x01e0 | FORM_DATA2 | |
PT_def_val = 0x01e0 | FORM_DATA8 | |
PT_def_val = 0x01e0 | FORM_STRING | |
PT_inline = 0x0200 | FORM_STRING | |
PT_is_opt = 0x0210 | FORM_STRING | |
PT_low_bound = 0x0220 | FORM_REF | |
PT_low_bound = 0x0220 | FORM_DATA2 | |
PT_low_bound = 0x0220 | FORM_DATA4 | |
PT_low_bound = 0x0220 | FORM_DATA8 | |
PT_program = 0x0230 | FORM_STRING | |
PT_private = 0x0240 | FORM_STRING | |
PT_compiler = 0x0250 | FORM_STRING | |
PT_protected = 0x0260 | FORM_STRING | |
PT_proto = 0x0270 | FORM_STRING | |
PT_public = 0x0280 | FORM_STRING | |
PT_ret_cont = 0x02a0 | FORM_BLOCK2 | |
PT_spec = 0x02b0 | FORM_REF | |
PT_start_scope = 0x02c0 | FORM_DATA4 | |
PT_up_bound = 0x02f0 | FORM_REF | |
PT_up_bound = 0x02f0 | FORM_DATA2 | |
PT_up_bound = 0x02f0 | FORM_DATA4 | |
PT_up_bound = 0x02f0 | FORM_DATA8 | |
PT_start_user = 0x2000 | |
PT_end_user = 0x3ff0 | |
{{{ XXX These values are intentionally out of order since the bitmask would otherwise conflict | |
with the form "flag" being set in each }}}} | |
== LOCATION IOTAS (p. 32) == | |
Each location iota has a 1-byte identification code which is interpreted to mean `reg(register)`, | |
`addr(address)`, or `const(number)`. For an iota that takes a number, the identifying byte is | |
followed by 4-byte value. For an iota that takes an address, the value is of a size that can | |
appropriately represent any address. | |
A location descriptor is the value of a location property and is stored in a 2-byte block. | |
Location iotas are encoded as follows: | |
OP_reg = 0x01 | |
OP_addr = 0x02 | |
OP_const = 0x03 | |
OP_start_user = 0xe0 | |
OP_end_user = 0xff | |
== FUNDAMENTAL TYPES (p. 33) == | |
For values falling in the range from `FT_start_user` through `FT_end_user`, the low order byte | |
of the type's code contains the byte-size of program objects that are the specified type only if | |
the size is constant, otherwise the low order byte is 0. | |
Fundamental types are encoded as follows: | |
FT_char = 0x0001 | |
FT_sign_char = 0x0002 | |
FT_usign_char = 0x0003 | |
FT_short = 0x0004 | |
FT_sign_short = 0x0005 | |
FT_usign_short = 0x0006 | |
FT_int = 0x0007 | |
FT_sign_int = 0x0008 | |
FT_usign_int = 0x0009 | |
FT_long = 0x000a | |
FT_sign_long = 0x000b | |
FT_usign_long = 0x000c | |
FT_ptr = 0x000d | |
FT_float = 0x000e | |
FT_dbl_float = 0x000f | |
FT_ext_float = 0x0010 | |
FT_complex = 0x0011 | |
FT_dbl_complex = 0x0012 | |
FT_void = 0x0014 | |
FT_bool = 0x0015 | |
FT_ext_complex = 0x0016 | |
FT_label = 0x0017 | |
FT_start_user = 0x8000 | |
FT_end_user = 0xffff | |
== TYPE MODIFIERS (p. 34) == | |
Type modifiers are encoded as 1-byte values as follows: | |
MOD_ptr_to = 0x01 | |
MOD_ref_to = 0x02 | |
MOD_const = 0x03 | |
MOD_start_user = 0x80 | |
MOD_end_user = 0xff | |
== SOURCE LANGUAGES (p. 34) == | |
Source languages are encoded as 4-byte constant values. To include a type for every known | |
dynamic language would be exhaustive and pointless. For now, only a short list of the most | |
commonly used and developed HLL's is necessary. Languages can be added in the future as need | |
arises. | |
LANG_perl6 = 0x00000001 | |
LANG_nqp = 0x00000002 | |
LANG_winxed = 0x00000003 | |
LANG_partcl = 0x00000004 | |
LANG_lua = 0x00000005 | |
LANG_cardinal = 0x00000006 | |
LANG_start_user = 0x00008000 | |
LANG_end_user = 0x0000ffff | |
== ARRAY ORDERING (p. 35) == | |
The order properties of arrays are encoded as follows: | |
ORD_row_major = 0x0 | |
ORD_col_major = 0x1 | |
== ARRAY SUBSCRIPTS (p. 35) == | |
The entire array subscript entry must be less than 65,536 bytes. This may seem overly large but | |
it allows for future implementations of C on Parrot. Such an implementation would only allow | |
5,957 dimensions in an array and would require 11 bytes per dimension plus the 5 bytes (at | |
least) for the element type description. | |
Array subscript data contains six components that are encoded as follows: | |
* Format Specifier | |
1-byte constant. | |
* Fundamental Type | |
2-byte constant. | |
* User-Defined Type | |
4-byte reference. | |
* Subscript Bound Index | |
4-byte constant. | |
* Subscript Bound Location | |
2-byte data block. | |
* Element Type | |
Any of the four type properties preceded by the corresponding 2-byte CLASS_* class. | |
The format specifiers in the array subscript entry are encoded as follows: | |
FMT_ft_c_c = 0x1 | |
FMT_ft_c_d = 0x2 | |
FMT_ft_d_c = 0x3 | |
FMT_ft_d_d = 0x4 | |
FMT_ut_c_c = 0x5 | |
FMT_ut_c_d = 0x6 | |
FMT_ut_d_c = 0x7 | |
FMT_ut_d_d = 0x8 | |
== PUBLIC NAME TABLE (p. 36) == | |
Each record in the public name table starts with a header containing three values: a 4-byte (non | |
inclusive) value representing the length of the set of entries for the compilation unit, a 4 | |
byte offset of the compilation unit's DDE from the start of the debug segment, and a 4-byte | |
value containing the byte-size of the DDE describing that particular compilation unit. The | |
header is followed by a series of pairs. Each pair contains a 4-byte offset followed by a null | |
terminated string. Each set is terminated by a 4-byte value of 0. 4 bytes might seem overly | |
excessive just to represent 0 but the length is consistent with the alignment. | |
== ADDRESS TABLE (p. 36) == | |
Each record in the address table starts with a header containing two values: a 4-byte (non | |
inclusive) value representing the length of the set of entries for the compilation unit and a 4 | |
byte offset into the debug segment. The header is followed by a series of pairs. Each pair | |
contains an address and a 4-byte constant length. Each set is terminated by a 4-byte value of 0. | |
== LINE NUMBER TABLE (p. 36) == | |
The source statement information for a compilation unit consists of a 4-byte (inclusive) length | |
and is followed by an address. This is followed by a series of source statement records. The 4 | |
byte (inclusive) length represents the number of bytes used by the statement information for the | |
compilation unit. The address represents the address of the first opcode generated for that | |
compilation unit. | |
Each record contains an unsigned 4-byte integer representing the source line number, an unsigned | |
2-byte integer representing the statement's position within the corresponding line, and an | |
unsigned 4-byte integer representing the address. The special position `SRC_NO_POS` has the | |
value 0xffff which indicates that the record refers to the entire line. | |
== APPLICABLE PROPERTIES (p. 41) == | |
This list describes all the properties that a given class can have in its DDE. It is important | |
to note that these are merely the applicable properties and a DDE is not required to specify | |
every single one of them. | |
To save space, the `PT_fund_type`, `PT_mod_fund_type`, `PT_user_def_type`, and `PT_mod_u_d_type` | |
properties will be abbreviated as FT, MTF, UDT, and MUDT respectively. | |
ELEMENT NAME PROPERTY NAME | |
----------------------------------------- | |
CLASS_array_type PT_byte_size | |
PT_name | |
PT_ordering | |
PT_private | |
PT_protected | |
PT_public | |
PT_sibling | |
PT_start_scope | |
PT_subscr_data | |
----------------------------------------- | |
CLASS_class_type PT_byte_size | |
PT_name | |
PT_private | |
PT_protected | |
PT_public | |
PT_sibling | |
PT_start_scope | |
----------------------------------------- | |
CLASS_compile_unit PT_comp_dir | |
PT_compiler | |
PT_end_pc | |
PT_lang | |
PT_name | |
PT_sibling | |
PT_start_pc | |
PT_stmt_list | |
----------------------------------------- | |
CLASS_enum_type PT_byte_size | |
PT_name | |
PT_private | |
PT_protected | |
PT_public | |
PT_sibling | |
PT_start_scope | |
----------------------------------------- | |
CLASS_global_sub FT/MFT/UDT/MUDT | |
PT_end_pc | |
PT_inline | |
PT_location | |
PT_member | |
PT_name | |
PT_private | |
PT_program | |
PT_protected | |
PT_proto | |
PT_public | |
PT_ret_cont | |
PT_sibling | |
PT_start_pc | |
PT_start_scope | |
----------------------------------------- | |
CLASS_global_var FT/MFT/UDT/MUDT | |
PT_const_val | |
PT_location | |
PT_member | |
PT_name | |
PT_private | |
PT_protected | |
PT_public | |
PT_sibling | |
PT_start_scope | |
----------------------------------------- | |
CLASS_inherit PT_location | |
PT_private | |
PT_protected | |
PT_public | |
PT_sibling | |
PT_user_def_type | |
----------------------------------------- | |
CLASS_inline_sub PT_end_pc | |
PT_sibling | |
PT_spec | |
PT_start_pc | |
----------------------------------------- | |
CLASS_label PT_name | |
PT_start_pc | |
PT_start_scope | |
PT_sibling | |
----------------------------------------- | |
CLASS_lex_block PT_end_pc | |
PT_name | |
PT_sibling | |
PT_start_pc | |
----------------------------------------- | |
CLASS_local_var FT/MFT/UDT/MUD | |
PT_const_val | |
PT_location | |
PT_name | |
PT_private | |
PT_protected | |
PT_public | |
PT_sibling | |
PT_start_scope | |
----------------------------------------- | |
CLASS_member FT/MFT/UDT/MUDT | |
PT_byte_size | |
PT_location | |
PT_name | |
PT_private | |
PT_protected | |
PT_public | |
PT_sibling | |
----------------------------------------- | |
CLASS_padding | |
----------------------------------------- | |
CLASS_param FT/MFT/UDT/MUDT | |
PT_def_val | |
PT_is_opt | |
PT_location | |
PT_name | |
PT_sibling | |
----------------------------------------- | |
CLASS_ptr_type FT/MFT/UDT/MUDT | |
PT_name | |
PT_private | |
PT_protected | |
PT_public | |
PT_start_scope | |
PT_sibling | |
----------------------------------------- | |
CLASS_ref_type FT/MFT/UDT/MUDT | |
PT_name | |
PT_private | |
PT_protected | |
PT_public | |
PT_start_scope | |
PT_sibling | |
----------------------------------------- | |
CLASS_str_type PT_byte_size | |
PT_name | |
PT_private | |
PT_protected | |
PT_public | |
PT_sibling | |
PT_start_scope | |
PT_str_length | |
----------------------------------------- | |
CLASS_struct_type PT_byte_size | |
PT_name | |
PT_private | |
PT_protected | |
PT_public | |
PT_sibling | |
PT_start_scope | |
----------------------------------------- | |
CLASS_sub FT/MFT/UDT/MUDT | |
PT_end_pc | |
PT_inline | |
PT_member | |
PT_name | |
PT_private | |
PT_protected | |
PT_proto | |
PT_public | |
PT_ret_cont | |
PT_start_pc | |
PT_start_scope | |
PT_sibling | |
----------------------------------------- | |
CLASS_sub_type FT/MFT/UDT/MUDT | |
PT_name | |
PT_private | |
PT_protected | |
PT_proto | |
PT_public | |
PT_sibling | |
PT_start_scope | |
----------------------------------------- | |
CLASS_typedef FT/MFT/UDT/MUDT | |
PT_name | |
PT_private | |
PT_protected | |
PT_public | |
PT_sibling | |
PT_start_scope | |
----------------------------------------- | |
CLASS_unspec_param PT_sibling | |
----------------------------------------- |
I'm not averse to changing the string representation but is it really necessary? I don't think it's possible to cause a buffer overflow since the debug segment isn't actual executable code. Considering this, does using a length field really provide any additional security?
How much thought did you put into adapting this to Parrot and pbc? There's a lot in there that makes perfect sense for a C-like language but doesn't really have an analog in Parrot. There are many examples, but a subset of what jumps out is typedefs (which don't exist in PIR or pbc), arrays (which are just a PMC) and fundamental types (which only have meaning in NCI). My criticism isn't about those three aspects of the proposal, but primarily that it seems like you haven't put enough thought in adapting only the relevant bits of the DWARF spec. I greatly appreciate that you waded through the spec in the first place and look forward to helping you further refine this proposal. Modeling debug segments on a widely-used design is a good plan, but only if we make sure to adapt it carefully.
The general strategy I'd use would be to try to figure out the motivation for the various parts of DWARF and see how those could be applied to Parrot. This will obviously be more work, but it'll also mean that we'll have a better understanding of why DWARF does what it does and how we can translate its solutions to parrot.
Additionally, my brain started to get sore before I got 1/3 through the proposal. I'd much prefer that you present a very general outline that's easy to understand and comment on, then once people are happy to expand that out to something more exhaustive. Laziness as a virtue applies to hackers too.
That's exactly where I'm at right now. I need to figure out how to Parrot-ize DWARF. I'm going to need a bit of help from some of the Parrot veterans who know those kind of things without having to even think about it. Of course, any suggestions you may have are always welcome.
I also agree that writing a smaller outline would be beneficial. I can get started on that soon.
All are very valid points and thanks. :)
the definition of a string, as found on line 70, is dangerous. that is:
while it has long been standard practice due to C, this design decision is now considered to be a mistake. anyone who's ever experienced damage due to buffer overruns will tell you so.
please consider changing the definition of a string to include a length field and drop the terminator.