rocky/issue-format.md

## issue-format.md

      
    Raw
  

              issue-format.md
            
          
Table of Contents

Issue Format V2

Introduction
Goals
Smart Contract Weakness ID (SWC)
Descriptive text

Descriptive text breakout


Severity
Location(s)

src
srcmap and list of src
Other ways to indicate a location


Application-driven selection of output fields
Detailed ANTL Grammar
Input specification for Issue Format


Issue Format V2

Introduction

Issues give how MythOS reports findings from analyses back to applications that requested service.
These are the parts of a single issue:

Smart Contract Weakness ID (SWC)
Descriptive text
Severity
Location(s)

We'll go over each in more detail after the next section. At the end we give a detailed ANTLR grammar.
Goals

The goal of the design is a to allow a simple, flexible, way of communicating issues from MythOS back to applications in a way that is most appropriate for that application.
We want to cover a wide variety of applications such as:

IDEs that have a wealth of source-code information, ASTs and have libraries for working with these
Applications that have limited access to libraries
Applications that work outside of the Solidity/EVM space (Vyper, Wasm)
Applications that don't have access to source code (e.g. something based on etherscan)

Another goal is to a complete formal description of the issues format for 3rd-party Mythril developers, while also motivating the thought behind its design.
Smart Contract Weakness ID (SWC)

We have proposed to the Ethereum Foundation a scheme for classifying a security weakness, similar to CWE used in other security domains.
We call this a Smart Contract Weakness Classification (SWC). See https://github.com/SmartContractSecurity/SWC-registry. The field name returned in
JSON is SWC-ID.
Since classification of a particular weakness may require approval, or there may be uncertainty as to which number to use, there should be a number to indicate "unassigned." I suggest 000.
Descriptive text

This should give information specific to the problem found. It generally should not be a repetition of the corresponding SWC text.
As is common practice in informational messages, ideally it would give a little bit of information as to what was found and what was expected.
For example, consider a recently improved Mythril message:
The operation can overflow

This was changed to:
A subtraction can overflow

which is better because it has been customized to some particular instance of the general SWC classification
Or consider this message:
The version of the compiler specified in the Pragma is outdated. It is recommended to use a more recent one (>=0.4.23).

The set of versions that is expected is given. This is good, but the first sentence doesn't give the version number.
Descriptive text breakout

Some applications are very space limited in their output. The eslint "stylish" format is an example. So, messages which are over 50 characters should be broken into a lead description followed by the remaining description, like a newspaper story..
The field names in the JSON output should be description-lead and description-rest. If the client declares that it can handle long descriptions, then a single field description is fine. More about client declarations of acceptable formats below.
Severity

This gives some measure of the cause for concern of an issue. This is perhaps a little bit subjective, but this kind of measure
is a standard practice in tools that report problems. Some common severity levels are:

informational
warning
error
fatal (can't process further)

However eslint, a popular linter for JavaScript, boils this down to "warning" and "error".
I plan on implementing in clients a way whereby a user can remap severity levels of particular SWC's. Therefore MythOS doesn't need to worry about pleasing everyone. But giving some rough estimate would be great.
The field name returned in JSON is severity. This is the same name used by eslint. And following that 1 is used for a warning and 2 is used for error. This suggests that if "informational" and "fatal" are used, then these would naturally be 0 and 3
Under "application request," we can also give the long, descriptive name..
Location(s)

This is the most difficult to describe because it depends on the context. The kinds of objects that
MythOS might analyze include:

A string containing some language (e.g. Solidity, Vyper) source code
A JSON string containing collection of Solidity files
The AST's (in some format) of the above
Contract or deployed EVM Bytecode
Instructions in something other than EVM bytecode, e.g. wasm

In light of this variation, the location returned should refer to one of the above.
src

solc has a well thought out mechanism for reporting locations. This is embodied in its src and srcmap entries. See https://solidity.readthedocs.io/en/v0.4.25/miscellaneous.html?source-mappings .
We will come to srcmap later. Here let us focus just on src, which is a compact way to describe a location.
A src entry is a string value that looks like this s:l:f with the following meaning

s is the some kind of pointer to the start of the object mentioned,
l is the length of the object entries, and
f is an index into a table of the objects that comprise the input.

Note: the above has been generalized a little bit from the solc definition.
In the most common case, where we are talking about character-input source code. s would be a character offset,
l the number of characters, and f an index into an array of filenames (of some sort).
However if we were working with only with bytecode or wasm where there is no source available, then s would be a byte offset for
bytecode and possibly an instruction number for wasm. Usually, when working with bytecode or assembly l would be 1 for instruction or opcode. But you might give a length greater than 1 if you wanted to convey the full instruction associated with the bytecode. And we leave open the possibility that an analysis can refer to a contiguous sequence of instructions.
Most of the time, input consists of a single file or object, so use 0 here. If there is no associated object or the object is not known, use -1.
srcmap and list of src

In some cases you might want an issue to refer to several locations. For example, if you want to point out that there are multiple sends in a transaction, you might want to give the location for each send.
srcmap is what solc outputs when it wants to give a list of locations.
The use of srcmap in solc is expected to cover the entire sequence of bytecodes, even those bytecodes that aren't on instruction boundaries. There is an implied index association between the position in the srcmap and the position in bytecode.
In our situation where we just need to get a list of locations, this index association isn't useful. However the solc typographic convention for indicating a list can be used. So in the places where we want another location indicated, we separate distinct locations with a semicolon ;
Other ways to indicate a location

Another kind of location that I think is useful when giving a solc-style AST, is the a node or "id" number in the tree.
(Somewhere else I should describe what a solc-style AST is).
Let me explain why.
Text ranges can cover a couple of concepts in different layers. Suppose I have:
    x = i + j
	~~~~~

The above underlined text could refer to:

a right-hand side expression, or
an arbitrary expression, or
an arbitrary binary expression, or
a binary addition operation

The name of the AST node will tell you which of these is meant.
But this,  as well as the idea of srcmap, will seem strange to those who haven't encountered this kind of thing before.
And there may be applications which prefer to have locations described in terms of
"line number", "column number" and "filename" when referring to source text, or "bytecode offset" when referring to evm wasm.
Presumably there will be libraries for converting back and forth between these concepts. Indeed these already exist to some extent in say (javascript) remix libraries.
This brings us to the topic of allowing an application to specify the kind of format it would like to receive.
Application-driven selection of output fields

Here we have been focusing on the output side of MythOS (to the application). In another document, I'll describe the input side, but input can impinge on output, and I'll mention that here.
On the input side, as proposed recently by Sergey, we would like to have a flexible system where we can add alternative kinds of inputs that are most convenient from the application side. For example truffle already has an extensive set of compile artifacts that are trivial to send over to MythOS.
To lessen the burden on the application side, we can add a field to indicate how the output should be returned. Applications that don't like src and srcmap could request line numbers and column numbers in issues. Some applications might want various kinds of descriptive text that is associated with SWC-IDs rather than looking these things up from the Internet.
Detailed ANTL Grammar

I will probably turn this into an openapi spec since that is easier to read and can be more verbose in describing fields.
However to get going quickly here is a detailed grammar for the JSON output.
issues : '[' issue (',' issue)* ']'
issue  : '{' swc ',' severity, ',' description ',' locations '}'

swc    : '{'     'SWC-ID' ':'  STRING
          ( ','  'SWC-title:' STRING ) ?
          ( ',', 'SWC-description:' STRING ) ?
         '}'
	 
severity : [0-4]
         | 'informational'
	 | 'warning' 
	 | 'error'
	 | 'fatal'

description : 'description:' string
            | 'description-lead': string ( ',' 'description-rest' ) ?

locations : srcmap
          | offsets
          | ids
          | line_column_length_files  

offsets : 'offset:' '[' offset (',' offset ) * ']'
        | 'offset:' offset

offset : NUMBER

ids : 'id:' '[' id (',' id ) * ']'
    | 'id:' id

id: NUMBER

line_column_lengths_files : '[' line_column_length_file (',' line_column_length_file ) * ']'
                          | line_column_length_file

line_column_length_file  : 'line:' NUMBER
                          ( ',' 'column:' NUMBER ) ?
  			  ( ',' 'length:' NUMBER ) ?
			    ',' 'filename:' string )


STRING: '"" ~('"')+ '"'
NUMBER [0-9]+

Input specification for Issue Format

TODO: make this a real grammar. For shorthand, below I am using
foo: { a | b } *  to mean

foo: a, or
foo: b,  or
foo: ['a', 'b']

issue_format: 'issue-format: {'
	{ swc | description | location } *

swc : 'swc: ' { 'swc-title | 'swc-description' } *
description: 'description: ' { 'description-lead' | 'description' '}' *
location : 'location: ' { 'offset' | 'id' | 'line_column_length_file' } *
severity : 'severity: ' { 'name | 'number' } *

Examples:
At one end of the spectrum where we don't have AST's or libraries and
no access to the SWC registry, but lots of space to print messages:
{...
  "issue_format": {
    "swc" : ["swc-title", "swc-description"],
    "severity": "name",
    "description": "description",
    "location": "line-column-length-file
}...

and at the other end of the spectrum where
we have AST's, and lots of libraries, but
limited space
{...
  "issue_format": {
     "description": "description-lead",
     "severity": "number",
     "location": "ids"
}...