Skip to content

Instantly share code, notes, and snippets.

@akotlar
Last active December 14, 2019 19:16
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save akotlar/72dbdeb5d4224fa0980e5d6b6cc0cd64 to your computer and use it in GitHub Desktop.
Save akotlar/72dbdeb5d4224fa0980e5d6b6cc0cd64 to your computer and use it in GitHub Desktop.
PType proposal

What are Physical Types?

Physical types are the classes that create in-memory representations (or code generate the in-memory representations), of Hail Types (Virtual Types). They serve as the implementations of Virtual Types, which are interfaces

Where possible Physical Type behavior should follow Python type behavior.

This proposal deals with the architectural goals of the PType implementation for 2020 Q1.

Motivation:

  • Improve performance by building specialized memory representations for data (improve developer velocity / enable performance optimizations in the future).

Project technical goals:

  • Abstract PType interfaces define code-generation and interpretation primitives (for example, PCanonicalArray concretely implements the PArray interface).
  • Remove requiredness from virtual types

Future directions:

  • Introduce the following invariant in the codebase: All region methods / Memory methods are used only in the ptypes hierarchy when dealing with values of Hail types.

Conceptual class hierarchy:

PType

PArray

PSet

PDict

PNDArray

  • Specialized implementations (canonical/non)

PTuple

PStruct

PLocus

PCall

  • Specialized implementations (canonical/non)

PInterval

  • Specialized implementations (canonical/non)

PFloat32

PFloat64

PInt32

PInt64

PString

PBinary

PVoid


PType

Utility methods

Code Methods

def store(destinationAddress: Long, destinationType: PType, value: Long, valueType: PType): Unit
def store(destinationAddress: Code[Long], destinationType: PType, value: Code[Long], valueType: PType): Unit`

PArray

An abstract class for an immutable ordered collections where all elements are of a single type. Does not contain the value constructor (e.g allocate)

Core Methods

(Each method has a staged version)

def loadLength(arrayAddress: Long): Long = ...
def loadLength(arrayAddress: Code[Long]): Code[Long] = ...

def isElementMissing(arrayAddress: Long, index: Int): Boolean= ...
def isElementMissing(arrayAddress: Long, index: Code[Int]): Code[Boolean] = ...

def loadElementAddress(arrayAddress: Long, index: Int): Long = ...
def loadElementAddress(arrayAddress: Code[Long], index: Code[Int]): Code[Long] = ...
  • Renamed from loadElement because this function only returns the address of the element, not the element itself
  • Does not take a region instance because memory addresses are valid across regions. In the current loadElement signatures that take a region instance, we do not use that region instance. The only cases that a region instance would be needed is if loadElement needs to allocate memory off-heap, but this seems semantically inconsistent with loading (instead that would be a value construction, which happens in allocate)

Questions:

  1. Do we want to allow a range of maximum array lengths (not just 32 bit)
  2. Do we want to have a loadElement that returns the actual data stored at that address? Currently the caller always needs to be perform a second step, at the cost of more allocations (and the number of bytes returned will be greater for an address than any primitive besides Long and Double)
  3. Do we want a loadElements that returns an iterable? This would save the caller boilerplate: currently they need to store the length of the array, an index variable, and manually construct a while loop, check whether an element is missing, (typically over non-null elements)

PArray concrete implementations

PCanonicalArray

Signature

PCanonicalArray(elementType: PType, required: Boolean = false)

Core methods

def allocate(region: Region length: Long): Long = ...
def allocate(region: Code[Region], length: Code[Long]): Code[Long] = ...
  • Allocate the value array (e.g code-generate allocation) and returns the memory address of (the start of) the set
def setElement(arrayAddress: Long, index: Int, value: Annotation): Unit = ...
def setElement(arrayAddress: Code[Long], index: Code[Int], value: Code[Annotation]): Code[Unit] = ...
  • Set the value at the given element. Assumes allocation. Does not track whether value has already been set

PSet

An abstract class for immutable (potentially unordered) collections of values where all values are unique and of one type. Does not contain the value constructor (e.g allocate)

  • TODO: Not sure what the intended semantics of our sets are, besides uniqueness. Should we be able to access them by index? Similar question about dictionary ptypes.

Core Methods

(Each method has a staged version)

def loadLength(arrayAddress: Long): Long = ...
def loadLength(arrayAddress: Code[Long]): Code[Long] = ...
  • Returns the array length

def isElementMissing(arrayAddress: Long, index: Int): Boolean
def isElementMissing(arrayAddress: Code[Long], index: Code[Int]): Code[Boolean]

def loadElementAddress(arrayAddress: Long, index: Int): Long
def loadElementAddresst(arrayAddress: Code[Long], index: Code[Int]): Code[Long]

Questions

  1. Why shouldn't loadElementAddress take a hashable value here? Code gen for figuring out the address of an unordered set by value seems like PType domain.

PSet concrete implementations

PCanonicalSet

Signature

PSet(elementType: PType, required: Boolean = false)

Core methods

def allocate(region: Region length: Long): Long = ...
def allocate(region: Code[Region], length: Code[Long]): Code[Long] = ...
  • Allocate the value array (e.g code-generate allocation) and returns the memory address of [the start of] the array
def setElement(arrayAddress: Long, index: Int, value: Annotation): Unit = ...
def setElement(arrayAddress: Code[Long], index: Code[Int], value: Code[Annotation]): Code[Unit] = ...
  • Insert a value at the index

PDict

An abstract class for immutable unordered collections of key:value pairs where keys are unique. Keys must all be of the same type, and values must all be of the same type (though can be different than the key type). Does not contain the value constructor (e.g allocate)

Core Methods

(Each method has a staged version)

def loadLength(arrayAddress: Long): Long = ...
def loadLength(arrayAddress: Code[Long]): Code[Long] = ...

def loadElementAddress(arrayAddress: Long, index: Int): Long = ...
def loadElement(arrayAddress: Code[Long], index: Code[Int]): Long = ...

PDict concrete implementations

PCanonicalDict

Signature

PCanonicalDict((keyType: PType, valueType: PType, required: Boolean = false)

Core methods

def allocate(region: Region, length: Int): Long = ...
def allocate(region: Code[Region], length: Code[Int]): Code[Long] = ...
  • Returns the address to the start of the dictionary

PTuple

An abstract class for immutable ordered collections of values that may be of different types. Does not contain the value constructor (e.g allocate)

Core Methods

(Each method has a staged version)

def loadLength(arrayAddress: Long): Long = ...
def loadLength(arrayAddress: Code[Long]): Code[Long] = ...
  • Returns the array length

def loadElement(arrayAddress: Long, index: Long): Option[AnyVal]
def loadElement(arrayAddress: Code[Long], index: Long): Code[Optional[AnyVal]] = ...

PTuple concrete implementations

PCanonicalTuple

Signature

PCanonicalTuple(fields: IndexeSeq[PType], required: Boolean = false)

Core methods

def allocate(region: Region length: Long): Long = ...
def allocate(region: Code[Region], length: Code[Long]): Code[Long] = ...

PStruct

An abstract class for immutable collections of (key, value) pairs of (potentially) different types. Keys are always strings. Values are looked up by key only. Does not contain the value constructor (e.g allocate)

Core Methods

(Each method has a staged version)

def loadLength(arrayAddress: Long): Long = ...
def loadLength(arrayAddress: Code[Long]): Code[Long] = ...
  • Returns the array length

def loadElement(arrayAddress: Long, fieldName: String): Option[AnyVal]
def loadElement(arrayAddress: Code[Long], fieldName: String): Code[Optional[AnyVal]] = ...
  • Same return value semantics as PArray with regard to missingness

PStruct concrete implementations

PCanonicalSturct

Signature

PCanonicalStruct(fields: Seq[String, PType], required: Boolean = false)

Core methods

def allocate(region: Region length: Long): Long = ...
def allocate(region: Code[Region], length: Code[Long]): Code[Long] = ...

PLocus

A representation of a chromosomal locus, encapsulating the reference genome, chromosome (called contig in our documentation), and position.

Core Methods

(Each method has a staged version)

def reference(arrayAddress: Long): String = ...
def reference(arrayAddress: Code[Long]): Code[String] = ...

def chromosome(arrayAddress: Long): String = ...
def chromosome(arrayAddress: Code[Long]): Code[String] = ...

def position(arrayAddress: Long): Long = ...
def position(arrayAddress: Code[Long]): Code[Long] = ...

Locus concrete implementations

PCanonicalLocus

Signature

PCanonicalLocus(reference: PString, chromosome: PString, position: PInt64)

Questions:

  1. Do we need to have a value constructor for PLocus?TODO: need some construction method

PSet

An abstract class for immutable (potentially unordered) collections of values where all values are unique and of one type. Does not contain the value constructor (e.g allocate)

  • TODO: This is wrong I think. Not sure what the intended semantics of our sets is, besides uniqueness. Should we be able to access them by index?

Core Methods

(Each method has a staged version)

def loadLength(arrayAddress: Long): Long = ...
def loadLength(arrayAddress: Code[Long]): Code[Long] = ...
  • Returns the array length

def loadElement(arrayAddress: Long, item: AnyVal): Option[AnyVal]
def loadElement(arrayAddress: Code[Long], item: Hashable): Code[Optional[AnyVal]] = ...
  • The return semantics for PSet's loadElement instance method are identical to PCanonicaArray's loadElement instance method

PSet concrete implementations

PCanonicalSet

Signature

PSet(elementType: PType, required: Boolean = false)

Core methods

def allocate(region: Region length: Long): Long = ...
def allocate(region: Code[Region], length: Code[Long]): Code[Long] = ...
  • Construct the value array (e.g code-generate allocation and insertion) and returns the memory address of [the start of] the array

PDict

An abstract class for immutable unordered collections of key:value pairs where keys are unique. Keys must all be of the same type, and values must all be of the same type (though can be different than the key type). Does not contain the value constructor (e.g allocate)

Core Methods

(Each method has a staged version)

def loadLength(arrayAddress: Long): Long = ...
def loadLength(arrayAddress: Code[Long]): Code[Long] = ...
  • Returns the array length

def loadElement(arrayAddress: Long, key: Hashable): Option[AnyVal]
def loadElement(arrayAddress: Code[Long], key: Hashable): Code[Optional[AnyVal]] = ...
  • Returns the key's corresponding value, if present. In the interpreted version, uses Scala's Option, and requires matching on Some/None. In staged version, uses Java's Optional semantics, match on v.isNull, just like PArray's loadElement.

PDict concrete implementations

PCanonicalDict

Signature

PCanonicalDict((keyType: PType, valueType: PType, required: Boolean = false)

Core methods

def allocate(region: Region length: Long): Long = ...
def allocate(region: Code[Region], length: Code[Long]): Code[Long] = ...

PTuple

An abstract class for immutable ordered collections of values that may be of different types. Does not contain the value constructor (e.g allocate)

Core Methods

(Each method has a staged version)

def loadLength(arrayAddress: Long): Long = ...
def loadLength(arrayAddress: Code[Long]): Code[Long] = ...
  • Returns the array length

def loadElement(arrayAddress: Long, index: Long): Option[AnyVal]
def loadElement(arrayAddress: Code[Long], index: Long): Code[Optional[AnyVal]] = ...

PTuple concrete implementations

PCanonicalTuple

Signature

PCanonicalTuple(fields: IndexeSeq[PType], required: Boolean = false)

Core methods

def allocate(region: Region length: Long): Long = ...
def allocate(region: Code[Region], length: Code[Long]): Code[Long] = ...

PStruct

An abstract class for immutable collections of (key, value) pairs of (potentially) different types. Keys are always strings. Values are looked up by key only. Does not contain the value constructor (e.g allocate)

Core Methods

(Each method has a staged version)

def loadLength(arrayAddress: Long): Long = ...
def loadLength(arrayAddress: Code[Long]): Code[Long] = ...
  • Returns the array length

def loadElement(arrayAddress: Long, fieldName: String): Option[AnyVal]
def loadElement(arrayAddress: Code[Long], fieldName: String): Code[Optional[AnyVal]] = ...
  • Same return value semantics as PArray with regard to missingness

PStruct concrete implementations

PCanonicalSturct

Signature

PCanonicalStruct(fields: Seq[String, PType], required: Boolean = false)

Core methods

def allocate(region: Region length: Long): Long = ...
def allocate(region: Code[Region], length: Code[Long]): Code[Long] = ...

PLocus

A representation of a chromosomal locus, encapsulating the reference genome, chromosome (called contig in our documentation), and position.

Core Methods

(Each method has a staged version)

def reference(arrayAddress: Long): String = ...
def reference(arrayAddress: Code[Long]): Code[String] = ...

def chromosome(arrayAddress: Long): String = ...
def chromosome(arrayAddress: Code[Long]): Code[String] = ...

def position(arrayAddress: Long): Long = ...
def position(arrayAddress: Code[Long]): Code[Long] = ...

Locus concrete implementations

PCanonicalLocus

Signature

PCanonicalLocus(reference: PString, chromosome: PString, position: PInt64)

Questions:

  1. Do we need to have a value constructor for PLocus
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment