Skip to content

Instantly share code, notes, and snippets.

@wickman
Created February 19, 2014 19:18
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save wickman/9099469 to your computer and use it in GitHub Desktop.
Save wickman/9099469 to your computer and use it in GitHub Desktop.
pystachio 1.0 designdoc
Pystachio 1.0 redesign
1. Documents:
Essentially a set of hierarchical key/value pairs a la a JSON document.
Documents may contain:
1. Leaves (string or numeric [integer, long, float, etc])
2. Iterables (can contain values of 1, 2, 3; will always be coerced to
tuples, regardless of underlying type e.g. set, so set
properties will not be preserved when coerced to Document)
3. Maps (keys /must/ be coercible to 1, values can be any of 1, 2, 3)
The Document itself is just a Map (#3). String leaves are always coerced to
either a Ref or Fragment as described in the next section. Maps are always
coerced to Documents (in other words, Documents are recursive data types.)
There are two important methods on documents:
.bind(*args, **kw)
.__call__(*args, **kw)
These used to be different, but in Pystachio 1.0 they are aliased together.
Each argument in args must be either another document or a dict which is to
be merged with this document. **kw should be passed to a Document and
merged in as well. Merge is performed by dict .update().
Documents should provide a .raw_items() iterator analagous to .items() that
iterates over the raw, unresolved contents of the document. .bind() and
.__call__() should merge .raw_items() from Documents, but plain .items()
from dicts. .items() should iterate over *resolved* versions of the
dictionary contents in the case of Documents.
NOTE: .bind and .__call__ functions return *new* Documents -- the original
document is immutable and unchanged, so:
>>> d = Document(hello = 'world')
>>> d(hello = 'universe')
Leaves 'd' unchanged, but instead returns a new Document(hello =
'universe').
Furthermore, as Documents are like dictionaries, so they can be accessed via
__getitem__. Use of __getattr__ is discouraged but just delegates to
__getitem__.
2. Mustaches:
We do not implement the Mustache document format (i.e. no loops or
conditionals.) We just use mustaches '{{}}' to denote 'pointer.'
There are two kinds of indirect objects:
1. Refs:
A single mustache instance, e.g.
{{foo}}
{{foo.bar}}
{{foo.bar[baz]}}
{{foo[`baz`]}}
2. Fragments
A sequence of [... str, ref, str, ref, str ...] representing a string.
A fragment may neither start with a Ref nor end with a Ref. A fragment
cannot be [Ref(...)]; instead it will just be a Ref.
It is possible to have the following case:
{{foo.bar[{{baz}}]}}
This will be parsed as:
Fragment(['{{foo.bar[', Ref('baz'), ']}}'])
The reason for the distinction between Refs and Fragments is due to how they
are resolved.
Ref resolution is simply "find the reference and substitute it in place of
this value." The process of finding the associated reference value is
described in the next section.
Fragment resolution on the other hand is an iterative process. If there are
any Refs within a fragment that cannot be found within the Document, raise
NotFound as it cannot be resolved. If all Refs can be resolved,
''.join(resulting strings) and reparse. There are three possible outcomes:
1. A single Ref: perform ref resolution (i.e., substitution)
2. Fragment with no Refs: return fragment.components()[0] (its string representation)
3. Fragment with Refs: repeat iteration
3. Evaluating mustache refs
There are three forms of refs:
{{name}}
{{name1.name2}}
{{name1[name2]}}
It is possible to escape a mustache using &:
{{&name}}
will always result in the string {{name}} rather than Ref('name'). The &
should be stripped as late as possible.
Implicitly, {{name}} is equivalent to {{.name}}, which on a document means
__getitem__['name']. Furthermore, these can be composed together, e.g.
{{name1.name2[name3]}}. It should also be possible to escape names using
back-ticks (this was not possible in Pystachio 0.x):
{{`foo bar`}}
{{name1.`foo bar`}}
{{name1[`foo bar`]}}
As such, back-ticks are not allowed within keys in Documents. Names must be
escaped if they do not conform to C-style variable names, i.e. [a-zA-Z_][a-zA-Z0-9_]*
This includes things like common team names like `aurora-team`.
In order to find a reference value, each of the 3 primary Document types
must understand finding via . and []:
1. Leaves: Dereference is not supported -- raise NotFound
2. Iterables
a) .-dereference: Not supported
b) []-dereference: The dereference is coerced to integer (if not
coercible, raise NotFound type error) and indexed.
3. Maps
a) .-dereference: Equivalent to __getitem__
b) []-dereference: Equivalent to __getitem__
*scoping rules*
Since Documents may be contained with Documents, there are scoping rules to
be aware of. Consider the following document
{
'name': '{{profile.first}} {{profile.last}}',
'profile': {
'title': 'mr.',
'first': '{{title}} brian',
'last': 'wickman',
}
}
Resolving {{profile.last}} is unambiguous: dereference .profile, which
results in the document {'title': 'mr.', 'first': '{{title}} brian', 'last': 'wickman'}.
Then dereference .last from said document resulting in 'wickman'.
Resolving {{profile.first}} is slightly more nuanced. You begin with
resolving {{profile}}, then next you must resolve {{.first}}. In order to
resolve {{.first}}, we must resolve '{{title}} brian'. '{{title}}' is
scoped to the 'profile' document. In this case, it's simple, as it resolves
directly to 'mr.'. However, should {{title}} not be found
within 'profile', all enclosing documents must be searched *in stack order*.
In other words, the top level document containing 'name' and 'profile' must be
used to attempt to resolve {{title}}. (If it were nested multiple levels,
this would continue until no more documents are reached, at which point
NotFound is raised.)
*special names*
There are two special names in the dereferencing algorithm that alter
behavior of resolution.
1. self: Restricts resolution of the variable to within the current document
and will not delegate to parent documents. This won't ever change the
resolution result (as it's always done locally) but it will change error
handling. For example, the difference between {{name}} and {{self.name}}
is that parent documents should never be able to provide the value for
{{name}} should it not be provided within the scope of that document.
2. super: Restricts resolution to the parent document. For example, if you
want explicit inheritance:
task = Document(name = '{{super.name}}', attribute = 'value')
job = Document(name = 'the job', task = task)
job.task.name will be 'the job'
This can be used multiple times, e.g. {{super.super.cpu}}.
This does complicate things slightly as Documents must retain the context in
which they were evaluated:
job.task is a Document with the parent document of job
This is necessary in order for job.task.name to be properly evaluated when
{{name}} is being resolved from "job".
There are two approaches to implementing this functionality:
i) If a Ref returns a Document 'd', yield d(super=self), but make Document
aware that 'super' ought to be hidden from most introspection e.g.
items(), raw_items(), and __str__. This means that you must be very
careful about doing resolution in that you do not lose the contents of
self['super'].
ii) Maintain a hidden _super attribute set by the parent and treat it
specially.
**Illustration 1**
d = Document({
'name': '{{profile.first}} "{{nicknames[{{profile.nick_index}}]}}" {{profile.last}}',
'yob': '{{profile.yob}}',
'nicknames': ['b', 'bibby', 'wickyman'],
'profile': {
'title': 'mr.',
'first': '{{title}} brian',
'last': 'wickman',
'yob': 1981,
'occupation': 'engineer',
'nick_index': '1',
},
})
assert d['name'] == d.name == 'brian "bibby" wickman'
assert d['yob'] == d.yob == 1981
assert d['nicknames'][0] == d.nicknames[0] == 'b'
assert dict(d.items()) == {
'name': 'brian "bibby" wickman',
'yob': 1981,
'nicknames': ('b', 'bibby', 'wickyman'),
'profile': {
'title': 'mr.',
'first': 'mr. brian',
'last': 'wickman',
'occupation': 'engineer',
'nick_index': '1',
},
}
N.B. These should always return copies of the underlying structure. So that
"d['nicknames'][0] = 23" should be a no-op, as d['nicknames'] is merely a
copy of the original.
4. Traits
Once a document has been acquired, information must be extracted from that
document. Traits are effectively schemas that dictate how to extract and
optionally serialize content from documents.
In an ideal world, they would be completely separate from documents, but in
order to maintain backwards compatibility with Pystachio 0.x, they must be
slightly conflated with documents in that they must subclass documents.
In other words, ideally trait expression and extraction would appear like:
class Process(Trait):
name = Required(String)
cmdline = Required(String)
daemon = Default(Boolean, False)
ephemeral = Default(Boolean, False)
max_failures = Default(Integer, 1)
d = Document.from_json('process.json')
process = d.extract_trait(Process) [extracts trait and type checks]
subprocess.call(process.cmdline.split())
Instead Trait's metaclass must inject Document as a parent class and
behave like so in order to maintain backwards compatibility:
process = Process.from_json('process.json')
process.check()
subprocess.call(process.cmdline.split())
However, they should not require _any_ shared methods with Documents, so
they can be tested in isolation.
This separation of concerns of Documents and Traits should make it simpler
to extract Traits from other IDLs e.g. Thrift for example using a library
like ptsd:
Process = ThriftTrait('thermos.thrift', 'Process')
process = Process.from_json('process.json')
process.check()
process.serialize() # serialize to thrift byte stream
where thermos.thrift may look like:
struct Process {
1: required string name
2: required string cmdline
3: optional bool daemon = false
4: optional bool ephemeral = false
5: optional i16 max_failures = 1
}
In practice, we'll alias Struct = Trait to maintain backwards compatibility,
but implicitly you can think of Structs as Documents with an implied Trait.
5. Trait representation and extraction
XXX(finishme)
TBD. Represent base types:
Boolean .coerce
Integer .coerce
Float .coerce
String .coerce
Enum .coerce
Container types:
List .coerce
Map .coerce
Trait.coerce(document) ? Seems reasonable -- should also be compatible with
ThriftTrait too.
Requirements:
Required
Default
Then Trait has a class attribute:
_TYPE_MAP { name => ??? }
_TRAIT_MAP { name => Value }
6. Trait merging
It should be possible to merge certain traits together and/or monkeypatch
existing traits. Traits should support a .extends() methods that accept new
traits and merges them together to produce new ones.
For example,
class Job(Trait):
name = Required(String)
task = Required(Task)
class Announceable(Trait):
announce = Announce
AnnounceableJob = Job.extend(Announceable)
Then in the Pystachio Loader (which should remain essentially unchanged from
version 0.x to version 1.x) can do things like:
Job = Job.extend(AnnounceableJob)
so that it is possible to create organizational-specific configuration on
top of jobs and tasks.
Now unfortunately this only works elegantly for top-most declarations, so it
should be worth considering to do something along the lines of:
Job = Job.extend_attribute('task', Task.extend(HealthCheckable))
which could correspondingly be chained:
Job = Job.extend_attribute('task',
Task.extend_attribute('process', Process.extend(HealthCheckable)))
7. Lambdas
Just joking, I'm not proposing implementing Lambdas for Pystachio.
However, there are certain use-cases where it makes sense to provide a
smarter document. Consider a typical schema:
class Resources(Trait):
cpu = Default(Float, 1.0)
ram = Default(Integer, 1 * GB)
disk = Default(Intege, 1 * GB)
class Process(Trait):
name = Required(String)
cmdline = Required(String)
class Task(Trait):
name = Default(String, '{{processes[0].name}}')
processes = Required(List(Process))
resources = Default(Resources, Resources())
Now you may construct a task in the following manner:
task = Task(
processes = [Process(name = 'hello_world', cmdline = 'echo hello world')],
resources = Resources(ram = 8 * GB),
)
Consider the case where we want to run a JVM but must set things like -Xmx
properly. We may not want to set -Xmx blindly to -Xmx{{super.resources.ram}}
but instead perform arithmetic on the values provided.
Documents accept both dicts and Documents as Mappings, but explicitly only
coerce dicts to Documents. Therefore, it is perfectly acceptable to
subclass Document to do smarter things.
class JavaProcess(Document):
def __init__(self, jar_name, args, **kw):
self._jar_name = jar_name
self._args = args
super(JavaProcess, self).__init__(**kw)
def __produce_cmdline(self):
cpu = self._resolve('{{super.resources.cpu}}')
assert cpu >= 1, 'cpu cannot be less than 1'
if cpu <= 8:
gc_threads = cpu
else:
gc_threads = int(math.ceil(8 + (cpu - 8) * 5/8))
return 'java -jar %s -XX:ParallelGCThreads=%s %s' % (
self._jar_name, gc_threads, ' '.join(self._args))
def __getitem__(self, name):
# only resolve a unique cmdline:
if name == 'cmdline':
return self.__produce_cmdline()
return super(JavaProcess, self).__getitem__(name)
This way it is possible to do:
task = Task(
processes = [JavaProcess('foo.jar', ['-httpPort', '80'], name='foo')],
resources = '{{profile.resources}}',
)(profile = {'resources': {'cpu': 16}})
and have the following hold true:
assert task.processes[0].cmdline == 'java -jar foo.jar -XX:ParallelGCThreads=13.0 -httpPort 80'
whereas before it was not possible to add logic to template evaluation.
The downside of course is that it is no longer possible to do d =
Document.from_json('task.json') and have any way to express that a process
should be evaluated as a JavaProcess. However, task.to_json() would work
correctly in the above situation.
8. Dynamic Documents
Much in the same spirit of section 7, it is possible to consider dynamic
documents. There are specific use-cases where we have need for this at
Twitter:
1) Resolving package locations
2) Resolving jenkins artifacts
3) Resolving build artifacts
For example,
{{artifactory[`wickman-cache`][`org.apache.aurora.scheduler`][`0.5.0`]}}
This might dynamically resolve this artifact and replace it with an https
URL that can be curled.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment