At the lowest layer, the wire transport takes the form of Tokens. These all take the shape of header/type-byte/body, where the type byte has its high bit set, to distinguish it from the header bytes that precede it. Some tokens have bodies, some do not, and the length of all bodies are determined by data in the header. The bodies are composed of arbitrary bytes. For most tokens, the header is a base-128 number, but for some it may be a UTF-7 encoded string. (?? maybe just strict ASCII. the high bit of all bytes in the header must be zero).
Tokens are described below as [header-TOKEN-body], where either
header
or body
may be empty. For example, [len-LIST-empty]
indicates that the length is put into the header, LIST
is the token
being used, and the body is empty.
The possible Token types are:
0x80: LIST (old): [len-LIST-empty]
This token marks the beginning of a list with LEN elements. It acts as
the open parenthesis
, and the matching close parenthesis
is
implicit, based upon the length of the list. It will be followed by LEN
things, which may be tokens like INTs or STRINGS, or which may be
sublists. Banana keeps a list stack to handle nested sublists.
This token (and the notion of length-prefixed lists in general) is from oldbanana. In newbanana it is only used during the initial dialect negotiation (so that oldbanana peers can be detected). Newbanana requires that LIST(old) tokens be followed exclusively by strings and have a rather limited allowable length (say, 640 dialects long).
0x81: INT: [value-INT-empty]
This token defines a single positive short integer. The range is
defined as whatever fits in a Python IntType
, as opposed to a
LongType.
0x82: STRING [len-STRING-chars]
This token defines a string. To be precise, it defines a sequence of bytes. The length is a base-128-encoded integer. The type byte is followed by LEN bytes of data which make up the string. LEN is required to be shorter than 640k: this is intended to reduce the amount of memory that can be consumed on the receiving end before user code gets to decide whether to accept the data or not.
0x83: NEG: [value-NEG-empty]
This token defines a negative short integer.
0x84: FLOAT [empty-FLOAT-value]
This token defines a floating-point number. There is no header, and the
type byte is followed by 8 bytes which are a 64-bit IEEE double
, as
defined by struct.pack("!d", num)
.
0x85: LONGINT: [value-LONGINT-empty]
0x86: LONGNEG: [value-LONGNEG-empty]
These define positive/negative long integers, defined as Python's
long integer
types.
0x87: VOCAB: [index-VOCAB-empty]
This defines a tokenized string. Banana keeps a mapping of common strings, each one is assigned a small integer. These strings can be sent compressed as a two-byte (index, VOCAB) sequence. They are delivered to Jelly as plain strings with no indication that they were compressed for transit.
The strings in this mapping are fixed and defined in banana.py, but the intention is for dialect negotiation to allow other strings to be added to the table.
0x88: OPEN: [[num]-OPEN-empty]
0x89: CLOSE: [[num]-CLOSE-empty]
These tokens are the newbanana parenthesis markers. They carry an optional number in their header: if present, the number counts the appearance of OPEN tokens in the stream, starting at 0 for the first OPEN used for a given connection and incrementing by 1 for each subsequent OPEN. The matching CLOSE token must contain an identical number. These numbers are solely for debugging and may be omitted. They may be removed from the protocol once development has been completed.
In contrast to oldbanana (with the LIST token), newbanana does not use length-prefixed lists. Instead it relies upon the Banana layer to track OPEN/CLOSE tokens.
The token which follows an OPEN marker must be a string: either a STRING token or a VOCAB token. This string indicates what kind of new sub-expression is being started.
0x8A: ABORT: [[num]-ABORT-empty]
This token indicates that something has gone wrong on the sender side, and that the resulting object must not be handed upwards in the unslicer stack. It may be impossible or inconvenient for the sender to stop sending the tokens associated with the unfortunate object, so the receiver must be prepared to silently drop all further tokens up to the matching STOP marker.
The number, if present, will be the same one used by the OPEN token.
TODO: add a token to re-set the list of abbreviated strings. This should basically contain a dictionary that replaces the previous VOCAB mapping.
When serializing an object, it is useful to view it as a directed graph. The root object is the one you start with, any objects it refers to are children of that root. Those children may point back to other objects that have already been serialized, or which will be serialized later.
Banana, like pickle and other serialization schemes, does a depth-first
traversal of this graph. Serialization is begun on each node before going
down into the child nodes. Banana tracks previously-handled nodes and
replaces them with numbered reference
tokens to break loops in
the graph.
A Banana Slicer is responsible for serializing a single user
object: it slices
that object into a series of Banana tokens. On the
receiving end, there is a corresponding Banana Unslicer which
accepts the incoming tokens and re-creates the user object. There are
different kinds of Slicers and Unslicers for lists, tuples, dictionaries,
and instances. Classes can provide their own Slicers if they want more
control over the serialization process.
The serialization context is stored in a Banana
object (names are
still being decided). This holds a stack of Banana Slicers, one per object
currently being serialized (i.e. one per node in the path from the root
object to the one currently being serialized).
For example, suppose a class instance is being serialized, and this class chose to use a dictionary to hold its instance state. That dictionary holds a list of numbers in one of its values. The Banana Stack would hold the root slicer, an InstanceSlicer, a DictSlicer, and finally a ListSlicer.
(note: it might be possible to move the functionality of the Banana
object entirely into the root slice
).
Upon unserialization, the Unbanana
object holds this context. A
stack of Unslicer
objects handle incoming tokens. The Unbanana
is responsible for tracking OPEN and CLOSE tokens, making sure a failure in
an Unslicer doesn't cause a loss of synchronization. Unslicer methods may
raise exceptions: these are caught by the Unbanana and cause the object
currently being unserialized to fail: its parent gets a UnbananaFailure
instead of the dict or list or instance that it would normally have
received.
The stack is used to determine three things:
The default case puts a Parent
slice at the bottom of the stack.
This can also be interpreted as a root object
, if you imagine that
any given user object being serialized is somehow a child of the overall
serialization context. In PB, for example, the root object would be related
to the connection.
In addition, the stack can be queried to find out what path leads from
the root object to the one currently being serialized. If something goes
wrong in the serialization process (an exception is thrown), this path can
make it much easier to find out when the trouble happened, as
opposed to merely where. Knowing which method of your FooObject failed
during serialization isn't very useful when you have 500 of them inside your
data structure and you need to know whether it was bar.thisfoo
or bar.thatfoo
which caused the problem. To this end, each
Slicer has a .describe
method which is supposed to return a
short string that explains how to get to the child node currently being
processed. When an error occurs, these strings are concatenated together and
put into the failure object.
The Parent slice is meant to provide the default behavior for the stack. The default class currently does the following:
TODO: The idea is to let other serialization contexts to other things. The tokens should probably go to the parent slice for handling: turning into bytes and sending over a wire, saving to a file, etc. Having the whole stack participate in the Tasting process means that objects can put restrictions on what is sent on their behalf: objects could refuse to let certain classes be sent as part of their instance state.
Serialization starts with the Parent Slicer being asked to serialize the
given object. The Parent gives the object to Banana. Banana starts by
walking the stack (which, of course, has only the Parent on it at that
point), calling the .taste
method for each Slicer there. If any
of them have a problem with the object being serialized, they express it by
raising an exception (TODO: which one? InsecureBanana?).
If the Taster stack passes the object, Banana's next job is to find a new
Slicer to handle the object. It does this by walking the stack, calling
.newSlicer
on each slice. The first one that returns an object
ends the search. In most cases, this is the Parent slice, which just looks
up the type()
of the object in the SlicerRegistry
.
A type which does not have a Slicer registered for it will cause an
exception to be raised here.
The new Slicer is pushed on the the stack. It is then sent three methods
in succession: .start
, .slice
, and
.finish
. start
defaults to registering the object
with setRefID
and sending a appropriate OPEN token.
slice
is defined on a per-Slicer basis to send all the
necessary tokens. finish
sends the CLOSE token.
Banana keeps strict track of the nesting level. For safety, each OPEN gets a sequence number so it can be matched with its CLOSE token. If a Slicer's .close method fails to send the close token, very bad things will happen (in general, all further objects will become children of the one that didn't CLOSE properly). The sequence numbers are an attempt to minimize the damage.
The Unbanana object has a corresponding stack of Banana Unslicer objects. Each one receives the tokens emitted by the matching Slicer on the sending side. The whole stack is used to determine new Unslicer objects, perform Tasting of incoming tokens, and manage object references.
OPEN tokens have a string to indicate what kind of object is being started. This is looked up in the UnbananaRegistry just like object types are looked up in the BananaRegistry. The new Unslicer is pushed onto the stack.
ABORT
tokens indicate that something went wrong on the sending
side and that the current object is to be aborted. It causes the receiver to
ignore all tokens until the CLOSE token which closes the current node. This
is implemented by replacing the top-most slice with a DiscardUnslicer.
CLOSE
tokens finish the current node. The slice will pass its
completed object up to the childFinished
method of its parent.
Types and classes are roughly classified into containers and non-containers. The containers are further divided into mutable and immutable. Some examples of immutable containers are tuples and bound methods. Lists and dicts are mutable containers. Ints and strings are non-containers.
During unserialization, objects are in one of three states: uncreated, referenceable (but not complete), and complete. Only mutable containers can be referenceable but not complete: immutable containers have no intermediate referenceable state.
Mutable containers (like lists) are referenceable but not complete during traversal of their child nodes. This means those children can reference the list without trouble.
Immutable containers (like tuples) present challenges when unserializing. The object cannot be created until all its components are referenceable. While it is guaranteed that these component objects will be complete before the graph traversal exits the current node, the child nodes are allowed to reference the current node during that traversal. To handle this, the TupleUnslicer installs a Deferred into the object table when it begins unserializing (in the .start method). When the tuple is finally complete, the object table is updated and the Deferred is fired with the new tuple.
Containers (both mutable and immutable) are required to pay attention to the types of their incoming children. These containers are not complete (in the sense described above) until those Deferreds have been replaced with referenceable objects. When the container receives the Deferred, it should attach a callback to it which will perform the replacement. In addition, immutable containers should check after each update to see if all the Deferreds have been cleared, and if so, complete the object (and fired their own Deferreds so any containers they are a child of may be updated and/or completed).
Having the whole Slicer stack particpate in Tasting on the sending side seems to make a lot of sense. It might be better to have a way to push new Taster objects onto a separate stack. This would certainly help with performance, as the common case (where most Slicers ignore .taste) does a pointless method call to every Slice for every object sent. The trick is to make sure that exception cases don't leave a taster stranded on the stack when the object that put it there has gone away.
On the receiving side, each object has a corresponding .taste method,
which receives tokens instead of complete objects. This makes sense, because
you want to catch the dangerous data before it gets turned into an object,
but tokens are a pretty low-level place to do security checks. It might be
more useful to have some kind of instance taster stack
, with tasters
that are asked specifically about (class,state) pairs and whether they
should be turned into objects or not.
Because the Unslicers receive their data one token at a time, things like
InstanceUnslicer can perform security checks one attribute at a time.
traits
-style attribute constraints (see the Chaco project or the
PyCon-2003 presentation for details) can be implemented by having a
per-class dictionary of tests that attribute values must pass before they
will be accepted. The instance will only be created if all attributes fit
the constraints. The idea is to catch violations before any code is run on
the receiving side. Typical checks would be things like .foo must be a
number
, .bar must not be an instance
, .baz must implement the
IBazzer interface
.
Using the stack instead of a single Taster object means that the rules can be changed depending upon the context of the object being processed. A class that is valid as the first argument to a method call may not be valid as the second argument, or inside a list provided as the first argument. The PBMethodArgumentsUnslicer could change the way its .taste method behaves as its state machine progresses through the argument list.
There are several different ways to implement this Taster stack:
I'm sure this is safeclasses to override higher-level paranoia.
Of course, all this holds true for the sending side as well. A Slicer could enforce a policy that no objects of type Foo will be sent while it is on the stack.
It is anticipated that something like the current Jellyable/Unjellyable classes will be created to offer control over the Slicer/Unslicers used to handle instance of that class.
One eventual goal is to allow PB to implement E-like argument constraints.
The big change from the old Jelly scheme is that now serialization/unserialization is done in a more streaming format. Individual tokens are the basic unit of information. The basic tokens are just numbers and strings: anything more complicated (starting at lists) involves composites of other tokens.
The serialization side will be reworked to be a bit more producer-oriented. Objects should be able to defer their serialization temporarily (TODO: really??) like twisted.web resources can do NOT_DONE_YET right now. The big goal here is that large objects which can't fit into the socket buffers should not consume lots of memory, sitting around in a serialized state with nowhere to go. This must be balanced against the confusion caused by time-distributed serialization. PB method calls must retain their current in-order execution, and it must not be possible to interleave serialized state (big mess).
Another goal of the Jelly+Banana->JustBanana change is the hope of writing Slicers and Unslicers in C. The CBanana module should have C objects (structs with function pointers) that can be looked up in a registry table and run to turn python objects into tokens and vice versa. This ought to be faster than running python code to implement the slices, at the cost of less flexibility. It would be nice if the resulting tokens could be sent directly to the socket at the C level without surfacing into python; barring this it is probably a good idea to accumulate the tokens into a large buffer so the code can do a few large writes instead of a gazillion small ones.
It ought to be possible to mix C and Python slices here: if the C code doesn't find the slice in the table, it can fall back to calling a python method that does a lookup in an extensible registry.
Random notes and wild speculations: take everything beyond here with two grains of salt
The oldbanana usage model has the layer above banana written in one of
two ways. The simple form is to use the banana.encode
and banana.decode
functions to turn an object into a
bytestream. This is used by twisted.spread.publish . The more flexible model
is to subclass Banana. The largest example of this technique is, of course,
twisted.spread.pb.Broker, but others which use it are twisted.trial.remote
and twisted.scripts.conch (which appears to use it over unix-domain
sockets).
Banana itself is a Protocol. The Banana subclass would generally override
the expressionReceived
method, which receives s-expressions
(lists of lists). These are processed to figure out what method should be
called, etc (processing which only has to deal with strings, numbers, and
lists). Then the serialized arguments are sent through Unjelly to produce
actual objects.
On output, the subclass usually calls self.sendEncoded
with
some set of objects. In the case of PB, the arguments to the remote method
are turned into s-expressions with jelly, then combined with the method
meta-data (object ID, method name, etc), then the whole request is sent to
sendEncoded
.
Newbanana moves the Jelly functionality into a stack of Banana Slices,
and the lowest-level token-to-bytestream conversion into the new Banana
object. Instead of overriding expressionReceived
, users could
push a different root Unslicer. to get more control over the receive
process.
Currently, Slicers call Banana.sendOpen/sendToken/sendClose/sendAbort, which
then creates bytes and does transport.write .
To move this into C, the transport should get to call CUnbanana.receiveToken
There should be CBananaUnslicers. Probably a parent.addMe(self) instead of
banana.stack.append(self), maybe addMeC for the C unslicer.
The Banana object is a Protocol, and has a dataReceived method. (maybe in
some C form, data could move directly from a CTransport to a CProtocol). It
parses tokens and hands them to its Unslicer stack. The root Unslicer is
probably created at connectionEstablished time. Subclasses of Banana could
use different RootUnslicer objects, or the users might be responsible for
setting up the root unslicer.
The Banana object is also created with a RootSlicer. Banana.writeToken
serializes the token and does transport.write . (a C form could have CSlicer
objects which hand tokens to a little CBanana which then hands bytes off to
a CTransport).
Doing the bytestream-to-Token conversion in C loses a lot of utility when
the conversion is done token at a time. It made more sense when a whole mess
of s-lists were converted at once.
All Slicers currently have a Banana pointer.. maybe they should have a
transport pointer instead? The Banana pointer is needed to get to top of the
stack.
want to be able to unserialize lists/tuples/dicts/strings/ints (basic
types
) without surfacing into python. want to deliver the completed
object to a python function.