The Banana Protocol

NOTE! This is all preliminary and is more an exercise in semiconscious protocol design than anything else. Do not believe this document. This sentence is lying. So there.

Banana tokens

At the lowest layer, the wire transport takes the form of Tokens. These all take the shape of header/type-byte/body, where the type byte has its high bit set, to distinguish it from the header bytes that precede it. Some tokens have bodies, some do not, and the length of all bodies are determined by data in the header. The bodies are composed of arbitrary bytes. For most tokens, the header is a base-128 number, but for some it may be a UTF-7 encoded string. (?? maybe just strict ASCII. the high bit of all bytes in the header must be zero).

Tokens are described below as [header-TOKEN-body], where either header or body may be empty. For example, [len-LIST-empty] indicates that the length is put into the header, LIST is the token being used, and the body is empty.

The possible Token types are:

TODO: add a token to re-set the list of abbreviated strings. This should basically contain a dictionary that replaces the previous VOCAB mapping.

Object graph serialization

When serializing an object, it is useful to view it as a directed graph. The root object is the one you start with, any objects it refers to are children of that root. Those children may point back to other objects that have already been serialized, or which will be serialized later.

Banana, like pickle and other serialization schemes, does a depth-first traversal of this graph. Serialization is begun on each node before going down into the child nodes. Banana tracks previously-handled nodes and replaces them with numbered reference tokens to break loops in the graph.

Banana Slices

A Banana Slicer is responsible for serializing a single user object: it slices that object into a series of Banana tokens. On the receiving end, there is a corresponding Banana Unslicer which accepts the incoming tokens and re-creates the user object. There are different kinds of Slicers and Unslicers for lists, tuples, dictionaries, and instances. Classes can provide their own Slicers if they want more control over the serialization process.

The Banana Stack

The serialization context is stored in a Banana object (names are still being decided). This holds a stack of Banana Slicers, one per object currently being serialized (i.e. one per node in the path from the root object to the one currently being serialized).

For example, suppose a class instance is being serialized, and this class chose to use a dictionary to hold its instance state. That dictionary holds a list of numbers in one of its values. The Banana Stack would hold the root slicer, an InstanceSlicer, a DictSlicer, and finally a ListSlicer.

(note: it might be possible to move the functionality of the Banana object entirely into the root slice).

Upon unserialization, the Unbanana object holds this context. A stack of Unslicer objects handle incoming tokens. The Unbanana is responsible for tracking OPEN and CLOSE tokens, making sure a failure in an Unslicer doesn't cause a loss of synchronization. Unslicer methods may raise exceptions: these are caught by the Unbanana and cause the object currently being unserialized to fail: its parent gets a UnbananaFailure instead of the dict or list or instance that it would normally have received.

The stack is used to determine three things:

The default case puts a Parent slice at the bottom of the stack. This can also be interpreted as a root object, if you imagine that any given user object being serialized is somehow a child of the overall serialization context. In PB, for example, the root object would be related to the connection.

In addition, the stack can be queried to find out what path leads from the root object to the one currently being serialized. If something goes wrong in the serialization process (an exception is thrown), this path can make it much easier to find out when the trouble happened, as opposed to merely where. Knowing which method of your FooObject failed during serialization isn't very useful when you have 500 of them inside your data structure and you need to know whether it was bar.thisfoo or bar.thatfoo which caused the problem. To this end, each Slicer has a .describe method which is supposed to return a short string that explains how to get to the child node currently being processed. When an error occurs, these strings are concatenated together and put into the failure object.

The Parent slice is meant to provide the default behavior for the stack. The default class currently does the following:

TODO: The idea is to let other serialization contexts to other things. The tokens should probably go to the parent slice for handling: turning into bytes and sending over a wire, saving to a file, etc. Having the whole stack participate in the Tasting process means that objects can put restrictions on what is sent on their behalf: objects could refuse to let certain classes be sent as part of their instance state.

Bananaing

Serialization starts with the Parent Slicer being asked to serialize the given object. The Parent gives the object to Banana. Banana starts by walking the stack (which, of course, has only the Parent on it at that point), calling the .taste method for each Slicer there. If any of them have a problem with the object being serialized, they express it by raising an exception (TODO: which one? InsecureBanana?).

If the Taster stack passes the object, Banana's next job is to find a new Slicer to handle the object. It does this by walking the stack, calling .newSlicer on each slice. The first one that returns an object ends the search. In most cases, this is the Parent slice, which just looks up the type() of the object in the SlicerRegistry. A type which does not have a Slicer registered for it will cause an exception to be raised here.

The new Slicer is pushed on the the stack. It is then sent three methods in succession: .start, .slice, and .finish. start defaults to registering the object with setRefID and sending a appropriate OPEN token. slice is defined on a per-Slicer basis to send all the necessary tokens. finish sends the CLOSE token.

Banana keeps strict track of the nesting level. For safety, each OPEN gets a sequence number so it can be matched with its CLOSE token. If a Slicer's .close method fails to send the close token, very bad things will happen (in general, all further objects will become children of the one that didn't CLOSE properly). The sequence numbers are an attempt to minimize the damage.

Unbananaing

The Unbanana object has a corresponding stack of Banana Unslicer objects. Each one receives the tokens emitted by the matching Slicer on the sending side. The whole stack is used to determine new Unslicer objects, perform Tasting of incoming tokens, and manage object references.

OPEN tokens have a string to indicate what kind of object is being started. This is looked up in the UnbananaRegistry just like object types are looked up in the BananaRegistry. The new Unslicer is pushed onto the stack.

ABORT tokens indicate that something went wrong on the sending side and that the current object is to be aborted. It causes the receiver to ignore all tokens until the CLOSE token which closes the current node. This is implemented by replacing the top-most slice with a DiscardUnslicer.

CLOSE tokens finish the current node. The slice will pass its completed object up to the childFinished method of its parent.

Other Issues

Deferred Object Recreation: The Trouble With Tuples

Types and classes are roughly classified into containers and non-containers. The containers are further divided into mutable and immutable. Some examples of immutable containers are tuples and bound methods. Lists and dicts are mutable containers. Ints and strings are non-containers.

During unserialization, objects are in one of three states: uncreated, referenceable (but not complete), and complete. Only mutable containers can be referenceable but not complete: immutable containers have no intermediate referenceable state.

Mutable containers (like lists) are referenceable but not complete during traversal of their child nodes. This means those children can reference the list without trouble.

Immutable containers (like tuples) present challenges when unserializing. The object cannot be created until all its components are referenceable. While it is guaranteed that these component objects will be complete before the graph traversal exits the current node, the child nodes are allowed to reference the current node during that traversal. To handle this, the TupleUnslicer installs a Deferred into the object table when it begins unserializing (in the .start method). When the tuple is finally complete, the object table is updated and the Deferred is fired with the new tuple.

Containers (both mutable and immutable) are required to pay attention to the types of their incoming children. These containers are not complete (in the sense described above) until those Deferreds have been replaced with referenceable objects. When the container receives the Deferred, it should attach a callback to it which will perform the replacement. In addition, immutable containers should check after each update to see if all the Deferreds have been cleared, and if so, complete the object (and fired their own Deferreds so any containers they are a child of may be updated and/or completed).

Security Model

Having the whole Slicer stack particpate in Tasting on the sending side seems to make a lot of sense. It might be better to have a way to push new Taster objects onto a separate stack. This would certainly help with performance, as the common case (where most Slicers ignore .taste) does a pointless method call to every Slice for every object sent. The trick is to make sure that exception cases don't leave a taster stranded on the stack when the object that put it there has gone away.

On the receiving side, each object has a corresponding .taste method, which receives tokens instead of complete objects. This makes sense, because you want to catch the dangerous data before it gets turned into an object, but tokens are a pretty low-level place to do security checks. It might be more useful to have some kind of instance taster stack, with tasters that are asked specifically about (class,state) pairs and whether they should be turned into objects or not.

Because the Unslicers receive their data one token at a time, things like InstanceUnslicer can perform security checks one attribute at a time. traits-style attribute constraints (see the Chaco project or the PyCon-2003 presentation for details) can be implemented by having a per-class dictionary of tests that attribute values must pass before they will be accepted. The instance will only be created if all attributes fit the constraints. The idea is to catch violations before any code is run on the receiving side. Typical checks would be things like .foo must be a number, .bar must not be an instance, .baz must implement the IBazzer interface.

Using the stack instead of a single Taster object means that the rules can be changed depending upon the context of the object being processed. A class that is valid as the first argument to a method call may not be valid as the second argument, or inside a list provided as the first argument. The PBMethodArgumentsUnslicer could change the way its .taste method behaves as its state machine progresses through the argument list.

There are several different ways to implement this Taster stack:

Of course, all this holds true for the sending side as well. A Slicer could enforce a policy that no objects of type Foo will be sent while it is on the stack.

It is anticipated that something like the current Jellyable/Unjellyable classes will be created to offer control over the Slicer/Unslicers used to handle instance of that class.

One eventual goal is to allow PB to implement E-like argument constraints.

Streaming Slices

The big change from the old Jelly scheme is that now serialization/unserialization is done in a more streaming format. Individual tokens are the basic unit of information. The basic tokens are just numbers and strings: anything more complicated (starting at lists) involves composites of other tokens.

The serialization side will be reworked to be a bit more producer-oriented. Objects should be able to defer their serialization temporarily (TODO: really??) like twisted.web resources can do NOT_DONE_YET right now. The big goal here is that large objects which can't fit into the socket buffers should not consume lots of memory, sitting around in a serialized state with nowhere to go. This must be balanced against the confusion caused by time-distributed serialization. PB method calls must retain their current in-order execution, and it must not be possible to interleave serialized state (big mess).

CBanana, CBananaRun, RunBananaRun

Another goal of the Jelly+Banana->JustBanana change is the hope of writing Slicers and Unslicers in C. The CBanana module should have C objects (structs with function pointers) that can be looked up in a registry table and run to turn python objects into tokens and vice versa. This ought to be faster than running python code to implement the slices, at the cost of less flexibility. It would be nice if the resulting tokens could be sent directly to the socket at the C level without surfacing into python; barring this it is probably a good idea to accumulate the tokens into a large buffer so the code can do a few large writes instead of a gazillion small ones.

It ought to be possible to mix C and Python slices here: if the C code doesn't find the slice in the table, it can fall back to calling a python method that does a lookup in an extensible registry.

Beyond Banana

Random notes and wild speculations: take everything beyond here with two grains of salt

Oldbanana usage

The oldbanana usage model has the layer above banana written in one of two ways. The simple form is to use the banana.encode and banana.decode functions to turn an object into a bytestream. This is used by twisted.spread.publish . The more flexible model is to subclass Banana. The largest example of this technique is, of course, twisted.spread.pb.Broker, but others which use it are twisted.trial.remote and twisted.scripts.conch (which appears to use it over unix-domain sockets).

Banana itself is a Protocol. The Banana subclass would generally override the expressionReceived method, which receives s-expressions (lists of lists). These are processed to figure out what method should be called, etc (processing which only has to deal with strings, numbers, and lists). Then the serialized arguments are sent through Unjelly to produce actual objects.

On output, the subclass usually calls self.sendEncoded with some set of objects. In the case of PB, the arguments to the remote method are turned into s-expressions with jelly, then combined with the method meta-data (object ID, method name, etc), then the whole request is sent to sendEncoded.

Newbanana

Newbanana moves the Jelly functionality into a stack of Banana Slices, and the lowest-level token-to-bytestream conversion into the new Banana object. Instead of overriding expressionReceived, users could push a different root Unslicer. to get more control over the receive process. Currently, Slicers call Banana.sendOpen/sendToken/sendClose/sendAbort, which then creates bytes and does transport.write . To move this into C, the transport should get to call CUnbanana.receiveToken There should be CBananaUnslicers. Probably a parent.addMe(self) instead of banana.stack.append(self), maybe addMeC for the C unslicer. The Banana object is a Protocol, and has a dataReceived method. (maybe in some C form, data could move directly from a CTransport to a CProtocol). It parses tokens and hands them to its Unslicer stack. The root Unslicer is probably created at connectionEstablished time. Subclasses of Banana could use different RootUnslicer objects, or the users might be responsible for setting up the root unslicer. The Banana object is also created with a RootSlicer. Banana.writeToken serializes the token and does transport.write . (a C form could have CSlicer objects which hand tokens to a little CBanana which then hands bytes off to a CTransport). Doing the bytestream-to-Token conversion in C loses a lot of utility when the conversion is done token at a time. It made more sense when a whole mess of s-lists were converted at once. All Slicers currently have a Banana pointer.. maybe they should have a transport pointer instead? The Banana pointer is needed to get to top of the stack. want to be able to unserialize lists/tuples/dicts/strings/ints (basic types) without surfacing into python. want to deliver the completed object to a python function.