Quotient Design

Introduction

This is our erstwhile "design document" for the Quotient system as a whole. My goal is to describe the as-yet-unwritten code that I know has to be written eventually so that we can distribute the load of implementing all this stuff as widely as possible. Take nothing in this document as gospel: XP is "design amenable to change", so if you have a better idea I will be glad to throw out what I have here.

Scalability

Our eventual scalability goal is ambitious: to have a bank of servers which provide computation, communication, and storage as commodities. We we want to be able to add more horsepower to the system by dropping in a new machine and ghosting the drive with a fresh quotient+linux image.

This is an incredibly difficult goal, however, and is impossible with our current architecture. Of course, there are many stops on the way there.

One significant architectural barrier to scalability is synchronousness. Effectively, any object accessed without going through a Deferred must be in the same Python memory image as the object accessing it. However, this is directly at odds with efficiency: creating a Deferred is very expensive compared to a regular method call, so we should only use them when there is some real benefit to separating different aspects of the system.

Another potential scalability trap is the need for storage to be synchronous (or effectively synchronous) to the computation. It is possible to offload objects to a separate, more complex "storage" layer such as an RDBMS, but only if the objects in question are either (A) direct wrappers around the storage, with no state of their own, or, (B) loadable in fairly large batches at one time. The Twisted team's experience with Popsicle indicates that attempting to have fine-grained Deferred storage is a losing battle both in terms of complexity and efficiency.

Therefore, our system should be broken apart on logical "fault lines"; for example, an avatar and its messages should be on the same machine and the same filesystem, but a separate avatar could possibly be on a different system. This means that all messages between avatars need to be asynchronous (through callRemote), which is slightly inconvenient; however, this is good design in any event because Items cannot be shared between different ItemStore instances.

Asynchronous avatar access also implies that the UI will have to be at least partially asynchronous, because we need one avatar to be able to browse anothers' public storage. The optimal way to handle this, I think, will be to do the UI rendering local to the objects, but to have the actual HTTP protocol be spoken by a separate object. Considering that the HTTP object is assumed to be remote from all UI objects with which it is communicating, the actual HTTP load can be distributed in a straightforward round-robin-DNS manner.

Security

Rule 1, of course: Don't use C.

The general security design is capability-oriented. Whenever you have to answer the question "am I allowed to do X?", translate it to "how do I get a reference to X?". In python we don't have any restrictions, but when a user-interface event is asking for a reference to a secured object, there should be some verification associated with that. In general the verification will be performed before-the-fact, by presenting a reference to the user in a context where they can request it without any checks.

More specifically, there are 4 levels of protection.

  1. "private" data, which is what we implemented first: this data is just items that sit in your store. You can request any item from your store.
  2. "shared" data, which is made accessible from one individual to another. This should be made possible by dropping a "foreign reference" object (which acts exactly like a pb.Referenceable, for the reasons explained above) into the receiving user's pool. That object can be deleted by either user: the granting user retains a record of their grant which can be revoked, or the granted user can delete the item.
  3. "group" data, which is made accessible to a user by their presence in a particular group. This works similarly to shared data: presence in a group is effectively a remote reference reference to a particular pool, which by proxy gives remote references to all items contained within it. The chief difference between "group" and "shared" is that whereas "shared" data is merely a read-only reference presented to a different user, group data is stored separately and the reference objects are intelligent facades that encode some information about he role the user is performing with respect to that group. So, a user may have totally separate "normal participant" and "administrator" facades onto the group so that an offhand remark as a user should not be mistaken as an official notice from an admin, and administrative powers cannot be accidentally invoked.

"Security" is a bad word for this aspect of the design, since the property of "security" is really just that the system functions as it's expected to and unauthorized actions are truly not allowed. In addition to "Don't use C", that will really be enforced by "New PB (TM)" and Itamar's jelly-pattern-matching system (described in sandbox/itamar/regex.py).

Responsiveness

Rule 1, of course: Use C. ;-)

It's hard to say exactly where "efficiency" diverges from "scalability", so I will put concerns about responsiveness into this section and other questions about raw efficiency into the scalability seciton.

Quotient is a fairly large and, by design, arbitrarily complex system. However, as in many such systems, it does appear that the severe performance hogs break down into a few predictable categories.

FilePile

FilePile makes for impressive, debuggable prototype code as long as you're using UNIX and don't have too many items to worry about and all of your items are about 8k each. However, it makes for lousy production code once you start hitting certain edge cases, which our use of it with email very certainly hits.

We will be replacing all uses of FilePile for indexing. Pools will have index *files* rather than index directories. However, it is still a useful abstraction for iterating the store itself, so that will remain. However it will likely be more optimized for direct lookups. Currently we have to create all kinds of intermediary structures and list directories in some cases in order to do an atKey, which should be at most 2 filesystem syscalls in the common case.

Text Indexing

It's well known that Lupy needs to be just cleaned out and optimized. One of the chief problems with doing such a cleanup is our inability to delineate between phases of message processing.

Processing Phases

The biggest problem with responsiveness is atomicity. Currently what little failure-tolerance is implemented is written in terms of an all-or-nothing processing of the message, ending with a final "commit" that consists of a filesystem "move".

We need a more robust message-processing pipeline. It must:

  1. track dependencies between parts of message processing so that messages can be processed by independant systems in parallel.
  2. break up individual long-running messaging phases by either
    1. running them within a thread or
    2. turning them into generators that are invoked with iterateInReactor
  3. keep message processing in order and persistently track which message has been processed by which subsystem in order to allow

Backup

For a backup system, we have slightly different requirements to normal backup processes. Unlike a traditional backup system, which attempts to restore a previous consistent state and then have users manually re-do previous work, we should strive to never drop an email. It is acceptable to require our users to re-file a day or two's worth of email manually, or to re-run our processing should a failure occur, but dropped emails can't reasonably be manually restored.

In order to do this, we need redundancy at many levels. We need a full log of all messages sent to the system, that is structured in a repeatable way, replicated to every other server on the network. We also need to have checkpoint backups which freeze the whole state of the system at a particular point in time in order to save the manual work that users put in.