This is our erstwhile "design document" for the Quotient system as a whole. My goal is to describe the as-yet-unwritten code that I know has to be written eventually so that we can distribute the load of implementing all this stuff as widely as possible. Take nothing in this document as gospel: XP is "design amenable to change", so if you have a better idea I will be glad to throw out what I have here.
Our eventual scalability goal is ambitious: to have a bank of servers which provide computation, communication, and storage as commodities. We we want to be able to add more horsepower to the system by dropping in a new machine and ghosting the drive with a fresh quotient+linux image.
This is an incredibly difficult goal, however, and is impossible with our current architecture. Of course, there are many stops on the way there.
One significant architectural barrier to scalability is synchronousness. Effectively, any object accessed without going through a Deferred must be in the same Python memory image as the object accessing it. However, this is directly at odds with efficiency: creating a Deferred is very expensive compared to a regular method call, so we should only use them when there is some real benefit to separating different aspects of the system.
Another potential scalability trap is the need for storage to be synchronous (or effectively synchronous) to the computation. It is possible to offload objects to a separate, more complex "storage" layer such as an RDBMS, but only if the objects in question are either (A) direct wrappers around the storage, with no state of their own, or, (B) loadable in fairly large batches at one time. The Twisted team's experience with Popsicle indicates that attempting to have fine-grained Deferred storage is a losing battle both in terms of complexity and efficiency.
Therefore, our system should be broken apart on logical "fault lines"; for example, an avatar and its messages should be on the same machine and the same filesystem, but a separate avatar could possibly be on a different system. This means that all messages between avatars need to be asynchronous (through callRemote), which is slightly inconvenient; however, this is good design in any event because Items cannot be shared between different ItemStore instances.
Asynchronous avatar access also implies that the UI will have to be at least partially asynchronous, because we need one avatar to be able to browse anothers' public storage. The optimal way to handle this, I think, will be to do the UI rendering local to the objects, but to have the actual HTTP protocol be spoken by a separate object. Considering that the HTTP object is assumed to be remote from all UI objects with which it is communicating, the actual HTTP load can be distributed in a straightforward round-robin-DNS manner.
Rule 1, of course: Don't use C.
The general security design is capability-oriented. Whenever you have to answer the question "am I allowed to do X?", translate it to "how do I get a reference to X?". In python we don't have any restrictions, but when a user-interface event is asking for a reference to a secured object, there should be some verification associated with that. In general the verification will be performed before-the-fact, by presenting a reference to the user in a context where they can request it without any checks.
More specifically, there are 4 levels of protection.
"Security" is a bad word for this aspect of the design, since the property of "security" is really just that the system functions as it's expected to and unauthorized actions are truly not allowed. In addition to "Don't use C", that will really be enforced by "New PB (TM)" and Itamar's jelly-pattern-matching system (described in sandbox/itamar/regex.py).
Rule 1, of course: Use C. ;-)
It's hard to say exactly where "efficiency" diverges from "scalability", so I will put concerns about responsiveness into this section and other questions about raw efficiency into the scalability seciton.
Quotient is a fairly large and, by design, arbitrarily complex system. However, as in many such systems, it does appear that the severe performance hogs break down into a few predictable categories.
FilePile makes for impressive, debuggable prototype code as long as you're using UNIX and don't have too many items to worry about and all of your items are about 8k each. However, it makes for lousy production code once you start hitting certain edge cases, which our use of it with email very certainly hits.
We will be replacing all uses of FilePile for indexing. Pools will have index *files* rather than index directories. However, it is still a useful abstraction for iterating the store itself, so that will remain. However it will likely be more optimized for direct lookups. Currently we have to create all kinds of intermediary structures and list directories in some cases in order to do an atKey, which should be at most 2 filesystem syscalls in the common case.
It's well known that Lupy needs to be just cleaned out and optimized. One of the chief problems with doing such a cleanup is our inability to delineate between phases of message processing.
The biggest problem with responsiveness is atomicity. Currently what little failure-tolerance is implemented is written in terms of an all-or-nothing processing of the message, ending with a final "commit" that consists of a filesystem "move".
We need a more robust message-processing pipeline. It must:
For a backup system, we have slightly different requirements to normal backup processes. Unlike a traditional backup system, which attempts to restore a previous consistent state and then have users manually re-do previous work, we should strive to never drop an email. It is acceptable to require our users to re-file a day or two's worth of email manually, or to re-run our processing should a failure occur, but dropped emails can't reasonably be manually restored.
In order to do this, we need redundancy at many levels. We need a full log of all messages sent to the system, that is structured in a repeatable way, replicated to every other server on the network. We also need to have checkpoint backups which freeze the whole state of the system at a particular point in time in order to save the manual work that users put in.