Cooperative Data Flows

Background

Handling more than one Request

When creating services which handle more than once client at a given time, the current state of each request must be managed. A typical way to keep the current state is to use the call stack, a stack of functions and their arguments. Unfortunately, this doesn't work well when you have more than one simultaneous request.

One primary approach to this problem is preemptive multi-tasking, where each request is handled by a new process or thread which has its own call stack. The primary difference between processes and threads, from the programmer's perspective, is primarily in how objects common to multiple requests are shared. For processes, one must use sockets or shared memory; in threads, the programmer must take care to lock objects which could be touched by other threads. In either case, each request is given its own call stack, making things easy for the programmer. However, the programmer must take extra care with information that is not managed with the stack and are shared between requests.

Another approach, is to use a event-driven approach, where each request is broken down into distinct stages of work, and when each stage is done, it schedules the next stage to be run. This is called cooperative multi-tasking. The Twisted framework primarily uses this approach to handling multiple requests. In Twisted, the execution of each stage is typically scheduled with reactor.callLater and the result of each stage is reported with a call-back, usually managed by a internet.defer.Deferred object.

While this Deferred approach is very good, it starts to have problems when the flow of the request is not a simple linear sequence of stages, or when results must flow between stages incrementally. An alternative cooperative multitasking approach is to view data flows as a hierarchy of stages, where the last stage in the flow pulls from previous stages, and so on. This approach allows for a more granular control of the flow, giving both incremental results and also allowing the next state of the information to be chosen more dynamically. Unfortunately, this approach is somewhat complicated since building iterators isn't easy. Luckily, in version 2.2 and up, Python has generators, which are, in short, syntax sugar for making iterators.

A further complication to this iterator based approach is that every once and a while, a iterator in the flow (perhaps one nested several layers deep) may have to block on a resource. In this case, the flow must be paused so that other flows in the system have a chance to produce results. So, rather than blocking, the entire state of the iterator chain must be saved so that it can be resumed later. Furthermore, traditional exception handling doesn't work since the call stack for a given request may be paused indefinitely. Helping the programmer manage these sorts of things is what the flow module does. While it does not depend on generators, to use flow effectively, generators are very useful.

Iterators and generators

An iterator is basically an object which produces a sequence of values. Python's iterators are simply objects with an __iter__() member function which returns an object (usually itself) which has a next() member function. The next() method is then invoked till it raises a StopIteration exception. In Python 2.2, the for syntax knows about iterators, making them very nice to use.


from twisted.python.compat import iter, StopIteration

class Counter:
    def __init__(self, count):
        self.count = count
    def __iter__(self):
        return self
    def next(self):
        ret = self.count
        self.count -= 1
        if ret: return ret
        raise StopIteration
        return ret

def list(it):
    ret = []
    import sys
    if sys.version_info >= (2,2):
        for x in it:
           ret.append(x)
    else:
        it = iter(it)
        try:
            while 1:
                ret.append(it.next())
        except StopIteration: pass
    return ret

print list(Counter(3))

# prints: [3, 2, 1]

State pattern

Often times it is useful for an iterator to change state during its production of values. This can be done nicely with the 'state' pattern. Simply store in the iterator the next function to be run.


class States:
    def __iter__(self):
        self.state = self.next_initial
        return self
    def next_initial(self):
        self.state = self.next_middle
        return "one"
    def next_middle(self):
        self.state = self.next_final
        return "two"
    def next_final(self):
        raise StopIteration
    def next(self):
        return self.state()

print list(States())

# prints: ['one', 'two']

Generators

With Python 2.2, there is a wonderful syntax sugar for creating iterators... generators. When a generator is first executed, an iterator is returned. And from there on, each invocation of next() gives the subsequent value produced by the yield statement. With generators, the two iterators above become very easy to express.

from __future__ import generators   # <-- first line of file

def Counter(count):
    while count > 0:
        yield count
        count -= 1

def States():
    yield "one"
    yield "two"


print list(Counter(3))
print list(States())

# prints:
#    [3, 2, 1]
#    ['one', 'two']

An important detail here, is that code which uses both iterators and generators (dump) can be expressed in a manner which works in with 2.1 and thus can be included in Twisted's code base. One technical difference between iterators and generators, is that raising an exception from a generator permanently halts the generator, while raising an exception from an iterator's next() method does not necessarily stop the iterator, that is, one could call the next() method again and possibly get results. From here on, we use the generator syntax for expressing iterators.

Introducing Flow

It is possible, and often useful to view a data flow as a nested iterator. In this view, the 'last' iterator in the chain 'pulls' data from previous iterators in the data flow. If you wish, you may call the last iterator a consumer, and the first iterator a producer. In the following example, we use the counter generator defined above as our producer.

from __future__ import generators 

def Counter(count):
    while count > 0:
        yield count
        count -= 1

def Consumer():
    for result in Counter(3):
        if 2 != result:
            yield result

print list(Consumer())

# prints: [3, 1]

The problem with this approach, in a cooperative multi-tasking environment, is that a producer could potentially block, and if it did, the entire process would stop servicing all other requests. Thus, some mechanism for pausing the flow and running it later is required.

Flow Basics

The flow module provides this ability to cooperate with other tasks by placing a control mechanism between each stage of a flow. This is accomplished in code , by creating a wrapper object for each iterable, which one should yield before every call to next(), implicit or otherwise. In the yield, the control mechanism can then take over to support the underlying cooperative multi-tasking mechanism.

from __future__ import generators
import flow

def Counter(count):
    while count > 0:
        yield count
        count -= 1

def Consumer():
    producer = flow.wrap(Counter(3))
    yield producer
    for result in producer:
        if 2 != result:
            yield result
        yield producer

print list(flow.Block(Consumer))

# prints: [3, 1]

In the code above, producer.next() is implicitly called. It does several things like checking for the end of the iterator and scanning for failures. Its behavior can best be described with a more verbose version below, with Counter replaced with a simple list, for brevity.

from __future__ import generators
import flow

def Consumer():
    producer = flow.wrap([3,2,1])
    while 1:
        yield producer
        if producer.stop: break
        if producer.isFailure():
            raise producer.result
        if 2 != producer.result:
            yield producer.result

print list(flow.Block(Consumer))

# prints: [3, 1]

Handling failures

Another difference between plain old iterables and one wrapped with the flow module is that exceptions thrown must be delayed for later delivery. This is done with twisted.python.fail.Failure. Within a for loop, failure objects are raised if they are not provided to wrap(). Furthermore, Failure must also be used to send Exceptions back from generators if it is recoverable.

from __future__ import generators
import flow

def Producer():
    yield 1
    yield flow.Failure(IOError("recoverable"))
    yield 2
    assert 0, "asserting"
    yield 3

def Consumer():
    producer = flow.wrap(Producer(), trap=IOError)
    yield producer
    try:
        for result in producer:
            if result is IOError:
                # handle recoverable error
                pass
            else:
                yield result
            yield producer
    except flow.Failure, fail:
        # pass other failures up the stack
        fail.trap(AssertionError)
        # handle non-recoverable error
        yield str(fail.value)

print list(flow.Block(Consumer))

# prints: [1, 2, 'asserting']

Cooperate

This seems like quite the effort, wrapping each iterator and then having to alter the calling sequence. Why? The answer is that it allows for a flow.Cooperate object to be returned. When this happens, the entire call chain can be paused so that other flows can use the call stack. For flow.Iterator (which blocks), the implementation of Cooperate simply puts the call chain to sleep.

import flow

lst = ['1','2', flow.Cooperate(4), '3']
print list(flow.Block(lst))

# prints: ['1,'2','3']

Merging iterators

An application of Cooperate can be demonstrated with the Merge command. This simply zips two or more wrapped iterators together, without blocking one or the other. In the example below, the States iterator isn't blocked by the Counter iterator.

from __future__ import generators
import flow

def States():
    yield "one"
    yield "two"

def Counter(count):
    while count > 0:
        if not count % 2:
            yield flow.Cooperate()
        yield count
        count -= 1

mrg = flow.Merge(Counter(3),States)
print list(flow.Block(mrg))

# prints: [3, 'one', 'two', 2, 1]

Deferred Flow

The real value in Flow comes not from its stand-alone use, in this case, Cooperate does very little and the overhead imposed by flow doesn't offset the functionality. However, when flow is combined with Twisted's reactor.callLater and internet.defer.Deferred mechanism, things get very cosy. In the example below, the first two items in the list are produced (although they are not delivered yet), other events in the reactor are allowed to proceed, and then the last item in the list is produced.

from __future__ import generators
from twisted.internet import reactor
import flow

def prn(x): 
    print x
d = flow.Deferred([1,2,flow.Cooperate(1),3])
d.addCallback(prn)
reactor.callLater(2, reactor.stop)
reactor.run()

# prints
#   [1,2,3]

Dealing /w Threads

While the Flow module allows for multiple cooperative tasks to work in a single thread, sometimes it is necessary to have the output of another thread be consumed within a flow. This can be done with the ThreadedIterator. In the example, the following Count implementation blocks within a thread by using sleep.

from __future__ import generators
from twisted.internet import reactor
import flow

class Count:
    def __init__(self, count):
        self.count = count
    def __iter__(self):
        return self
    def next(self): # this is run in a separate thread
        from time import sleep
        sleep(.2)
        val = self.count
        if not(val): raise flow.StopIteration
        self.count -= 1
        print "producing", val
        return val

d = flow.Deferred(flow.Threaded(Count(5)))
def prn(x):
    print "results", x
d.addCallback(prn)
reactor.callLater(4, reactor.stop)
reactor.run()

# results:
#   producing 5
#   producing 4
#   producing 3
#   producing 2
#   producing 1
#   results [5, 4, 3, 2, 1]

# list(flow.Block(flow.Threaded(Count(5))))

Using database connections

Since most standard database drivers are thread based, the flow builds on the ThreadedIterator by providing a QueryIterator, which takes an sql query and a ConnectionPool.

from __future__         import generators
from twisted.enterprise import adbapi
from twisted.internet   import reactor
import flow

dbpool = adbapi.ConnectionPool("SomeDriver",host='localhost', 
             db='Database',user='User',passwd='Password')

sql = """
  (SELECT 'one')
UNION ALL
  (SELECT 'two')
UNION ALL
  (SELECT 'three')
"""

def consumer():
    query = flow.Threaded(flow.QueryIterator(dbpool, sql))
    while 1:
        yield query
        if query.stop: break
        print "Processed result : ", query.result

from twisted.internet import reactor
def finish(result): 
    print "Deferred Complete : ", result
f = flow.Deferred(consumer())
f.addBoth(finish)
reactor.callLater(1,reactor.stop)
reactor.run()

# prints
# Processed result :  ('one',)
# Processed result :  ('two',)
# Processed result :  ('three',)
# Deferred Complete:  []