Cayuga: A General Purpose Event Monitoring System
Demers, Gehrke, Pand, Riedewald, Sharma, White
event correlation databases in-network continuous query stream processing
@conference{demers:cidr-2007,
title={Cayuga: A General Purpose Event Monitoring System},
author={Demers, A. and Gehrke, J. and Panda, B. and
Riedewald, M. and Sharma, V. and White, W.M.},
booktitle={Conference on Innovative Data Systems Research},
year={2007}
}
Event monitoring applications differ from previous stream processing
- Many monitoring queries must find correlation between events, patterns in the stream
- One important class of patterns is "safety conditions" which stipulate events that must not happen between other events
- Such queries in previous stream languages are hard to write, hard to optimize
- Larger number of concurrent queries
Major additions to query language are NEXT and FOLD operators
- NEXT operator permits the next tuple matching a given filter that occur after a given event
- FOLD operator permits tuples matching a given filter until a tuple matching a stop condition is received, with aggregation permitted along the way
- Example: "Suppose we want to find a monotonically increasing run of prices for a single company, where the run lasts for at least 10 stock quotes, and the first quote has a volume greater than 10000. We can formulate this query in CEL as follows."
SELECT *
FROM FILTER{cnt > 10} (
(SELECT *, 1 AS cnt FROM
FILTER{Volume > 10000}(Stock))
FOLD{$2.Name = $.Name, $2.Price > $.Price,
$.cnt+1 AS cnt} Stock)
Cayuga queries expressible as slightly generalized NFAs
- Input is arbitrary relational streams
- State transitions controlled using predicates
- Data may be stored and selection may compare to previous events
- Internal query processor works off detailed, explicit XML encoding of the automatons
- This makes it easier to plug other interfaces on top of the engine, e.g. a new high level query interface
- Predicates are compiled into bytecodes executed by an interpreter
Time synchronization is a big deal as events must be processed in order
- It must correct for clock skew between data sources, network delay & reordering, etc
- Everything gets thrown into a priority queue, ordering items to be processed even if earlier events come in slightly earlier than other events
- With clock skew bounded by T, events are buffered for T before being processed
- Note that other events, received later, may have taken priority by this point
- Seems like a bad actor could cause this to halt permanently, by constantly sending falsely timestamped tuples just before the interval expires
- High T affects latency but not throughput
- Earlier events received after an event has been processed (T is too small) are discarded
Related work
- Pub/sub has limited expressiveness, don't span multiple events
- Stream databases are awkward for relating multiple events, and may have scalability issues
- Complex event systems are similar, but not as general or well defined as Cayuga
Future work
- Functions
- Hierarchy of servers performing event processing at various levels of abstraction
Of note:
- Makes heavy use of garbage collection, shallow copies to manage the large amounts of objects instantiated and killed throughout the process
- Interesting bimodal distribution of objects: Some die very fast (filtered tuples), others last for quite some time, e.g. operator objects
- Copying collection?
- Uses an internal canonical string storage pool for strings
- Reduces memory footprint, speeds checks after canonicalized (can simply compare index/pointer)
- Note that this is not as straightforward as simply a hash table, as the objects need to be dereference so that they get garbage collected
- Uses "weak references," which do not block garbage collection
- Demers et al, Towards expressive publish/subscribe systems. EDBT, 2006
- Fabret et al, Filtering algorithms and implementation for very fast publish/subscribe. SIGMOD, 2001