Towards an Internet-Scale XML Dissemination Service
Diao, Rizvi, Franklin
publish/subscribe pub/sub xml routing dissemination
@article{diao:vldb-2004,
title={Towards an {Internet}-Scale {XML} Dissemination Service},
author={Diao, Y. and Rizvi, S. and Franklin, M.J.},
journal={International Conference on Very Large Databases},
pages={612--623},
year={2004}
}
From old notes:
Towards an Internet-Scale XML Dissemination Service. Diao, Rizvi, Franklin. VLDB 2004.
This paper presents a distributed publish/subscribe system based on XML messages and querying. The expressiveness supported by the XPath based query language and engine is fairly expressive, particularly as compared to most publish/subscribe engines. Much of the paper focuses on optimizing evaluation of XML messages such that as much work as possible may be shared between router components as well as queries. It also includes a scheme for propagating query generalization and aggregations up a broadcast tree to minimize downstream traffic.
The paper neither focuses on nor provides definite notes on construction of the underlying broadcast tree. Data source and user linking is discussed, but not how the router overlay is created. This is arguably outside the scope of this paper, but is an important component of such a system. In the absence of other evidence, the assumptions here seem to lean toward a fairly static and managed, infrastructure-type environment.
A scheme is presented for attaching data sources and users, as opposed to routers. However, the given scheme is fairly centralized. Although this service is handling less traffic than the message routers, in a truly Internet-scale system, it could still be substantial. This is particularly true if users and data sources are frequently connecting and disconnecting, which is quite likely for many applications of interest (e.g. users reading/monitoring streams during work breaks). Registration is also a single point of failure.
Although it is not discussed, it seems possible that there could be redundant/alternate registration services around the network. Given the current decision making on attaching sources and users, as discussed in the next several points, these decisions do not need to be coordinated. The major factor that might have to be coordinated is load balancing across the system, but this is not currently addressed at all. Coordination of ID ranges could be done a priori. It might therefore be straightforward to introduce multiple, fairly autonomous registration points into the network.
The decision making in attaching sources and users to the network also seems somewhat simplistic. Sources are assigned to routers based on data rates, router bandwidth, and topological distance. This assignment happens once, when the source joins, and is based on advertised data rates and bandwidth. Changes in the system as well as malicious or simply incorrect advertisements might have significant effects on the system. It's also not clear how (network) topological distance is determined. Similarly, it's not clear how criteria for users, such as general location in the routing fabric, is determined.
A good bit of time is also spent on techniques for aggregating user queries and partitioning them across the router fabric. This is a centralized process in this paper. That introduces another single point of failure, a heavily loaded service, and hinders the system's ability to adapt to changing requests and network environments. However, it is singled out as future work to be followed up on.
Another issue that doesn't seem to be addressed by either the partitioning or registration schemes is the "Britney Spears" effect. Partioning is based solely on the queries and data schemas, without taking into account network and processor load balancing issues. Registration is noted as taking into account available bandwidth and topology, not load. Although it may use a source's published data rate and routers' published bandwidth capacities when a source is incorporated, but this does not consider its effect on that region of the network, nor the user demands placed upon that flow. Nothing is presented to specifically spread the work for large data sources & queries across the system.