On the Feasibility of Peer-to-Peer Web Indexing and Search
Li et al
distributed hashtables p2p search keywords approximate matching
@article{li:p2p-2003,
title={On the Feasibility of Peer-to-Peer Web Indexing and Search},
author={Li, J. and Loo, B. and Hellerstein, J. and Kaashoek, M. and
Karger, D. and Morris, R.},
journal={Peer-to-Peer Systems II},
pages={207--215},
year={2003},
publisher={Springer-Verlag}
}
Notably in this context, core DHT mechanics are based around flat
labels: Content identifiers consist of exact numeric or bit string
identifiers. Those identifiers are generated either from some natural
key associated with content, e.g., a URI, or other exact
specification, e.g., hashing the content itself. DHT mechanisms rely
on having exact content identification and do not directly provide
more expressive addressing and querying models, e.g., keyword search
for content.
The most natural mechanism to provide search in a DHT stores content
at its identifier as usual. A set of extracted features of the
content, such as keywords, are then also mapped to identifiers, with
lists of associated content identifiers kept by the controlling nodes
of each. Search proceeds by mapping search terms to identifiers,
fetching the lists associated with edge, applying any query logic to
them, e.g., intersecting results, and fetching from the remaining
object identifiers. This approach, however, may introduce significant
network or storage resource consumption~\cite{li:p2p-2003} from the
replicated content pointers, and becomes significantly worse once
practical issues such as substrings and misspellings are considered.
Alternative DHT structures address these latter two points, such as
approximate matching based on metric
keyspaces~\cite{wong:hotos-2007,wong:cubit-2008}, but the basic
paradigm of replicating object pointers out to controlling nodes for
each search feature remains.