The advent of the World Wide Web in combination with efficient search engines like Google and Yahoo! has made an enormous amount of information easily available for everybody. It is generally accepted nowadays that current web search technologies work well and have made access to huge information corpora feasible. They also provide efficient ranking, thus easing separation of important results. Nevertheless, this shining image of centralized web indexing and searching is starting to blur and a number of issues question their future applicability: 1) current search engines only cover a small fraction of the documents on the Web, for example recent studies have shown that Google currently indexes less than 1 % of the total (static and dynamic) Web , 2) many web pages are rarely accessed by search engines, resulting in outdated search index contents, 3) web search providers possess the control over information, what should be indexed, how it is presented, the ranking of indexed pages and the ability to filter out material that is deemed controversial (censoring), and 4) web search is still based on exact matching techniques, calling for large scale and credible use of semantics.
These aforementioned issues can be solved by employing a P2P network of web servers, where participants are willing to share data and computing cycles, and where the control of information is completely distributed among the participating peers. In this paper, we present SOWES, a scalable approach to P2P web searching. Previous approaches to P2P web search have relied on the use of structured P2P networks, storing indexing information in DHTs [2,4]. However, scalability and support for semantics can be difficult in such systems, so instead we base our approach on unstructured P2P networks. Such systems in their basic form suffer very high search costs in terms of both consumed bandwidth and latency, so in order to be useful for real applications, more sophisticated search mechanisms are required. We solve this problem by employing semantic overlay networks (SONs) , where peers containing related information are connected together in separate overlay networks. If SONs have been created, queries can be forwarded to only those peers containing documents that satisfy the constraints of the query context, for example based on topic, user bookmarks/profiles or features extracted from previous queries.
We will in the following describe the SOWES approach to web search, based on a scalable approach for SONs creation.
A main feature of SOWES is the unsupervised, distributed and decentralized generation of SONs. Peers that contain semantically related information become part of a logical cluster (i.e., SON). Connections are created between the peers within a cluster. In this way, a query relevant to a cluster can be forwarded only to peers that belong to this cluster. Each cluster is represented by one or more special peers, henceforth named cluster representatives, which maintain connections to other cluster representatives and they are responsible for forwarding queries to the most relevant clusters, avoiding the excessive cost of flooding the entire network. Two important challenges in a P2P system employing SONs are 1) creation of SONs and 2) SON-based searching.
Creation of semantic overlays in SOWES is a multi-phase distributed process. We assume that peers store documents and have already organized themselves in an unstructured P2P network.
In the first phase, clustering takes place on local data on each peer. Clustering is performed on feature vectors (a vector of tuples where each tuple contains a feature and a weight ) extracted from documents using a feature extraction process that includes tokenization, stemming, stop-word removal and finally keeping the top- features based on their local TF/IDF values. Feature vectors are also used to represent clusters. In the subsequent phases, first 1) local zones of peers are created, by randomly selected peers (initiators), then 2) semantic clusters are formed, based on data stored on the peers in a zone. In the next steps, zones and clusters are merged recursively until global clusters are obtained.
An example of SOWES creation is illustrated in Figure 1. In the first step, local clustering is performed, for simplicity we assume in the example that only one cluster is formed at each peer and use a zone size of peers. The zone initiators in each step are identified by underlined identifiers, and the zones at each step are illustrated by lines around the peers in the zone. In the second step, a certain percentage of the peers are selected to be initiators and each creates a zone consisting of an average of peers. At this point, clusters that are similar are merged and links created between the clusters. Note that in the figure only one link is used between cluster elements (peers or existing clusters), but in general () such links will be created, to ensure fault-tolerance. Subsequently, the merged clusters will be treated as one cluster, i.e., represented by one feature vector and a list of representative peers. In the third step, a percentage of the existing zone initiators are selected to be zone initiators for the next level, and create new zones where each new zone consists of an average of zones. Again similar clusters are merged, in this example the clusters containing A and B are merged into a new cluster, the H/N-cluster is merged with the J cluster, and M with the I cluster. In the fourth step, new zones are created, but none of the clusters are similar enough to be merged. In the fifth and final step, the zone covers the entire network, F is merged with K, and G is merged with the H/N/J cluster. Thus the final result is the following clusters: A/B, C/D, K/F, E, L, G/N/J/H, and M/I. The connections between the clusters, created at the final step, are not shown in the figure.
A query for web documents originates from a querying peer . Routing is performed by directing and evaluating the query to appropriate peers and then returning matching results. In our context, query processing is performed by first determining which clusters might contain relevant data (inter-cluster routing), followed by searching one or more of these clusters (intra-cluster routing).
Inter-cluster routing refers to query routing at cluster representative level, which aims to identify similar cluster descriptions to the query. In order to limit the costly intra-cluster searching, the inter-cluster routing is performed in two steps. In the first step, a search for appropriate clusters is performed, using the links created among cluster representatives to route queries. In the second step, determines the most appropriate clusters (based on the results of step ), and selectively forwards the query for intra-cluster searching. In order to reduce the cost of subsequent searches, the cluster representatives cache the result of queries, in order to reduce the cost of intra-cluster searching for frequent queries.
In order to demonstrate the feasibility and efficiency of SOWES, we have developed a simulation environment that covers all phases of SOWES generation and searching. In the experiments a number of peers (up to 20,000) were interconnected using a topology created by the GT-ITM topology generator. We used the Reuters text collection and we generated random queries from the terms. In Figure 2, the reduction of the average number of (intra-cluster routing) messages is depicted, when caching of query results is employed for various cache sizes. Also the associated average recall is presented for different values of and top-, for a network of 8,000 peers. These premature results exhibit small recall, however we believe that better distributed clustering algorithms will improve the search results and we intend to focus on this issue in our future work. For more details on the experiments we refer to .
The main contributions of our approach are 1) an architecture for P2P web search based on an unstructured P2P network, 2) algorithms for unsupervised, distributed and decentralized construction of semantic overlays, and 3) appropriate algorithms for web search based on the use of SONs. Our work differs from super-peer approaches, since cluster representatives do not have a constant role and any peer can take their place. Furthermore, when compared to gossiping protocols, our approach guarantees that a query will reach a thematically relevant peer, no matter how many hops are required, since the created overlays cover the entire network, grouping similar peers together. In our future work, we intend to improve the quality and efficiency of SOWES.