In this paper, we propose a novel link-based similarity measure, called PageSim. Contrast to SimRank, our method can measure similarity between any two web pages, whereas SimRank cannot in some cases.

SimRank is a fixed point of the recursive definition: *two
pages are similar if they are linked to by similar pages*.
Numerically, for any web page and
, this is specified by defining
and

(1) |

for and , where denotes the set of inlink pages of , denotes the cardinality of the set. If or is empty, then is zero by definition. The SimRank iteration starts with for and for . The

Unfortunately, the result of SimRank is not convincing in some
cases. In one case, if one of two web pages has no inlink, then
the SimRank score of them is zero by definition, which means they
are not similar. However, this is not always true. For example,
in Figure 1
of section 4, has no inlink, but it is
clear that both and
have some similarity with it
for they are linked to by . In
another case (also in Figure 1), SimRank
concludes that and are not similar. In fact, obviously and indeed have some similarity, for they *link to each
other*.

PageSim can be considered as an extension of cocitation
algorithm, in which the similarity score between two web pages is
defined by the number of inlink neighbors that they have in
common. Actually, on the Web, not all links are equally
important. For example, if the only common neighbor of page
and
is the Yahoo home page [1],
whereas page and have several common neighbors from obscure places,
then which page is more similar to page , page or page ? As we know, hyperlink from web page *u* to
*v* can be considered as a recommendation of page *v*
by page *u* [2], and the
more important a web page is, the more important its
recommendation is. Evidently, the reasonable answer should be
page , since the Yahoo home page is
much more ``important". In another perspective, the action of
recommendation can be considered that page *u*
*propagates* some kind of similarity to page *v*, and
the more pages it links to, the less similarity it should
propagate to each of these pages. Therefore, it is also
reasonable to think that the Yahoo home page has some kind of
similarity with both page and page
.

Since PageRank [5] is one
of the most prominent ranking algorithm which assigns global
ranking scores to all pages on the Web, we take the PageRank
score of a web page as the *importance* (*weight* or
*similarity score*) of it in the PageSim method. The
intuitions to PageSim model is described as follows, and the
mathematical definitions will be given later.

At the beginning, each web page only contains its own
similarity score, and then each web page propagates its own
similarity score to its outlink neighbors, receiving and
propagating the similarity scores of others at the same time.
After the propagation, each page contains its own similarity
score as well as the similarity scores of others. These scores
are stored in a vector called the *similarity vector* of
this page. Then we can calculate the **PageSim score** of each
pair of pages by *summing their common similarity scores
up*.

We model the Web as a directed graph with vertices
representing web pages
and directed
edges representing hyperlinks between
web pages. Let and denote the set of *inlink* pages and
*outlink* pages of
respectively, for any . Let
denotes a sequence
of vertices
such that
and
are distinct, it is called a
**path** from to . Let denotes the **length** of path , define
. Let
denotes the set of all
possible paths from page to
.

A good evaluation of PageSim is difficult without performing extensive user studies or having a reliable external measure of similarity to compare against. In this section, we give a simple example in which PageSim is compared with SimRank to illustrate the performance of PageSim.

For a given graph , where (see Figure 1). Let . We have

The *PageSim score matrix* is

.

Let denotes the top
similar pages to page
(excluding ). Let , we
have

The *SimRank score matrix* of graph is

.

Thus, we have

We can see that the results of PageSim and SimRank are
different. First, SimRank shows that there's no page similar to
. While PageSim shows that
is most similar to
, which is more reasonable.
Because the fact that links to
implies ``considers" has
some level of similarity with it. Secondly, SimRank shows
is not similar to , while PageSim shows that is
not true. Obviously, and
are similar, for they *link
to each other*. Moreover, PageSim considers that is most similar to
. SimRank shows is most similar to
, for they have a common
inlink page . We believe PageSim is
the winner in this situation because the ``link to each other"
relationship really implies stronger similarity than that of the
``common inlink" relationship.

This paper introduces PageSim, a novel link-based similarity
measure. Based on the strategy of *PageRank score
propagation*, PageSim is capable of measuring similarity
between any two web pages.

There are numbers of avenues for future work. Foremost, we
must address the efficiency issue. For example, the computing
time of PageSim is expected to be greatly reduced by limiting the
radius of propagation, i.e., the path length of propagation.
Especially, when the radius is reduced to 1, PageSim becomes a
``*weighted* cocitation algorithm". Finally, extensive
evaluations of PageSim are needed.

[1] http://www.yahoo.com.

[2] A. Arasu, J. Cho, H. Garcia-Molina, A. Paepcke, and S. Raghavan. Searching the Web. *ACM Transactions on Internet Technology*, 1(1):2-43, 2001.

[3] J. Dean and M. R. Henzinger. Finding related pages in the World Wide Web. *Computer Networks (Amsterdam, Netherlands: 1999)*, 31(11-16):1467-1479, 1999.

[4] G. Jeh and J. Widom. SimRank: a measure of structural-context similarity. In *KDD '02: Proceedings of the eighth ACM SIGKDD*, pages 538-543, New York, NY, USA, 2002. ACM Press.

[5] L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: bringing order to the Web. Technical report, Stanford Digital Library Technologies Project, 1998.