(DRAFT)
Random notes on some elements of the PageRank algorithm.
by
Sergio Caldarella
PageRank is, in short, a sampling algorithm for search
engine optimization (SEO) that operates by calculating averages (normalized
sum) and assigning a value to web pages and does
not contain any non-standard logic approach.
Sergey
Brin and Lawrence Page in the now classic paper The
Anatomy of a Large-Scale Hypertextual
Web Search Engine (1998) describes “The PageRank of a page
A” (i.e. the probability distribution over web pages), giving the algorithm:
PR(A) = (1-d) + d
(PR(T1)/C(T1) + ... + PR(Tn)/C(Tn)) [1.1]
where,
• PR(A) =
Page Rank of page A
• PR(Ti) =
Page Rank of pages Ti which link to page A
• C(Ti) =
number of outbound links on page Ti
• d =
damping factor which can be set between 0 and 1.
Assuming that “page A has pages T1...Tn which point to
it (i.e., are citations). The parameter d is a damping factor, which can be set
between 0 and 1. We usually set d to 0.85. There are more details about d in
the next section. Also C(A) is defined as the number of links going out of page
A.” (Brin and Lawrence).
We can rewrite [1.1] as follow: pT = pT(αS
+ (1 – α)E) [1.1a]
Let’s reinterpret [1.1] with:
PR(A) → f(x) d = 17/20 + ⟨d1, d2,
d3, dn⟩ + C [1.1b]
with
PR(pi;0) = 1/N for t = 0.
Assuming C(A) = A. The = 20 PageRank of a page A is
given as follows: = 20. The basic structure of PageRank is set to a = model
of=20 user behavior where “"random surfer" who is given a web = page
at=20 random and keeps clicking on links, never hitting "back" but
eventually = gets=20 bored and starts on another random page. The probability
that the random = surfer=20 visits a page is its PageRank. And, the d damping
factor is the=20 probability at each page the "random surfer" will
get bored and request = another=20 random page. One important variation is to
only add the damping factor = d=20 to a single page, or a group of pages. This
allows for personalization = and can=20 make it nearly impossible to
deliberately mislead the system in order to = get a=20 higher ranking. We have
several other extensions to PageRank, again see = [Page=20 98].=20”. This set
the scalability to be “=20 near term to a goal of 100 million web pages”. A
value that is larger than that will not be computed.
PR(A) = 3D (1-d) + d (PR(T1)/C(T1) + ... +
PR(Tn)/C(Tn)) = 20 [1.2]
“PageRank=20 or PR(A) can be calculated using a simple
iterative algorithm, = and=20 corresponds to the principal eigenvector of the
normalized link matrix = of the=20 web. Also, a PageRank for 26 million web
pages can be computed in a few = hours on=20 a medium size workstation.”
As pointed out by Ian Rogers (The Google Pagerank Algorithm and How It Works), equation [1.1] has
5 main components two of which (1) & (2) define an interval:
(1)
PR(Tn) - The computation define an
interval PR(T1) -> PR(Tn)
(2)
C(Tn) - Assign a value in an
interval C(T1) -> C(Tn)
(3)
PR(Tn)/C(Tn) - Defines a backlink range. Being backlinks
any link received by a “web node” a higher hierarchical set containing web
pages, directories, websites, etc. as subsets (Rogers put it as: “If page A
links out to page B, then page B is said to have a “backlink” from page A.”)
(4)
The factor d = 0.85 – Is used as stop
factor (most likely there is another stop factor or a simple timer not showed
in the algorithm – that’s why if we perform the calculations using this rough
algorithm results may vary extremely although they are always contained in the
mathematical interval that could be defined by (1) + (2)).
(5)
(1 - d) - This is the normalization (damping
factor) factor (set between 0 and 1) that multiplies Ti.
Following those steps PageRank can generate a value
calculated for every url, allowing to determine a value for the page relevance
(“page rank”) in the search engine results page (SERP). [- See more at: http://www.marketingignite.com/blog/how-does-google-pagerank-work/#sthash.A2W6bu5y.dpuf.]
From [1.1a] follows: p(k+1)T = p(k)T
E [1.1.a1] with E = ⟨d1, d2, d3, dn⟩ + C.
For an increment in the results from a S matrix {p1,
p2, p3, pn} “noise” increase accordingly: S =
E + a(1/neT). Brin and Page added a so called “Google matrix” to
adjust the results: G= αS + (1 – α)1/neeT [see 1.1a] with α = 6/10.
[1.1a1] can be then rewritten as: p(k+1)T =
p(k)T G [1.1a2], applying a different α of 9/10, producing a visible
implementation.
For the calculation are eigenvalues essential and what
is here fundamental are the arbitrary eigenvectors.
There are many possible ways to graphically represent
the operations performed by PageRanks. A matrix model is the more efficient.
To set a limit on response time, once an x number (at
the moment x = 40,000) of matching entries are found, the searcher will go to a
previous step until there is a document that matches all the search terms.
---------
A few other points to be noted (from: The Basics of PageRank: What Does It Measure
& How Does It Work? February 14th, 2009):
“•PageRank
values are not arithmetic. Nobody outside of Google’s upper echelons knows the
formulae, but it is generally agreed that the scale is logarithmic. In other
words, it takes a lot more to advance from PR4 to PR5 than it takes to advance
from PR1 to PR2.
•A site’s total
PageRank (the combined PageRanks of each of its component pages) is also an
important measurement. More than anything else, this is determined by the
number of unique pages within a site, clearly benefiting larger sites. New
pages (“orphans”) should be directly linked to existing pages in order to yield
any benefit for the PageRank of the overall site.
•Links to pages
with no outbound links of their own (or pages that Google has not indexed) are
known as “dangling” links and have little or no value.
•There are many
experts who agree that outbound links that are not reciprocated can be a drain
on a site’s total PageRank.
Because the Internet is constantly growing, the
logarithmic scales that determine PageRank, by definition, must be continually
evolving.”