Thursday, March 1, 2018

Notes about the "Google algorithm".

(DRAFT)
Random notes on some elements of the PageRank algorithm.
by
Sergio Caldarella

PageRank is, in short, a sampling algorithm for search engine optimization (SEO) that operates by calculating averages (normalized sum) and assigning a value to web pages and does not contain any non-standard logic approach.
Sergey Brin and Lawrence Page in the now classic paper The Anatomy of a Large-Scale Hypertextual
Web Search Engine (1998) describes “The PageRank of a page A” (i.e. the probability distribution over web pages), giving the algorithm:

PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))                         [1.1]
where,
• PR(A) = Page Rank of page A
• PR(Ti) = Page Rank of pages Ti which link to page A
• C(Ti) = number of outbound links on page Ti
• d = damping factor which can be set between 0 and 1.

Assuming that “page A has pages T1...Tn which point to it (i.e., are citations). The parameter d is a damping factor, which can be set between 0 and 1. We usually set d to 0.85. There are more details about d in the next section. Also C(A) is defined as the number of links going out of page A.” (Brin and Lawrence).
We can rewrite [1.1] as follow: pT = pT(αS + (1 – α)E)                                        [1.1a]
Let’s reinterpret [1.1] with:
             PR(A) → f(x) d = 17/20 + d1, d2, d3, dn + C                                            [1.1b]
                                                                        with PR(pi;0) = 1/N for t = 0.
Assuming C(A) = A. The = 20 PageRank of a page A is given as follows: = 20. The basic structure of PageRank is set to a = model of=20 user behavior where “"random surfer" who is given a web = page at=20 random and keeps clicking on links, never hitting "back" but eventually = gets=20 bored and starts on another random page. The probability that the random = surfer=20 visits a page is its PageRank. And, the d damping factor is the=20 probability at each page the "random surfer" will get bored and request = another=20 random page. One important variation is to only add the damping factor = d=20 to a single page, or a group of pages. This allows for personalization = and can=20 make it nearly impossible to deliberately mislead the system in order to = get a=20 higher ranking. We have several other extensions to PageRank, again see = [Page=20 98].=20”. This set the scalability to be “=20 near term to a goal of 100 million web pages”. A value that is larger than that will not be computed.
PR(A) = 3D (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn)) = 20                                 [1.2]
“PageRank=20 or PR(A) can be calculated using a simple iterative algorithm, = and=20 corresponds to the principal eigenvector of the normalized link matrix = of the=20 web. Also, a PageRank for 26 million web pages can be computed in a few = hours on=20 a medium size workstation.”
As pointed out by Ian Rogers (The Google Pagerank Algorithm and How It Works), equation [1.1] has 5 main components two of which (1) & (2) define an interval:
(1) PR(Tn)               - The computation define an interval PR(T1) -> PR(Tn)
(2) C(Tn)                 - Assign a value in an interval C(T1) -> C(Tn)
(3) PR(Tn)/C(Tn)    - Defines a backlink range. Being backlinks any link received by a “web node” a higher hierarchical set containing web pages, directories, websites, etc. as subsets (Rogers put it as: “If page A links out to page B, then page B is said to have a “backlink” from page A.”)
(4) The factor d = 0.85 – Is used as stop factor (most likely there is another stop factor or a simple timer not showed in the algorithm – that’s why if we perform the calculations using this rough algorithm results may vary extremely although they are always contained in the mathematical interval that could be defined by (1) + (2)).
(5) (1 - d)               - This is the normalization (damping factor) factor (set between 0 and 1) that multiplies Ti.
Following those steps PageRank can generate a value calculated for every url, allowing to determine a value for the page relevance (“page rank”) in the search engine results page (SERP). [- See more at: http://www.marketingignite.com/blog/how-does-google-pagerank-work/#sthash.A2W6bu5y.dpuf.]
From [1.1a] follows: p(k+1)T = p(k)T E [1.1.a1] with E = d1, d2, d3, dn + C.
For an increment in the results from a S matrix {p1, p2, p3, pn} “noise” increase accordingly: S = E + a(1/neT). Brin and Page added a so called “Google matrix” to adjust the results: G= αS + (1 – α)1/neeT [see 1.1a] with α = 6/10.
[1.1a1] can be then rewritten as: p(k+1)T = p(k)T G [1.1a2], applying a different α of 9/10, producing a visible implementation.
For the calculation are eigenvalues essential and what is here fundamental are the arbitrary eigenvectors.
There are many possible ways to graphically represent the operations performed by PageRanks. A matrix model is the more efficient.
To set a limit on response time, once an x number (at the moment x = 40,000) of matching entries are found, the searcher will go to a previous step until there is a document that matches all the search terms.
---------
A few other points to be noted (from: The Basics of PageRank: What Does It Measure & How Does It Work? February 14th, 2009):
 “•PageRank values are not arithmetic. Nobody outside of Google’s upper echelons knows the formulae, but it is generally agreed that the scale is logarithmic. In other words, it takes a lot more to advance from PR4 to PR5 than it takes to advance from PR1 to PR2.
 •A site’s total PageRank (the combined PageRanks of each of its component pages) is also an important measurement. More than anything else, this is determined by the number of unique pages within a site, clearly benefiting larger sites. New pages (“orphans”) should be directly linked to existing pages in order to yield any benefit for the PageRank of the overall site.
 •Links to pages with no outbound links of their own (or pages that Google has not indexed) are known as “dangling” links and have little or no value.
 •There are many experts who agree that outbound links that are not reciprocated can be a drain on a site’s total PageRank.

Because the Internet is constantly growing, the logarithmic scales that determine PageRank, by definition, must be continually evolving.”