روشنای دانش(ریاضی محض)

Graph Theory and the Web Map

There are millions of HTML pages in the Net and most of them contain links (that’s not great news). These links have a direction (obvious too).

Some pages receive much attention and have many pages refer to them: software repositories, international organizations, pages written by acknowledged experts in various fields etc. Conversely, there are pages with many links, such as resource lists.

Some clever people realized that they could build better search engines by relating the two types:

This is a circular definition but it works. Iterative algorithms can take usual text-search results, sort out things and produce the "authorities". This is the idea behind the IBM CLEVER project, click here for details. Google also does something similar, read the press release.

Interestingly, these ideas can be extended to find concepts and applications of Graph Theory in the Web map. A directed graph is a collection of nodes (vertices) together with arcs joining some of these vertices. The Web is a huge directed graph G=(V,E) where the set V of vertices is the set of all pages and the set E of arcs corresponds to all pointers.

In the figure above, page B points to C and D and is pointed by pages A and C. Vertex B has both outdegree and indegree equal to 2.

The Web is certainly not a complete graph (if it was, there would be a link between any two pages!). In fact, it is a very sparse graph. The pointers are much less than the maximum possible.

The Web is not a connected graph. Although every page is accessible to anybody by "definition of the Web", this does not mean that everything is reachable from everywhere just by following links (even in reverse direction). There are isolated groups of pages with no link from/to their outside world. In fact, there are also isolated vertices, not pointing to or being pointed by anything, for example personal pages with just a CV, without pointers, accessible only from their known URL.

From now on, let’s stay within a subgraph G' of the Web, that is, some subset of the Web pages together with the arcs that have endpoints in this subset. This subset could be a “community”, say all pages on a particular topic, all commercial pages in some domain etc. There are complete subgraphs in the Web. A simple example is 3 pages each pointing to or pointed by the others (I wonder what is the maximum order found in the Web).

A nice case is when there is a rule "if I point to you and you point to X, then I will also put a link to X". This creates complete subgraphs called tournaments with a pattern in the in-outdegrees of the vertices:

A Minimal Dominating Set of G' is a minimal subset of vertices from which every other vertex in G' can be reached immediately. This would be an economic collection of bookmarks. Each page would be in the bookmarks or would be pointed by a bookmarked page. The number of vertices in the minimal dominating set is called the dominance number of G', 3 in the example below.

The Set Covering Problem (SCP) is exactly this problem. If in addition, we don't want overlaps in the coverage (one page "covered" by more than one members in the "base"), we are facing a version called the Set Partitioning Problem (SPP). SPP would obviously lead to a larger bookmarks list because it has more constraints. More precisely, some of the SPP constraints are relaxed (made more loose) in SCP so the latter gives better results.

What is the Shortest Path between two pages A and B in the same community which are totally unknown to each other? In a specialized community with "extravert" members, it is highly probable that there are various sequences of pages each pointing to the next one, starting at A and ending at B.

If we take the “length” of arcs to be 1, the Shortest Path would be the sequence with the minimum number of steps. In some cases this might be surprisingly short (just like the "small-world" paradox). Arcs can also have varying lengths, say the "distance" between the hosting servers (geographical, average connection time etc). In that case, the Shortest Path solution would approximate the fastest way to locate page B starting from page A.

The Shortest Spanning Tree: From all arcs in G’ select some so as to cover all vertices tree-like, at minimum total “length”. There is now a unique way from every page to another and there is also a root.

The root is not necessarily a page of particular importance, it is just the best location to start building the optimal tree. It is not necessarily a resource list (it may point to only one other page for example). Notice that you cannot do cycles in this collection of arcs, i.e. you cannot go back to the starting page (this is the definition of a tree).

Another way to build a bookmarks list, this time of fixed length, say p. Every non-bookmarked page has some "distance" (again, steps, or any connection-based measure) from the list, the one from the nearest bookmarked one. If we want again to cover all the community in a way that minimizes the sum of all these "distances", we are solving a minisum problem, the p-Median Problem. This problem is better understood as an approach to install facilities in an optimal way for customers. A practical application would be the location of mirror sites.

The minimax version is the p-Centres Problem where we seek to minimize the maximum "distance". Again it is closely related to facilities location problems, but the emphasis is more in avoiding extreme delays in serving individual customers rather than the community as a whole (for example emergency centres).

Graphs can have flows. An obvious analogue is network traffic between servers/vertices with the connecting (non-directed) lines having fixed capacities. The Maximum Flow Problem would find the maximum possible traffic between any two servers. There are also special algorithms to find the maximum flow between every pair of servers/ vertices. Conversely, if the traffic between two servers is set (less than or equal to the maximum one), there are various flow patterns to achieve it. Given also costs per unit flow on the lines, we can find the Minimum Cost Flow between the two servers. These are techniques actually used in the design of networks.

The Graph Partitioning Problem would divide pages in a community into clusters of minimum interaction. There are many variations of this problem, depending on the constraints we impose: maximum cluster size, fixed number of clusters, fixed number and sizes of clusters, equal-size clusters (equipartitions) etc. The simplest case is partition into two clusters. If the two clusters are not necessarily of equal size, this would clearly divide certain communities into “opposite opinion” groups (it is not frequent for, say, an environmental protection site to point to a hunters club).

For the commercial sites in a specific industry in some country, meaningful clusters (i.e. without much interaction) would be numerous and would correspond to groups of affiliated or otherwise related companies - nobody points to competitors !

ادامه نوشته

+ نوشته شده در سه شنبه ۵ مرداد ۱۳۸۹ ساعت ۹:۹ ب.ظ توسط شپ پره |

مرگ تدريجي ما آغاز خواهد شداگر مطالعه نكنيم

نوشته‌های پیشین

آرشیو موضوعی

پیوندها

روشنای دانش(ریاضی محض)

مرگ تدريجي ما آغاز خواهد شد اگر مطالعه نكنيم

نوشته‌های پیشین

آرشیو موضوعی

پیوندها

مرگ تدريجي ما آغاز خواهد شداگر مطالعه نكنيم