Six Degreesofwikipedia
If you've ever heard of "Six Degrees of Kevin Bacon", this is a similar concept. The goal is to link two chosen articles on Wikipedia together by following links to other articles on Wikipedia. This is slightly different since not every link is bi-directional, i.e. if article A links to article B, there isn't necessarily a link from article B to article A. Thus there needs to be a "start" article and a "goal" article.
"SDOW" is a program that searches Wikipedia for a path from a starting article to a goal article using a basic breadth first search with a few restrictions. Like any good breadth first search algorithm, it first explores all the articles linked from the start article, then link depth 1, then link depth 2, etc until it finds a link to the goal article. The path is then constructed from the found links.
Version 1 is written in Ruby and is quite slow because it crawls Wikipedia as it searches, requiring a full download of each article every time it needs to explore it. It begins by adding the start article to a queue, then continues exploring the first article in the queue until it is empty (which is not very likely, since the English Wikipedia currently contains almost 1.7 million articles).
To explore an article, it is downloaded with a Ruby HTTP request, all links are extracted from the text using regular expressions, and certain links are filtered out (namely links that are not Wikipedia articles located at en.wikipedia.com as well as a few other).
The resulting links are then compared against the set of already checked articles (both sides of the comparison are normalized by removing the anchor # and converting all characters to lowercase, however the if the link is actually a redirect to the goal article, it will be missed), and if they haven't been checked they are added to the end of the queue, as well as the checked set. As described above, version 1 should be complete (i.e. if a path exists it will eventually find it) and optimal (i.e.
the first found path will be the shortest path). I suspect that for most pairs of articles there is a path between them in both directions. This path isn't always obvious, but if it exists it will find it (within reason). There are currently around 1.7 million articles in the English version of Wikipedia. It took about 9 hours to check 60,000 articles (a depth of about 5), giving a rate of about 6500 articles per hour.
It would take about 10 days to check every article on en.wikipedia.org (assuming no duplicates arise from redirects, etc). Of course this depends on a number of things like your internet connection speed. However, this version places further restrictions on which articles are explored, which eliminate the completeness and optimality guarantees. All articles that contain more than a certain number of links are ignored. It isn't clear whether this helps speed up the search or not. The Ruby source code to Version 1 is available here.
Version 2 is very similar to Version 1 but with one huge difference. Rather than crawling Wikipedia via HTTP requests, it uses a database containing all required information. A database dump of all Wikipedia article metadata, including links between articles, is stored into a MySQL database, and SQL queries are used to obtain article links. This version also eliminates the redirect issue. If an article link is a redirect, the redirects are followed until the actual article is reached.
Wikipedia provides a variety of data dumps that are useful to us. In particular, we need the "Base per-page data" (page.sql.gz), the "Wiki page-to-page link records" (pagelinks.sql.gz), and the "Redirect list" (redirect.sql.gz). Additionally, we need to add several indices on these three tables in order to get decent performance. Ideally we would combine all three of these tables into one simple table, as well as eliminate all non-article data. However, combined these three tables are over 4GB, so this processing would take some time.
Importing and adding the indices takes many hours as is, on a normal personal computer. Version 2 also adds more statistics, because really, who doesn't like statistics? The Ruby source code to Version 2 is available here. - check article title once the page has been downloaded (link may be redirected.
version 1 only) - implement some sort of heuristic (if goal article title is included in current page's text increase value of current page, etc) - implement bidirectional search (start from both starting article and goal article, work towards middle)
People Also Asked
- Wikipedia:Six degrees of Wikipedia - Wikipedia
- Six Degrees of Wikipedia
- GitHub - jwngr/sdow: Six Degrees of Wikipedia
- Six Degrees of Wikipedia - Website Hunt
- Six Degrees of Wikipedia : r/wikipedia - Reddit
- GitHub - nicolaslazo/six-degrees-of-wikipedia
- Six Degrees of Wikipedia – Vivian Robert's website
Wikipedia:Six degrees of Wikipedia - Wikipedia?
If you've ever heard of "Six Degrees of Kevin Bacon", this is a similar concept. The goal is to link two chosen articles on Wikipedia together by following links to other articles on Wikipedia. This is slightly different since not every link is bi-directional, i.e. if article A links to article B, there isn't necessarily a link from article B to article A. Thus there needs to be a "start" article ...
Six Degrees of Wikipedia?
If you've ever heard of "Six Degrees of Kevin Bacon", this is a similar concept. The goal is to link two chosen articles on Wikipedia together by following links to other articles on Wikipedia. This is slightly different since not every link is bi-directional, i.e. if article A links to article B, there isn't necessarily a link from article B to article A. Thus there needs to be a "start" article ...
GitHub - jwngr/sdow: Six Degrees of Wikipedia?
"SDOW" is a program that searches Wikipedia for a path from a starting article to a goal article using a basic breadth first search with a few restrictions. Like any good breadth first search algorithm, it first explores all the articles linked from the start article, then link depth 1, then link depth 2, etc until it finds a link to the goal article. The path is then constructed from the found li...
Six Degrees of Wikipedia - Website Hunt?
If you've ever heard of "Six Degrees of Kevin Bacon", this is a similar concept. The goal is to link two chosen articles on Wikipedia together by following links to other articles on Wikipedia. This is slightly different since not every link is bi-directional, i.e. if article A links to article B, there isn't necessarily a link from article B to article A. Thus there needs to be a "start" article ...
Six Degrees of Wikipedia : r/wikipedia - Reddit?
If you've ever heard of "Six Degrees of Kevin Bacon", this is a similar concept. The goal is to link two chosen articles on Wikipedia together by following links to other articles on Wikipedia. This is slightly different since not every link is bi-directional, i.e. if article A links to article B, there isn't necessarily a link from article B to article A. Thus there needs to be a "start" article ...