Nisdom. As opposed to Wisdom.

The problem with the future is that it keeps turning into the present.

A Simple Ruby sitemap.xml Generator

| Comments

Yesterday I finished writing a simple Ruby CLI tool which I called SiteMapper and which sole purpose was to output the sitemap.xml file, a standard format used by many popular search engines. I have made it available at https://github.com/okulik/sitemapper. While doing first smoke tests I realized I would benefit from some sort of visualization (besides creating space-indented text logs), so I added a way to generate dot file which can be converted to .png file by the graphviz tool.

SiteMapper is actually a very simple, static web pages hierarchy explorer. It starts from the arbitrary page you provide and descents into the depth of web site following links until it either traverses all available content or stops at some predefined traversal depth.

Links Normalization

The main challenge with links traversal was realizing if some link was previously seen or not. If such reliable mechanism wouldn’t exist, infinite traversal of pages would be possible and we might be jumping from page to page or in circles forever, wherever links would take us. This was done by taking raw URLs and normalizing them – each href value was expanded to full path, fragments were discarded and query parameters were alphabetically sorted. Let’s examine some Ruby code responsible for that.

get_normalized_url methodlink
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
def self.get_normalized_url(host_url, resource_url)
  host_url = Addressable::URI.parse(host_url)
  resource_url = Addressable::URI.parse(resource_url)

  m = {}
  m[:scheme] = host_url.scheme unless resource_url.scheme
  unless resource_url.host
    m[:host] = host_url.host
    m[:port] = host_url.port
  end
  resource_url.merge!(m) unless m.empty?
  return nil unless SUPPORTED_SCHEMAS.include?(resource_url.scheme)
  return nil unless PublicSuffix.valid?(resource_url.host)
  resource_url.omit!(:fragment)
  resource_url.query = resource_url.query.split("&").map(&:strip).sort.join("&")
    unless resource_url.query.nil? || resource_url.query.empty?

  return Addressable::URI.encode(resource_url, ::Addressable::URI).normalize
rescue Addressable::URI::InvalidURIError, TypeError
  nil
end
  1. We parse URL string and convert it to Addressable:URI object (addressable is a ruby gem that servers as a replacement for the URI implementation that is part of Ruby’s standard library).
  2. Host parameter is created from the starting URL, the one which we chose as a starting point of our web site quest. It is here also converted to Addressable::URI.
  3. If URL is given without a scheme, often in the form of //www.nisdom.com/a-simple-ruby-sitemap-xml-generator/, we assume scheme and port number from a host. By calling merge, we also ensure that URLs like /a-simple-ruby-sitemap-xml-generator will end with host name too.
  4. Check if host part of our URL is valid with PublicSuffix gem. Since HTML can contain any kind of text, we want to separate wheat from the chaff and make the content we will scrape as good as possible.
  5. Remove everything from the right side of the # mark (i.e. fragments) since in most cases this will result in the same HTML content. Of course, if we are dealing with routing features of the single page apps written with e.g. AngularJS, we might get different content with different fragments (and different content might mean more URLs to crawl). But, as previously mentioned, SiteMapper is simple and deals only with static content.
  6. Alphabetically sort query parameters. We don’t support JavaScript, forms and whatnot, but we do query parameters as they are rather easy (and I get to use that nice Ruby one-liner).
  7. Finally, we encode any spaces and other non-URL compatible characters. Addressable to the rescue once again.

There are a couple of more interesting places and Crawler#should_crawl_page is one of them:

should_crawl_page? methodlink
1
2
3
4
5
6
7
8
def should_crawl_page?(host, page, depth)
  unless UrlHelper.is_url_same_domain?(host, page.path)
  ...
  if @robots && @robots.disallowed?(page.path.to_s)
  ...
  if depth >= @opts[:max_page_depth].to_i
  ...
end

When traversing from page to page, should_crawl_page? is called for each new encountered link. It checks if link belongs to the same domain as the one we started with, if the link is allowed by robots.txt file and if we reached maximum traversal depth. is_url_same_domain? is dead simple:

is_url_same_domain? methodlink
1
2
3
4
def self.is_url_same_domain?(host_url, resource_url)
  ...
  host_url.host == resource_url.host
end

One more interesting method is is_url_already_seen?, which, once URL is normalized, tries to match with previously seen URLs. If URL was already seen, we simply ignore that path.

is_url_already_seen? methodlink
1
2
3
4
def is_url_already_seen?(url, depth)
  if @seen_urls[Digest::MurmurHash64B.hexdigest(url.omit(:scheme).to_s)]
  ...
end

Concurrent Downloads

Another interesting point worth explaining is how pages are concurrently downloaded and processed. Since downloading pages using HTTP is mostly I/O bound, making an effort of creating multiple threads and handing downloads to them is worthwhile, even with MRI. For that purpose, I used a producer-consumer concurrency pattern. Follows line-by-line explanation of the process. First set of code lines is extracted from the Core#start method (check at https://github.com/okulik/sitemapper/blob/master/core.rb#start), which represents the main thread of execution:

concurrent downloadinglink
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
urls_queue = Queue.new
pages_queue = Queue.new
seen_urls = {}
threads = []
root = nil

Thread.abort_on_exception = true
(1..@opts.scraper_threads.to_i).each_with_index do |index|
  threads << Thread.new { Scraper.new(seen_urls, urls_queue, pages_queue, index, @opts, @robots).run }
end

urls_queue.push(host: host, url: start_url, depth: 0, parent: root)
loop do
  msg = pages_queue.pop
  if msg[:page]
    msg[:page].anchors.each do |anchor|
      urls_queue.push(host: host, url: anchor, depth: msg[:depth] + 1, parent: msg[:page])
    end
    ...
  end
  ...

Here we create two queues and a number of scraper thread. Main thread communicates with scraper threads via those two queues. When some page needs to be retrieved, it sends message to the urls_queue queue and retrieves finished page objects from the pages_queue queue (created and assembled by scraper threads).

concurrent downloadinglink
1
2
3
4
5
6
7
8
9
10
11
12
13
  ...
  if urls_queue.empty? && pages_queue.empty?
    until urls_queue.num_waiting == threads.size
      Thread.pass
    end
    if pages_queue.empty?
      threads.size.times { urls_queue << nil }
      break
    end
  end
end

threads.each { |thread| thread.join }

Here we try to detect if we’re done. If both queues are empty and some threads are still processing pages (i.e. not all scraper threads are waiting blocked on urls_queue), we call Thread.pass in the loop to hint scheduler that we’re giving up our quota – Ruby’s version of sleep(0). When all scraper threads are done we check if there are some pages waiting to be processed once more. If yes, we loop back to the beginning of the main loop. If there are no more pages, we send as many nil messages to urls_queue as we have scraper threads and wait for them all to finish.

Scraper threads’ main method is pretty simple. It pops messages containing page URL to be processed and calls create_page method which gets HTML, parses it (using excellent Nokogiri gem) and finally creates a page object. Such object is then pushed back to the pages_queue where main thread takes over and inserts it into the directed graph of pages.

scraper thread methodlink
1
2
3
4
5
6
7
8
9
10
11
loop do
  msg = @urls_queue.pop
  unless msg
    LOGGER.debug "scraper #{@index} received finish message"
    break
  end

  page = create_page(msg)

  @pages_queue.push(page: page, url: msg[:url], depth: msg[:depth], parent: msg[:parent])
end

Comments