Apr 12, 2014

A simple Ruby sitemap.xml generator

1154 Words

2014-04-12 17:17 +0000

Yesterday, I completed a simple Ruby CLI tool that I’ve named SiteMapper. Its main purpose is to generate a sitemap.xml file, a format widely recognized by many popular search engines. You can find the tool at this GitHub link: https://github.com/okulik/lame-sitemapper.

During my initial tests, I realized that having a visual representation would be quite cool, rather than relying solely on space-indented text logs. As a result, I added a feature to generate a .dot file, which can then be converted into a .png image using the graphviz tool.

SiteMapper essentially serves as a straightforward, static web page hierarchy explorer. It starts from a page of your choice and navigates through the web site’s structure by following links. It will continue until it has traversed all the available content or until it reaches a predefined depth limit.

Links Normalization

The primary challenge in traversing links was determining whether a link had been visited before or not. Without a reliable mechanism, there would be a risk of endlessly navigating through pages, potentially stuck in a loop and jumping from one page to another indefinitely. To tackle this issue, I implemented a method for normalizing raw URLs. This involved expanding each ‘href’ value to its full path, removing any fragments, and sorting query parameters alphabetically. Let’s take a look at some of the Ruby code responsible for this process.

def self.get_normalized_url(host_url, resource_url)
  host_url = Addressable::URI.parse(host_url)
  resource_url = Addressable::URI.parse(resource_url)
 
  m = {}
  m[:scheme] = host_url.scheme unless resource_url.scheme
  unless resource_url.host
    m[:host] = host_url.host
    m[:port] = host_url.port
  end
  resource_url.merge!(m) unless m.empty?
  return nil unless SUPPORTED_SCHEMAS.include?(resource_url.scheme)
  return nil unless PublicSuffix.valid?(resource_url.host)
  resource_url.omit!(:fragment)
  resource_url.query = resource_url.query.split("&").map(&:strip).sort.join("&") 
    unless resource_url.query.nil? || resource_url.query.empty?
 
  return Addressable::URI.encode(resource_url, ::Addressable::URI).normalize
rescue Addressable::URI::InvalidURIError, TypeError
  nil
end

We parse URL string and convert it to Addressable:URI object (addressable is a ruby gem that servers as a replacement for the URI implementation that is part of Ruby’s standard library).
Host parameter is created from the starting URL, the one which we chose as a starting point of our web site quest. It is here also converted to Addressable::URI.
If URL is given without a scheme, often in the form of //www.nisdom.com/a-simple-ruby-sitemap-xml-generator/, we assume scheme and port number from a host. By calling merge, we also ensure that URLs like /a-simple-ruby-sitemap-xml-generator will end with host name too.
Check if host part of our URL is valid with PublicSuffix gem. Since HTML can contain any kind of text, we want to separate wheat from the chaff and make the content we will scrape as good as possible.
Remove everything from the right side of the # mark (i.e. fragments) since in most cases this will result in the same HTML content. Of course, if we are dealing with routing features of the single page apps written with e.g. AngularJS, we might get different content with different fragments (and different content might mean more URLs to crawl). But, as previously mentioned, SiteMapper is simple and deals only with static content.
Alphabetically sort query parameters. We don’t support JavaScript, forms and whatnot, but we do query parameters as they are rather easy (and I get to use that nice Ruby one-liner).
Finally, we encode any spaces and other non-URL compatible characters. Addressable to the rescue once again.

There are a couple of more interesting places and Crawler#should_crawl_page is one of them:

def should_crawl_page?(host, page, depth)
  unless UrlHelper.is_url_same_domain?(host, page.path)
  ...
  if @robots && @robots.disallowed?(page.path.to_s)
  ...
  if depth >= @opts[:max_page_depth].to_i
  ...
end

When traversing from page to page, should_crawl_page? is called for each new encountered link. It checks if link belongs to the same domain as the one we started with, if the link is allowed by robots.txt file and if we reached maximum traversal depth. is_url_same_domain? is dead simple:

def self.is_url_same_domain?(host_url, resource_url)
  ...
  host_url.host == resource_url.host
end

One more interesting method is is_url_already_seen?, which, once URL is normalized, tries to match with previously seen URLs. If URL was already seen, we simply ignore that path.

def is_url_already_seen?(url, depth)
  if @seen_urls[Digest::MurmurHash64B.hexdigest(url.omit(:scheme).to_s)]
  ...
end

Concurrent Downloads

Another intriguing aspect worth exploring is how pages are downloaded and processed concurrently. Given that downloading pages via HTTP is predominantly I/O-bound, it’s ok to create multiple threads and delegate downloads to them, even within MRI. To accomplish this, I implemented a producer-consumer concurrency pattern. Let’s go into a step-by-step explanation of the process. The following code snippets are extracted from the Core#start method, which represents the main thread of execution..

urls_queue = Queue.new
pages_queue = Queue.new
seen_urls = {}
threads = []
root = nil
 
Thread.abort_on_exception = true
(1..@opts.scraper_threads.to_i).each_with_index do |index|
  threads << Thread.new { Scraper.new(seen_urls, urls_queue, pages_queue, index, @opts, @robots).run }
end
 
urls_queue.push(host: host, url: start_url, depth: 0, parent: root)
loop do
  msg = pages_queue.pop
  if msg[:page]
    msg[:page].anchors.each do |anchor|
      urls_queue.push(host: host, url: anchor, depth: msg[:depth] + 1, parent: msg[:page])
    end
    ...
  end
  ...

Here we create two queues and a set of scraper threads. The main thread interacts with the scraper threads through these two queues. When there’s a need to fetch a particular page, a message is sent to the urls_queue, and the completed page objects, which are created and assembled by the scraper threads, are obtained from the pages_queue.

  ...
  if urls_queue.empty? && pages_queue.empty?
    until urls_queue.num_waiting == threads.size
      Thread.pass
    end
    if pages_queue.empty?
      threads.size.times { urls_queue << nil }
      break
    end
  end
end
 
threads.each { |thread| thread.join }

Here we attempt to determine if we’ve completed the task. If both queues are empty, and some threads are still actively processing pages (i.e., not all scraper threads are blocked, waiting on the urls_queue), we utilize a Thread.pass call within the loop to signal to the scheduler that we’re yielding our quota - this is Ruby’s equivalent of sleep(0). Once all scraper threads are finished, we check if there are any remaining pages waiting to be processed. If there are, we loop back to the beginning of the main loop. However, if there are no more pages, we send as many nil messages to the urls_queue as we have scraper threads and then wait for all of them to complete.

The main method of the scraper threads is quite simple. It dequeues messages containing page URLs to be processed and invokes the create_page method, which fetches the HTML, parses it (using the excellent Nokogiri gem), and ultimately generates a page object. This object is then pushed back into the pages_queue, from where the main thread takes charge and integrates it into the directed graph of pages.

loop do
  msg = @urls_queue.pop
  unless msg
    LOGGER.debug "scraper #{@index} received finish message"
    break
  end
 
  page = create_page(msg)
 
  @pages_queue.push(page: page, url: msg[:url], depth: msg[:depth], parent: msg[:parent])
end

Conclusion

In a nutshell, SiteMapper Ruby CLI tool, allows simple generation of sitemap.xml files. It not only simplifies web page hierarchy exploration but also offers a nice visual representation, making the process more intuitive. Here I provided a sneak peek into its inner workings, from URL normalization to concurrent downloads, making it perhaps a handy tool for web developers.