<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/">
	<channel>
		<title>Posts on Nisdom. As opposed to Wisdom.</title>
		<link>https://nisdom.com/posts/</link>
		<description>Recent content in Posts on Nisdom. As opposed to Wisdom.</description>
		<generator>Hugo -- 0.156.0</generator>
		<language>en-us</language>
		<lastBuildDate>Sun, 01 Feb 2026 16:31:51 +0100</lastBuildDate>
		<atom:link href="https://nisdom.com/posts/index.xml" rel="self" type="application/rss+xml" />
		
		
		<item>
			<title>Casey on AI</title>
			<link>https://nisdom.com/posts/2026-02-01-casey-on-ai/</link>
			<pubDate>Sun, 01 Feb 2026 16:31:51 +0100</pubDate><guid>https://nisdom.com/posts/2026-02-01-casey-on-ai/</guid>
			<description><![CDATA[&lt;no value&gt;]]></description><content type="text/html" mode="escaped"><![CDATA[<h2 id="do-i-hate-ai">Do I Hate AI?<a href="#do-i-hate-ai" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h2>
<p>Do I hate AI? No, I don&rsquo;t think I do. I think AI companies have been behaving very badly, but it&rsquo;s important to separate the different things involved. There is so much bundled together that it&rsquo;s easy to conflate unrelated issues.</p>
<p>There&rsquo;s the question of whether I want to use AI myself. In many cases, I don&rsquo;t. I don&rsquo;t like using AI, so I don&rsquo;t use it. That is not the same thing as hating it. I do hate the way AI companies have been behaving, but that is also not the same as hating AI as a technology.</p>
<p>One of the core problems with discussions about AI is that too many things are wrapped up together. It becomes very easy to paint everything with a broad brush. I try to retain some subtlety there. Since I&rsquo;ve become a semi-public personality and appear on podcasts and similar venues, I get asked about this a lot, and I&rsquo;ve always tried to give fairly nuanced answers.</p>
<p>There are many issues worth discussing, and it&rsquo;s important not to lump everything together. It&rsquo;s not helpful to say AI is all good or AI is all bad, that it&rsquo;s revolutionary or that it&rsquo;s trash. All of these questions matter and need to be considered individually.</p>
<hr>
<h2 id="asking-the-right-questions-about-ai">Asking the Right Questions About AI<a href="#asking-the-right-questions-about-ai" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h2>
<p>What are the things AI can do for us? What are the dangers of it? Are companies doing things that are criminal? Are companies doing things that are immoral?</p>
<p>These are all separate questions, and they should be answered separately. The actions you want to see taken depend on the answers. That&rsquo;s why it matters to talk about them individually.</p>
<p>There are plenty of people screaming in one direction or the other. Some insist everyone needs to use AI and that it&rsquo;s the future. Others insist AI is terrible and ruining the world. You don&rsquo;t really need me to repeat either of those positions. There are already enough voices doing that.</p>
<p>What I try to do instead is talk about specific things. I think it helps people realize that there are many distinct aspects here. You want to focus on bad behavior and call it out. You also want to recognize things that are probably unambiguously good.</p>
<hr>
<h2 id="an-unambiguously-good-and-bad-use">An Unambiguously Good and Bad Use<a href="#an-unambiguously-good-and-bad-use" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h2>
<p>Consider a hypothetical example of something that is clearly good.</p>
<p>Suppose I have a passive collection of cameras mounted in cars, like a Tesla. I don&rsquo;t use it to track customers. I don&rsquo;t sell the data. I don&rsquo;t do anything nefarious with it. I use it only to train an AI system. That AI becomes very good at emergency braking, preventing a customer from getting into a fatal car accident.</p>
<p>To me, that is an unambiguous good use of AI. Nothing was stolen. No one was manipulated. The data was used strictly to train a system that saves lives. That&rsquo;s it. We can imagine uses like this that are simply good, end of story.</p>
<p>On the other end of the spectrum, there are things that are obviously bad.</p>
<p>Some AI companies have, on the record, literally pirated people&rsquo;s materials. They didn&rsquo;t even pay for the originals used in their training data. To me, that is completely unambiguous. If a consumer did that, they would go to jail. These companies should go to jail too.</p>
<p>There are ends of the spectrum where one side is people should be in jail for doing thing A, and the other side is people doing an unalloyed good thing B. Between those extremes is a huge middle ground where we need to talk about many other cases. All of those deserve discussion.</p>
<hr>
<h2 id="why-i-personally-dont-use-ai">Why I Personally Don&rsquo;t Use AI<a href="#why-i-personally-dont-use-ai" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h2>
<p>Personally, I don&rsquo;t currently use AI. I don&rsquo;t use it for personal reasons. The satisfaction I derive from my work comes from doing the thing myself.</p>
<p>This isn&rsquo;t unique to AI. It&rsquo;s also why I don&rsquo;t have subordinates. I don&rsquo;t manage other programmers because I don&rsquo;t derive satisfaction from telling someone else to write a program. Similarly, I don&rsquo;t enjoy having someone else do my programming for me.</p>
<p>I like participating in the discussion because it&rsquo;s important, but it affects me only tangentially. My reasons for not using AI aren&rsquo;t really about whether it works well, whether it&rsquo;s moral, or whether companies stole data. Those discussions matter for society at large, and I care about them in that sense, but they don&rsquo;t matter much to my day-to-day life because I&rsquo;m simply not going to use AI.</p>
<hr>
<h2 id="different-reasons-people-avoid-ai">Different Reasons People Avoid AI<a href="#different-reasons-people-avoid-ai" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h2>
<p>There are many valid reasons someone might avoid AI.</p>
<p>Some people avoid it because of the immorality of data theft. They might want AI companies prosecuted or large settlements paid to authors. Others might avoid AI simply because they think it&rsquo;s not very good. They look at the code it produces and decide it&rsquo;s not up to their standards.</p>
<p>These are very different concerns. As AI improves, people in the second group might change their minds once the output crosses a quality threshold. That threshold will probably never matter to me, because I&rsquo;m not judging AI on output quality. I don&rsquo;t enjoy using it, and if I don&rsquo;t enjoy using it, it doesn&rsquo;t matter how good the code is.</p>
<p>For me, much of this discussion just passes over my head. I don&rsquo;t really care how good the output is. I just don&rsquo;t enjoy it.</p>
<hr>
<h2 id="ai-as-an-advanced-search-and-recombination-engine">AI as an Advanced Search and Recombination Engine<a href="#ai-as-an-advanced-search-and-recombination-engine" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h2>
<p>AI feels to me like an advanced search and recombination engine. I understand why people are excited about it, because for many programmers, that is effectively their job. They are asked to find pieces, combine them, and make something work.</p>
<p>AI shows promise in doing that. It can search for how a service does a particular JavaScript thing, find relevant snippets, and combine them. It may not do a perfect job, but it often does enough that it&rsquo;s faster than manually searching, copying, and pasting.</p>
<p>That is genuinely useful. I understand the excitement.</p>
<p>But I don&rsquo;t want to do that job. I already didn&rsquo;t want to do that kind of work, which is another reason AI is less exciting to me personally.</p>
<hr>
<h2 id="the-workforce-risk-hollowing-out-the-pipeline">The Workforce Risk: Hollowing Out the Pipeline<a href="#the-workforce-risk-hollowing-out-the-pipeline" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h2>
<p>My real worry about AI is its impact on the workforce. It might outcompete junior developers without improving enough to justify its cost. Companies could then fail, leaving a gap of many years.</p>
<p>This is a very well-founded concern. If AI becomes as good as a great programmer, then we don&rsquo;t have much to worry about. We tell the AI what to do, it writes the software, and the software works.</p>
<p>The scarier world is one where it never gets there.</p>
<p>If AI becomes good enough to replace junior or intermediate programmers, but never good enough to replace experts, we end up in a dangerous situation. There are no entry-level jobs. Juniors never become experts. The existing experts age out and retire, and the pipeline is hollowed out.</p>
<p>That&rsquo;s how you get a great software crash. This is my biggest fear about AI. Not that it&rsquo;s too good, but that it&rsquo;s not quite good enough.</p>
<hr>
<h2 id="a-plausible-and-dangerous-future">A Plausible and Dangerous Future<a href="#a-plausible-and-dangerous-future" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h2>
<p>What makes this especially scary is that AI companies already seem comfortable flooding the market with low-quality tools. If they can make money doing that, there may be no forcing function to do the really hard remaining work - the last 10% required to make AI as good as the best programmers, not just mediocre ones.</p>
<p>If they never do that work because they don&rsquo;t have to, we&rsquo;re in serious trouble.</p>
<p>This scenario feels more plausible to me than some of the more dramatic AGI hypotheticals. A world where companies replace 50% of the workforce, extract enormous value, and then stop innovating sounds very much like patterns we&rsquo;ve already seen in Silicon Valley. It doesn&rsquo;t require artificial general intelligence or a massive breakthrough.</p>
<p>It only requires doing just enough.</p>
<p>You can imagine AI tools hollowing out junior programming, collecting money on a per-token basis, allowing companies to hire fewer engineers, and gradually weakening the entire system. Unfortunately, that sounds plausible.</p>
<p>The future is impossible to predict. I don&rsquo;t know what the most likely outcome is. But that particular scenario does sound worryingly realistic to me.</p>
]]></content>
		</item>
		
		<item>
			<title>Live Coding Exercises in the Age of Generative AI</title>
			<link>https://nisdom.com/posts/2025-01-30-ai-live-coding/</link>
			<pubDate>Thu, 30 Jan 2025 18:38:12 +0200</pubDate><guid>https://nisdom.com/posts/2025-01-30-ai-live-coding/</guid>
			<description><![CDATA[&lt;no value&gt;]]></description><content type="text/html" mode="escaped"><![CDATA[<p>Here&rsquo;s a pro-tip: if you are looking for work as a software developer, or you think you might be looking for a job soon, I suggest you turn off any AI-assisted code generation tools and do things as you used to do, at least a couple of weeks ahead.</p>
<p>Almost none of the companies hiring right now allow you to use them during live-coding sessions, and if you have been using such tools for a few months already, there&rsquo;s a high chance that your muscle memory has deteriorated. You have probably become a bit lazier with typing and increasingly ignorant of the programming languages you are using and their APIs. None of those things will do you any good in situations where you need to demonstrate you&rsquo;re a seasoned code combatant.</p>
<p>Since there&rsquo;s a lot of talk (and hype) being built around these tools, from the ones that act as much better auto-completers, to the ones claiming to increase your productivity by 30%, or those hoping to completely take away your job, the fact that these tools promote laziness over learning skills is worrying.</p>
<p>There are too many demos from senior engineers doing AI-supported &ldquo;coding&rdquo; in programming languages they are not familiar with, and encouraging &ldquo;you don&rsquo;t need to care what these tools produce, as long as the task gets done.&rdquo; This is quite irresponsible for the younger generations and the whole future of software engineering. As all seasoned software professionals well know, our industry has been trying really hard to build tools, processes, communication, and cultural patterns to create environments where building increasingly reliable software becomes more and more possible.</p>
<p>The key issue here is not the tools themselves, but how we use them. When developers rely too heavily on AI assistance without understanding the underlying concepts, they risk building systems they can&rsquo;t properly maintain or debug. This creates technical debt that becomes increasingly difficult to manage as projects grow in complexity.</p>
<p>However, there&rsquo;s still a lot of value in AI-assisted coding tools, especially if used as learning tools. They can be very successfully used for explaining, summarizing, and documenting legacy or some other existing and unfamiliar code. They can also be used to learn how to create a skeleton/POC code in a language/framework you never used before. Imagine having a StackOverflow, but with a use case tailored just for you! You can inquire about distinct features of the used programming language, boilerplate code explanations, or some other, more advanced usage patterns.</p>
]]></content>
		</item>
		
		<item>
			<title>Move to fly.io</title>
			<link>https://nisdom.com/posts/2023-10-14-move-to-flyio/</link>
			<pubDate>Sat, 14 Oct 2023 22:08:32 +0200</pubDate><guid>https://nisdom.com/posts/2023-10-14-move-to-flyio/</guid>
			<description><![CDATA[&lt;no value&gt;]]></description><content type="text/html" mode="escaped"><![CDATA[<p>After more than 10 years of hosting my <a href="https://gohugo.io">hugo</a>-based blog on one of DigitalOcean&rsquo;s machines, I decided to move to a different hosting environment. For a decade, I ran this site on a DigitalOcean Ubuntu droplet with tightly secured nginx. This droplet, beyond hosting a simple static website, served as my experimentation platform — running 24/7, connected to the internet with a decent connection, and costing me only $6 a month. Over the years, it was a place for this software engineer to experiment and play. I believe that every software engineer should have the skills to create a website, serve it, set up certificates, configure DNS, and email, at a bare minimum.</p>
<p>Finally, I chose <a href="https://fly.io">fly.io</a>, a platform that utilizes <a href="https://firecracker-microvm.github.io">Firecracker</a> virtualization. For those who might not be up-to-date on this, Firecracker is a virtual machine monitor (VMM) developed by Amazon Web Services (AWS). It was initially created to replace QEMU (or a derivative they&rsquo;ve used) and power AWS Lambda and Firegate products more efficiently. Later, they open-sourced it. Firecracker uses the Linux Kernel-based Virtual Machine (KVM) to create and manage microVMs. With this tool, you can host your services using resource-efficient containers, like Docker.</p>
<p>One of the very cool aspects of fly.io is its ability to put apps to sleep when you&rsquo;re not using them. Even more impressively, it can spin them up in under a second, or even a few hundred milliseconds, even with hobby instances having just 1 shared CPU and 256 MiB of RAM!</p>
<p>Instead of spinning up nginx inside my container, I opted for <a href="https://caddyserver.com">Caddy</a> as it seemed like a simpler solution. Notably, Caddy has no libc dependency, and configuring it is much simpler compared to nginx. To illustrate, here&rsquo;s a Caddy configuration file for this site:</p>
<pre tabindex="0"><code>{
	auto_https off
}

http://nisdom.com {
	root * /usr/share/caddy
	file_server
}
</code></pre><p>I&rsquo;ve turned off HTTPS in the Caddy configuration since fly.io handles that for you. Even without fly.io, Caddy is capable of automatic TLS certificate renewals, eliminating the need for manual cronjobs to generate Let&rsquo;s Encrypt certificates.</p>
<p>The Dockerfile is equally simple (the <code>public</code> folder is where Hugo generates your static website, and the Caddyfile consists of the seven lines above):</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-Dockerfile" data-lang="Dockerfile"><span class="line"><span class="cl"><span class="k">FROM</span><span class="w"> </span><span class="s">caddy:2.7.5</span><span class="err">
</span></span></span><span class="line"><span class="cl"><span class="err">
</span></span></span><span class="line"><span class="cl"><span class="k">COPY</span> ./public/ /usr/share/caddy/<span class="err">
</span></span></span><span class="line"><span class="cl"><span class="k">COPY</span> ./Caddyfile /etc/caddy/Caddyfile<span class="err">
</span></span></span></code></pre></div>]]></content>
		</item>
		
		<item>
			<title>Installing mysql2 Ruby gem on MacOS</title>
			<link>https://nisdom.com/posts/2019-05-19-ruby-install-mysql2-gem/</link>
			<pubDate>Sun, 19 May 2019 19:19:51 +0200</pubDate><guid>https://nisdom.com/posts/2019-05-19-ruby-install-mysql2-gem/</guid>
			<description><![CDATA[&lt;no value&gt;]]></description><content type="text/html" mode="escaped"><![CDATA[<p>The other day I was installing mysql2 gem on macOS for Ruby 2.6.2, something that was supposed to be less than a walk in the park. I knew I would most likely have some hiccups when compiling gem&rsquo;s native extension, but that usually and rather unglamorously boils down to finding the correct MySQL dev libraries. However, there were unexpected twists and turns.</p>
<h2 id="installing-the-mysql2-gem">Installing the mysql2 gem<a href="#installing-the-mysql2-gem" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">$ asdf shell ruby 2.6.2
</span></span><span class="line"><span class="cl">$ gem install mysql2
</span></span><span class="line"><span class="cl">Building native extensions. This could take a <span class="k">while</span>...
</span></span><span class="line"><span class="cl">ERROR:  Error installing mysql2:
</span></span><span class="line"><span class="cl">	ERROR: Failed to build gem native extension.
</span></span><span class="line"><span class="cl">...
</span></span><span class="line"><span class="cl">mysql client is missing. You may need to <span class="s1">&#39;brew install mysql&#39;</span> or <span class="s1">&#39;port install mysql&#39;</span>, and try again.
</span></span><span class="line"><span class="cl">...
</span></span></code></pre></div><p>Ok, so gem&rsquo;s build script is politely telling me that I need to first install MySQL using the Homebrew package manager. Since I don&rsquo;t need the whole database server but only some development libraries (will be using MySQL from a Docker container), I tried installing the usual <code>mysql-devel</code>.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">$ brew install mysql-devel
</span></span><span class="line"><span class="cl">...
</span></span><span class="line"><span class="cl">Error: No available formula with the name <span class="s2">&#34;mysql-devel&#34;</span>
</span></span></code></pre></div><p>After some googling, I figured the MySQL client library was available at <a href="https://dev.mysql.com/downloads/connector/c/">this</a> mysql.com page. Luckily, there&rsquo;s already a ready-made homebrew formula <a href="https://github.com/Homebrew/homebrew-core/blob/master/Formula/mysql-connector-c.rb"><code>mysql-connector-c</code></a>.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">$ brew install mysql-connector-c
</span></span><span class="line"><span class="cl">...
</span></span><span class="line"><span class="cl">🍺 /usr/local/Cellar/mysql-connector-c/6.1.11: <span class="m">79</span> files, 15.3MB
</span></span><span class="line"><span class="cl">$ gem install mysql2
</span></span><span class="line"><span class="cl">Building native extensions. This could take a <span class="k">while</span>...
</span></span><span class="line"><span class="cl">ERROR:  Error installing mysql2:
</span></span><span class="line"><span class="cl">	ERROR: Failed to build gem native extension.
</span></span><span class="line"><span class="cl">...
</span></span><span class="line"><span class="cl">compiling statement.c
</span></span><span class="line"><span class="cl">linking shared-object mysql2/mysql2.bundle
</span></span><span class="line"><span class="cl">ld: library not found <span class="k">for</span> -l-Wno-atomic-implicit-seq-cst
</span></span><span class="line"><span class="cl">clang: error: linker <span class="nb">command</span> failed with <span class="nb">exit</span> code <span class="m">1</span> <span class="o">(</span>use -v to see invocation<span class="o">)</span>
</span></span><span class="line"><span class="cl">make: *** <span class="o">[</span>mysql2.bundle<span class="o">]</span> Error <span class="m">1</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">make failed, <span class="nb">exit</span> code <span class="m">2</span>
</span></span></code></pre></div><p>Dafuq?!</p>
<h2 id="linker-trouble">Linker trouble<a href="#linker-trouble" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h2>
<p>Linker coldly reported <code>library not found error</code> so I had to take a bit longer look at the build log. Finally managed to found one rather interesting line more towards the beginning of it: <code>Using mysql_config at /usr/local/bin/mysql_config</code>.</p>
<p>After some reading how mysql2 gem builds its <a href="https://github.com/brianmario/mysql2/blob/master/ext/mysql2/extconf.rb">native extension</a> and checking the content of mentioned <a href="https://dev.mysql.com/doc/refman/5.7/en/mysql-config.html">mysql_config script</a>, I suspected something might be wrong there - linker command &ldquo;-l-Wno-atomic-implicit-seq-cst&rdquo; just didn&rsquo;t make any sense.</p>
<h2 id="broken-mysql_config">Broken mysql_config<a href="#broken-mysql_config" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h2>
<p>A more detailed look at that config script revealed the culprit:</p>
<pre tabindex="0"><code># /usr/local/bin/mysql_config
...
# Create options 
libs=&#34;-L$pkglibdir&#34;
libs=&#34;$libs -l &#34;
embedded_libs=&#34;-L$pkglibdir&#34;
embedded_libs=&#34;$embedded_libs -l &#34;
...
</code></pre><p>Lines with <code>libs</code> and <code>embedded_libs</code> contained errors - they were cut off after the <code>-l</code> parameter. I quickly tracked the error to the <a href="https://dev.mysql.com/downloads/connector/c">https://dev.mysql.com/downloads/connector/c</a> repository where the problematic file was present (homebrew&rsquo;s formula pulls package from there). After some tinkering, I managed to produce the correct version of the <code>mysql_config</code> file.:</p>
<pre tabindex="0"><code># Create options 
libs=&#34;-L$pkglibdir&#34;
libs=&#34;$libs -lmysqlclient -lcrypto -lssl&#34;
embedded_libs=&#34;-L$pkglibdir&#34;
embedded_libs=&#34;$embedded_libs -lmysqlclient -lcrypto -lssl&#34;
</code></pre><h2 id="almost-done">Almost done<a href="#almost-done" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h2>
<p>Trying to install the gem now, ends up with yet another error:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">$ gem install mysql2
</span></span><span class="line"><span class="cl">...
</span></span><span class="line"><span class="cl">ld: library not found <span class="k">for</span> -lcrypto
</span></span></code></pre></div><p>This one is easy. We just need to give instructions to the linker where to find the necessary <code>libcrypto</code> (part of the <code>openssl</code>).</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">$ brew install openssl <span class="c1"># if you haven&#39;t done that already</span>
</span></span><span class="line"><span class="cl">$ gem install mysql2 -- --with-ldflags<span class="o">=</span>-L/usr/local/opt/openssl/lib
</span></span></code></pre></div><h2 id="alternative-solution-tldr">Alternative solution (TL;DR)<a href="#alternative-solution-tldr" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h2>
<p>Another solution that doesn&rsquo;t require fixing the <code>mysql_config</code> is providing all folders to the native extension&rsquo;s build command. <code>mysql2</code>gem will then use those settings and not the ones provided by the <code>mysql_config</code> script:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">$ gem install mysql2 -- <span class="se">\
</span></span></span><span class="line"><span class="cl">  --with-ldflags<span class="o">=</span>-L/usr/local/opt/openssl/lib <span class="se">\
</span></span></span><span class="line"><span class="cl">                 -L/usr/local/opt/mysql-connector-c/lib -lmysqlclient -lcrypto -lssl <span class="se">\
</span></span></span><span class="line"><span class="cl">  --with-cppflags<span class="o">=</span>-I/usr/local/opt/mysql-connector-c/include
</span></span></code></pre></div>]]></content>
		</item>
		
		<item>
			<title>Aws Lambda Primer With Ruby using the RedShift, Secrets Manager and S3</title>
			<link>https://nisdom.com/posts/2019-05-14-aws-lambda-primer-with-ruby-redshift-s3/</link>
			<pubDate>Tue, 14 May 2019 21:55:22 +0200</pubDate><guid>https://nisdom.com/posts/2019-05-14-aws-lambda-primer-with-ruby-redshift-s3/</guid>
			<description><![CDATA[&lt;no value&gt;]]></description><content type="text/html" mode="escaped"><![CDATA[<p>The last time I was writing about AWS Lambda was more than four years ago and <a href="https://nisdom.com/posts/2015-03-04-a-matter-of-logs/">that story</a> involved some batch processing with a very rough cost estimate of custom code processing vs the AWS Lambda.</p>
<p>This time I am writing about my AWS Lambda experience using the Ruby runtime and hopefully sharing not so obvious a thing or two.</p>
<h2 id="1-basic-scaffolding">1. Basic scaffolding<a href="#1-basic-scaffolding" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h2>
<p>Writing AWS Lambda functions requires you to define a static handler method. I decided to have <code>lambda_handler.rb</code> file in the root folder and everything else would go inside the <code>lib</code> folder. Don&rsquo;t forget to name your lambda handler in the AWS console as <code>lambda_handler.LambdaHandler.call</code>.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-ruby" data-lang="ruby"><span class="line"><span class="cl"><span class="c1"># frozen_string_literal: true</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nb">require</span> <span class="s2">&#34;honeybadger&#34;</span>
</span></span><span class="line"><span class="cl"><span class="nb">require</span> <span class="s2">&#34;pg&#34;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">require_relative</span> <span class="s2">&#34;utils&#34;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="no">Honeybadger</span><span class="o">.</span><span class="n">context</span> <span class="p">\</span>
</span></span><span class="line"><span class="cl">  <span class="ss">tags</span><span class="p">:</span> <span class="s2">&#34;lambda, </span><span class="si">#{</span><span class="no">Utils</span><span class="o">.</span><span class="n">lambda_name</span><span class="si">}</span><span class="s2">&#34;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">class</span> <span class="nc">LambdaHandler</span>
</span></span><span class="line"><span class="cl">  <span class="k">class</span> <span class="o">&lt;&lt;</span> <span class="nb">self</span>
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="nf">call</span><span class="p">(</span><span class="ss">event</span><span class="p">:,</span> <span class="ss">context</span><span class="p">:)</span>
</span></span><span class="line"><span class="cl">      <span class="o">...</span>
</span></span><span class="line"><span class="cl">    <span class="k">rescue</span> <span class="no">StandardError</span> <span class="o">=&gt;</span> <span class="n">e</span>
</span></span><span class="line"><span class="cl">      <span class="no">Honeybadger</span><span class="o">.</span><span class="n">notify</span> <span class="p">\</span>
</span></span><span class="line"><span class="cl">        <span class="n">e</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">        <span class="ss">sync</span><span class="p">:</span> <span class="kp">true</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">        <span class="ss">context</span><span class="p">:</span> <span class="no">Utils</span><span class="o">.</span><span class="n">lambda_to_hb_context</span><span class="p">(</span><span class="n">context</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">      <span class="k">raise</span>
</span></span><span class="line"><span class="cl">    <span class="k">end</span>
</span></span></code></pre></div><p>Already in this example, there&rsquo;s a small lesson to learn about the error reporting using the <a href="https://www.honeybadger.io/">Honeybadger</a> gem. Honeybadger is smart enough to realize when it&rsquo;s been used from Rails or Sinatra. When used from those environments, it won&rsquo;t do anything special about executing its async notifications. In all other cases (like being used from the Ruby CLI app) it will install so-called <a href="https://github.com/honeybadger-io/honeybadger-ruby/blob/v4.2.2/lib/honeybadger/singleton.rb#L66"><code>at_exit</code></a> hook to guarantee that all its async code is being waited upon until it properly finishes. This, however, doesn&rsquo;t work with AWS Lambda. I quickly realized that regular Honeybadger notifications are executed asynchronously and were not doing so properly within Lambda. Luckily, <code>sync: true</code> comes to rescue.</p>
<blockquote>
<h2 id="pro-tip-use-honeybadgernotify-sync-true--when-sending-notifications-from-aws-lambda">Pro tip: Use <code>Honeybadger.notify(..., sync: true, ...)</code> when sending notifications from AWS Lambda.<a href="#pro-tip-use-honeybadgernotify-sync-true--when-sending-notifications-from-aws-lambda" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h2>
</blockquote>
<h2 id="2-connecting-to-a-redshift">2. Connecting to a RedShift<a href="#2-connecting-to-a-redshift" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h2>
<p>RedShift is based on PostgreSQL 8.0.2 and in order to access it from Ruby, you should probably head straight for the <a href="https://rubygems.org/gems/pg/">pg</a> gem. The first problem I bumped into is that <code>pg</code> gem&rsquo;s native extension didn&rsquo;t want to compile. My build environment is using <code>lambci/lambda:build-ruby2.5</code> docker images from the <a href="https://github.com/lambci/docker-lambda">lambci project</a>, so fixing that was rather easy:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-make" data-lang="make"><span class="line"><span class="cl"><span class="c"># my package build Makefile
</span></span></span><span class="line"><span class="cl"><span class="nf">docker run -v $$PWD</span><span class="o">:</span>/<span class="n">var</span>/<span class="n">task</span> -<span class="n">it</span> --<span class="n">rm</span> <span class="n">lambci</span>/<span class="n">lambda</span>:<span class="n">build</span>-<span class="n">ruby</span>2.5 \
</span></span><span class="line"><span class="cl">  /<span class="n">bin</span>/<span class="n">bash</span> -<span class="n">c</span> &#39;<span class="n">yum</span> -<span class="n">q</span> -<span class="n">y</span> <span class="n">install</span> <span class="n">postgresql</span>-<span class="n">devel</span> &amp;&amp; ...&#39;
</span></span></code></pre></div><p>However, once I loaded zipped package to AWS, and ran a test I got a rather funny looking error:</p>
<blockquote>
<p>libpq.so.5: cannot open shared object file: No such file or directory - /var/task/vendor/bundle/ruby/2.5.0/extensions/x86_64-linux/2.5.0-static/pg-1.1.4/pg_ext.so</p>
</blockquote>
<p>It seems that our <code>pg</code> native extension requires yet another shared object library i.e. <code>libpq.so.5</code>. In order to fetch it, I went into that docker container:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">docker run -v <span class="sb">`</span><span class="nb">pwd</span><span class="sb">`</span>:/var/task -it --rm lambci/lambda:build-ruby2.5 /bin/bash
</span></span></code></pre></div><p>From there I installed the required PostgreSQL dev libraries, built the required dependencies and checked the extension&rsquo;s dependencies:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">yum -y install postgresql-devel
</span></span><span class="line"><span class="cl">bundle install --without development <span class="nb">test</span> --path vendor/bundle
</span></span><span class="line"><span class="cl">readelf -d vendor/bundle/ruby/2.5.0/extensions/x86_64-linux/2.5.0-static/pg-1.1.4/pg_ext.so
</span></span><span class="line"><span class="cl">Dynamic section at offset 0x2e3f0 contains <span class="m">31</span> entries:
</span></span><span class="line"><span class="cl">  Tag        Type                         Name/Value
</span></span><span class="line"><span class="cl"> 0x0000000000000001 <span class="o">(</span>NEEDED<span class="o">)</span>             Shared library: <span class="o">[</span>libpq.so.5<span class="o">]</span>
</span></span><span class="line"><span class="cl"> 0x0000000000000001 <span class="o">(</span>NEEDED<span class="o">)</span>             Shared library: <span class="o">[</span>libpthread.so.0<span class="o">]</span>
</span></span><span class="line"><span class="cl"> 0x0000000000000001 <span class="o">(</span>NEEDED<span class="o">)</span>             Shared library: <span class="o">[</span>libgmp.so.10<span class="o">]</span>
</span></span><span class="line"><span class="cl"> 0x0000000000000001 <span class="o">(</span>NEEDED<span class="o">)</span>             Shared library: <span class="o">[</span>libdl.so.2<span class="o">]</span>
</span></span><span class="line"><span class="cl"> 0x0000000000000001 <span class="o">(</span>NEEDED<span class="o">)</span>             Shared library: <span class="o">[</span>libcrypt.so.1<span class="o">]</span>
</span></span><span class="line"><span class="cl"> 0x0000000000000001 <span class="o">(</span>NEEDED<span class="o">)</span>             Shared library: <span class="o">[</span>libm.so.6<span class="o">]</span>
</span></span><span class="line"><span class="cl"> 0x0000000000000001 <span class="o">(</span>NEEDED<span class="o">)</span>             Shared library: <span class="o">[</span>libc.so.6<span class="o">]</span>
</span></span></code></pre></div><p>Let&rsquo;s find the location of that first dependency:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">find / -name libpq.so.5
</span></span><span class="line"><span class="cl">/usr/lib64/libpq.so.5
</span></span></code></pre></div><p>So then I had to figure out how to package that file into my Lambda and make sure path to it is added to <code>LD_LIBRARY_PATH</code> environment variable. Luckily, Amazon made that quite easy and there are multiple options for it. Let&rsquo;s check first some env vars from that docker image:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl"><span class="nb">echo</span> <span class="nv">$LD_LIBRARY_PATH</span>
</span></span><span class="line"><span class="cl">/var/lang/lib:/lib64:/usr/lib64:/var/runtime:/var/runtime/lib:/var/task:/var/task/lib:/opt/lib
</span></span><span class="line"><span class="cl"><span class="nb">echo</span> <span class="nv">$PATH</span>
</span></span><span class="line"><span class="cl">/var/lang/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/bin
</span></span></code></pre></div><p>It seems there are a number of places where we can put our shared objects and binaries. The easiest options would be either putting <code>libpq.so.5</code> in the Ruby project&rsquo;s root folder or creating a <code>lib</code> folder in the same place and slip it into there. If you are a bit more ambitious, you will create a separate zip package and have a <a href="https://docs.aws.amazon.com/lambda/latest/dg/configuration-layers.html">AWS Lambda Layer</a> attached to your lambda function. Just make sure your zip file structure looks something like this:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl"><span class="c1"># layer.zip</span>
</span></span><span class="line"><span class="cl">+-- lib
</span></span><span class="line"><span class="cl">  +-- libpg.so.5
</span></span></code></pre></div><blockquote>
<h2 id="pro-tip-package-libpqso5-with-your-lambda-code-or-have-it-in-a-layer">Pro tip: package <code>libpq.so.5</code> with your lambda code or have it in a layer.<a href="#pro-tip-package-libpqso5-with-your-lambda-code-or-have-it-in-a-layer" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h2>
</blockquote>
<p>The last part of the RedShift puzzle is putting your Lambda into the VPC to be able to access the database. Later on this move will turn out to be a bit of a problem, but for now, all it takes is making sure Lambda function is in the same VPC as RedShift, with all the subnets and security groups to allow access to port 5439 and, of course, lambda&rsquo;s execution role. Here&rsquo;s how the JSON policy should look like:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-json" data-lang="json"><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="nt">&#34;Version&#34;</span><span class="p">:</span> <span class="s2">&#34;2012-10-17&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="nt">&#34;Statement&#34;</span><span class="p">:</span> <span class="p">[</span>
</span></span><span class="line"><span class="cl">        <span class="p">{</span>
</span></span><span class="line"><span class="cl">            <span class="nt">&#34;Effect&#34;</span><span class="p">:</span> <span class="s2">&#34;Allow&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">            <span class="nt">&#34;Action&#34;</span><span class="p">:</span> <span class="p">[</span>
</span></span><span class="line"><span class="cl">                <span class="s2">&#34;logs:CreateLogGroup&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                <span class="s2">&#34;logs:CreateLogStream&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                <span class="s2">&#34;logs:PutLogEvents&#34;</span>
</span></span><span class="line"><span class="cl">            <span class="p">],</span>
</span></span><span class="line"><span class="cl">            <span class="nt">&#34;Resource&#34;</span><span class="p">:</span> <span class="s2">&#34;arn:aws:logs:*:*:*&#34;</span>
</span></span><span class="line"><span class="cl">        <span class="p">},</span>
</span></span><span class="line"><span class="cl">        <span class="p">{</span>
</span></span><span class="line"><span class="cl">            <span class="nt">&#34;Effect&#34;</span><span class="p">:</span> <span class="s2">&#34;Allow&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">            <span class="nt">&#34;Action&#34;</span><span class="p">:</span> <span class="p">[</span>
</span></span><span class="line"><span class="cl">                <span class="s2">&#34;ec2:CreateNetworkInterface&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                <span class="s2">&#34;ec2:DescribeNetworkInterfaces&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                <span class="s2">&#34;ec2:DeleteNetworkInterface&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                <span class="s2">&#34;ec2:DescribeSecurityGroups&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                <span class="s2">&#34;ec2:DescribeSubnets&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                <span class="s2">&#34;ec2:DescribeVpcs&#34;</span>
</span></span><span class="line"><span class="cl">            <span class="p">],</span>
</span></span><span class="line"><span class="cl">            <span class="nt">&#34;Resource&#34;</span><span class="p">:</span> <span class="s2">&#34;*&#34;</span>
</span></span><span class="line"><span class="cl">        <span class="p">}</span>
</span></span><span class="line"><span class="cl">    <span class="p">]</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><h2 id="3-accessing-the-secrets-manager-and-s3">3. Accessing the Secrets Manager and S3<a href="#3-accessing-the-secrets-manager-and-s3" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h2>
<p>You will probably want to store your RedShift credentials to some encrypted storage compared to keeping it hardcoded inside your lambda code (GitHub) or keeping it inside some environment variables (also GitHub via e.g. terraforming script). A good place to keep those RedShift credentials is AWS Secrets Manager, so let&rsquo;s see how that code might look like:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-ruby" data-lang="ruby"><span class="line"><span class="cl"><span class="c1"># frozen_string_literal: true</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nb">require</span> <span class="s2">&#34;yaml&#34;</span>
</span></span><span class="line"><span class="cl"><span class="nb">require</span> <span class="s2">&#34;aws-sdk-secretsmanager&#34;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">class</span> <span class="nc">SecretsManager</span>
</span></span><span class="line"><span class="cl">  <span class="k">class</span> <span class="o">&lt;&lt;</span> <span class="nb">self</span>
</span></span><span class="line"><span class="cl">    <span class="kp">attr_reader</span> <span class="ss">:db_config</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="nf">init_secrets</span>
</span></span><span class="line"><span class="cl">      <span class="n">honeybadger_id</span> <span class="o">=</span> <span class="no">ENV</span><span class="o">.</span><span class="n">fetch</span><span class="p">(</span><span class="s2">&#34;HONEYBADGER_ID&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">      <span class="n">hb_secret</span> <span class="o">=</span>
</span></span><span class="line"><span class="cl">        <span class="n">client</span><span class="o">.</span><span class="n">get_secret_value</span><span class="p">(</span><span class="ss">secret_id</span><span class="p">:</span> <span class="n">honeybadger_id</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">      <span class="n">hb_config</span> <span class="o">=</span> <span class="no">JSON</span><span class="o">.</span><span class="n">parse</span><span class="p">(</span><span class="n">hb_secret</span><span class="o">.</span><span class="n">secret_string</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">      <span class="no">ENV</span><span class="o">[</span><span class="s2">&#34;HONEYBADGER_API_KEY&#34;</span><span class="o">]</span> <span class="o">=</span> <span class="n">hb_config</span><span class="o">[</span><span class="s2">&#34;api_key&#34;</span><span class="o">]</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">      <span class="n">redshift_id</span> <span class="o">=</span> <span class="no">ENV</span><span class="o">.</span><span class="n">fetch</span><span class="p">(</span><span class="s2">&#34;REDSHIFT_CREDENTIALS_SECRET&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">      <span class="n">redshift_secret</span> <span class="o">=</span>
</span></span><span class="line"><span class="cl">        <span class="n">client</span><span class="o">.</span><span class="n">get_secret_value</span><span class="p">(</span><span class="ss">secret_id</span><span class="p">:</span> <span class="n">radium_config_redshift_id</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">      <span class="vi">@db_config</span> <span class="o">=</span>
</span></span><span class="line"><span class="cl">        <span class="no">JSON</span><span class="o">.</span><span class="n">parse</span><span class="p">(</span><span class="n">redshift_secret</span><span class="o">.</span><span class="n">secret_string</span><span class="p">,</span> <span class="ss">symbolize_names</span><span class="p">:</span> <span class="kp">true</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="k">end</span>
</span></span><span class="line"><span class="cl">    
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="nf">client</span>
</span></span><span class="line"><span class="cl">      <span class="vi">@client</span> <span class="o">||=</span>
</span></span><span class="line"><span class="cl">        <span class="no">Aws</span><span class="o">::</span><span class="no">SecretsManager</span><span class="o">::</span><span class="no">Client</span><span class="o">.</span><span class="n">new</span> <span class="p">\</span>
</span></span><span class="line"><span class="cl">          <span class="ss">region</span><span class="p">:</span> <span class="no">ENV</span><span class="o">.</span><span class="n">fetch</span><span class="p">(</span><span class="s2">&#34;AWS_REGION&#34;</span><span class="p">,</span> <span class="s2">&#34;us-east-1&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="k">end</span>
</span></span></code></pre></div><p>However, I quickly realized that once running the code from above, my lambda started to timeout.</p>
<p>Long story short (in reality it was a very long and painful debugging session), once I have decided to put my lambda within the VPC, I have lost access to the internet. Since AWS Secrets Manager is accessible via the internet, and my VPC didn&rsquo;t have a NAT Gateway associated with it, I was in trouble.</p>
<p>Luckily there is a workaround for this called interface Endpoint and can be found under the VPC settings. Check <a href="https://aws.amazon.com/blogs/compute/sharing-secrets-with-aws-lambda-using-aws-systems-manager-parameter-store/">this article</a> for further details.</p>
<p>Once I got AWS Secrets Manager code running, I ran into the same issue when accessing the S3. S3 service is also not accessible from within the VPC unless you either have a NAT Gateway or you have defined another Endpoint, but this time of a <a href="https://docs.aws.amazon.com/vpc/latest/userguide/vpce-gateway.html">gateway type</a>.</p>
<blockquote>
<h2 id="pro-tip-accessing-secrets-manager-requires-a-nat-gateway-using-public-internet-or-interface-endpoint-preferable-once-you-put-lambda-inside-the-vpc">Pro tip: accessing Secrets Manager requires a NAT Gateway (using public internet) or interface Endpoint (preferable) once you put lambda inside the VPC<a href="#pro-tip-accessing-secrets-manager-requires-a-nat-gateway-using-public-internet-or-interface-endpoint-preferable-once-you-put-lambda-inside-the-vpc" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h2>
</blockquote>
<p>Check <a href="https://docs.aws.amazon.com/lambda/latest/dg/vpc.html">Lambda VPC docs</a> for some more sensible bits of advice on the subject.</p>
<h2 id="4-reusing-the-database-connection">4. Reusing the database connection<a href="#4-reusing-the-database-connection" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h2>
<p>Reuse that single database connection between different lambda handler invocations. Lambda Ruby runtime calls your handler in a loop synchronously, never in parallel. So there&rsquo;s no need for any connection pooling, just make sure to reuse that one connection properly. Here&rsquo;s an example of how to do it:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-ruby" data-lang="ruby"><span class="line"><span class="cl"><span class="c1"># frozen_string_literal: true</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nb">require</span> <span class="s2">&#34;pg&#34;</span>
</span></span><span class="line"><span class="cl"><span class="nb">require</span> <span class="s2">&#34;retryable&#34;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">class</span> <span class="nc">DatabaseHelper</span>
</span></span><span class="line"><span class="cl">  <span class="o">...</span>
</span></span><span class="line"><span class="cl">  <span class="k">def</span> <span class="nf">run</span>
</span></span><span class="line"><span class="cl">    <span class="no">Retryable</span><span class="o">.</span><span class="n">retryable</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">      <span class="ss">tries</span><span class="p">:</span> <span class="mi">3</span><span class="p">,</span> <span class="ss">on</span><span class="p">:</span> <span class="no">PG</span><span class="o">::</span><span class="no">ConnectionBad</span>
</span></span><span class="line"><span class="cl">    <span class="p">)</span> <span class="k">do</span> <span class="o">|</span><span class="n">retries</span><span class="p">,</span> <span class="n">_</span><span class="o">|</span>
</span></span><span class="line"><span class="cl">      <span class="nb">puts</span> <span class="s2">&#34;db connection error, retry </span><span class="si">#{</span><span class="n">retries</span><span class="si">}</span><span class="s2">&#34;</span> <span class="k">if</span> <span class="n">retries</span><span class="o">.</span><span class="n">positive?</span>
</span></span><span class="line"><span class="cl">      <span class="n">db_conn</span> <span class="o">=</span> <span class="nb">self</span><span class="o">.</span><span class="n">class</span><span class="o">.</span><span class="n">connection</span><span class="p">(</span><span class="ss">force</span><span class="p">:</span> <span class="n">retries</span><span class="o">.</span><span class="n">positive?</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">      <span class="n">db_conn</span><span class="o">.</span><span class="n">transaction</span> <span class="k">do</span> <span class="o">|</span><span class="n">conn</span><span class="o">|</span>
</span></span><span class="line"><span class="cl">        <span class="n">do_crazy_stuff</span><span class="p">(</span><span class="n">conn</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">      <span class="k">end</span>
</span></span><span class="line"><span class="cl">    <span class="k">end</span>
</span></span><span class="line"><span class="cl">  <span class="k">rescue</span> <span class="no">PG</span><span class="o">::</span><span class="no">Error</span> <span class="o">=&gt;</span> <span class="n">e</span>
</span></span><span class="line"><span class="cl">    <span class="n">put</span> <span class="s2">&#34;failed to do crazy stuff: </span><span class="si">#{</span><span class="n">e</span><span class="o">.</span><span class="n">class</span><span class="si">}</span><span class="s2">, </span><span class="si">#{</span><span class="n">e</span><span class="o">.</span><span class="n">message</span><span class="si">}</span><span class="s2">&#34;</span>
</span></span><span class="line"><span class="cl">  <span class="k">end</span>
</span></span><span class="line"><span class="cl">  <span class="o">...</span>
</span></span><span class="line"><span class="cl">  <span class="k">class</span> <span class="o">&lt;&lt;</span> <span class="nb">self</span>
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="nf">connection</span><span class="p">(</span><span class="ss">force</span><span class="p">:</span> <span class="kp">false</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">      <span class="vi">@connection</span> <span class="o">=</span>
</span></span><span class="line"><span class="cl">        <span class="k">if</span> <span class="n">force</span>
</span></span><span class="line"><span class="cl">          <span class="n">connect_to_db</span>
</span></span><span class="line"><span class="cl">        <span class="k">else</span>
</span></span><span class="line"><span class="cl">          <span class="vi">@connection</span> <span class="o">||</span> <span class="n">connect_to_db</span>
</span></span><span class="line"><span class="cl">        <span class="k">end</span>
</span></span><span class="line"><span class="cl">    <span class="k">end</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="nf">connect_to_db</span>
</span></span><span class="line"><span class="cl">      <span class="no">PG</span><span class="o">.</span><span class="n">connect</span><span class="p">(</span><span class="no">SecretsManager</span><span class="o">.</span><span class="n">db_config</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="k">end</span>
</span></span><span class="line"><span class="cl">  <span class="k">end</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">end</span>
</span></span></code></pre></div><p>However, if there are more events to handle than the single lambda worker is able to process, the lambda scheduler will spawn more lambda instances and these will work in parallel. Such behavior is regulated by the lambda concurrency number and is preferred to set it up to a number of max connections you might have on your database (or any other shared resource you might be accessing in a similar way).</p>
<blockquote>
<h2 id="pro-tip-on-a-single-box-lambda-runtime-executes-your-handler-code-in-a-loop-synchronously">Pro tip: on a single box lambda runtime executes your handler code in a loop, synchronously.<a href="#pro-tip-on-a-single-box-lambda-runtime-executes-your-handler-code-in-a-loop-synchronously" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h2>
</blockquote>
<p>Check AWS docs on the <a href="https://docs.aws.amazon.com/lambda/latest/dg/concurrent-executions.html">lambda concurrency</a> or <a href="https://github.com/lambci/docker-lambda">lambci&rsquo;s GitHub repo</a> for even more details.</p>
<p>While at it, you might take a quick look at the my <a href="https://github.com/okulik/lambcli-ruby/blob/master/runtime/lib/runtime.rb#L27">lambcli-ruby repo</a> to get the idea of how that lambda runtime loop looks like. I copied <code>/var/runtime</code> folder off of <code>lambci/lambda:build-ruby2.5</code> docker image for an easy inspection.</p>
<h2 id="5-zip-package-liposuction">5. Zip package liposuction<a href="#5-zip-package-liposuction" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h2>
<p>The suggested way to make your zip package containing lambda code smaller is to move all your dependencies, shared object libraries and binaries into a separate layer.</p>
<p>However, I found there&rsquo;s even a simpler way to trim down your zip archive by carefully inspecting what ends up inside the <code>vendor/bundle</code> folder:</p>
<ol>
<li>exclude all your specs and native extension compiling artifacts\</li>
<li>remove all extra instances of <code>pg_ext.so</code> file (it&rsquo;s 1 MB in size and can be found in three different places - two are redundant).</li>
</ol>
<p>Here&rsquo;s how my bash packaging command looks like:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">zip -rq -9 <span class="s2">&#34;</span><span class="k">$(</span>BASE<span class="k">)</span><span class="s2">/</span><span class="k">$(</span>PROJECT_NAME<span class="k">)</span><span class="s2">.zip&#34;</span> . <span class="se">\
</span></span></span><span class="line"><span class="cl">  -x <span class="s2">&#34;spec/*&#34;</span> <span class="se">\
</span></span></span><span class="line"><span class="cl">     <span class="s2">&#34;**/spec/*&#34;</span> <span class="se">\
</span></span></span><span class="line"><span class="cl">     <span class="s2">&#34;vendor/bundle/ruby/2.5.0/gems/pg-1.1.4/lib/pg_ext.so&#34;</span> <span class="se">\
</span></span></span><span class="line"><span class="cl">     <span class="s2">&#34;vendor/bundle/ruby/2.5.0/gems/pg-1.1.4/ext/*&#34;</span>
</span></span></code></pre></div><blockquote>
<h2 id="pro-tip-know-what-goes-into-your-lambda-package">Pro tip: know what goes into your lambda package!<a href="#pro-tip-know-what-goes-into-your-lambda-package" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h2>
</blockquote>
<p>Making your code and package smaller makes deployment faster and code editing/testing inside the Cloud9 editor much more enjoyable.</p>
<p>That&rsquo;s that for now, until the next time!</p>
]]></content>
		</item>
		
		<item>
			<title>Minimalistic logging from Docker containers</title>
			<link>https://nisdom.com/posts/2015-04-10-minimalistic-logging-from-docker-containers/</link>
			<pubDate>Fri, 10 Apr 2015 12:53:00 +0200</pubDate><guid>https://nisdom.com/posts/2015-04-10-minimalistic-logging-from-docker-containers/</guid>
			<description><![CDATA[&lt;no value&gt;]]></description><content type="text/html" mode="escaped"><![CDATA[<p>Reading logs from Docker container can be done using <code>docker logs container_id</code>. This simply fetches logs present at the time of execution from container&rsquo;s STDOUT and STDERR streams. If you want to, however, transform those logs, and send them to a central repository using e.g. logstash, there are a number of options to choose from. Here I&rsquo;ll be describing the simplest case of writing logs to /dev/log socket.</p>
<h2 id="writers-will-write">Writers will write<a href="#writers-will-write" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h2>
<p>Minimalistic scenario for collecting logs expects that your software, running inside the Docker container, writes to a syslog using the <a href="http://en.wikipedia.org/wiki/Unix_domain_socket">Unix domain socket</a> /dev/log. With a Ruby app running inside the Docker container you can use <a href="http://ruby-doc.org/stdlib-2.0/libdoc/syslog/rdoc/Syslog/Logger.html">Syslog::Logger</a> class like this:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-ruby" data-lang="ruby"><span class="line"><span class="cl"><span class="n">log</span> <span class="o">=</span> <span class="no">Syslog</span><span class="o">::</span><span class="no">Logger</span><span class="o">.</span><span class="n">new</span><span class="p">(</span><span class="s1">&#39;my-awesome-app&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">log</span><span class="o">.</span><span class="n">info</span><span class="p">(</span><span class="s1">&#39;say something nice&#39;</span><span class="p">)</span>
</span></span></code></pre></div><p>With a nodejs app and some help of <a href="https://github.com/phuesler/ain">ain</a> package you might end up with:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-javascript" data-lang="javascript"><span class="line"><span class="cl"><span class="kd">var</span> <span class="nx">SysLogger</span> <span class="o">=</span> <span class="nx">require</span><span class="p">(</span><span class="s1">&#39;ain2&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="kd">var</span> <span class="nx">log</span> <span class="o">=</span> <span class="k">new</span> <span class="nx">SysLogger</span><span class="p">({</span><span class="nx">tag</span><span class="o">:</span> <span class="s1">&#39;my-cool-app&#39;</span><span class="p">,</span> <span class="nx">path</span><span class="o">:</span> <span class="s1">&#39;/dev/log&#39;</span><span class="p">});</span>
</span></span><span class="line"><span class="cl"><span class="nx">log</span><span class="p">.</span><span class="nx">setTransport</span><span class="p">(</span><span class="s1">&#39;unix_dgram&#39;</span><span class="p">);</span>
</span></span><span class="line"><span class="cl"><span class="nx">log</span><span class="p">.</span><span class="nx">info</span><span class="p">(</span><span class="s1">&#39;say something sweet&#39;</span><span class="p">);</span>
</span></span></code></pre></div><p>If you have a properly configured and running (r)syslog daemon, you will get to both accounts something like this in /var/log/messages:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">Apr <span class="m">10</span> 18:31:25 ip-123-21-31-41 my-awesome-app<span class="o">[</span>1101<span class="o">]</span>: say something nice
</span></span><span class="line"><span class="cl">Apr <span class="m">10</span> 18:31:25 ip-123-21-31-41 my-cool-app<span class="o">[</span>1105<span class="o">]</span>: say something sweet
</span></span></code></pre></div><p>The same effect can be achieved by using the <a href="https://www.freebsd.org/cgi/man.cgi?query=logger%281%29&amp;sektion=">logger</a> tool. Both previously mentioned libraries, as well as the logger, use <a href="https://www.freebsd.org/cgi/man.cgi?query=syslog&amp;sektion=3">syslog(3)</a> API call and write directly to /dev/log socket (if available).</p>
<h2 id="and-readers-will-read">And readers will read<a href="#and-readers-will-read" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h2>
<p>Now comes the fun part. I promised to explain how this logging stuff and syslog plays with Docker. To be able to send logs to an IPC socket, somebody has to create it first. Let that somebody be an rsyslog daemon running on a Docker host. For this to work we need to have the following line in <code>/etc/rsyslog.conf</code> uncommented:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl"><span class="nv">$ModLoad</span> imuxsock
</span></span></code></pre></div><p>Last thing yet to do is to bind mount /dev/log socket on Docker run using something like:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">docker run -v /dev/log:/dev/log my-image
</span></span></code></pre></div><p>One important thing to notice is that once we restart rsyslog daemon, for whatever reason, all apps running inside the Docker containers won&rsquo;t be able to write to syslog anymore. Reason for that is dead simple - the socket to which all apps were bound is now gone and replaced by another one. If we want to write logs to that new socket, we should probably restart our apps.</p>
]]></content>
		</item>
		
		<item>
			<title>A matter of logs</title>
			<link>https://nisdom.com/posts/2015-03-04-a-matter-of-logs/</link>
			<pubDate>Wed, 04 Mar 2015 14:15:57 +0100</pubDate><guid>https://nisdom.com/posts/2015-03-04-a-matter-of-logs/</guid>
			<description><![CDATA[&lt;no value&gt;]]></description><content type="text/html" mode="escaped"><![CDATA[<p>Logs are very important part of any serious software system. They provide invaluable insight in the current and past state of the system. Simply saving them to a disk or persisting them in any other crude way might probably deprive you from discovering anything interesting in it. The purpose of this article was to describe one such offline processing logs collection system I created years ago and to sketch possible real-time solutions using technologies available today.</p>
<h2 id="problem-description-and-motivation">Problem description and motivation<a href="#problem-description-and-motivation" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h2>
<p>The story begins a couple of years ago when I was working on some server-side code that needed to process on a number of logs streaming from the desktop app. These logs contained various time-stamped events and, since at the time I was using Heroku to run web services, I had to be extra careful about the running costs. I also didn&rsquo;t want to spend too much time on administration of some server software running on EC2 instances. Luckily, business requirements at the time didn&rsquo;t call for the realtime solution, so in the end I decided to go with the offline one.</p>
<h2 id="simpledb-solution">Simple(db) solution<a href="#simpledb-solution" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h2>
<p>Since there were a number of users running the desktop app concurrently, a relatively large number of events were generated, something close to 5K per second. From my experience, that number of concurrent calls on an HTTP endpoint wouldn&rsquo;t even work on Heroku (for comparison, <a href="http://highscalability.com/blog/2014/7/21/stackoverflow-update-560m-pageviews-a-month-25-servers-and-i.html">StackOverflow</a> had around 3000 req/s in 2014). Since this was a desktop app, the decision was made to directly upload compressed batches of events (serialized as JSON data) to S3. When upload of a single batch was finished, app would still call Heroku web service to store a timestamp and a pointer to uploaded S3 file to a <a href="http://en.wikipedia.org/wiki/Amazon_SimpleDB">SimpleDb</a>. Batching helped cutting down requests to less than 100 per second and writing metadata to SimpleDb was made out-of-band with a help of queue and some background workers. This solution was in the end still calling web service hosted on Heroku, but it was much a leaner one than it could have been.</p>
<blockquote>
<p>At a time new-object-created event wasn&rsquo;t available on S3 and even <a href="http://en.wikipedia.org/wiki/Amazon_DynamoDB">DynamoDB</a> wasn&rsquo;t there. SimpleDb was the only hosted columnar data store, with a very reasonable price-tag and bearable constraints for the offline processing purpose. If there was such <a href="http://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html">S3 event</a>, we could have skipped Heroku completely.</p>
</blockquote>
<p>Next thing that needed to be done was offline processing of those events. For this purpose I created a daily cron job (running at night though) that was spawning some Ruby code. First it queried SimpleDb by grouping events by timestamp for the previous day. Then it pushed those events to the SQS instance served my the arbitrary large set of listeners. Listeners were pulling related blobs of data from S3, doing some transformations and finally updating various counters in MySQL.</p>
<p>Here&rsquo;s a diagram of the whole scaffolding:</p>
<p><img src="/images/simpledb.png" alt="SimpleDB solution"></p>
<p>I hope I managed to clearly describe how the previous system was created. Now I am fast forwarding to see how could I build a similar, real-time system with the current technologies.</p>
<h2 id="fast-forward-today">Fast forward today<a href="#fast-forward-today" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h2>
<p>Every such journey should starts with a little research. You don&rsquo;t want to be a system architect stuck with a hammer and a saw; you better upgrade your toolbelt occasionally. After a relatively short research on the subject, I was amazed how enormous real-time logs/events processing area was and how many software products existed in this space. And by products I don&rsquo;t mean the traditional ones like <a href="http://en.wikipedia.org/wiki/Rsync">rsync</a> or syslog based <a href="http://en.wikipedia.org/wiki/Rsyslog">rsyslog</a> or <a href="http://en.wikipedia.org/wiki/Syslog-ng">syslog-ng</a>. I confess, it took me more than a day to grasp all the existing software products, what they actually represented and how they fitted inside their respective puzzles.</p>
<h2 id="producers">Producers<a href="#producers" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h2>
<p>If I want to handle my logs in real-time, I obviously have to forget about uploading of compressed batches to S3 and all that offline processing.</p>
<blockquote>
<p>I learned that I don&rsquo;t deal with logs, but events, and the thing I would be doing is real-time events ingestion and processing. One useful acronym is <a href="http://en.wikipedia.org/wiki/Extract,_transform,_load">ETL</a> which stands for extract, transform and load, a typical thing which event consumers do.</p>
</blockquote>
<p>We are dealing with roughly 5K events per second, so what comes to mind is that desktop apps could push events to some messaging i.e. queueing system. The usual suspects are RabbitMQ, 0MQ, Redis etc. They all could handle that much traffic, without even a blink, and if we needed more, we could always put some reverse proxy in front and happily continue. I would personally go with Redis since it&rsquo;s very easy to configure and there&rsquo;s a brilliant reverse proxy <a href="https://github.com/twitter/twemproxy">twemproxy</a> (aka nutcracker) that supports Redis protocol, that is if I ever needed to create Redis cluster. Reasoning behind such messaging systems is to isolate message producers (in our case desktop apps) from message consumers (our Ruby scripts running on EC2 instances). I previously used highly available S3 service and SimpleDB service (unfortunately, not so highly available) to achieve a similar sort of isolation.</p>
<p>But I discovered there are even cooler toys out there called <a href="http://en.wikipedia.org/wiki/Apache_Kafka">Apache Kafka</a> and <a href="http://aws.amazon.com/kinesis/">Amazon Kinesis</a>. The main difference between Kafka/Kinesis and those more traditional messaging systems, according to their documentation, is that they are built from the ground up with a distribution in mind. This usually means seamless horizontal scaling with much higher loads.</p>
<p>It seems that Kinesis is less flexible than Kafka, but Kinesis has some other advantages that matter to me even more. If I wanted to have highly-available Kafka cluster, I would need to maintain a number of EC2 instances running Kafka and a separate Zookeeper instance used by Kafka for coordination among the nodes. With Kinesis I don&rsquo;t need to worry about any of that cluster maintenance. It can even endlessly scale with almost no administrative burden. So I am perfectly happy to continue with the hassle-free Kinesis and write events directly from the desktop app to the Kinesis stream.</p>
<h2 id="consumers">Consumers<a href="#consumers" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h2>
<p>Second part of the equation is consumption of those messages. I need to ingest each message, transform it a bit and then store it somewhere safe for later access. If data represents some counter, I might update its value in the database, and if it&rsquo;s some text and I need to search on it later, I could store it to ElasticSearch. As I said, previously I used some Ruby script which execution was triggered once a day by a cron job. I could use that same Ruby script here as well, but this time it wouldn&rsquo;t be started from a cron job, but from some other code listening to events arriving from the Kinesis stream. Amazon even provides a server implementation that works on top of Kinesis Client Library called <a href="https://github.com/awslabs/amazon-kinesis-client">MultiLangDaemon</a> and that simplifies development of Kinesis record processors in languages other than Java. But I have my eyes set on something else.</p>
<p>As with messaging products, there are a number of choices in the logs/events collectors/processors arena. At least enough to spin my head once more - Apache Storm, Flume, logstash, fluentd, Amazon Lambda etc. Although these products differ in many ways, for the purpose of what I&rsquo;m trying to achieve and in what they&rsquo;re similar, I could use any of them. Apache Storm seems to be very powerful and quite a bit supported by the Amazon. On the other hand, there&rsquo;s a brand new Amazon offering called <a href="http://aws.amazon.com/lambda/">Amazon Lambda</a>, the holy grail of no-hassle solutions (which I always preferred, being a developer first person). Lambdas would even relieve me of having EC2 instances for events processing. So all I need to do is rewrite my Ruby transformations into JavaScript (Amazon uses Node.js behind the curtains) and unleash the magical power of Lambda. Sweet!</p>
<h2 id="cost-estimate">Cost estimate<a href="#cost-estimate" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h2>
<p>It seems that I managed to put all those different pieces together and to at least imagine how would I turn my offline events processing into a real-time analytics solution. And all that using Amazon&rsquo;s hosted solutions. The only remaining thing to do is to get the rough estimate of costs. I figured I would calculate only how much would I pay monthly for the use of Kinesis and Lambda. My original ETL code was transferring data to MySQL (RDS) and S3 in the &ldquo;L&rdquo; phase of ETL. This is something I would still be doing with Kinesis/Lambda solution. The only saving I would be able to achieve is the removal of $500/month worth of EC2 instances crunching the events, now replaced with Lambda.</p>
<h3 id="kinesis-shard-hour-cost">Kinesis shard-hour cost<a href="#kinesis-shard-hour-cost" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h3>
<p>I already said that every second we produce around 5K events. Each such event contains around 1K in payload which makes 5 MB/s of data input. Since one shard in Kinesis stream has capacity of 1 MB/s, I would need 5 such shards. This is roughly $55.80 per month.</p>
<h3 id="kinesis-put-record-cost">Kinesis PUT record cost<a href="#kinesis-put-record-cost" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h3>
<p>Next cost is related to PUT records. Number of events per month is 5000 * 60 * 60 * 24 * 31 i.e. 13,392,000,000. Million PUT records costs $0.028, so we end up with additional $375 per month. Since Kinesis messages can hold up to 50K in size, we might once again batch our events and write e.g. 10 events at once. This would make the number of PUT records 500 per second and we would still have system behaving as a real-time. So instead of adding $375, we would have extra $37.5 per month. Notice that the cost of shard-hour hasn&rsquo;t changed with batching.</p>
<h3 id="lambda-requests-count-cost">Lambda requests count cost<a href="#lambda-requests-count-cost" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h3>
<p>Since I decided to batch the events, I ended up with 1,339,200,000 lambda requests. First 1,000,000 requests are free and each next million costs $0.20. Add another $268.</p>
<h3 id="lambda-duration-cost">Lambda duration cost<a href="#lambda-duration-cost" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h3>
<p>Now things become a little bit harder regarding the cost estimation. I would need to know upfront how much memory my code would be needing on Lambda and how long would it execute. This all is, of course, impossible without really trying it out. Arriving at this point also makes painfully obvious that I&rsquo;ll still need to pay what I thought I saved by batching those events. I will make here a really modest estimates and suppose I would need only 128 MB of memory (the cheapest Amazon Lambda tier) and that my code would need 150 ms to process each single event i.e. 1.5 seconds for the whole batch. This makes a total of 2,008,480,000 seconds of work per month (first 3,200,000 seconds are free). Since the price per 100 ms is $0.000000208, we end up with $4178 of additional monthly cost. Oops.</p>
<h3 id="kinesislambda-costs-recap">Kinesis/Lambda costs recap<a href="#kinesislambda-costs-recap" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h3>
<p>Cost of $100 per month for Kinesis turned out to be a real bargain. It saves me from having at least two nodes Redis cluster and an extra reverse proxy instance, and all that to achieve at least modestly comparable HA properties of Kinesis. Lambda, however, turned out to be too pricey for my budget, even when I was estimating with the cheapest tier.</p>
<h2 id="summary">Summary<a href="#summary" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h2>
<p>To recap, I would be definitely pushing my data to Amazon Kinesis stream, but instead of Lambda I would be running e.g. a single c4.2xlarge instance ($345 monthly cost) with MultiLangDaemon and my slightly modified Ruby code. My guess is that this single machine would be able to process all 5 shards concurrently.</p>
<p>New solution managed to replace storing data to S3 and to remove most of the offline logs-processing EC2 instances, and with the costs remaining roughly the same. And yes, I managed to replace my poxy 24-hours-later analytics with a realtime solution. How cool is that?!</p>
<p>It seems that there are some new and shiny toys to play with on AWS. And once again, they come to rescue from the gruesome maintenance tasks of running software on EC2s, at least for the average back-end developer. But not all of them are for everyone and there is a hefty price tag attached to that Unbearable Lightness of Lambda.</p>
<h2 id="an-honorable-mention-to-elasticsearch-elk-stack">An honorable mention to ElasticSearch ELK stack<a href="#an-honorable-mention-to-elasticsearch-elk-stack" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h2>
<p>Although I love ElasticSearch and its whole ELK stack, <a href="http://logstash.net/">logstash</a> (which is btw. the &ldquo;L&rdquo; in the ELK and a very, very cool product in its own right) would be more appropriate to use when we would be dealing with the raw logs instead of events. In order to use logstash I would need to write a plugin to deal with events sent by the desktop apps (some boilerplate plus the existing Ruby code). This all seems like an overkill compared to Amazon&rsquo;s solution. In any other case where I would need to ingest more structured logs (like stuff coming from web servers), make them available for full-text search and even visualise, logstash is the way to go (make sure to check <a href="https://www.youtube.com/watch?v=RuUFnog29M4">Jordan Sissel&rsquo;s video</a>).</p>
]]></content>
		</item>
		
		<item>
			<title>Replacing Wordpress with Octopress</title>
			<link>https://nisdom.com/posts/2015-02-22-replacing-wordpress-with-octopress/</link>
			<pubDate>Sun, 22 Feb 2015 19:18:44 +0100</pubDate><guid>https://nisdom.com/posts/2015-02-22-replacing-wordpress-with-octopress/</guid>
			<description><![CDATA[&lt;no value&gt;]]></description><content type="text/html" mode="escaped"><![CDATA[<p>About a month ago, I embarked on an adventure to replace my WordPress blogging platform with a more Ruby-friendly alternative. I say &lsquo;once again&rsquo; because the last time I attempted this, after about 30 minutes of somewhat futile searching, I simply gave up. This time, the urge was stronger, and I was lucky to have better results. But before diving into the details of my quest, let me explain my motivation for making this change.</p>
<p><img src="/images/wordpress_to_octopress.png" alt="Octopress rules!"></p>
<h2 id="the-motivation">The motivation<a href="#the-motivation" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h2>
<p>Most people would likely agree that WordPress is one of the most powerful blogging platforms worldwide. It&rsquo;s highly customizable, and even more importantly, has a huge number of powerful add-ons that cover sitemap generation and Google Analytics to backups, Dropbox integration, themes, and more. After all, when you want to publish content on the internet, your primary concern is the &lsquo;what&rsquo; rather than the &lsquo;how,&rsquo; isn&rsquo;t it? When I started writing short pieces for this site, my main concern was practicing writing, so WordPress seemed like a suitable tool.</p>
<p>Perhaps that&rsquo;s the right choice for most people, but I had an itch that needed scratching. As a professional developer, WordPress didn&rsquo;t quite satisfy me. I also wanted a more Ruby-friendly solution since I&rsquo;m a Ruby developer and thought it might allow me to build something on top of it one day.</p>
<p>Budget considerations also played a role in my decision. I run this blog on DigitalOcean&rsquo;s smallest instance with 512MB of RAM, and I&rsquo;ve experienced issues where my blog went down due to MySQL consuming too much RAM. Although I enabled Linux swap once I identified the problem, the idea that WordPress was overkill for my needs stuck in my mind.</p>
<h2 id="enter-the-octopress">Enter The Octopress!<a href="#enter-the-octopress" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h2>
<p>In short, I discovered <a href="https://github.com/imathis/octopress">Octopress</a>, which allows me to create posts using Markdown and then generates static HTML pages served by nginx. This means no database, no server-side code, no swapping, amd no complicated backup/restore add-ons - just a code repository!</p>
<p>It works similarly to tools like <a href="https://docs.angularjs.org/tutorial#get-started">AngularJS toolchain</a>, with templates on one end, a templating engine in between, and raw HTML, CSS and JavaScript on the other. The key difference is that there are no Node.js, Grunt.js, Bower or other tools to manage - just some Ruby code. Sweet!</p>
]]></content>
		</item>
		
		<item>
			<title>CORS font issues with Rails, Heroku, CloudFront and Passenger</title>
			<link>https://nisdom.com/posts/2014-09-13-cors-font-issues-with-rails/</link>
			<pubDate>Sat, 13 Sep 2014 17:41:41 +0100</pubDate><guid>https://nisdom.com/posts/2014-09-13-cors-font-issues-with-rails/</guid>
			<description><![CDATA[&lt;no value&gt;]]></description><content type="text/html" mode="escaped"><![CDATA[<p>Ever saw a log in your browsers console saying some resources like web fonts could not be loaded because Access-Control-Allow-Origin headers were missing? Did you think “oh, this should be easy” and then spent hours of searching through various misleading articles and even more hours applying those advices and still failing? Well, I sure did and here’s the story and how I finally won.</p>
<h2 id="tldr">TL;DR;<a href="#tldr" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h2>
<p>You need to get Passenger’s nginx template, modify it to attach CORS headers and use it instead of the default one.</p>
<h2 id="the-setup">The Setup<a href="#the-setup" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h2>
<p>My Rails app is hosted on Heroku and assets are served from the CloudFront distribution that has custom origin pointing back to the Rails app. Heroku precompiles my assets during slug compilation and stores them under the folder public/assets (check assets and cloudfront Heroku documents for details). All that is powered by standalone Passenger, just recently upgraded from Unicorn.</p>
<p>My config/environments/production.rb file contains something like this:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-ruby" data-lang="ruby"><span class="line"><span class="cl"><span class="n">config</span><span class="o">.</span><span class="n">serve_static_assets</span> <span class="o">=</span> <span class="kp">true</span>
</span></span><span class="line"><span class="cl"><span class="n">config</span><span class="o">.</span><span class="n">action_controller</span><span class="o">.</span><span class="n">asset_host</span> <span class="o">=</span> <span class="s2">&#34;//something.cloudfront.net&#34;</span>
</span></span></code></pre></div><p>First line means that my app’s assets will be served from the Rails app and not from nginx. Actually, Rails will inject here a special middleware (previously Rack::Static and more recently ActionDispatch::Static) and serve all files from the folder public. So whenever some resource is requested from the web app, it is first inspected by the middleware. If the file is found, it will be served directly from the file system. If not, the request will travel through the usual Rails routing and controllers stuff. This is useful if we would like to control custom headers for those resources.</p>
<h2 id="i-know-what-im-doing">I know what I’m doing…<a href="#i-know-what-im-doing" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h2>
<p>The issue of missing CORS headers for web fonts was, I thought initially, a walk in the park. First I would need to inject manually those CORS headers by using some middleware injection magic, or even better, I would use font_assets gem. Then I would invalidate font assets in CloudFront to force a cache refresh and to get proper CORS headers. Unfortunately, it didn’t work. Whatever I’ve tried, CORS headers were nowhere to be seen.</p>
<h2 id="sobering-up">Sobering up<a href="#sobering-up" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h2>
<p>Of course, the real breakthrough came only until I started paying much closer attention to what was being returned from those requests. If I requested a valid resource that existed in the public folder, I got
<code>Server: nginx/1.6.1</code>
But if I requested some file that didn’t exist I got
<code>Server: nginx/1.6.1 + Phusion Passenger 4.0.50</code>
along with all the CORS headers I could ever hoped for (and 404 error too). Which means that my serve_static_assets setting didn’t work; nginx was somehow instructed to serve my static assets, without my consent.</p>
<p>It turned out that Passenger standalone gem is installing its own nginx configuration file, compared to much simpler Unicorn gem I previously had. Here’s how it looked like in the original config.erb:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-nginx" data-lang="nginx"><span class="line"><span class="cl"><span class="c1"># Rails asset pipeline support.
</span></span></span><span class="line"><span class="cl"><span class="k">location</span> <span class="p">~</span> <span class="sr">&#34;^/assets/.+-[0-9a-f]</span><span class="p">{</span><span class="kn">32}\..+&#34;</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="kn">error_page</span> <span class="mi">490</span> <span class="p">=</span> <span class="s">@static_asset</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="kn">error_page</span> <span class="mi">491</span> <span class="p">=</span> <span class="s">@dynamic_request</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="kn">recursive_error_pages</span> <span class="no">on</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="kn">if</span> <span class="s">(-f</span> <span class="nv">$request_filename</span><span class="s">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">        <span class="kn">return</span> <span class="mi">490</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="p">}</span>
</span></span><span class="line"><span class="cl">    <span class="kn">if</span> <span class="s">(!-f</span> <span class="nv">$request_filename</span><span class="s">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">        <span class="kn">return</span> <span class="mi">491</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="p">}</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span><span class="line"><span class="cl"><span class="kn">location</span> <span class="s">@static_asset</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="kn">gzip_static</span> <span class="no">on</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="kn">expires</span> <span class="s">max</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="kn">add_header</span> <span class="s">Cache-Control</span> <span class="s">public</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="kn">add_header</span> <span class="s">ETag</span> <span class="s">&#34;&#34;</span><span class="p">;</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span><span class="line"><span class="cl"><span class="kn">location</span> <span class="s">@dynamic_request</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="kn">passenger_enabled</span> <span class="no">on</span><span class="p">;</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><p>Let me explain those couple of lines:</p>
<ul>
<li>If a resource contains assets in path, contains a digest in its name and actually exists on the file system, it will be treated as a static resource and server by the location @static_asset setting.</li>
<li>If such resource’s file doesn’t exist on the file system, it will use location @dynamic_request i.e. go to the Rails app via Passenger.</li>
<li>If a resource doesn’t contain assets in path and/or doesn’t contain 32 characters digest in its name, it will always be treated as a static content with the usual location @static_asset code. My web fonts were such resources.</li>
</ul>
<h2 id="the-solution">The Solution<a href="#the-solution" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h2>
<p>What I did then is pretty straightforward; I copied that whole template, stored it in my config folder and modified it a bit. Here’s what I’ve changed to serve CORS headers (but only with web fonts):</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-nginx" data-lang="nginx"><span class="line"><span class="cl"><span class="c1"># Rails asset pipeline support.
</span></span></span><span class="line"><span class="cl"><span class="k">location</span> <span class="p">~</span> <span class="sr">&#34;^/assets/.+-[0-9a-f]</span><span class="p">{</span><span class="kn">32}\..+&#34;</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="kn">error_page</span> <span class="mi">490</span> <span class="p">=</span> <span class="s">@static_asset</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="kn">error_page</span> <span class="mi">491</span> <span class="p">=</span> <span class="s">@dynamic_request</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="kn">recursive_error_pages</span> <span class="no">on</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="kn">if</span> <span class="s">(-f</span> <span class="nv">$request_filename</span><span class="s">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">        <span class="kn">return</span> <span class="mi">490</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="p">}</span>
</span></span><span class="line"><span class="cl">    <span class="kn">if</span> <span class="s">(!-f</span> <span class="nv">$request_filename</span><span class="s">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">        <span class="kn">return</span> <span class="mi">491</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="p">}</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span><span class="line"><span class="cl"><span class="c1"># Fonts in assets that don&#39;t contain digest in file name.
</span></span></span><span class="line"><span class="cl"><span class="kn">location</span> <span class="p">~</span> <span class="sr">&#34;^/assets/.+\.(eot|svg|ttf|otf|woff)&#34;</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="kn">error_page</span> <span class="mi">490</span> <span class="p">=</span> <span class="s">@static_asset_fonts</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="kn">error_page</span> <span class="mi">491</span> <span class="p">=</span> <span class="s">@dynamic_request</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="kn">recursive_error_pages</span> <span class="no">on</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="kn">if</span> <span class="s">(-f</span> <span class="nv">$request_filename</span><span class="s">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">        <span class="kn">return</span> <span class="mi">490</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="p">}</span>
</span></span><span class="line"><span class="cl">    <span class="kn">if</span> <span class="s">(!-f</span> <span class="nv">$request_filename</span><span class="s">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">        <span class="kn">return</span> <span class="mi">491</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="p">}</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span><span class="line"><span class="cl"><span class="kn">location</span> <span class="s">@static_asset</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="kn">gzip_static</span> <span class="no">on</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="kn">expires</span> <span class="s">max</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="kn">add_header</span> <span class="s">Cache-Control</span> <span class="s">public</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="kn">add_header</span> <span class="s">ETag</span> <span class="s">&#34;&#34;</span><span class="p">;</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span><span class="line"><span class="cl"><span class="kn">location</span> <span class="s">@static_asset_fonts</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="kn">gzip_static</span> <span class="no">on</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="kn">expires</span> <span class="s">max</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="kn">add_header</span> <span class="s">Cache-Control</span> <span class="s">public</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="kn">add_header</span> <span class="s">ETag</span> <span class="s">&#34;&#34;</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="kn">add_header</span> <span class="s">&#39;Access-Control-Allow-Origin&#39;</span> <span class="s">&#39;*&#39;</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="kn">add_header</span> <span class="s">&#39;Access-Control-Allow-Methods&#39;</span> <span class="s">&#39;GET,</span> <span class="s">HEAD,</span> <span class="s">OPTIONS&#39;</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="kn">add_header</span> <span class="s">&#39;Access-Control-Allow-Headers&#39;</span> <span class="s">&#39;*&#39;</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="kn">add_header</span> <span class="s">&#39;Access-Control-Max-Age&#39;</span> <span class="mi">3628800</span><span class="p">;</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span><span class="line"><span class="cl"><span class="kn">location</span> <span class="s">@dynamic_request</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="kn">passenger_enabled</span> <span class="no">on</span><span class="p">;</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><p>Besides modifying nginx template, I needed to add to Procfile –nginx-config-template parameter and a path to my copy of template (for that parameter to work you need Passenger &gt;= 4.0.39).</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl"><span class="nt">web</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="l">bundle exec passenger start -p $PORT \</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>--<span class="l">max-pool-size ${WEB_CONCURRENCY:-3} \</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>--<span class="l">nginx-config-template ./config/passenger_config.erb</span><span class="w">
</span></span></span></code></pre></div><p>The only remaining thing is to remember to update i.e. merge Passenger’s nginx template with my changes whenever I decide to update that gem.</p>
]]></content>
		</item>
		
		<item>
			<title>A simple Ruby sitemap.xml generator</title>
			<link>https://nisdom.com/posts/2014-04-12-a-simple-ruby-sitemap-dot-xml-generator/</link>
			<pubDate>Sat, 12 Apr 2014 18:17:02 +0100</pubDate><guid>https://nisdom.com/posts/2014-04-12-a-simple-ruby-sitemap-dot-xml-generator/</guid>
			<description><![CDATA[&lt;no value&gt;]]></description><content type="text/html" mode="escaped"><![CDATA[<p>Yesterday, I completed a simple Ruby CLI tool that I&rsquo;ve named SiteMapper. Its main purpose is to generate a sitemap.xml file, a format widely recognized by many popular search engines. You can find the tool at this GitHub link: <a href="https://github.com/okulik/lame-sitemapper">https://github.com/okulik/lame-sitemapper</a>.</p>
<p>During my initial tests, I realized that having a visual representation would be quite cool, rather than relying solely on space-indented text logs. As a result, I added a feature to generate a .dot file, which can then be converted into a .png image using the graphviz tool.</p>
<p>SiteMapper essentially serves as a straightforward, static web page hierarchy explorer. It starts from a page of your choice and navigates through the web site&rsquo;s structure by following links. It will continue until it has traversed all the available content or until it reaches a predefined depth limit.</p>
<h2 id="links-normalization">Links Normalization<a href="#links-normalization" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h2>
<p>The primary challenge in traversing links was determining whether a link had been visited before or not. Without a reliable mechanism, there would be a risk of endlessly navigating through pages, potentially stuck in a loop and jumping from one page to another indefinitely. To tackle this issue, I implemented a method for normalizing raw URLs. This involved expanding each &lsquo;href&rsquo; value to its full path, removing any fragments, and sorting query parameters alphabetically. Let&rsquo;s take a look at some of the Ruby code responsible for this process.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-ruby" data-lang="ruby"><span class="line"><span class="cl"><span class="k">def</span> <span class="nc">self</span><span class="o">.</span><span class="nf">get_normalized_url</span><span class="p">(</span><span class="n">host_url</span><span class="p">,</span> <span class="n">resource_url</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">  <span class="n">host_url</span> <span class="o">=</span> <span class="no">Addressable</span><span class="o">::</span><span class="no">URI</span><span class="o">.</span><span class="n">parse</span><span class="p">(</span><span class="n">host_url</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">  <span class="n">resource_url</span> <span class="o">=</span> <span class="no">Addressable</span><span class="o">::</span><span class="no">URI</span><span class="o">.</span><span class="n">parse</span><span class="p">(</span><span class="n">resource_url</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"> 
</span></span><span class="line"><span class="cl">  <span class="n">m</span> <span class="o">=</span> <span class="p">{}</span>
</span></span><span class="line"><span class="cl">  <span class="n">m</span><span class="o">[</span><span class="ss">:scheme</span><span class="o">]</span> <span class="o">=</span> <span class="n">host_url</span><span class="o">.</span><span class="n">scheme</span> <span class="k">unless</span> <span class="n">resource_url</span><span class="o">.</span><span class="n">scheme</span>
</span></span><span class="line"><span class="cl">  <span class="k">unless</span> <span class="n">resource_url</span><span class="o">.</span><span class="n">host</span>
</span></span><span class="line"><span class="cl">    <span class="n">m</span><span class="o">[</span><span class="ss">:host</span><span class="o">]</span> <span class="o">=</span> <span class="n">host_url</span><span class="o">.</span><span class="n">host</span>
</span></span><span class="line"><span class="cl">    <span class="n">m</span><span class="o">[</span><span class="ss">:port</span><span class="o">]</span> <span class="o">=</span> <span class="n">host_url</span><span class="o">.</span><span class="n">port</span>
</span></span><span class="line"><span class="cl">  <span class="k">end</span>
</span></span><span class="line"><span class="cl">  <span class="n">resource_url</span><span class="o">.</span><span class="n">merge!</span><span class="p">(</span><span class="n">m</span><span class="p">)</span> <span class="k">unless</span> <span class="n">m</span><span class="o">.</span><span class="n">empty?</span>
</span></span><span class="line"><span class="cl">  <span class="k">return</span> <span class="kp">nil</span> <span class="k">unless</span> <span class="no">SUPPORTED_SCHEMAS</span><span class="o">.</span><span class="n">include?</span><span class="p">(</span><span class="n">resource_url</span><span class="o">.</span><span class="n">scheme</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">  <span class="k">return</span> <span class="kp">nil</span> <span class="k">unless</span> <span class="no">PublicSuffix</span><span class="o">.</span><span class="n">valid?</span><span class="p">(</span><span class="n">resource_url</span><span class="o">.</span><span class="n">host</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">  <span class="n">resource_url</span><span class="o">.</span><span class="n">omit!</span><span class="p">(</span><span class="ss">:fragment</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">  <span class="n">resource_url</span><span class="o">.</span><span class="n">query</span> <span class="o">=</span> <span class="n">resource_url</span><span class="o">.</span><span class="n">query</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s2">&#34;&amp;&#34;</span><span class="p">)</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="o">&amp;</span><span class="ss">:strip</span><span class="p">)</span><span class="o">.</span><span class="n">sort</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="s2">&#34;&amp;&#34;</span><span class="p">)</span> 
</span></span><span class="line"><span class="cl">    <span class="k">unless</span> <span class="n">resource_url</span><span class="o">.</span><span class="n">query</span><span class="o">.</span><span class="n">nil?</span> <span class="o">||</span> <span class="n">resource_url</span><span class="o">.</span><span class="n">query</span><span class="o">.</span><span class="n">empty?</span>
</span></span><span class="line"><span class="cl"> 
</span></span><span class="line"><span class="cl">  <span class="k">return</span> <span class="no">Addressable</span><span class="o">::</span><span class="no">URI</span><span class="o">.</span><span class="n">encode</span><span class="p">(</span><span class="n">resource_url</span><span class="p">,</span> <span class="o">::</span><span class="no">Addressable</span><span class="o">::</span><span class="no">URI</span><span class="p">)</span><span class="o">.</span><span class="n">normalize</span>
</span></span><span class="line"><span class="cl"><span class="k">rescue</span> <span class="no">Addressable</span><span class="o">::</span><span class="no">URI</span><span class="o">::</span><span class="no">InvalidURIError</span><span class="p">,</span> <span class="no">TypeError</span>
</span></span><span class="line"><span class="cl">  <span class="kp">nil</span>
</span></span><span class="line"><span class="cl"><span class="k">end</span>
</span></span></code></pre></div><ol>
<li>We parse URL string and convert it to Addressable:URI object (addressable is a ruby gem that servers as a replacement for the URI implementation that is part of Ruby’s standard library).</li>
<li>Host parameter is created from the starting URL, the one which we chose as a starting point of our web site quest. It is here also converted to Addressable::URI.</li>
<li>If URL is given without a scheme, often in the form of //www.nisdom.com/a-simple-ruby-sitemap-xml-generator/, we assume scheme and port number from a host. By calling merge, we also ensure that URLs like /a-simple-ruby-sitemap-xml-generator will end with host name too.</li>
<li>Check if host part of our URL is valid with PublicSuffix gem. Since HTML can contain any kind of text, we want to separate wheat from the chaff and make the content we will scrape as good as possible.</li>
<li>Remove everything from the right side of the # mark (i.e. fragments) since in most cases this will result in the same HTML content. Of course, if we are dealing with routing features of the single page apps written with e.g. AngularJS, we might get different content with different fragments (and different content might mean more URLs to crawl). But, as previously mentioned, SiteMapper is simple and deals only with static content.</li>
<li>Alphabetically sort query parameters. We don’t support JavaScript, forms and whatnot, but we do query parameters as they are rather easy (and I get to use that nice Ruby one-liner).</li>
<li>Finally, we encode any spaces and other non-URL compatible characters. Addressable to the rescue once again.</li>
</ol>
<p>There are a couple of more interesting places and Crawler#should_crawl_page is one of them:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-ruby" data-lang="ruby"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">should_crawl_page?</span><span class="p">(</span><span class="n">host</span><span class="p">,</span> <span class="n">page</span><span class="p">,</span> <span class="n">depth</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">  <span class="k">unless</span> <span class="no">UrlHelper</span><span class="o">.</span><span class="n">is_url_same_domain?</span><span class="p">(</span><span class="n">host</span><span class="p">,</span> <span class="n">page</span><span class="o">.</span><span class="n">path</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">  <span class="o">...</span>
</span></span><span class="line"><span class="cl">  <span class="k">if</span> <span class="vi">@robots</span> <span class="o">&amp;&amp;</span> <span class="vi">@robots</span><span class="o">.</span><span class="n">disallowed?</span><span class="p">(</span><span class="n">page</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">to_s</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">  <span class="o">...</span>
</span></span><span class="line"><span class="cl">  <span class="k">if</span> <span class="n">depth</span> <span class="o">&gt;=</span> <span class="vi">@opts</span><span class="o">[</span><span class="ss">:max_page_depth</span><span class="o">].</span><span class="n">to_i</span>
</span></span><span class="line"><span class="cl">  <span class="o">...</span>
</span></span><span class="line"><span class="cl"><span class="k">end</span>
</span></span></code></pre></div><p>When traversing from page to page, should_crawl_page? is called for each new encountered link. It checks if link belongs to the same domain as the one we started with, if the link is allowed by robots.txt file and if we reached maximum traversal depth. is_url_same_domain? is dead simple:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-ruby" data-lang="ruby"><span class="line"><span class="cl"><span class="k">def</span> <span class="nc">self</span><span class="o">.</span><span class="nf">is_url_same_domain?</span><span class="p">(</span><span class="n">host_url</span><span class="p">,</span> <span class="n">resource_url</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">  <span class="o">...</span>
</span></span><span class="line"><span class="cl">  <span class="n">host_url</span><span class="o">.</span><span class="n">host</span> <span class="o">==</span> <span class="n">resource_url</span><span class="o">.</span><span class="n">host</span>
</span></span><span class="line"><span class="cl"><span class="k">end</span>
</span></span></code></pre></div><p>One more interesting method is is_url_already_seen?, which, once URL is normalized, tries to match with previously seen URLs. If URL was already seen, we simply ignore that path.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-ruby" data-lang="ruby"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">is_url_already_seen?</span><span class="p">(</span><span class="n">url</span><span class="p">,</span> <span class="n">depth</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">  <span class="k">if</span> <span class="vi">@seen_urls</span><span class="o">[</span><span class="no">Digest</span><span class="o">::</span><span class="no">MurmurHash64B</span><span class="o">.</span><span class="n">hexdigest</span><span class="p">(</span><span class="n">url</span><span class="o">.</span><span class="n">omit</span><span class="p">(</span><span class="ss">:scheme</span><span class="p">)</span><span class="o">.</span><span class="n">to_s</span><span class="p">)</span><span class="o">]</span>
</span></span><span class="line"><span class="cl">  <span class="o">...</span>
</span></span><span class="line"><span class="cl"><span class="k">end</span>
</span></span></code></pre></div><h2 id="concurrent-downloads">Concurrent Downloads<a href="#concurrent-downloads" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h2>
<p>Another intriguing aspect worth exploring is how pages are downloaded and processed concurrently. Given that downloading pages via HTTP is predominantly I/O-bound, it&rsquo;s ok to create multiple threads and delegate downloads to them, even within MRI. To accomplish this, I implemented a producer-consumer concurrency pattern. Let&rsquo;s go into a step-by-step explanation of the process. The following code snippets are extracted from the <a href="https://github.com/okulik/lame-sitemapper/blob/master/core.rb#start">Core#start</a> method, which represents the main thread of execution..</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-ruby" data-lang="ruby"><span class="line"><span class="cl"><span class="n">urls_queue</span> <span class="o">=</span> <span class="no">Queue</span><span class="o">.</span><span class="n">new</span>
</span></span><span class="line"><span class="cl"><span class="n">pages_queue</span> <span class="o">=</span> <span class="no">Queue</span><span class="o">.</span><span class="n">new</span>
</span></span><span class="line"><span class="cl"><span class="n">seen_urls</span> <span class="o">=</span> <span class="p">{}</span>
</span></span><span class="line"><span class="cl"><span class="n">threads</span> <span class="o">=</span> <span class="o">[]</span>
</span></span><span class="line"><span class="cl"><span class="n">root</span> <span class="o">=</span> <span class="kp">nil</span>
</span></span><span class="line"><span class="cl"> 
</span></span><span class="line"><span class="cl"><span class="no">Thread</span><span class="o">.</span><span class="n">abort_on_exception</span> <span class="o">=</span> <span class="kp">true</span>
</span></span><span class="line"><span class="cl"><span class="p">(</span><span class="mi">1</span><span class="o">..</span><span class="vi">@opts</span><span class="o">.</span><span class="n">scraper_threads</span><span class="o">.</span><span class="n">to_i</span><span class="p">)</span><span class="o">.</span><span class="n">each_with_index</span> <span class="k">do</span> <span class="o">|</span><span class="n">index</span><span class="o">|</span>
</span></span><span class="line"><span class="cl">  <span class="n">threads</span> <span class="o">&lt;&lt;</span> <span class="no">Thread</span><span class="o">.</span><span class="n">new</span> <span class="p">{</span> <span class="no">Scraper</span><span class="o">.</span><span class="n">new</span><span class="p">(</span><span class="n">seen_urls</span><span class="p">,</span> <span class="n">urls_queue</span><span class="p">,</span> <span class="n">pages_queue</span><span class="p">,</span> <span class="n">index</span><span class="p">,</span> <span class="vi">@opts</span><span class="p">,</span> <span class="vi">@robots</span><span class="p">)</span><span class="o">.</span><span class="n">run</span> <span class="p">}</span>
</span></span><span class="line"><span class="cl"><span class="k">end</span>
</span></span><span class="line"><span class="cl"> 
</span></span><span class="line"><span class="cl"><span class="n">urls_queue</span><span class="o">.</span><span class="n">push</span><span class="p">(</span><span class="ss">host</span><span class="p">:</span> <span class="n">host</span><span class="p">,</span> <span class="ss">url</span><span class="p">:</span> <span class="n">start_url</span><span class="p">,</span> <span class="ss">depth</span><span class="p">:</span> <span class="mi">0</span><span class="p">,</span> <span class="ss">parent</span><span class="p">:</span> <span class="n">root</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="kp">loop</span> <span class="k">do</span>
</span></span><span class="line"><span class="cl">  <span class="n">msg</span> <span class="o">=</span> <span class="n">pages_queue</span><span class="o">.</span><span class="n">pop</span>
</span></span><span class="line"><span class="cl">  <span class="k">if</span> <span class="n">msg</span><span class="o">[</span><span class="ss">:page</span><span class="o">]</span>
</span></span><span class="line"><span class="cl">    <span class="n">msg</span><span class="o">[</span><span class="ss">:page</span><span class="o">].</span><span class="n">anchors</span><span class="o">.</span><span class="n">each</span> <span class="k">do</span> <span class="o">|</span><span class="n">anchor</span><span class="o">|</span>
</span></span><span class="line"><span class="cl">      <span class="n">urls_queue</span><span class="o">.</span><span class="n">push</span><span class="p">(</span><span class="ss">host</span><span class="p">:</span> <span class="n">host</span><span class="p">,</span> <span class="ss">url</span><span class="p">:</span> <span class="n">anchor</span><span class="p">,</span> <span class="ss">depth</span><span class="p">:</span> <span class="n">msg</span><span class="o">[</span><span class="ss">:depth</span><span class="o">]</span> <span class="o">+</span> <span class="mi">1</span><span class="p">,</span> <span class="ss">parent</span><span class="p">:</span> <span class="n">msg</span><span class="o">[</span><span class="ss">:page</span><span class="o">]</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="k">end</span>
</span></span><span class="line"><span class="cl">    <span class="o">...</span>
</span></span><span class="line"><span class="cl">  <span class="k">end</span>
</span></span><span class="line"><span class="cl">  <span class="o">...</span>
</span></span></code></pre></div><p>Here we create two queues and a set of scraper threads. The main thread interacts with the scraper threads through these two queues. When there&rsquo;s a need to fetch a particular page, a message is sent to the <code>urls_queue</code>, and the completed page objects, which are created and assembled by the scraper threads, are obtained from the pages_queue.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-ruby" data-lang="ruby"><span class="line"><span class="cl">  <span class="o">...</span>
</span></span><span class="line"><span class="cl">  <span class="k">if</span> <span class="n">urls_queue</span><span class="o">.</span><span class="n">empty?</span> <span class="o">&amp;&amp;</span> <span class="n">pages_queue</span><span class="o">.</span><span class="n">empty?</span>
</span></span><span class="line"><span class="cl">    <span class="k">until</span> <span class="n">urls_queue</span><span class="o">.</span><span class="n">num_waiting</span> <span class="o">==</span> <span class="n">threads</span><span class="o">.</span><span class="n">size</span>
</span></span><span class="line"><span class="cl">      <span class="no">Thread</span><span class="o">.</span><span class="n">pass</span>
</span></span><span class="line"><span class="cl">    <span class="k">end</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="n">pages_queue</span><span class="o">.</span><span class="n">empty?</span>
</span></span><span class="line"><span class="cl">      <span class="n">threads</span><span class="o">.</span><span class="n">size</span><span class="o">.</span><span class="n">times</span> <span class="p">{</span> <span class="n">urls_queue</span> <span class="o">&lt;&lt;</span> <span class="kp">nil</span> <span class="p">}</span>
</span></span><span class="line"><span class="cl">      <span class="k">break</span>
</span></span><span class="line"><span class="cl">    <span class="k">end</span>
</span></span><span class="line"><span class="cl">  <span class="k">end</span>
</span></span><span class="line"><span class="cl"><span class="k">end</span>
</span></span><span class="line"><span class="cl"> 
</span></span><span class="line"><span class="cl"><span class="n">threads</span><span class="o">.</span><span class="n">each</span> <span class="p">{</span> <span class="o">|</span><span class="n">thread</span><span class="o">|</span> <span class="n">thread</span><span class="o">.</span><span class="n">join</span> <span class="p">}</span>
</span></span></code></pre></div><p>Here we attempt to determine if we&rsquo;ve completed the task. If both queues are empty, and some threads are still actively processing pages (i.e., not all scraper threads are blocked, waiting on the urls_queue), we utilize a <code>Thread.pass</code> call within the loop to signal to the scheduler that we&rsquo;re yielding our quota - this is Ruby&rsquo;s equivalent of sleep(0). Once all scraper threads are finished, we check if there are any remaining pages waiting to be processed. If there are, we loop back to the beginning of the main loop. However, if there are no more pages, we send as many nil messages to the urls_queue as we have scraper threads and then wait for all of them to complete.</p>
<p>The main method of the scraper threads is quite simple. It dequeues messages containing page URLs to be processed and invokes the <code>create_page</code> method, which fetches the HTML, parses it (using the excellent Nokogiri gem), and ultimately generates a page object. This object is then pushed back into the pages_queue, from where the main thread takes charge and integrates it into the directed graph of pages.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-ruby" data-lang="ruby"><span class="line"><span class="cl"><span class="kp">loop</span> <span class="k">do</span>
</span></span><span class="line"><span class="cl">  <span class="n">msg</span> <span class="o">=</span> <span class="vi">@urls_queue</span><span class="o">.</span><span class="n">pop</span>
</span></span><span class="line"><span class="cl">  <span class="k">unless</span> <span class="n">msg</span>
</span></span><span class="line"><span class="cl">    <span class="no">LOGGER</span><span class="o">.</span><span class="n">debug</span> <span class="s2">&#34;scraper </span><span class="si">#{</span><span class="vi">@index</span><span class="si">}</span><span class="s2"> received finish message&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="k">break</span>
</span></span><span class="line"><span class="cl">  <span class="k">end</span>
</span></span><span class="line"><span class="cl"> 
</span></span><span class="line"><span class="cl">  <span class="n">page</span> <span class="o">=</span> <span class="n">create_page</span><span class="p">(</span><span class="n">msg</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"> 
</span></span><span class="line"><span class="cl">  <span class="vi">@pages_queue</span><span class="o">.</span><span class="n">push</span><span class="p">(</span><span class="ss">page</span><span class="p">:</span> <span class="n">page</span><span class="p">,</span> <span class="ss">url</span><span class="p">:</span> <span class="n">msg</span><span class="o">[</span><span class="ss">:url</span><span class="o">]</span><span class="p">,</span> <span class="ss">depth</span><span class="p">:</span> <span class="n">msg</span><span class="o">[</span><span class="ss">:depth</span><span class="o">]</span><span class="p">,</span> <span class="ss">parent</span><span class="p">:</span> <span class="n">msg</span><span class="o">[</span><span class="ss">:parent</span><span class="o">]</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="k">end</span>
</span></span></code></pre></div><h2 id="conclusion">Conclusion<a href="#conclusion" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h2>
<p>In a nutshell, SiteMapper Ruby CLI tool, allows simple generation of sitemap.xml files. It not only simplifies web page hierarchy exploration but also offers a nice visual representation, making the process more intuitive. Here I provided a sneak peek into its inner workings, from URL normalization to concurrent downloads, making it perhaps a handy tool for web developers.</p>
]]></content>
		</item>
		
	</channel>
</rss>
