Since Hubot ships with a Redis Brain by default, I decided to use this as an opportunity to learn some Redis. While reading through Redis documentation, I came across Redis Mass Insertion, which sparked an odd curiosity (twinkle twinkle). The main crux of Redis Mass Insertion is a recommendation to write large data sets to a Redis instance using the Redis protocol with redis-cli --pipe rather than pushing data through a Redis client. The benefits are maximized throughput, better assurance of data consistency, and a nice validation message:

All data transferred. Waiting for the last reply...
Last reply received from server.
errors: 0, replies: 1000

The Redis Mass Insertion documentation includes a couple of short code snippets for generating test data and example commands for pushing data to Redis. From these snippets, I cobbled together a ruby script to generate an arbitrary number of K/V pairs to STDOUT:

#!/usr/bin/ruby

def int_check(val)
  pass = Integer(val) rescue nil
  if pass
    val.to_i
  else
    STDERR.puts "Argument must be an integer."
    exit
  end
end

def gen_redis_proto(*cmd)
  proto = ""
  proto << "*"+cmd.length.to_s+"\r\n"
  cmd.each{|arg|
    proto << "$"+arg.to_s.bytesize.to_s+"\r\n"
    proto << arg.to_s+"\r\n"
  }
  proto
end

def generate_data(val)
  (0...val).each do |n|
    STDOUT.write(gen_redis_proto("SET", "Key#{n}", "Value#{n}"))
  end
end

generate_data(int_check(ARGV[0]))

The above script can be called as, ruby redis-pipe.rb 10000000 >> ./proto.txt to generate a file containing ten million key:value pairs.

From here I figured it might be fun to do a few benchmarks of redis-cli --pipe versus netcat HOST PORT, as well as protocol versus flat commands. I created a bash one-liner to generate the same data set from above as a flat list of Redis SET commands without the extra protocol markup:

i=0 ; while [[ ${i} -lt 10000000 ]] ; do echo " SET Key${i} Value${i}" ; i=$((i + 1)) ; done >> flat.txt

Here's how the resulting files look:

$ du -a *.txt
274M  flat.txt
464M  proto.txt

$ head -7 proto.txt
*3
$3
SET
$4
Key0
$6
Value0

$ head -1 flat.txt
 SET Key0 Value0

With data in hand, we just need a Redis instance to test against. I set up an Automated Build through Docker Hub with the current latest Redis version. I then deployed this container locally (OSX) via boot2docker: docker pull nhoag/redis && docker run --name redis -p 6379 -d nhoag/redis. Next I installed Redis locally with brew install redis to facilitate accessing the Redis container.

As a small test, we can connect to the container and SET and GET. But first we need the connection specs for the Redis container:

$ docker ps -a
CONTAINER ID        IMAGE                COMMAND                CREATED             STATUS              PORTS                     NAMES
ca48d4ff024e        nhoag/redis:latest   "redis-server /etc/r   2 seconds ago       Up 1 seconds        0.0.0.0:49156->6379/tcp   redis

$ boot2docker ip
192.3.4.5

Using the above information, we can connect with Redis as follows:

redis-cli -h 192.3.4.5 -p 49156
192.3.4.5:49156> SET a b
OK
192.3.4.5:49156> GET a
"b"
192.3.4.5:49156> FLUSHDB
OK
192.3.4.5:49156>

It works! On to mass inserts. As you can see above, I opted to pre-generate data to standardize the insertion process. This means we can run inserts as follows:

# Redis Protocol
$ cat proto.txt | redis-cli --pipe -h 192.3.4.5 -p 49156 > /dev/null
$ cat proto.txt | nc 192.3.4.5 49156 > /dev/null
$ cat proto.txt | socat - TCP:192.3.4.5:49156 > /dev/null

# Flat Commands
$ cat flat.txt | redis-cli --pipe -h 192.3.4.5 -p 49156 > /dev/null
$ cat flat.txt | nc 192.3.4.5 49156 > /dev/null
$ cat flat.txt | socat - TCP:192.3.4.5:49156 > /dev/null

Rinse and repeat after each iteration:

  1. redis-cli -h 192.3.4.5 -p 49156
  2. DBSIZE - should be 10,000,000
  3. FLUSHDB

I introduced socat into the equation because my version of netcat doesn't auto-recognize EOF. Some versions of netcat have -c or-q0, but not mine :( This means netcat will hang after the data has been fully processed until it's manually told to stop. socat will automatically hang up on EOF by default, which is attractive as it allows simple benchmarking with time. But notice I haven't included any time statistics. As you'll see, I found a better alternative to time, and then kept the socat data since it was already in the mix.

There is a very fun project for monitoring Redis called redis-stat. Using redis-stat --server=8282 192.3.4.5:49156 1, we get updates every second from the command line as well as in the browser at localhost:8282.

redis-stat command line

redis-stat command line output

redis-stat browser

redis-stat browser

When Commands per Second and CPU Usage drop, and when Memory Usage levels off, we know it's safe to shut down netcat. In the command line output we also get a rich dataset about how the insert performed that can be further parsed and analyzed. And the browser display provides a nice high-level overview.

In addition to redis-stat, I set up an ssh session running htop for an added lens into performance. This turned out to be very helpful in cases where the VM would unexpectedly hit capacity and start swapping, queuing, and backgrounding tasks. This didn't happen often, but caused a massive slowdown for inserts.

The below data is from "clean" runs where the above-mentioned tipping point did not occur. Of course it would be better to run these inserts hundreds of times and aggregate the results. The data presented below is a semi-curated set of what seem to be typical responses for each insert method.

To generate the below data, I started with the raw redis-stat cli output. I parsed all of the rows that show insert activity, and then removed the first and last rows since these were typically inconsistent with the rest of the data set. Here is an example of generating an average for inserts per millisecond from a prepared data-set:

$ cat stats.txt \
  | tail -n +2 \        # Remove the first line
  | sed '$d' \          # Remove the last line
  | awk '{print $9}' \  # Print the cmd/s column
  | tr -d k \           # Remove the 'k'
  | awk '{ sum += $1 } END { if (NR > 0) print sum / NR }'
161.148

161.148 * 1000 inserts * 1/1000s = 161 inserts/ms

Redis Protocol

Command - Time (s) - Agg. Inserts/ms - Avg. Inserts/ms
netcat 63 159 161
redis-cli 57 175 169
socat 62 161 160

Flat Redis Commands

Command - Time (s) - Agg. Inserts/ms - Avg. Inserts/ms
netcat 60 167 161
redis-cli 66 152 147
socat 66 152 148
  1. redis-cli --pipe with the Redis protocol shows a slight performance edge
  2. netcat was the runner up in flat format and the Redis protocol was only slightly slower
  3. socat was comparable to netcat with the Redis protocol
  4. socat and redis-cli --pipe without Redis protocol were slower

TLDR: Use redis-cli --pipe with the Redis protocol for mass inserts and save on the order of 10+ minutes per billion K/V pairs ;)

In deploying Hubot for the first time, you may encounter the following error:

ERROR ReferenceError: fillAddress is not defined
  at TextListener.callback (/path/to/bot/node_modules/hubot-maps/src/maps.coffee:58:16, <js>:57:18)
</truncated>

At the time of this writing, running a grep in the Hubot Maps source code shows a single instance of the function and no function definition:

grep -rn fillAddress .
./src/maps.coffee:58:    location = fillAddress(msg.match[3])

Stepping back a level to grep all of Hubot and scripts yields the same result as above.

Running a Google search for hubot maps fillAddress gives one promising hit. Looking at this code, we can see that fillAddress() is defined!:

  fillAddress = (address) ->
    if (address.match(/borderlands/i))
      return '1109 Pebblewood Way, San Mateo, CA'
    else if (address.match(/hhh/i))
      return '516 Chesterton Ave, Belmont, CA'
    else if (address.match(/airbnb/i))
      return '888 Brannan St, San Francisco, CA'

    return address

But do we need this function defined, or do we need to remove the reference? All the function does is to provide a system for aliasing particular addresses to a human-friendly reference. So by typing hubot map me hhh we should actually get back the coordinates for '516 Chesterton Ave, Belmont, CA'. This is a nice idea, but definitely not necessary for my purposes.

Back to Hubot Maps, there is an open pull request that addresses this issue by removing the remaining fillAddress() reference that's causing the error (while adding support for Japanese characters). This PR hasn't yet been merged into Hubot Maps, but we can still benefit from the fix.

Here's one way to deploy a patched script to Hubot:

  1. Fork hubot-maps
  2. Deploy the patch to your fork
  3. Tag your fork (i.e. 0.0.1p) - git tag 0.0.1p && git push --tags
  4. Reference your forked version of hubot-maps in package.json:
...
    "hubot-maps": "https://github.com/USERNAME/hubot-maps/archive/0.0.1p.tar.gz",
...

Now when you run npm install, npm will pull in the patched script.

And subsequently firing up Hubot and asking for a map shows success:

./bin/hubot
# No error! \o/

Hubot map me Boston

A little over a year ago, I wrote a quick blog post about deploying Hubot with Docker. A lot has changed with Hubot and Docker since that time, so I decided to revisit the build.

The new implementation I whipped up consists of three main components:

  1. Yeoman-generated Hubot
  2. Base Docker image
  3. Dockerfile for configuring Hubot

The Hubot 'Getting Started' instructions walk us through generating a deployable Hubot with Yeoman. Once generated, the code can be stashed away somewhere until we're ready to pull it into a Docker image. In this case I committed the code to Github.

Now that we have a bot defined, we can build a new Docker image to deploy and run the bot. The base Docker image (below) installs Node.js, pulls in our bot repo, and runs npm install, but notice we're not deploying any configuration yet:

# DOCKER-VERSION  1.3.2

FROM ubuntu:14.04
MAINTAINER Nathaniel Hoag, info@nathanielhoag.com

ENV BOTDIR /opt/bot

RUN apt-get update && \
  apt-get install -y wget && \
  wget -q -O - https://deb.nodesource.com/setup | sudo bash - && \
  apt-get install -y git build-essential nodejs && \
  rm -rf /var/lib/apt/lists/* && \
  git clone --depth=1 https://github.com/nhoag/bot.git ${BOTDIR}

WORKDIR ${BOTDIR}

RUN npm install

Anyone can use or modify the build to create their own Docker images:

git clone git@github.com:nhoag/doc-bot.git
# Optionally edit ./doc-bot/Dockerfile
docker build -t="id/hubot:tag" ./doc-bot/
docker push id/hubot

At this point, we have an image ready to go and just need to sprinkle in some configuration to make sure our bot is talking to the right resources. This is where we'll make use of the bot-cfg repo, which contains yet another Dockerfile:

# DOCKER-VERSION        1.3.2

FROM nhoag/hubot
MAINTAINER Nathaniel Hoag, info@nathanielhoag.com

ENV HUBOT_PORT 8080
ENV HUBOT_ADAPTER slack
ENV HUBOT_NAME bot-name
ENV HUBOT_GOOGLE_API_KEY xxxxxxxxxxxxxxxxxxxxxx
ENV HUBOT_SLACK_TOKEN xxxxxxxxxxxxxxxxxxxxx
ENV HUBOT_SLACK_TEAM team-name
ENV HUBOT_SLACK_BOTNAME ${HUBOT_NAME}
ENV PORT ${HUBOT_PORT}

EXPOSE ${HUBOT_PORT}

WORKDIR /opt/bot

CMD bin/hubot

Here we're extending the public nhoag/hubot image created earlier by adding our private credentials as environment variables. Once this is populated with real data, the last steps are to build and run the updated image.

Below is the full deployment process that should give you a new Slack-integrated Hubot:

  1. docker pull nhoag/hubot
  2. git clone git@github.com:nhoag/bot-cfg.git
  3. vi ./bot-cfg/Dockerfile (configure ENVs)
  4. docker build -t="nhoag/hubot:live" ./bot-cfg/
  5. docker run -d -p 45678:8080 nhoag/hubot:live
  6. Add public Hubot address to your Slack Hubot Integration (i.e. http://2.2.2.2:45678/)

Happy chatting!

Update -- 2014-12-08

Small optimization to bot-cfg to remove command arguments in favor of environment variables.


Update: It turns out I have multiple accounts and was reviewing a secondary account I'd forgotten about and had barely used. Still, it's a useful exercise to consider the worst case of data loss in a blackbox cloud system. Digging deeper into the topic of efficient and distributed notes, I found that Brett Terpstra has put an incredible amount of time and effort into evolving this space. Nothing yet feels fully baked, but tools such as Popclip (with awesome extensions), nvalt, Bullseye, and GistBox provide a lot of interesting avenues.


A few weeks ago, it seemed my Evernote account was unexpectedly truncated - it went from hundreds of notes to a mere handful. Turns out I was looking at the wrong account - D'oh! Without realizing the mistake, I was suddenly very motivated to find a transparent and robust system for keeping notes.

My personal notes are mostly excerpts from daily tech ramblings - passages and one-liners from projects, emails, chat transcripts, and the Web. I leverage notes to recall information sources, as fodder for blogging, to remember tricky tech solutions and problems, and to share information with friends and colleagues at opportune moments. All the regular stuff.

The (presumed) data loss provided motivation to investigate alternatives. I can't yet say that my search is anywhere near complete, but following are some thoughts about Smallest Federated Wiki and IPython Notebook, along with musings around a simpler alternative.

Smallest Federated Wiki


Update: I recently listened to the Javascript Jabber episode on Federated Wiki (no longer 'Smallest'). It's worth a listen if you're interested in distributed information systems.


Smallest Federated Wiki is an impressive distributed information system that has a lot of potential to revolutionize the Wiki-sphere. The major block for me with using this project is the investment to learn how to use it correctly. It has a dense UI and is as amorphous as they come. Up front, this reads as a deep investment of time to only possibly get my needs met.

IPython Notebook

IPython Notebook has so far been easy and fun to set up and use. It maps to my expectations pretty well and is very pluggable. IPython Notebook can provide a very similar functionality and experience to Evernote. It also has the capability to execute code, which is an awesome bonus. There are some lacking features with IPython that if addressed could make it even better for this use-case, but as I'm discovering with additional use, IPython is full of fun surprises.

IPython Notebook doesn't have built-in functionality for creating local directories, comprehensive search, or note sharing. But it's easy enough to add or make up for these missing features with plugins, in-notebook code execution, and straight bash.

One thing I really miss from Evernote is the ability to embed an HTML snapshot of a webpage in a note. IPython provides embedded iframes, but this doesn't protect against a page going away.

There are implementations of IPython for Ruby and PHP, which adds further power to in-notebook computation.

For backups I set up an S3 bucket with sync'ing courtesy of AWS CLI ala:

aws --profile=profile-id s3 sync . s3://bucket-id --delete

Simpler Alternative

Getting back to basics, there are plenty of alternatives for assembling a simple notebook repository. The most straight-forward approach would be to write/paste locally in $editor and commit to $vcs repository. This is stable, can be backed up anywhere, and is version controlled. For sharing, specific notes can be piped to Github Gist:

gist -c -p < path/to/file

For increased compatibility and consistency across mediums (notebook, Gist, static website), it's probably not a bad idea to compose notes in Markdown. There are a few tools for automating conversions to markdown, but it'll take some investigation to identify whether the below options are any good:

The Fast 404 Drupal contributed module project page provides a lot of context for why 404s are expensive in Drupal:

... On an 'average' site with an 'average' module load, you can be looking at 60-100MB of memory being consumed on your server to deliver a 404. Consider a page with a bad .gif link and a missing .css file. That page will generate 2 404s along with the actual load of the page. You are most likely looking at 180MB of memory to server that page rather than the 60MB it should take.

The explanation continues to describe how Drupal 7 has a rudimentary implementation for reducing the impact of 404s. You may have seen the below code while reviewing settings.php:

<?php
$conf['404_fast_paths_exclude'] = '/\/(?:styles)\//';
$conf['404_fast_paths'] = '/\.(?:txt|png|gif|jpe?g|css|js|ico|swf|flv|cgi|bat|pl|dll|exe|asp)$/i';
$conf['404_fast_html'] = '<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN" "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head><title>404 Not Found</title></head><body><h1>Not Found</h1><p>The requested URL "@path" was not found on this server.</p></body></html>';

But wouldn't it be nice to actually see the difference between all of these implementations? Thanks to the magic of open source, we can!

Nearly a month ago now, Mark Sonnabaum posted a Gist with instructions for generating flame graphs from XHProf-captured Drupal stacks. The technique converts XHProf samples to a format that can be read and interpreted by Brandan Gregg's excellent FlameGraph tool.

I set up a local Drupal site in three 404 configurations (unmitigated, default 404, Fast 404) and tested them one at a time. One difficulty with testing is that the default XHProf sample rate is 0.1 seconds. This was plenty for unmitigated 404s, but I had to make a lot of requests to capture a stack with the Fast 404 module in place.

The flame graph screenshots below corroborate what we would expect, with unmitigated 404s being the tallest stack of the bunch, the Drupal core 404 implementation showing a favorable reduction, and Fast 404 showing the shortest stack. We can also extrapolate that adding various contrib modules will push the stacks even higher with unmitigated 404s.

Click each image below for an interactive flame graph.

Unmitigated 404s

Unmitigated 404 Flame Graph

Drupal Core 404 Implementation

Drupal 404 Flame Graph

Fast 404

Fast 404 Flame Graph