Over the past few months, I’ve found and created a bunch of fun new scripting tricks and tools. Below are two somewhat related items that helped to unlock new possibilities for me in remote bash automation. The first is a Perl one-liner that allows filtering access logs by start and end times. The second is a method for executing complex commands remotely via ssh without all those intricate escapes.

As context for the Perl log filter, my team at work regularly performs Load Test Analyses. A customer will run a Load Test, provide us with the start and end times for the test window, and then we run a comprehensive analysis to determine whether any problems were recorded. Previous to automation, we would develop grep time filters via Regular Expressions (i.e. grep -E '15/Oct/2015:0(4:(3[4-9]|[4-5]|5:([0-1]|2[0-3]))'), and then run a bunch of analyses on the results. This is not so bad, but involves training in Regular Expressions, is prone to human error, and requires some careful thought.

In developing a more human/beginner-friendly solution, I desired for people to be able to enter start and stop times in the following format:

  • YYYY-MM-DD:HH:mm

This part is pretty easy, since the entered date can be converted to a timestamp for easy comparisons and then passed along to another function for the comparison/computation.

I first built a filter using awk, but found that the version of awk on my local machine is more feature-rich than mawk which is available on the platform. Most crucially, mawk is missing time functions that would enable doing the following:

awk -v starttime=$STARTTIME -v endtime=$ENDTIME'
BEGIN{
  m = split("Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec", d, "|")
  for(o=1; o<=m; o++) {
    months[d[o]] = sprintf("%02d", o)
  }
}
{
  gsub("[\[]", "", $0)
  split($0, hm, ":")
  split(hm[1], dmy, "/")
  date = (dmy[3] " " months[dmy[2]] " " dmy[1] " " hm[2] " " hm[3] " 0")
  logtime = mktime(date)
  if (starttime <= logtime && logtime <= endtime) print $0
}'

Since it’s not likely that gawk will be available any time soon on the platform, the next alternative I considered was Perl. With some deliberation, I came up with the Perl one-liner below (here wrapped in a bash function):

function time_filter() {
  PERL_COLUMN=${1:-3}
  echo "perl -MPOSIX -sane '
    {
      @months = split(\",\", \"Jan,Feb,Mar,Apr,May,Jun,Jul,Aug,Sep,Oct,Nov,Dec\");
      \$i = 0;
      foreach \$mon (@months) {
        @m{\$mon} = \$i;
        \$i++
      };
      @hm = split(\":\", \$F[$PERL_COLUMN]);
      @dmy = split(\"/\", @hm[0]);
      @dmy[0] =~ s/\[//;
      \$logtime = mktime(0, @hm[2], @hm[1], @dmy[0], @m{@dmy[1]}, @dmy[2]-1900);
      if (\$startTime <= \$logtime && \$logtime <= \$endTime) {
        print
      }
    };' -- -startTime=${START_TIMESTAMP} -endTime=${END_TIMESTAMP}"
}

To start off, the function takes an optional argument for the location of the date string in a log line. Next, the POSIX module is loaded along with Perl options (switch, loop, split, single line). Diving into the actual script logic, an array containing 3-letter abbreviations for each month is created, and then the date/time value from the log line is converted to a format that can be used to create a timestamp via the mktime function. Lastly, if the converted log time falls between specified start and end times the full log line is printed. startTime and endTime are fed into the Perl script from corresponding bash variables. You can see that there is a little bit of escaping for the double quotation marks and dollar signs, but nothing beyond what’s required to run this locally.

Next up, I needed the ability to execute this remotely over ssh. I initially attempted to insert additional escapes to be able to pass this directly to ssh. This task proved to be quite challenging, and greatly impacted the human-friendliness of the code. Thankfully, there is a nice alternative - base64 encoding. This approach gets a bad wrap as a common hacker technique, but I can attest that it works wonders (it’s not the tool but the intent!).

Here’s a sample implementation:

function ssh_cmd() {
  echo "ssh -q -t -F ${HOME}/.ssh/config \
    -o StrictHostKeyChecking=no \
    -o UserKnownHostsFile=/dev/null \
    -o PreferredAuthentications=publickey \
    ${1}"
}

TIME_FILTER=$(time_filter)
REMOTE_CMD=$(echo "cat access.log | ${TIME_FILTER} >> /path/to/filtered/access.log" \
  | base64)
$(ssh_cmd server-id)" echo ${REMOTE_CMD} | base64 -d | sudo bash" < /dev/null

Firstly, we’re defining a general purpose ssh command. Then the Perl logic is loaded into a variable and concatenated into a full remote command that is subsequently base64-encoded. Lastly, we assemble the full remote ssh command by piping the base64-encoded logic to be decoded remotely and piped into sudo bash.

There are alternatives to this approach such as using rsync to pass a file with the above Perl script to a remote server ahead of remote execution, but I really like the simplicity that’s achievable with base64.

The last six months have been very full with the arrival of our first baby and all the prep work and new responsibilities that go with being a new parent. Here and there I’ve managed to squeeze in little hobby projects. Much to my astonishment, I also won the !!Con attendance lottery(!!), and had an amazing few days in NYC.

!!Con

!!Con was an extremely fun conference, and I consider myself so lucky to have won the attendance lottery. The topics were diverse and quirky, but consistently deep and informative. Enough cannot be said for the specialness of !!Con and the focus on inclusivity, accessibility, and openness. I was going to try to highlight a few presentations, but reviewing the conference program I’m reminded of so many awesome talks. So once this year’s videos are available, maybe a random !!Con video selector will be in order :)

Nodejs Twitter Bots

Before the baby arrived, I feverishly built up a fleet of Twitter bots that pull notable bird sightings from the eBird API and post them to Twitter. This has been a fun project with some interesting challenges.

Below is a list of source repos for the bird bots:

The bots are built on Nodejs backed by Redis, all in Docker containers on a single Digital Ocean droplet. Each bot is associated with a particular state in the US. I started out naively running a couple of bots as concurrent Nodejs containers. This worked great for a while, but as the bot population increased, system stability took a nose dive. I considered a hardware upsize, but since this is a hobby project I opted to reduce the memory requirements of the fleet instead.

I noticed that some of the bots were more active, and would store a greater amount of data. My first idea was to slim down the data for some of the active bots by reducing the window of time they’ll consider a sighting as valid. This helped a bit, but didn’t get to the heart of the issue - lots of concurrent Nodejs instances consuming all the available memory.

The really big gain came with moving from persistent to ephemeral Nodejs containers. I started out with the Nodejs app keeping track of the time interval between queries to the eBird API. This meant each bot was crowding up the memory space with Nodejs. I could have stayed with this model, and rebuilt the app to run multiple bots per Nodejs instance, but there is a simpler approach.

Rather than managing time intervals with Nodejs, the whole system can be run from crontab at a much lower cost. Following is an example cronjob that will fire up an image, query eBird, process the data, post to Twitter, update Redis, log errors to a persistent file on the host, and then remove the container on exit:

*/30 * * * * sudo /usr/bin/docker run --name nbirds-id --rm=true -v /data/nbirds/ID:/var/log/nbirds --link redis:redis nbirds:id

The script completes in 1-2 seconds, which means there’s now a lot of idle on the server, and headroom for a whole lot more bots! One remaining item to address with continued scaling will be to split out Redis to a separate server instance, as the data will eventually outgrow the available memory even with slimming the tracked time range for bots.

Provisioning for Productivity

Another recent project is a Development Base container image. The idea is that the image will have all of my favorite tools and configurations for developing software, so starting up a new project on an unfamiliar machine will be extremely fast (assuming Docker). I also recently started using Boxen, which facilitates automated provisioning of a Mac computer. At first, I was dismissive of Boxen in comparison to the speed of deploying containers. But in digging into Boxen a bit, I’ve come around to a new perspective. Given the current landscape of provisioning options, Boxen is a great resource for getting a Mac into a desired base state with apps and configuration. While it may be excellent for big teams or small elite teams, I wouldn’t recommend Boxen where there is a lack of dedicated resources for maintenance and/or mentorship.

Tying Boxen back to Docker, you’ll want to have a look at the following pull request to use boot2docker on OSX:

Docker Drupal Onbuild

Another project I’ve been piecing together is an onbuild image based on the official Drupal Docker image. ‘Onbuild’ has become a standard option for many official images but doesn’t yet exist for Drupal. Beyond relying on drush site-install, another good option would be to work up a code volume share between the host and guest machines, but there are as yet unresolved issues with this approach:

That’s all for this installment!

Hello, Hubot.

I’ve written previously about deploying Hubot on Docker, deploying patched Hubot scripts, and bechmarking mass inserts with a Redis Docker container. In this post, I’ll cover how to link a Hubot Docker container to a Redis Docker container to equip Hubot with persistent memory.

As an overview we’re going to:

  1. Spin up a Redis Docker container with a host directory mounted as a data volume
  2. Spin up a linked Hubot Docker container that will use Redis as a persistent brain

For my most recent post on Redis mass inserts, I created a basic Redis Docker image that satisfies all of the requirements to be used as a Hubot Redis brain. We’ll bring this up now:

$ docker run --name redis -v /host/dir:/data -d nhoag/redis
3fc0b9888d5428f01ad13a26d1390904d58b9a34a76ce9965469694307781e3b
$ docker ps -a
CONTAINER ID        IMAGE                COMMAND                CREATED             STATUS              PORTS               NAMES
3fc0b9888d54        nhoag/redis:latest   "redis-server /etc/r   8 seconds ago       Up 7 seconds        6379/tcp            redis

In the above docker run, the Redis container is named as ‘redis’, and a directory from the host is mounted as a volume in the Docker container. By mounting a directory from the host as a volume on the guest, we can now retain Redis backup data through a container failure or reprovision. You can get a sense for this by adding a file on the host, editing from the guest, and viewing the change back on the host:

$ echo "Host" > /host/dir/test.txt
$ docker exec -it redis sh -c 'echo "Guest" > /data/test.txt'
$ cat /host/dir/test.txt
Guest

Next up, we need a suitable Hubot Docker image. I previously assembled a Hubot Docker image that almost meets our requirements. As stated on the Hubot Redis Brain page on NPM:

hubot-redis-brain requires a redis server to work. It uses the REDIS_URL environment variable for determining where to connect to. The default is on localhost, port 6379 (ie the redis default).

The following attributes can be set using the REDIS_URL

  • authentication
  • hostname
  • port
  • key prefix

For example, export REDIS_URL=redis://passwd@192.168.0.1:16379/prefix would authenticate with password, connecting to 192.168.0.1 on port 16379, and store data using the prefix:storage key.

Let’s spin up the old Hubot image without any modifications to scout out what needs to change. I’m using the same build process outlined in my previous post, A Dockerized and Slack-integrated Hubot, where I’ve defined a base Hubot image into which I’ll sprinkle some additional configuration in order to connect to various services:

$ git clone git@github.com:nhoag/bot-cfg.git && cd bot-cfg
# Add credentials to ./Dockerfile
$ docker build -t="my-bot" .
$ docker run -d -p 45678:8080 --name bot --link redis:redis my-bot
$ docker exec -it bot env | grep "^REDIS_PORT"
REDIS_PORT=tcp://172.1.2.3:6379
REDIS_PORT_6379_TCP=tcp://172.1.2.3:6379
REDIS_PORT_6379_TCP_ADDR=172.1.2.3
REDIS_PORT_6379_TCP_PORT=6379
REDIS_PORT_6379_TCP_PROTO=tcp

From the above environment variables, there are a lot of options for defining a connection to the Redis container, but the easiest option is to use REDIS_PORT since it has everything we need and can be used as-is. With one new line added to the bot repo (which gets pulled into the Docker image defined here), we have a Hubot that can automatically talk to Redis on start-up.

Here is the addition to bin/hubot for reference:

export REDIS_URL="${REDIS_PORT}"

After rebuilding the base Hubot image and my-bot, we now have a suitable Hubot Docker image to auto-connect to our running Redis container.

Let’s spin up the updated Hubot:

# Don't forget to rebuild the my-bot image from the updated Hubot
$ docker run -d -p 45678:8080 --name bot --link redis:redis my-bot

To verify that Hubot is connected, let’s attach to the running Redis container and review with redis-cli:

$ docker exec -it redis redis-cli -h 3fc0b9888d54
3fc0b9888d54:6379> SCAN 0
1) "0"
2) 1) "hubot:storage"

Success!!

Since Hubot ships with a Redis Brain by default, I decided to use this as an opportunity to learn some Redis. While reading through Redis documentation, I came across Redis Mass Insertion, which sparked an odd curiosity (twinkle twinkle). The main crux of Redis Mass Insertion is a recommendation to write large data sets to a Redis instance using the Redis protocol with redis-cli --pipe rather than pushing data through a Redis client. The benefits are maximized throughput, better assurance of data consistency, and a nice validation message:

All data transferred. Waiting for the last reply...
Last reply received from server.
errors: 0, replies: 1000

The Redis Mass Insertion documentation includes a couple of short code snippets for generating test data and example commands for pushing data to Redis. From these snippets, I cobbled together a ruby script to generate an arbitrary number of K/V pairs to STDOUT:

#!/usr/bin/ruby

def int_check(val)
  pass = Integer(val) rescue nil
  if pass
    val.to_i
  else
    STDERR.puts "Argument must be an integer."
    exit
  end
end

def gen_redis_proto(*cmd)
  proto = ""
  proto << "*"+cmd.length.to_s+"\r\n"
  cmd.each{|arg|
    proto << "$"+arg.to_s.bytesize.to_s+"\r\n"
    proto << arg.to_s+"\r\n"
  }
  proto
end

def generate_data(val)
  (0...val).each do |n|
    STDOUT.write(gen_redis_proto("SET", "Key#{n}", "Value#{n}"))
  end
end

generate_data(int_check(ARGV[0]))

The above script can be called as, ruby redis-pipe.rb 10000000 >> ./proto.txt to generate a file containing ten million key:value pairs.

From here I figured it might be fun to do a few benchmarks of redis-cli --pipe versus netcat HOST PORT, as well as protocol versus flat commands. I created a bash one-liner to generate the same data set from above as a flat list of Redis SET commands without the extra protocol markup:

i=0 ; while [[ ${i} -lt 10000000 ]] ; do echo " SET Key${i} Value${i}" ; i=$((i + 1)) ; done >> flat.txt

Here’s how the resulting files look:

$ du -a *.txt
274M  flat.txt
464M  proto.txt

$ head -7 proto.txt
*3
$3
SET
$4
Key0
$6
Value0

$ head -1 flat.txt
 SET Key0 Value0

With data in hand, we just need a Redis instance to test against. I set up an Automated Build through Docker Hub with the current latest Redis version. I then deployed this container locally (OSX) via boot2docker: docker pull nhoag/redis && docker run --name redis -p 6379 -d nhoag/redis. Next I installed Redis locally with brew install redis to facilitate accessing the Redis container.

As a small test, we can connect to the container and SET and GET. But first we need the connection specs for the Redis container:

$ docker ps -a
CONTAINER ID        IMAGE                COMMAND                CREATED             STATUS              PORTS                     NAMES
ca48d4ff024e        nhoag/redis:latest   "redis-server /etc/r   2 seconds ago       Up 1 seconds        0.0.0.0:49156->6379/tcp   redis

$ boot2docker ip
192.3.4.5

Using the above information, we can connect with Redis as follows:

redis-cli -h 192.3.4.5 -p 49156
192.3.4.5:49156> SET a b
OK
192.3.4.5:49156> GET a
"b"
192.3.4.5:49156> FLUSHDB
OK
192.3.4.5:49156>

It works! On to mass inserts. As you can see above, I opted to pre-generate data to standardize the insertion process. This means we can run inserts as follows:

# Redis Protocol
$ cat proto.txt | redis-cli --pipe -h 192.3.4.5 -p 49156 > /dev/null
$ cat proto.txt | nc 192.3.4.5 49156 > /dev/null
$ cat proto.txt | socat - TCP:192.3.4.5:49156 > /dev/null

# Flat Commands
$ cat flat.txt | redis-cli --pipe -h 192.3.4.5 -p 49156 > /dev/null
$ cat flat.txt | nc 192.3.4.5 49156 > /dev/null
$ cat flat.txt | socat - TCP:192.3.4.5:49156 > /dev/null

Rinse and repeat after each iteration:

  1. redis-cli -h 192.3.4.5 -p 49156
  2. DBSIZE - should be 10,000,000
  3. FLUSHDB

I introduced socat into the equation because my version of netcat doesn’t auto-recognize EOF. Some versions of netcat have -c or-q0, but not mine :( This means netcat will hang after the data has been fully processed until it’s manually told to stop. socat will automatically hang up on EOF by default, which is attractive as it allows simple benchmarking with time. But notice I haven’t included any time statistics. As you’ll see, I found a better alternative to time, and then kept the socat data since it was already in the mix.

There is a very fun project for monitoring Redis called redis-stat. Using redis-stat --server=8282 192.3.4.5:49156 1, we get updates every second from the command line as well as in the browser at localhost:8282.

redis-stat command line

redis-stat command line output

redis-stat browser

redis-stat browser

When Commands per Second and CPU Usage drop, and when Memory Usage levels off, we know it’s safe to shut down netcat. In the command line output we also get a rich dataset about how the insert performed that can be further parsed and analyzed. And the browser display provides a nice high-level overview.

In addition to redis-stat, I set up an ssh session running htop for an added lens into performance. This turned out to be very helpful in cases where the VM would unexpectedly hit capacity and start swapping, queuing, and backgrounding tasks. This didn’t happen often, but caused a massive slowdown for inserts.

The below data is from “clean” runs where the above-mentioned tipping point did not occur. Of course it would be better to run these inserts hundreds of times and aggregate the results. The data presented below is a semi-curated set of what seem to be typical responses for each insert method.

To generate the below data, I started with the raw redis-stat cli output. I parsed all of the rows that show insert activity, and then removed the first and last rows since these were typically inconsistent with the rest of the data set. Here is an example of generating an average for inserts per millisecond from a prepared data-set:

$ cat stats.txt \
  | tail -n +2 \        # Remove the first line
  | sed '$d' \          # Remove the last line
  | awk '{print $9}' \  # Print the cmd/s column
  | tr -d k \           # Remove the 'k'
  | awk '{ sum += $1 } END { if (NR > 0) print sum / NR }'
161.148

161.148 * 1000 inserts * 1/1000s = 161 inserts/ms

Redis Protocol

Command - Time (s) - Agg. Inserts/ms - Avg. Inserts/ms
netcat 63 159 161
redis-cli 57 175 169
socat 62 161 160

Flat Redis Commands

Command - Time (s) - Agg. Inserts/ms - Avg. Inserts/ms
netcat 60 167 161
redis-cli 66 152 147
socat 66 152 148
  1. redis-cli --pipe with the Redis protocol shows a slight performance edge
  2. netcat was the runner up in flat format and the Redis protocol was only slightly slower
  3. socat was comparable to netcat with the Redis protocol
  4. socat and redis-cli --pipe without Redis protocol were slower

TLDR: Use redis-cli --pipe with the Redis protocol for mass inserts and save on the order of 10+ minutes per billion K/V pairs ;)

In deploying Hubot for the first time, you may encounter the following error:

ERROR ReferenceError: fillAddress is not defined
  at TextListener.callback (/path/to/bot/node_modules/hubot-maps/src/maps.coffee:58:16, <js>:57:18)
</truncated>

At the time of this writing, running a grep in the Hubot Maps source code shows a single instance of the function and no function definition:

grep -rn fillAddress .
./src/maps.coffee:58:    location = fillAddress(msg.match[3])

Stepping back a level to grep all of Hubot and scripts yields the same result as above.

Running a Google search for hubot maps fillAddress gives one promising hit. Looking at this code, we can see that fillAddress() is defined!:

  fillAddress = (address) ->
    if (address.match(/borderlands/i))
      return '1109 Pebblewood Way, San Mateo, CA'
    else if (address.match(/hhh/i))
      return '516 Chesterton Ave, Belmont, CA'
    else if (address.match(/airbnb/i))
      return '888 Brannan St, San Francisco, CA'

    return address

But do we need this function defined, or do we need to remove the reference? All the function does is to provide a system for aliasing particular addresses to a human-friendly reference. So by typing hubot map me hhh we should actually get back the coordinates for ‘516 Chesterton Ave, Belmont, CA’. This is a nice idea, but definitely not necessary for my purposes.

Back to Hubot Maps, there is an open pull request that addresses this issue by removing the remaining fillAddress() reference that’s causing the error (while adding support for Japanese characters). This PR hasn’t yet been merged into Hubot Maps, but we can still benefit from the fix.

Here’s one way to deploy a patched script to Hubot:

  1. Fork hubot-maps
  2. Deploy the patch to your fork
  3. Tag your fork (i.e. 0.0.1p) - git tag 0.0.1p && git push --tags
  4. Reference your forked version of hubot-maps in package.json:
...
    "hubot-maps": "https://github.com/USERNAME/hubot-maps/archive/0.0.1p.tar.gz",
...

Now when you run npm install, npm will pull in the patched script.

And subsequently firing up Hubot and asking for a map shows success:

./bin/hubot
# No error! \o/

Hubot map me Boston