I've written previously about trimming the fat on bloated Git repositories. Here I'll present a convenient method for listing the largest objects in a human-friendly format.

The first step in finding the largest objects is to list all objects. We can do this with verify-pack:

$ git verify-pack -v .git/objects/pack/pack-{HASH}.idx
8306276cde4f3cff8cbe5598fd98bf10146da262 commit 254 170 9725
4677b7c11924cefa62393f0e3e7db6c06787815e tree   30 63 9895 1 08ce30d6550bed7725c399026d91fce24c86a79f
5062bde76952965ff5c473a7de0ae102b4d2c9f3 tree   1122 944 9958
1c1ef555c77ee527c95ca093f251313a6418c158 blob   10 19 10902
non delta: 15175 objects
chain length = 1: 1672 objects
chain length = 31: 10 objects
chain length = 32: 4 objects
.git/objects/pack/pack-d59d9ffc33fbbf297076d5ab7abc07ce2cd8eae0.pack: ok

The above is a highly curated result from an actual repo. Here are column IDs for reference:

SHA-1 type size size-in-packfile offset-in-packfile depth base-SHA-1

What we care about are columns 1 and 3, corresponding to SHA-1 object ID and size in bytes. We can get the info we want for the 20 largest objects by adding a few pipes:

$ git verify-pack -v .git/objects/pack/pack-{HASH}.idx \
  | sort -k 3 -rn \     # sort descending by the size column
  | head -20 \          # return the first 20 items from the sorted list
  | awk '{print $1,$3}' # return colums 1 and 3 for each row
67c7d98775171c7e91aafac8e9905ec204194c30 881498661
447ed6a08656ef9e7047189523d7907bed891ce4 881494950
078659b9e1aed95600fe046871dbb2ab385e093d 46903069
a78bb70f7d351bd3789859bb2e047a6f01430e7f 37732234
432c2dad0b7869c6df11876c0fe9f478c15fb261 30695043

The next step is typically to run git rev-list and to grep for specific hashes from above:

$ git rev-list --objects --all \
  | grep 67c7d98775171c7e91aafac8e9905ec204194c30
67c7d98775171c7e91aafac8e9905ec204194c30 path/to/archive.tar.gz

Performing this next step manually is repetitive and intensive. xargs could be employed, but for longer lists of hashes and large repos this would involve a lot of extra overhead to process the full rev-list multiple times.

One way to speed this up AND to eliminate the manual repetition is to construct a single Regex grep with all the hash IDs we want so that we can process them all with a single call to rev-list. This means we'll need variables in order to track hashes, file sizes, and file paths.

Let's start with data from verify-pack:

HASHES_SIZES=$(git verify-pack -v .git/objects/pack/pack-*.idx \
  | sort -k 3 -rn \
  | head -20 \
  | awk '{print $1,$3}' \
  | sort)

Nothing too much new here, but you might notice a couple of new features:

  • Using a wildcard for the idx file(s)
  • Sorting the final result by hash ID (you'll see why in a bit)

Now to put the hashes in a form that we can pass to a grep:

  | awk '{printf $1"|"}' \
  | sed 's/\|$//')

This gives us a string of pipe-separated hashes like so:

  • hash01|hash02|hashN

Which we can use to get a list of files from rev-list in one go:

HASHES_FILES=$(git rev-list --objects --all \
  | \grep -E "($HASHES)" \
  | sort)

Here again we're sorting the result by hash ID. This facilitates the final step, which is to assemble the gathered data together into a human-friendly format:

paste <(echo "$HASHES_SIZES") <(echo "$HASHES_FILES") \
  | sort -k 2 -rn \
  | awk '{
      size=$2; $1="";
      split( "KB MB GB" , v );
      while( size>1024 ){
        size/=1024; s++
      } print int(size) v[s]"\t"$0
    }' \
  | column -ts $'\t'

We start by merging together data from 'SIZES' and 'FILES' variables. Then we re-sort by file-size before converting the file size field to human-friendly file sizes with awk.

The final result is a simple list of files preceded by size:

44MB     docroot/images/video1.wmv
35MB     docroot/images/video1.mp4
29MB     docroot/images/video2.wmv
7MB      docroot/images/video3.wmv
3MB      docroot/images/image1.JPG
3MB      docroot/images/image2.JPG

Overall this is still an expensive operation, but most of the cost is associated with the initial verify-pack. Otherwise this is easy to use and to read.

The complete script is available at the following Gist:

I routinely inspect live SSL certificates to validate domain coverage. While working directly with openssl is not necessarily painful, I wanted a tool that could be used to return a simple list of domains without the extra output and without the terminal hang. Below is an example of retrieving the SSL cert for google.com with openssl s_client:

$ openssl s_client -showcerts -connect google.com:443
depth=2 /C=US/O=GeoTrust Inc./CN=GeoTrust Global CA
verify error:num=20:unable to get local issuer certificate
verify return:0
Certificate chain
 0 s:/C=US/ST=California/L=Mountain View/O=Google Inc/CN=*.google.com
     i:/C=US/O=Google Inc/CN=Google Internet Authority G2
 1 s:/C=US/O=Google Inc/CN=Google Internet Authority G2
     i:/C=US/O=GeoTrust Inc./CN=GeoTrust Global CA
 2 s:/C=US/O=GeoTrust Inc./CN=GeoTrust Global CA
     i:/C=US/O=Equifax/OU=Equifax Secure Certificate Authority
Server certificate
subject=/C=US/ST=California/L=Mountain View/O=Google Inc/CN=*.google.com
issuer=/C=US/O=Google Inc/CN=Google Internet Authority G2
No client certificate CA names sent
SSL handshake has read 4021 bytes and written 456 bytes
New, TLSv1/SSLv3, Cipher is AES128-SHA
Server public key is 2048 bit
Secure Renegotiation IS supported
Compression: NONE
Expansion: NONE
        Protocol  : TLSv1
        Cipher    : AES128-SHA
        Session-ID: C01EDE7DB78D6343DA2344F1FB9DDC962F39E92F8A1A98216E75F5C0F2285A2E
        Master-Key: F94716D18028CB0245582EECE632F956CF0B0FA208F6F4D66DD1BB78FF4B19AA6CA064E21811671D0082E33C1E6ECCB6
        Key-Arg   : None
        Start Time: 1445825624
        Timeout   : 300 (sec)
        Verify return code: 0 (ok)
# Terminal hangs here until CTRL-C

First, we can get rid of the terminal hang by updating the command as follows:

$ openssl s_client -showcerts -connect google.com:443 </dev/null

Next, we can reveal the certificate contents in human-readable form by piping to x509:

$ openssl s_client -showcerts -connect google.com:443 </dev/null \
  | openssl x509 -text
depth=2 /C=US/O=GeoTrust Inc./CN=GeoTrust Global CA
verify error:num=20:unable to get local issuer certificate
verify return:0
                Version: 3 (0x2)
                Serial Number:
                Signature Algorithm: sha256WithRSAEncryption
                Issuer: C=US, O=Google Inc, CN=Google Internet Authority G2
                        Not Before: Oct 28 18:49:32 2015 GMT
                        Not After : Jan 26 00:00:00 2016 GMT
                Subject: C=US, ST=California, L=Mountain View, O=Google Inc, CN=*.google.com
                Subject Public Key Info:
                        Public Key Algorithm: rsaEncryption
                        RSA Public Key: (2048 bit)
                                Modulus (2048 bit):
                                Exponent: 65537 (0x10001)
                X509v3 extensions:
                        X509v3 Extended Key Usage:
                                TLS Web Server Authentication, TLS Web Client Authentication
                        X509v3 Subject Alternative Name:
                                DNS:*.google.com, DNS:*.android.com, DNS:*.appengine.google.com, DNS:*.cloud.google.com, DNS:*.google-analytics.com, DNS:*.google.ca, DNS:*.google.cl, DNS:*.google.co.in, DNS:*.google.co.jp, DNS:*.google.co.uk, DNS:*.google.com.ar, DNS:*.google.com.au, DNS:*.google.com.br, DNS:*.google.com.co, DNS:*.google.com.mx, DNS:*.google.com.tr, DNS:*.google.com.vn, DNS:*.google.de, DNS:*.google.es, DNS:*.google.fr, DNS:*.google.hu, DNS:*.google.it, DNS:*.google.nl, DNS:*.google.pl, DNS:*.google.pt, DNS:*.googleadapis.com, DNS:*.googleapis.cn, DNS:*.googlecommerce.com, DNS:*.googlevideo.com, DNS:*.gstatic.cn, DNS:*.gstatic.com, DNS:*.gvt1.com, DNS:*.gvt2.com, DNS:*.metric.gstatic.com, DNS:*.urchin.com, DNS:*.url.google.com, DNS:*.youtube-nocookie.com, DNS:*.youtube.com, DNS:*.youtubeeducation.com, DNS:*.ytimg.com, DNS:android.clients.google.com, DNS:android.com, DNS:g.co, DNS:goo.gl, DNS:google-analytics.com, DNS:google.com, DNS:googlecommerce.com, DNS:urchin.com, DNS:youtu.be, DNS:youtube.com, DNS:youtubeeducation.com
                        Authority Information Access:
                                CA Issuers - URI:http://pki.google.com/GIAG2.crt
                                OCSP - URI:http://clients1.google.com/ocsp

                        X509v3 Subject Key Identifier:
                        X509v3 Basic Constraints: critical
                        X509v3 Authority Key Identifier:

                        X509v3 Certificate Policies:

                        X509v3 CRL Distribution Points:

        Signature Algorithm: sha256WithRSAEncryption

Now that we have the cert contents, the next thing we can do is filter out the list of domains:

$ openssl s_client -showcerts -connect google.com:443 </dev/null \
  | openssl x509 -text \
  | grep DNS \
  | tr ',' '\n' \
  | cut -d':' -f2
depth=2 /C=US/O=GeoTrust Inc./CN=GeoTrust Global CA
verify error:num=20:unable to get local issuer certificate
verify return:0

At this point (or maybe much earlier), you might notice that there is some extra data printed to STDERR:

depth=2 /C=US/O=GeoTrust Inc./CN=GeoTrust Global CA
verify error:num=20:unable to get local issuer certificate
verify return:0

The above is indicating that the cert is not properly validated against a known root certificate. Let's get that validated!

First let's get the CA cert bundle from curl.haxx.se:

$ curl -O http://curl.haxx.se/ca/cacert.pem

Now we can reference the CA with s_client:

$ openssl s_client -showcerts \
    -CAfile /path/to/cacert.pem \
    -connect google.com:443 </dev/null \
  | openssl x509 -text \
  | grep DNS \
  | tr ',' '\n' \
  | cut -d':' -f2
depth=3 /C=US/O=Equifax/OU=Equifax Secure Certificate Authority
verify return:1
depth=2 /C=US/O=GeoTrust Inc./CN=GeoTrust Global CA
verify return:1
depth=1 /C=US/O=Google Inc/CN=Google Internet Authority G2
verify return:1
depth=0 /C=US/ST=California/L=Mountain View/O=Google Inc/CN=*.google.com
verify return:1

Now we're properly validated, but the validation data is still printing to STDERR. Of course we can keep this as-is for proper assurance, but for now let's get rid of the clutter by sending STDERR to STDOUT.

$ openssl s_client -showcerts \
    -CAfile /path/to/cacert.pem -connect google.com:443 </dev/null 2>&1 \
  | openssl x509 -text \
  | grep DNS \
  | tr ',' '\n' \
  | cut -d':' -f2

Lastly, we can add an enhancement to alert for failed validation, but otherwise to provide the list of domains:

LIVE_CERT=$(openssl s_client -showcerts \
  -CAfile /path/to/cacert.pem -connect google.com:443 </dev/null 2>&1)
VALIDATION=$(echo "$LIVE_CERT" | grep -c -E '^verify error')
[[ $VALIDATION > 0 ]] && >&2 echo 'failed cert validation' \
  || echo "$LIVE_CERT" \
    | openssl x509 -text \
    | grep DNS \
    | tr ',' '\n' \
    | cut -d':' -f2

Wrap that in a bash function or executable with an easy-to-remember name, and you've got a very convenient tool for listing the domains covered by an SSL cert.

Over the past few months, I've found and created a bunch of fun new scripting tricks and tools. Below are two somewhat related items that helped to unlock new possibilities for me in remote bash automation. The first is a Perl one-liner that allows filtering access logs by start and end times. The second is a method for executing complex commands remotely via ssh without all those intricate escapes.

As context for the Perl log filter, my team at work regularly performs Load Test Analyses. A customer will run a Load Test, provide us with the start and end times for the test window, and then we run a comprehensive analysis to determine whether any problems were recorded. Previous to automation, we would develop grep time filters via Regular Expressions (i.e. grep -E '15/Oct/2015:0(4:(3[4-9]|[4-5]|5:([0-1]|2[0-3]))'), and then run a bunch of analyses on the results. This is not so bad, but involves training in Regular Expressions, is prone to human error, and requires some careful thought.

In developing a more human/beginner-friendly solution, I desired for people to be able to enter start and stop times in the following format:

  • YYYY-MM-DD:HH:mm

This part is pretty easy, since the entered date can be converted to a timestamp for easy comparisons and then passed along to another function for the comparison/computation.

I first built a filter using awk, but found that the version of awk on my local machine is more feature-rich than mawk which is available on the platform. Most crucially, mawk is missing time functions that would enable doing the following:

awk -v starttime=$STARTTIME -v endtime=$ENDTIME'
  m = split("Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec", d, "|")
  for(o=1; o<=m; o++) {
    months[d[o]] = sprintf("%02d", o)
  gsub("[\[]", "", $0)
  split($0, hm, ":")
  split(hm[1], dmy, "/")
  date = (dmy[3] " " months[dmy[2]] " " dmy[1] " " hm[2] " " hm[3] " 0")
  logtime = mktime(date)
  if (starttime <= logtime && logtime <= endtime) print $0

Since it's not likely that gawk will be available any time soon on the platform, the next alternative I considered was Perl. With some deliberation, I came up with the Perl one-liner below (here wrapped in a bash function):

function time_filter() {
  echo "perl -MPOSIX -sane '
      @months = split(\",\", \"Jan,Feb,Mar,Apr,May,Jun,Jul,Aug,Sep,Oct,Nov,Dec\");
      \$i = 0;
      foreach \$mon (@months) {
        @m{\$mon} = \$i;
      @hm = split(\":\", \$F[$PERL_COLUMN]);
      @dmy = split(\"/\", @hm[0]);
      @dmy[0] =~ s/\[//;
      \$logtime = mktime(0, @hm[2], @hm[1], @dmy[0], @m{@dmy[1]}, @dmy[2]-1900);
      if (\$startTime <= \$logtime && \$logtime <= \$endTime) {
    };' -- -startTime=${START_TIMESTAMP} -endTime=${END_TIMESTAMP}"

To start off, the function takes an optional argument for the location of the date string in a log line. Next, the POSIX module is loaded along with Perl options (switch, loop, split, single line). Diving into the actual script logic, an array containing 3-letter abbreviations for each month is created, and then the date/time value from the log line is converted to a format that can be used to create a timestamp via the mktime function. Lastly, if the converted log time falls between specified start and end times the full log line is printed. startTime and endTime are fed into the Perl script from corresponding bash variables. You can see that there is a little bit of escaping for the double quotation marks and dollar signs, but nothing beyond what's required to run this locally.

Next up, I needed the ability to execute this remotely over ssh. I initially attempted to insert additional escapes to be able to pass this directly to ssh. This task proved to be quite challenging, and greatly impacted the human-friendliness of the code. Thankfully, there is a nice alternative - base64 encoding. This approach gets a bad wrap as a common hacker technique, but I can attest that it works wonders (it's not the tool but the intent!).

Here's a sample implementation:

function ssh_cmd() {
  echo "ssh -q -t -F ${HOME}/.ssh/config \
    -o StrictHostKeyChecking=no \
    -o UserKnownHostsFile=/dev/null \
    -o PreferredAuthentications=publickey \

REMOTE_CMD=$(echo "cat access.log | ${TIME_FILTER} >> /path/to/filtered/access.log" \
  | base64)
$(ssh_cmd server-id)" echo ${REMOTE_CMD} | base64 -d | sudo bash" < /dev/null

Firstly, we're defining a general purpose ssh command. Then the Perl logic is loaded into a variable and concatenated into a full remote command that is subsequently base64-encoded. Lastly, we assemble the full remote ssh command by piping the base64-encoded logic to be decoded remotely and piped into sudo bash.

There are alternatives to this approach such as using rsync to pass a file with the above Perl script to a remote server ahead of remote execution, but I really like the simplicity that's achievable with base64.

The last six months have been very full with the arrival of our first baby and all the prep work and new responsibilities that go with being a new parent. Here and there I've managed to squeeze in little hobby projects. Much to my astonishment, I also won the !!Con attendance lottery(!!), and had an amazing few days in NYC.


!!Con was an extremely fun conference, and I consider myself so lucky to have won the attendance lottery. The topics were diverse and quirky, but consistently deep and informative. Enough cannot be said for the specialness of !!Con and the focus on inclusivity, accessibility, and openness. I was going to try to highlight a few presentations, but reviewing the conference program I'm reminded of so many awesome talks. So once this year's videos are available, maybe a random !!Con video selector will be in order :)

Nodejs Twitter Bots

Before the baby arrived, I feverishly built up a fleet of Twitter bots that pull notable bird sightings from the eBird API and post them to Twitter. This has been a fun project with some interesting challenges.

Below is a list of source repos for the bird bots:

The bots are built on Nodejs backed by Redis, all in Docker containers on a single Digital Ocean droplet. Each bot is associated with a particular state in the US. I started out naively running a couple of bots as concurrent Nodejs containers. This worked great for a while, but as the bot population increased, system stability took a nose dive. I considered a hardware upsize, but since this is a hobby project I opted to reduce the memory requirements of the fleet instead.

I noticed that some of the bots were more active, and would store a greater amount of data. My first idea was to slim down the data for some of the active bots by reducing the window of time they'll consider a sighting as valid. This helped a bit, but didn't get to the heart of the issue - lots of concurrent Nodejs instances consuming all the available memory.

The really big gain came with moving from persistent to ephemeral Nodejs containers. I started out with the Nodejs app keeping track of the time interval between queries to the eBird API. This meant each bot was crowding up the memory space with Nodejs. I could have stayed with this model, and rebuilt the app to run multiple bots per Nodejs instance, but there is a simpler approach.

Rather than managing time intervals with Nodejs, the whole system can be run from crontab at a much lower cost. Following is an example cronjob that will fire up an image, query eBird, process the data, post to Twitter, update Redis, log errors to a persistent file on the host, and then remove the container on exit:

*/30 * * * * sudo /usr/bin/docker run --name nbirds-id --rm=true -v /data/nbirds/ID:/var/log/nbirds --link redis:redis nbirds:id

The script completes in 1-2 seconds, which means there's now a lot of idle on the server, and headroom for a whole lot more bots! One remaining item to address with continued scaling will be to split out Redis to a separate server instance, as the data will eventually outgrow the available memory even with slimming the tracked time range for bots.

Provisioning for Productivity

Another recent project is a Development Base container image. The idea is that the image will have all of my favorite tools and configurations for developing software, so starting up a new project on an unfamiliar machine will be extremely fast (assuming Docker). I also recently started using Boxen, which facilitates automated provisioning of a Mac computer. At first, I was dismissive of Boxen in comparison to the speed of deploying containers. But in digging into Boxen a bit, I've come around to a new perspective. Given the current landscape of provisioning options, Boxen is a great resource for getting a Mac into a desired base state with apps and configuration. While it may be excellent for big teams or small elite teams, I wouldn't recommend Boxen where there is a lack of dedicated resources for maintenance and/or mentorship.

Tying Boxen back to Docker, you'll want to have a look at the following pull request to use boot2docker on OSX:

Docker Drupal Onbuild

Another project I've been piecing together is an onbuild image based on the official Drupal Docker image. 'Onbuild' has become a standard option for many official images but doesn't yet exist for Drupal. Beyond relying on drush site-install, another good option would be to work up a code volume share between the host and guest machines, but there are as yet unresolved issues with this approach:

That's all for this installment!

Hello, Hubot.

I've written previously about deploying Hubot on Docker, deploying patched Hubot scripts, and bechmarking mass inserts with a Redis Docker container. In this post, I'll cover how to link a Hubot Docker container to a Redis Docker container to equip Hubot with persistent memory.

As an overview we're going to:

  1. Spin up a Redis Docker container with a host directory mounted as a data volume
  2. Spin up a linked Hubot Docker container that will use Redis as a persistent brain

For my most recent post on Redis mass inserts, I created a basic Redis Docker image that satisfies all of the requirements to be used as a Hubot Redis brain. We'll bring this up now:

$ docker run --name redis -v /host/dir:/data -d nhoag/redis
$ docker ps -a
CONTAINER ID        IMAGE                COMMAND                CREATED             STATUS              PORTS               NAMES
3fc0b9888d54        nhoag/redis:latest   "redis-server /etc/r   8 seconds ago       Up 7 seconds        6379/tcp            redis

In the above docker run, the Redis container is named as 'redis', and a directory from the host is mounted as a volume in the Docker container. By mounting a directory from the host as a volume on the guest, we can now retain Redis backup data through a container failure or reprovision. You can get a sense for this by adding a file on the host, editing from the guest, and viewing the change back on the host:

$ echo "Host" > /host/dir/test.txt
$ docker exec -it redis sh -c 'echo "Guest" > /data/test.txt'
$ cat /host/dir/test.txt

Next up, we need a suitable Hubot Docker image. I previously assembled a Hubot Docker image that almost meets our requirements. As stated on the Hubot Redis Brain page on NPM:

hubot-redis-brain requires a redis server to work. It uses the REDIS_URL environment variable for determining where to connect to. The default is on localhost, port 6379 (ie the redis default).

The following attributes can be set using the REDIS_URL

  • authentication
  • hostname
  • port
  • key prefix

For example, export REDIS_URL=redis://passwd@ would authenticate with password, connecting to on port 16379, and store data using the prefix:storage key.

Let's spin up the old Hubot image without any modifications to scout out what needs to change. I'm using the same build process outlined in my previous post, A Dockerized and Slack-integrated Hubot, where I've defined a base Hubot image into which I'll sprinkle some additional configuration in order to connect to various services:

$ git clone git@github.com:nhoag/bot-cfg.git && cd bot-cfg
# Add credentials to ./Dockerfile
$ docker build -t="my-bot" .
$ docker run -d -p 45678:8080 --name bot --link redis:redis my-bot
$ docker exec -it bot env | grep "^REDIS_PORT"

From the above environment variables, there are a lot of options for defining a connection to the Redis container, but the easiest option is to use REDIS_PORT since it has everything we need and can be used as-is. With one new line added to the bot repo (which gets pulled into the Docker image defined here), we have a Hubot that can automatically talk to Redis on start-up.

Here is the addition to bin/hubot for reference:


After rebuilding the base Hubot image and my-bot, we now have a suitable Hubot Docker image to auto-connect to our running Redis container.

Let's spin up the updated Hubot:

# Don't forget to rebuild the my-bot image from the updated Hubot
$ docker run -d -p 45678:8080 --name bot --link redis:redis my-bot

To verify that Hubot is connected, let's attach to the running Redis container and review with redis-cli:

$ docker exec -it redis redis-cli -h 3fc0b9888d54
3fc0b9888d54:6379> SCAN 0
1) "0"
2) 1) "hubot:storage"