Recently I’ve been working with PHP applications in varying states of modernity/antiquity, and very regularly have to switch the current PHP version to accomodate the PHP flavors of the day. Homebrew un/link commands are super simple and at this point all I want is to be able to switch the PHP version with a single command. With a little bit of searching, I found a script that comes pretty close to what I need:

I made a fork, and then removed Apache stuff and unused code, converted conditionals to guard clauses, ran the remainder through shellcheck, and addressed a couple of minor consistency/grammar issues to achieve the following result:

Since Hubot ships with a Redis Brain by default, I decided to use this as an opportunity to learn some Redis. While reading through Redis documentation, I came across Redis Mass Insertion, which sparked an odd curiosity (twinkle twinkle). The main crux of Redis Mass Insertion is a recommendation to write large data sets to a Redis instance using the Redis protocol with redis-cli --pipe rather than pushing data through a Redis client. The benefits are maximized throughput, better assurance of data consistency, and a nice validation message:

All data transferred. Waiting for the last reply...
Last reply received from server.
errors: 0, replies: 1000

The Redis Mass Insertion documentation includes a couple of short code snippets for generating test data and example commands for pushing data to Redis. From these snippets, I cobbled together a ruby script to generate an arbitrary number of K/V pairs to STDOUT:

#!/usr/bin/ruby

def int_check(val)
  pass = Integer(val) rescue nil
  if pass
    val.to_i
  else
    STDERR.puts "Argument must be an integer."
    exit
  end
end

def gen_redis_proto(*cmd)
  proto = ""
  proto << "*"+cmd.length.to_s+"\r\n"
  cmd.each{|arg|
    proto << "$"+arg.to_s.bytesize.to_s+"\r\n"
    proto << arg.to_s+"\r\n"
  }
  proto
end

def generate_data(val)
  (0...val).each do |n|
    STDOUT.write(gen_redis_proto("SET", "Key#{n}", "Value#{n}"))
  end
end

generate_data(int_check(ARGV[0]))

The above script can be called as, ruby redis-pipe.rb 10000000 >> ./proto.txt to generate a file containing ten million key:value pairs.

From here I figured it might be fun to do a few benchmarks of redis-cli --pipe versus netcat HOST PORT, as well as protocol versus flat commands. I created a bash one-liner to generate the same data set from above as a flat list of Redis SET commands without the extra protocol markup:

i=0 ; while [[ ${i} -lt 10000000 ]] ; do echo " SET Key${i} Value${i}" ; i=$((i + 1)) ; done >> flat.txt

Here’s how the resulting files look:

$ du -a *.txt
274M  flat.txt
464M  proto.txt

$ head -7 proto.txt
*3
$3
SET
$4
Key0
$6
Value0

$ head -1 flat.txt
 SET Key0 Value0

With data in hand, we just need a Redis instance to test against. I set up an Automated Build through Docker Hub with the current latest Redis version. I then deployed this container locally (OSX) via boot2docker: docker pull nhoag/redis && docker run --name redis -p 6379 -d nhoag/redis. Next I installed Redis locally with brew install redis to facilitate accessing the Redis container.

As a small test, we can connect to the container and SET and GET. But first we need the connection specs for the Redis container:

$ docker ps -a
CONTAINER ID        IMAGE                COMMAND                CREATED             STATUS              PORTS                     NAMES
ca48d4ff024e        nhoag/redis:latest   "redis-server /etc/r   2 seconds ago       Up 1 seconds        0.0.0.0:49156->6379/tcp   redis

$ boot2docker ip
192.3.4.5

Using the above information, we can connect with Redis as follows:

redis-cli -h 192.3.4.5 -p 49156
192.3.4.5:49156> SET a b
OK
192.3.4.5:49156> GET a
"b"
192.3.4.5:49156> FLUSHDB
OK
192.3.4.5:49156>

It works! On to mass inserts. As you can see above, I opted to pre-generate data to standardize the insertion process. This means we can run inserts as follows:

# Redis Protocol
$ cat proto.txt | redis-cli --pipe -h 192.3.4.5 -p 49156 > /dev/null
$ cat proto.txt | nc 192.3.4.5 49156 > /dev/null
$ cat proto.txt | socat - TCP:192.3.4.5:49156 > /dev/null

# Flat Commands
$ cat flat.txt | redis-cli --pipe -h 192.3.4.5 -p 49156 > /dev/null
$ cat flat.txt | nc 192.3.4.5 49156 > /dev/null
$ cat flat.txt | socat - TCP:192.3.4.5:49156 > /dev/null

Rinse and repeat after each iteration:

  1. redis-cli -h 192.3.4.5 -p 49156
  2. DBSIZE - should be 10,000,000
  3. FLUSHDB

I introduced socat into the equation because my version of netcat doesn’t auto-recognize EOF. Some versions of netcat have -c or-q0, but not mine :( This means netcat will hang after the data has been fully processed until it’s manually told to stop. socat will automatically hang up on EOF by default, which is attractive as it allows simple benchmarking with time. But notice I haven’t included any time statistics. As you’ll see, I found a better alternative to time, and then kept the socat data since it was already in the mix.

There is a very fun project for monitoring Redis called redis-stat. Using redis-stat --server=8282 192.3.4.5:49156 1, we get updates every second from the command line as well as in the browser at localhost:8282.

redis-stat command line

redis-stat command line output

redis-stat browser

redis-stat browser

When Commands per Second and CPU Usage drop, and when Memory Usage levels off, we know it’s safe to shut down netcat. In the command line output we also get a rich dataset about how the insert performed that can be further parsed and analyzed. And the browser display provides a nice high-level overview.

In addition to redis-stat, I set up an ssh session running htop for an added lens into performance. This turned out to be very helpful in cases where the VM would unexpectedly hit capacity and start swapping, queuing, and backgrounding tasks. This didn’t happen often, but caused a massive slowdown for inserts.

The below data is from “clean” runs where the above-mentioned tipping point did not occur. Of course it would be better to run these inserts hundreds of times and aggregate the results. The data presented below is a semi-curated set of what seem to be typical responses for each insert method.

To generate the below data, I started with the raw redis-stat cli output. I parsed all of the rows that show insert activity, and then removed the first and last rows since these were typically inconsistent with the rest of the data set. Here is an example of generating an average for inserts per millisecond from a prepared data-set:

$ cat stats.txt \
  | tail -n +2 \        # Remove the first line
  | sed '$d' \          # Remove the last line
  | awk '{print $9}' \  # Print the cmd/s column
  | tr -d k \           # Remove the 'k'
  | awk '{ sum += $1 } END { if (NR > 0) print sum / NR }'
161.148

161.148 * 1000 inserts * 1/1000s = 161 inserts/ms

Redis Protocol

Command - Time (s) - Agg. Inserts/ms - Avg. Inserts/ms
netcat 63 159 161
redis-cli 57 175 169
socat 62 161 160

Flat Redis Commands

Command - Time (s) - Agg. Inserts/ms - Avg. Inserts/ms
netcat 60 167 161
redis-cli 66 152 147
socat 66 152 148
  1. redis-cli --pipe with the Redis protocol shows a slight performance edge
  2. netcat was the runner up in flat format and the Redis protocol was only slightly slower
  3. socat was comparable to netcat with the Redis protocol
  4. socat and redis-cli --pipe without Redis protocol were slower

TLDR: Use redis-cli --pipe with the Redis protocol for mass inserts and save on the order of 10+ minutes per billion K/V pairs ;)

Part of the process of migrating new customers to Acquia Hosting involves adding (or verifying the presence of) three Drupal modules:

Manual?! Awe shucks…

Verifying, adding, and committing these modules manually generally takes about five to ten minutes and can be error-prone. I don’t usually stand a site up for this task, but just clone the repo locally, download the modules and move them into place with rsync. This means I can’t lean on Drupal to make the right decisions for me. Mistakes are not a huge deal at this phase, but can add many minutes to an otherwise quick task (assuming we actually catch the mistake!). Mistakes might include adding D7 modules to a D6 site, putting modules in the wrong location, or adding a slightly older version of a module (perhaps with a known security flaw!). Once a mistake has been introduced, we now have to verify the mistake, maybe perform an interactive Git rebase on bad commits, and generally do more work.

In order to ease some of the human error factor of the above scenario, and since this is repetitive and script-worthy, I decided to cobble together a bash script to automate the process. Now the whole task is much less error-prone, and takes all of 5-10 seconds to complete!

The Brainstorm

Below is the basic plan I brainstormed for how I initially thought the script should operate:

get drupal version from prompt
check if acquia_connector, fast_404, memcache already exist in the repo
check contrib modules path - contrib|community
download modules that don't exist and move into place
git add and commit each downloaded module individually
git push to origin

You’ll notice that not all of the above was actually implemented/needed, but it gave a good starting point for setting up the basic mechanics of the script, and served as an anchor when I needed to reset my focus.

Gotta drush That

To simplify the process of downloading and adding the latest version of each module for the correct version of Drupal core, I decided to lean on drush, and particularly drush’s ability to build out a codebase via the make command.

A few important points:

  • in Drupal 67, shared contributed modules are generally located at ‘sites/all/modules[/contrib]’
  • drush make receives build instructions via a make file
  • since each project is evaluated individually, we need a make file for each project
  • since make files are static, we need a different set of make files for each version of Drupal, and for each contrib module path

Looking back through the repo history, you’ll see that my initial approach was to generate static make files for each Drupal version, project, and project path. You’ll also see that I included a secondary script to generate a new set of make files for those rare times when a codebase is using a contrib path such as ‘sites/all/modules/community’ or other. Fortunately, there is a better way!

A Better Way

In bash, we can define dynamic make files as heredocs. By making this shift, I was able to trim 12+ static make files along with secondary bash script down to two heredocs:

function makefile() {
  if [[ $3 == 'modules' ]]; then
    cat <<EOF
core = $1.x
api = 2

; Modules
projects[] = $2
EOF
  else
    cat <<EOF
core = $1.x
api = 2

; Modules
projects[$2][subdir] = $3
EOF
  fi
}

In order to support the shift to heredocs, I also had to convert the drush command from referencing static files to processing make files via STDIN. Thanks to this comment, I ended up with the following:

for i in "${DIFF[@]}"; do
  makefile $VERSION $i $CONTRIB \
    | $DRUSH make --no-core -y php://stdin
done

And with that, we have a powerful and dynamic bash script that will save lots of time, and can be easily expanded or improved to handle additional modules and use cases. I also set the repo up to be a collection of helpful scripts, and I very much look forward to automating away additional complexities.

I made a bunch of big changes to the GoAccess shell plugin including improved argument handling, greater agnosticism, and greater configurability with time filtering. I converted the plugin from sourced script to regular bash script. This meant the script components had to be ordered into a sequential configuration, but this also makes the script more portable and easier to fire up.

The script options now support short and long call values, courtesy of Stack Overflow. So the log file location can be designated with either -l or –location. and the same is true of each option. I also made the options more intuitive to use by ensuring that options such as –report and –open can be called without having to pass an associated extra value such as ‘1’ or ‘yes’.

The time filter defaults to showing results for the past hour, but this can be altered in various ways. You could specify that you want results from 3 hours ago with -b3. By default, the end value is ‘now’, but this can also be customized by passing -d10M to specify that results should span a 10-minute period following the start time. Time units and integer values are parsed with regex and sed, respectively.

Many of the big changes were made over the course of several hours at a location where I was unable to iteratively test. It was a bit scary to diverge so far from a working copy of the script, but in the end I think it allowed me to be more adventurous with the direction of the scipt. The subsequent debugging wasn’t as involved as I had anticipated.

The remaining TODOs are to add support for parallel remote connections, compression support for gzipped log files, and support for filtering by arbitary UTC (HH:MM) time values.

With analyzing data sets and performing repetitive (and tedious) tasks, piping streams in the terminal can save a ridiculous amount of manual labor and time. Following are a few examples that I use in diagnosing and resolving various types of problems with web applications.

Assuming log entries in the following format:

1.1.1.1 - - [24/Aug/2013:12:55:30 +0000] "GET / HTTP/1.0" 200 2227 "-" "USER AGENT STRING" vhost=vhost.example.com host=example.com hosting_site=docroot request_time=5640229

Number of Requests

``` bash Number of requests in a given half hour in 10-minute increments $ for i in {0..2} ; do echo “– 10:$i”“x UTC –” ; grep -c “2013:10:$i” access.log ; echo -e “\n” ; done


__Results:__
<pre>
-- 10:0x UTC --
494

-- 10:1x UTC --
458

-- 10:2x UTC --
446
</pre>

### HTTP Response Codes

``` bash Sort HTTP response codes in a given half hour in 10-minute increments
$ for i in {0..2} ; do echo "-- 10:$i""x UTC --" ; grep "2013:10:$i" access.log | cut -d\" -f3 | awk '{print $1}' | sort | uniq -c | sort -nr ; echo -e "\n" ; done

Results:

– 10:0x UTC –
 337 200
 108 302
  31 304
  11 301
   7 404

– 10:1x UTC – 283 200 110 302 39 301 21 304 5 403

– 10:2x UTC – 280 200 116 302 46 301 2 404 2 403

Most Requested URIs

```bash Most-requested URIs in a given half hour in 10-minute increments for i in {0..2} ; do echo “– 10:$i”“x UTC –” ; grep “2013:10:$i” access.log | cut -d\” -f2 | awk ‘{print $2}’ | cut -d\? -f1 | sort | uniq -c | sort -nr | head -n 5 ; echo -e “\n” ; done


__Results:__
<pre>
-- 10:0x UTC --
  64 /example/one
  48 /example/two
  32 /example/three
  16 /example/four
   9 /example/five

-- 10:1x UTC --
  62 /example/two
  55 /example/three
  30 /example/six
  27 /example/one
  20 /example/four

-- 10:2x UTC --
  58 /example/one
  34 /example/three
  33 /search
  31 /rss.xml
  27 /example/two
</pre>

### Most Active IPs

``` bash Most active IPs in a given half hour in 10-minute increments
for i in {0..2} ; do echo "-- 10:$i""x UTC --" ; grep "2013:10:$i" access.log | awk '{print $1}' | sort | uniq -c | sort -nr | head -n 5 ; echo -e "\n" ; done

Results:

– 10:0x UTC –
 145 1.1.1.1
  73 2.2.2.2
  29 3.3.3.3
  25 4.4.4.4
  13 5.5.5.5

– 10:1x UTC – 153 1.1.1.1 76 3.3.3.3 32 5.5.5.5 29 2.2.2.2 18 4.4.4.4

– 10:2x UTC – 131 2.2.2.2 61 4.4.4.4 38 5.5.5.5 34 3.3.3.3 10 1.1.1.1