Rickshaw Graph

Searching for interesting online datasets, one of the first you'll encounter is Death Row data from the Texas Department of Criminal Justice. There are several datasets listed on the Death Row Information page including Executed Offenders.

To convert this information into an infographic, the first step is converting the HTML table to a format that's easy to load and parse. Toward this end I used the Ruby gem, Wombat, to scrape the Executed Offenders table and convert it to JSON.

The following code builds a hash from the Executed Offenders table:

table = Wombat.crawl do
  base_url "http://www.tdcj.state.tx.us"
  path "/death_row/dr_executed_offenders.html"

  headline xpath: "//h1"

  people 'xpath=//table/tbody/tr', :iterator do
    execution "xpath=./td[1]"
    first_name "xpath=./td[5]"
    last_name "xpath=./td[4]"
    race "xpath=./td[9]"
    age "xpath=./td[7]"
    date "xpath=./td[8]"
    county "xpath=./td[10]"
    statement "xpath=./td[3]/a/@href"
  end
end

The above code iterates through all of the table rows (tr) and grabs data from the data cells we're interested in.

We could stop with the above code in terms of parsing, but at the time I generated this script I was also thinking about analyzing final statements. Since last statements are stored on separate pages referenced from the Executed Offenders table, this next code section scrapes each last statement and replaces the statement link from the table with the actual statement.

data = {}
table['people'].each do |x|
  last_words = Wombat.crawl do
    base_url "http://www.tdcj.state.tx.us"
    path "/death_row/#{x['statement']}"
    statement xpath: "//p[text()='Last Statement:']/following-sibling::p"
  end
  x['statement'] = last_words['statement']
  x['gender'] = 'male'
  unless x['execution'].nil?
    data[x['execution']] = x
    data[x['execution']].delete('execution')
  end
end

At the tail end of the above code block is a bit of cleanup to remove duplicate data and to slightly shrink the hash.

The next code block writes the hash to disk as a JSON array:

File.open('dr.json', 'w') do |f|
  f.puts data.to_json
end

Now that we have the data in an easily digestible format, the next step is to generate a display. I used the Rickshaw JavaScript toolkit - a D3.js wrapper - to convert the data into an infographic.

I repurposed much of the Rickshaw Interactive Real-Time Data example. The main crux of this project was parsing the JSON data into the correct format for use with Rickshaw.

I used CoffeeScript to define and compile JS assets. A limitation of Rickshaw is an inability to cope with unset values (I initially expected these might be treated as zeros). With this in mind, the first step was to populate every possible x-axis value with zeros to avoid errors.

Below are three functions that initialize all of the data points with zeros:

timeAxis = ->
  time = {}
  for year in [1982..2014]
    time[year] = 0
  time

races = ->
  ['White', 'Black', 'Hispanic', 'Other']

preload = ->
  time = {}
  for t in races()
    time[t] = timeAxis()
  time

The last steps are to add the real data, and to build up the chart components. I read in the JSON file with jQuery ($.getJSON file, (data) ->), and ran the data through a couple of quick conversions before building the chart:

pop = (data) ->
  struct = preload()
  for k, v of data
    yr = /\d{4}$/.exec v['date']
    struct[v['race']][yr[0]]++
  struct

tally = (arg) ->
  count = {}
  for t in races()
    count[t] = []
  for a, b of pop
    for r, s of b
      z = new Date(r)
      m = z.getTime()/1000
      h = { x: m, y: s }
      count[a].push h
  count

I'll spare you the chart code here since it's fairly lengthy and well documented. But the full code for this project is available here. And the final product can be viewed here. Note that the default chart zoom shows the year '2000' twice. I haven't looked into this much yet, but the correct year (2010) appears on the second '2000' value on zooming in.

Overall, I found Rickshaw to be a fun library with an excellent API. It does have limitations, but is a good choice for representing time series data. If you need more options for chart type, see NVD3.js or straight D3.js.

Over the last couple of weekends I converted my blog from Octopress to straight Jekyll (still hosting on S3). There wasn't any particular reason behind the move, but I was curious to know more about the differences between the two platforms, wanted to try out a new theme, and just generally enjoy these types of migrations.

Overall, there aren't many differences between the two platforms. As many have stated before, the major difference is that Octopress comes with more functionality out of the box, but at the cost of increased complexity. Octopress is a great way to get into static sites, but after gaining some experience I really enjoyed paring down and digging into Jekyll.

There were a couple of fun tasks with the conversion, mostly with regard to setting up redirects and deprecating xml documents.

Redirects

One of the main differences between the old and new versions of my site is the way tags are handled. On the old site they were found at blog/categories/{category}, but on the new site they are at tags/#{category}. There are several plugins for generating redirects, and in my case I wanted to make sure I could automate the process, and set up redirects for a defined set of paths. The Jekyll Pageless Redirects plugin got me most of the way there.

The Jekyll Pageless Redirects plugin introduced a couple of issues, with resolution detailed in the following issue thread:

Basically you'll want to apply the changes from here, and ensure that paths are specified as follows (note the leading forward slash):

/origin : /destination

With a working redirect implementation, the next step was to define the list of redirects. Here's the one-liner bash script I used:

ls -1 path/to/categories | while read tag ; do echo "/blog/categories/$tag : /tags/#$tag" >> _redirects.yml ; done

The above script starts by listing all of the category directories in the original site, pipes these one-by-one into a while loop that echos each value in the desired YAML format and appends the result to the _redirects.yml file.

Deprecating Feeds

Similarly to category redirects, the old site generated an xml feed for each category. These feeds are not included in the new Jekyll site, and I don't see a need for them to continue. Rather than shut them down entirely, it's easy enough to hijack the redirect plugin to perform the small additional task of deprecating each of these feed items. This way most feed consumption services will interpret that this resource is officially gone.

I added the below code to the plugin and was good to go:

# snip

retire_xml = 'atom.xml'

# snip

retire_xml_path = File.join(alias_dir, retire_xml)

# snip

File.open(File.join(fs_path_to_dir, retire_xml), 'w') do |file|
  file.write(xml_template)
end

# snip

(retire_xml_path.split('/').size + 1).times do |sections|
  @site.static_files << PagelessRedirectFile.new(@site, @site.dest, retire_xml_path.split('/')[1, sections + 1].join('/'), '')
end

# snip

def xml_template()
  <<-EOF
<?xml version="1.0"?>
<redirect>
  <newLocation />
</redirect>
  EOF
end

# snip

And with that I pushed up to S3 and am fully on Jekyll, with the small exception that I still use the Octopress gem to generate new posts.

I recently pieced together a bash wrapper script to create, add, and delete an ssh key for temporary use in performing remote tasks over ssh. The processes outlined here assume cURL is available, and that the remote service you wish to connect to has API methods for ssh key handling.

Automating the process of generating and deleting a local ssh key is the easy part. Here's one way to create a key:

ssh-keygen -q -b 4096 -t rsa -N "" -f ./scripted.key

Options rundown:

  • -b bit strength -> higher than the default 2048
  • -f filename -> easier to subsequently delete the keypair
  • -N passphrase -> empty string
  • -q quiet mode -> no need to review output
  • -t key type -> specify rsa

And now to delete the newly created key-pair:

rm -r ./scripted.key*

Next we'll set up the API calls to register and delete a public key on remote services, in this case Github and Acquia.

Github

To automate ssh key deployment on Github, you'll first want to generate a personal access token under 'Account settings' > 'Applications'. We'll set a variable to the value of the token for easy re-use:

TOKEN='your-token'

Per the docs, note that the DELETE operation we will eventually employ requires a special admin:public_key permission.

In addition to the token, we'll set another variable to the value of the ssh public key as follows:

PUBKEY=`cat ./script.key.pub`

Now we can cURL the Github API using the TOKEN and PUBKEY variables. Since we're setting up for a procedural operation and to reduce the number of network requests, we'll capture the Github API response (which contains the key ID):

RESPONSE=`curl -s -H "Authorization: token ${TOKEN}" \
  -X POST --data-binary "{\"title\":\"nr@blackhole\",\"key\":\"${PUBKEY}\"}" \
  https://api.github.com/user/keys`

And now to extract the key ID:

KEYID=`echo $RESPONSE \
  | grep -o '\"id.*' \
  | grep -o "[0-9]*" \
  | grep -m 1 "[0-9]*"`

Note that the above is more than we really need to be able to parse the Github response. With the Acquia example (coming up next), we'll see a good reason for setting up the extraction in this manner.

Only one step left, but you may want to add a 10-second sleep to the script to give an opportunity to verify that the key was added before it is deleted.

And now for the delete:

curl -s -H "Authorization: token ${TOKEN}" -X DELETE \
   https://api.github.com/user/keys/${KEYID} \
  -o /dev/null

Here we're sending the result to /dev/null to ensure the script stays quiet.

Acquia

Performing this task with Acquia's Cloud API is much the same, but with a couple of notable differences.

First, we need to set a couple of additional variables:

CLOUDAPI_ID='id'
CLOUDAPI_KEY='key'
DOCROOT="docroot"
CREDS="${CLOUDAPI_ID}:${CLOUDAPI_KEY}"

Variables set, here's the cURL command to add the key:

RESPONSE=`curl -s -u $CREDS \
  -X POST --data-binary "{\"ssh_pub_key\":\"${PUBKEY}\"}" \
  https://cloudapi.acquia.com/v1/sites/"${DOCROOT}"/sshkeys.json?nickname=script`

In this case, we're going to extract 2 pieces of data from the response. We'll need the task ID to track the status of adding the key, and we'll also need the key ID (as with the Github example) so that we can delete the key:

TASKID=`echo $RESPONSE \
  | grep -o '\"id.*' \
  | grep -o "[0-9]*" \
  | grep -m 1 "[0-9]*"`

KEYID=`echo $RESPONSE \
  | grep -o "sshkeyid.*" \
  | grep -o "[0-9]*" \
  | grep -m 1 "[0-9]*"`

This is where the utility of the extra bash logic comes in handy, as the Acquia response is condensed JSON, whereas the Github response is readable JSON. Since we don't have things nicely separated into lines, and since we want to minimize dependencies (this is where I'd otherwise recommend jq), the above gives us what we need with fairly low overhead.

Now to query the task ID so we know when we can start using our key:

STATUS='null'
until [[ $STATUS =~ ^error|done$ ]]; do
  STATUS=`curl -s -u $CREDS \
  https://cloudapi.acquia.com/v1/sites/"${DOCROOT}"/tasks/"${TASKID}".json \
  | grep -o 'state.*' \
  | grep -o '[a-z]*' \
  | sed -n 2p`
  echo "ADDING SSH KEY: ${STATUS}"
  sleep 5
done

And finally, here's the delete:

curl -s -u $CREDS -X DELETE \
  https://cloudapi.acquia.com/v1/sites/"${DOCROOT}"/sshkeys/"${SSHID}".json \
  -o /dev/null

For reference, I set up a Gist that contains complete bash scripts for both services covered above.

Enjoy!

After building a bash script to automate Drupal module deployments, I figured it might be worthwhile to convert the script over to Ruby. I decided to spin up the new version as a Ruby gem leveraging the Thor CLI Framework.

Having already worked out many of the mechanics of deploying Drupal contrib modules in the previous bash script, I was able to dive right into coding. I started by fleshing out the command options and then moved into scripting the functionality. Thor makes it really easy to set up the command interface, though formatting long descriptions can be a little tricky.

In building the script, I wanted to stay faithful to keeping as much of the logic in Ruby as possible. The result was many opportunities to explore Ruby and to make some interesting discoveries. The two areas where I was most tempted to shell out were with identifying and downloading the "best" version of a contributed Drupal module (drush), and with performing version control activities (Git).

In the first case, Nokogiri was an obvious choice for parsing Drupal contrib XML feeds. Fortunately, drupal.org exposes uniform project feeds in the following format:

http://updates.drupal.org/release-history/{project}/{core-version}

Reviewing several project feeds, it wasn't immediately obvious how to parse a feed to select the "best" project, so I referenced drush source code for pointers:

function updatexml_best_release_found($releases) {
  // If there are releases found, let's try first to fetch one with no
  // 'version_extra'. Otherwise, use all.

The above comment says it all. In the Ruby script, you can see this basic logic is reproduced in contrib.rb (dl method):

    def dl
      doc = Nokogiri::XML(open(@feed).read)
      releases = {}
      doc.xpath('//releases//release').each do |item|
        if !item.at_xpath('version_extra')
          releases[item.at_xpath('mdhash').content] = item.at_xpath('download_link').content
        end
      end
      if releases.nil?
        doc.xpath('//releases//release').each do |item|
          releases[item.at_xpath('mdhash').content] = item.at_xpath('download_link').content
        end
      end
      return releases.first
    end

For downloads of both XML documents and project archives, I wanted to prevent getting myself (or others) blacklisted through unintentionally DOS'ing drupal.org with lots of requests. Here I decided to lean on a small OpenURI extension called open-uri-cached. The way this is implemented is a bit hacky, but it gets the job done for now. For locating cached project archives, you'll see that I replicated a small bit of logic from open-uri-cached to find and extract archives:

uri = URI.parse(archive)
targz = "/tmp/open-uri-503" + [ @path, uri.host, Digest::SHA1.hexdigest(archive) ].join('/')

Addressing Git functionality was initially not so straight-forward. Following the Git breadcrumbs from Ruby Toolbox, the most obvious place to start is Grit, which "is no longer maintained. Check out rugged." Rugged was initially promising, but in the end failed to yield a working git push. That left ruby-git as the next logical choice. Fortunately ruby-git did the trick without much fuss:

    def update
      prj_root = Pathname.new(docroot)
      workdir = prj_root.parent.to_s
      project = File.basename(path)

      g = Git.open(workdir)
      g.branch('master').checkout

      changes = []
      g.status.changed.keys.each { |x| changes.push x }
      g.status.deleted.keys.each { |x| changes.push x }
      g.status.untracked.keys.each { |x| changes.push x }

      if changes.nil? == false
        g.add(path, :all=>true)
        g.commit("Adds #{project}")
        g.push
      else
        puts "No changes to commit for #{project}"
      end
    end

There are many improvements left to be made with this script, but so far I'm very happy with the result. Using classes and objects is a breath of fresh air compared to procedural bash, and having this rolled into a gem makes it very easy to share with the team.

Part of the process of migrating new customers to Acquia Hosting involves adding (or verifying the presence of) three Drupal modules:

Manual?! Awe shucks...

Verifying, adding, and committing these modules manually generally takes about five to ten minutes and can be error-prone. I don't usually stand a site up for this task, but just clone the repo locally, download the modules and move them into place with rsync. This means I can't lean on Drupal to make the right decisions for me. Mistakes are not a huge deal at this phase, but can add many minutes to an otherwise quick task (assuming we actually catch the mistake!). Mistakes might include adding D7 modules to a D6 site, putting modules in the wrong location, or adding a slightly older version of a module (perhaps with a known security flaw!). Once a mistake has been introduced, we now have to verify the mistake, maybe perform an interactive Git rebase on bad commits, and generally do more work.

In order to ease some of the human error factor of the above scenario, and since this is repetitive and script-worthy, I decided to cobble together a bash script to automate the process. Now the whole task is much less error-prone, and takes all of 5-10 seconds to complete!

The Brainstorm

Below is the basic plan I brainstormed for how I initially thought the script should operate:

get drupal version from prompt
check if acquia_connector, fast_404, memcache already exist in the repo
check contrib modules path - contrib|community
download modules that don't exist and move into place
git add and commit each downloaded module individually
git push to origin

You'll notice that not all of the above was actually implemented/needed, but it gave a good starting point for setting up the basic mechanics of the script, and served as an anchor when I needed to reset my focus.

Gotta drush That

To simplify the process of downloading and adding the latest version of each module for the correct version of Drupal core, I decided to lean on drush, and particularly drush's ability to build out a codebase via the make command.

A few important points:

  • in Drupal 6/7, shared contributed modules are generally located at 'sites/all/modules[/contrib]'
  • drush make receives build instructions via a make file
  • since each project is evaluated individually, we need a make file for each project
  • since make files are static, we need a different set of make files for each version of Drupal, and for each contrib module path

Looking back through the repo history, you'll see that my initial approach was to generate static make files for each Drupal version, project, and project path. You'll also see that I included a secondary script to generate a new set of make files for those rare times when a codebase is using a contrib path such as 'sites/all/modules/community' or other. Fortunately, there is a better way!

A Better Way

In bash, we can define dynamic make files as heredocs. By making this shift, I was able to trim 12+ static make files along with secondary bash script down to two heredocs:

function makefile() {
  if [[ $3 == 'modules' ]]; then
    cat <<EOF
core = $1.x
api = 2

; Modules
projects[] = $2
EOF
  else
    cat <<EOF
core = $1.x
api = 2

; Modules
projects[$2][subdir] = $3
EOF
  fi
}

In order to support the shift to heredocs, I also had to convert the drush command from referencing static files to processing make files via STDIN. Thanks to this comment, I ended up with the following:

for i in "${DIFF[@]}"; do
  makefile $VERSION $i $CONTRIB \
    | $DRUSH make --no-core -y php://stdin
done

And with that, we have a powerful and dynamic bash script that will save lots of time, and can be easily expanded or improved to handle additional modules and use cases. I also set the repo up to be a collection of helpful scripts, and I very much look forward to automating away additional complexities.