The CASE of the Conditional Records

I’m working on a change request for a web service tied to an application we support at work, and I ran into one requirement (out of many) which I didn’t immediately know how to resolve. Maybe it was just the way it was worded in the spec, but it had me scratching my head for a good few hours before I got to the heart of the matter.

This post is mostly a reminder to myself to take a step back, and work through a problem in a notebook or similar; go back and forward a few times, from different angles until the puzzle is clear and you can find a way to approach it. Once I worked through it, and figured out exactly what was being asked for, the problem itself turned out to be significantly easier than I had first imagined.

The Problem

I have two linked tables of journey information: 1 for the journey record, one for each leg of that journey. For each leg, there could be up to 2 records, the planned leg (P), and the actual leg (A). When we return the results of the query, we want a row combining the flight information and the leg information, for every leg. The kicker was we never want to return a P leg record, when an A leg record existed. But we don’t know if an A leg record exists before we run the query.

[Side note – I am massively simplifying the data structure, and some of the problem for clarity and confidentiality]

[code language=”text” light=”true”]
————————— ————————————
| journey_id | journey_date | | journey_id | leg_number | leg_type |
————————— ————————————
| 1 | 01/07/2015 | | 1 | 1 | P |
————————— ————————————
| 2 | 01/07/2015 | | 1 | 1 | A |
————————— ————————————
| 3 | 01/07/2015 | | 1 | 2 | P |
————————— ————————————
[/code]

So from the above data we want to return the following:

[code language=”text” light=”true”]
——————————————————-
| journey_id | journey_date | leg_number | leg_type |
——————————————————-
| 1 | 01/07/2015 | 1 | A |
——————————————————-
| 1 | 01/07/2015 | 2 | P |
——————————————————-
[/code]

And not this, which a straightforward JOIN would produce.

[code language=”text” light=”true”]
——————————————————-
| journey_id | journey_date | leg_number | leg_type |
——————————————————-
| 1 | 01/07/2015 | 1 | P |
——————————————————-
| 1 | 01/07/2015 | 1 | A |
——————————————————-
| 1 | 01/07/2015 | 2 | P |
——————————————————-
[/code]

Obviously we need a WHERE clause in there, but like I said, we don’t know what records exist for each leg, so most common clauses aren’t going to give desirable results.

Effectively we are conditionally returning a record only if a certain other, nearly identical record in the same table does not exist – otherwise we want to return that record instead.

Additional Restraints

  1. Everything has to come from running a single SQL query against a table/set of tables. We can’t utilise Oracle packages or procedures. The query is embedded in the application and called as a prepared statement (to feed in the “real” filtering criteria later). We can use functions, but should be mindful of the performance implications
  2. We can’t do any “post-processing” of the data in the web service application code. Well, we could, but it’d be costly and clunky to do in the current framework.
  3. Everything should (obviously) be as performant as possible.

Solution

First Attempt

An infrequently used, but powerful, part of SQL is the CASE construct:

[code language=”sql”]
CASE
WHEN foo THEN bar
ELSE baz
END
[/code]

By utilising a CASE statement and some subqueries, I could check if a row was in the actual legs of a journey, or whether it was a planned leg and there was no matching actual leg:

[code language=”sql”]
SELECT j.journey_id, j.journey_date,
jl.leg_number, jl.leg_type
FROM journies j, journey_legs jl
WHERE
(CASE
WHEN (jl.journey_id, jl.leg_number) IN
(SELECT journey_id, leg_number
FROM flight_legs
WHERE journey_id = fl.journey_id
AND leg_number = fl.leg_number
AND leg_type = ‘A’) THEN
(1)
WHEN ((fl.journey_id, fl.leg_number) NOT IN
(SELECT journey_id, leg_number
FROM flight_legs
WHERE journey_id = fl.journey_id
AND leg_number = fl.leg_number
AND leg_type = ‘A’) AND fl.leg_type = ‘P’) THEN
(1)
ELSE
(0)
END) = 1;

Execution time: 1.8 seconds
[/code]

Second Attempt

A nice start, the above query returns what we want, but it’s a bit bloated. Do we need the first subquery at all? No, it turns out:

[code language=”sql”]
SELECT j.journey_id, j.journey_date,
jl.leg_number, jl.leg_type
FROM journies j, journey_legs jl
WHERE
(CASE
WHEN (jl.leg_type = ‘A’) THEN
(1)
WHEN ((fl.journey_id, fl.leg_number) NOT IN
(SELECT journey_id, leg_number
FROM flight_legs
WHERE journey_id = fl.journey_id
AND leg_number = fl.leg_number
AND leg_type = ‘A’) AND fl.leg_type = ‘P’) THEN
(1)
ELSE
(0)
END) = 1;

Execution time: 1 second
[/code]

Final Attempt

Awesome, we are getting the data we want, the query is running at an acceptable pace for the number of records we have… but it’s kinda clunky looking. We want to try simplify it if possible so it’s easier for anyone who comes after us to maintain. Is it possible to rewrite this to take out the CASE, and make it more readable? Of course:

[code language=”sql”]
SELECT j.journey_id, j.journey_date,
jl.leg_number, jl.leg_type
FROM journies j, journey_legs jl
WHERE (jl.leg_type = ‘A’ OR
(jl.leg_type = ‘P’ AND
(jl.flight_number, jl.leg_order) NOT IN
(SELECT journey_id, leg_order
FROM journey_legs
WHERE journey_id = jl.journey_id
AND leg_order = jl.leg_order
AND leg_type = ‘A’)));

Execution time: 0.85 seconds
[/code]

That’s a bit better for anyone who maybe isn’t familiar with CASE to understand, and has a bit of hierarchy to it so you can mostly see at a glance what it’s doing. I don’t think we can get rid of the subquery, but sometimes these things are unavoidable.

Synchronising GitHub and an Internal Git server

Note: I found this mini How-To while having a clean-up of my GitHub repositories. I figured it would be worth sharing on my blog. Hopefully it is of use to someone. If you want to play around with the steps, but don’t want to use one of your existing projects, you can use this repository.


The Problem

  1. I have my repository hosted on GitHub
  2. I have an internal Git server used for deployments
  3. I want to keep these synchronised using my normal workflow

Getting Started

Both methods I’ll describe need a “bare” version of the GitHub repository on your internal server. This worked best for me:

[code lang=bash]
cd ~/projects/repo-sync-test/
scp -r .git [email protected]:/path/to/sync.git
[/code]

Here, I’m changing to my local working directory, then using scp to copy the .git folder to the internal server over ssh.

More information and examples this can be found in the online Git Book:

4.2 Git on the Server – Getting Git on a Server

Once the internal server version of the repository is ready, we can begin!

The Easy, Safe, But Manual Method:

[code lang=text]
+———+ +———-+ /——>
| GitHub | | internal | — deploy –>
+———+ +———-+ \——>
^ ^
| |
| +———+ |
\—–| ME! | —-/
+———+
[/code]

This one I have used before, and is the least complex. It needs the least setup, but doesn’t sync the two repositories automatically. Essentially we are going to add a second Git Remote to the local copy, and push to both servers in our workflow:

In your own local copy of the repository, checked out from GitHub, add a new remote a bit like this:

[code lang=bash]
git remote add internal [email protected]:/path/to/sync.git
[/code]

This guide on help.github.com has a bit more information about adding Remotes.

You can change the remote name of “internal” to whatever you want. You could also rename the remote which points to GitHub (“origin”) to something else, so it’s clearer where it is pushing to:

[code lang=bash]
git remote rename origin github
[/code]

With your remotes ready, to keep the servers in sync you push to both of them, one after the other:

[code lang=bash]
git push github master
git push internal master
[/code]

  • Pros: Really simple
  • Cons: It’s a little more typing when pushing changes

The Automated Way:

[code lang=text]
+———+ +———-+ /——>
| GitHub | ======> | internal | — deploy –>
+———+ +———-+ \——>
^
|
| +———+
L————- | ME! |
+———+
[/code]

The previous method is simple and reliable, but it doesn’t really scale that well. Wouldn’t it be nice if the internal server did the extra work?

The main thing to be aware of with this method is that you wouldn’t be able to push directly to your internal server – if you did, then the changes would be overwritten by the process I’ll describe.

Anyway:

One problem I had in setting this up initially, is the local repositories on my PC are cloned from GitHub over SSH, which would require a lot more setup to allow the server to fetch from GitHub without any interaction. So what I did was remove the existing remote, and add a new one pointing to the https link:

[code lang=bash]
(on the internal server)
cd /path/to/repository.git
git remote rm origin
git remote add origin https://github.com/chrismcabz/repo-syncing-test.git
git fetch origin
[/code]

You might not have to do this, but I did, so best to mention it!

At this point, you can test everything is working OK. Create or modify a file in your local copy, and push it to GitHub. On your internal server, do a git fetch origin to sync the change down to the server repository. Now, if you were to try and do a normal git merge origin at this point, it would fail, because we’re in a “bare” repository. If we were to clone the server repository to another machine, it would reflect the previous commit.

Instead, to see our changes reflected, we can use git reset (I’ve included example output messages):

[code lang=bash]
git reset refs/remotes/origin/master

Unstaged changes after reset:
M LICENSE
M README.md
M testfile1.txt
M testfile2.txt
M testfile3.txt
[/code]

Now if we were to clone the internal server’s repository, it would be fully up to date with the repository on GitHub. Great! But so far it’s still a manual process, so lets add a cron task to stop the need for human intervention.

In my case, adding a new file to /etc/cron.d/, with the contents below was enough:

[code lang=bash]
*/30 * * * * user cd /path/to/sync.git && git fetch origin && git reset refs/remotes/origin/master > /dev/null
[/code]

What this does is tell cron that every 30 minutes it should run our command as the user user. Stepping through the command, we’re asking to:

  1. cd to our repository
  2. git fetch from GitHub
  3. git reset like we did in our test above, while sending the messages to /dev/null

That should be all we need to do! Our internal server will keep itself up-to-date with our GitHub repository automatically.

  • Pros: It’s automated; only need to push changes to one server.
  • Cons: If someone mistakenly pushes to the internal server, their changes will be overwritten

Credits

Setting Up Chef

I just finished setting up Chef, to have a play around with this DevOps stuff I keep hearing about. While Chef is quite well documented, I found myself struggling in places where things weren’t quite clear enough. So naturally, I’m posting how I got myself up and running.

[Note: I haven’t actually done anything with this setup yet, other than get it working.]

Step One: Get A Server

There are 2 parts to a Chef install: client and server. You can run them all on one machine, but given how much Chef slows down my Joyent VM, I’d suggest keeping it off of your day-to-day workstation.

I used my Joyent credit to setup a new Ubuntu 12.04 64-bit server. Chef server only supports Ubuntu or RedHat/CentOS 64-bit. Once the server was provisioned, I followed this 5-minute guide to lockdown the server enough for my needs (this being just an experiment and all…)

Step Two: Set the Server FQDN

Once the server is prepared, make sure it has a resolvable, fully qualified domain name before going any further. While the Chef docs make mention of this, they do so after the rest of the setup instructions. This was one area I was banging my head against for ages, wondering why the built-in NginX server wasn’t working.

Setting the hostname on my Joyent VM was a case of running:

[code language=”bash”]
$ sudo hostname ‘chef.example.com’
$ echo "chef.example.com" | sudo tee /etc/hostname
[/code]

As I wasn’t on the same network as my Chef server, I added a DNS A record to match the server FQDN.

Step Three: Install Chef Server

This bit was really easy, probably the easiest part of the whole setup. In short: download the latest Chef Server package for your platform, install the package, run the reconfigure tool. In my case, this was:

[code language=”bash”]
$ wget https://opscode-omnibus-packages.s3.amazonaws.com/ubuntu/12.04/x86_64/chef-server_11.0.10-1.ubuntu.12.04_amd64.deb
$ sudo dpkg -i chef-server_11.0.10-1.ubuntu.12.04_amd64.deb
$ sudo chef-server-ctl reconfigure
[/code]

The Chef installer will whirr away, using Chef to setup your new installation automatically. How cool is that?

Step Four: Copy Server Certificates to Your Workstation

This wasn’t mentioned anywhere I could see, but I figured it out from some snippets written around the web. To successfully setup the Chef client, you need some security certificates from your new server. I used SCP from my local PC:

[code language=”bash”]
$ scp [email protected]:/etc/chef-server/admin.pem ~/tmp/
$ scp [email protected]:/etc/chef-server/chef-validator.pem ~/tmp/
[/code]

If you find you don’t have permission to copy directly from their default location, SSH to the server and sudo copy them to somewhere you can.

Step Five: Install the Chef Client

Now we should be armed with everything we need to install the client tools. I’m using the Debian-derived Crunchbang, but any *NIX-based OS should be roughly the same as below. If you’re on Windows, I’m afraid you’re on your own.

Run the “Omniinstaller” for Chef:

[code language=”bash”]
$ curl -L https://www.opscode.com/chef/install.sh | sudo bash
[/code]

Create a .chef folder in your home directory, and add the certificates copied from the server

[code language=”bash”]
$ mkdir ~/.chef
$ cp ~/tmp/*.pem ~/.chef
[/code]

Configure Knife (the main Chef CLI utility):

[code language=”bash”]
$ knife configure –initial
WARNING: No knife configuration file found
Where should I put the config file? [/home/chris/.chef/knife.rb] /home/chris/.chef/knife.rb
Please enter the chef server URL: [https://localhost:443] https://chef.example.com:443
Please enter a name for the new user: [chris]
Please enter the existing admin name: [admin]
Please enter the location of the existing admin’s private key: [/etc/chef-server/admin.pem] /home/chris/.chef/admin.pem
Please enter the validation clientname: [chef-validator]
Please enter the location of the validation key: [/etc/chef-server/chef-validator.pem] /home/chris/.chef/chef-validator.pem
Please enter the path to a chef repository (or leave blank):
Creating initial API user…
Please enter a password for the new user:
Created user[chris]
Configuration file written to /home/chris/.chef/knife.rb
[/code]

Test Knife by listing all users:

[code language=”bash”]
$ knife user list
admin
chris
[/code]

Wrap Up

That’s it! You now have a working Chef installation. Or at least, I do. Steps two and four are the steps I had to hunt out and piece together myself to get Chef up and running. Everything else is more or less as documented.

All that’s left to do now is figure out how to use Chef!

Coda 2 Coming May 24th

Coda 2 Coming May 24th

Coda 2 coming May 24th – it’s about time! Coda is one of the reasons I keep coming back to the Mac platform. It’s one of those apps that is a joy to use. Espresso overtook it for a while, but this new version looks like a very worthy upgrade – check out the Coda Tour video.