A tale of woe, a tale of SQL woe. This is what the past two days have been all about. At work, we have our master database which contains all the demographic and biographic information that allows us to do our jobs. It tracks donors, constituents, alumni, individuals, technically everyone.
We have arranged with an external vendor for extended services, the vendor is called iModules and part of my job is to help ‘bridge the gap’ between these two systems, at least data-wise. So, the primary question is, how do we update data on iModules from Millennium, Millennium being our CRM database. With SQL of course! Silly! That was the albatross I spoke of on Twitter yesterday. Mostly the construction of update scripts isn’t terribly difficult only time consuming and involves a lot of typing and checking up column names, our addrline1 is their Address_1, that sort of thing.
Before I can send updates there are two attributes that need to be added to various records to mark them as “ripe to be sent to iModules” and that’s what has had me stuck for the past two days. Our system has two distinct and not clearly compatible ways of extracting data. There is Millennium Internet Reporter, called MIR, that the report writing people use to extract data from the database and then there is little old me with Aqua Data Studio. My report-writer coworker, Lisa, handed me a copy of the criteria that MIR uses to extract which people get these attributes added to their records. It’s a pretty straightforward presentation, this from that, never in that, with these values, so on and so forth. Almost always these queries start with a very simple SELECT block and then start growing from there. Almost always I end up using JOIN or LEFT OUTER JOIN in order to collect the right data. Turns out in this case, JOIN was exactly NOT what I needed to use. Lisa gave me a number, a magic goal number for the number of records that my query, if it’s correct, should pull. This number is 687. When I started I got 287433. Then I adjusted the query and went to 133496. Over time I bounded all over, from 115 million all the way down to 20. Never really hitting that magic number. There are a lot of little gotchas that you have to be aware of and code properly. The sense that the query depends on is that we want to select certain people with certain values OR people without any values at all, but not a specific kind of value after that. I was wandering around trying various methods of attack, pulling the criteria out into temporary tables was one, switching all my joins to left outer joins (that lead to 115 million, oops) and then I thought I had it and was really clever when I enriched my joins with subqueries that used the IN predicate. Even then, I couldn’t get below 22500 records pulled. Remember the goal number is 687. There were some more little gotchas, for example, I forgot to remove the dead from the list, so that got me down to about 2700. Then I started to read about some of these predicates and I had a passing familiarity with SQL’s three-value logic property. In SQL there is true, there is false, and there is UNKNOWN. 1=1 is true, 1=0 is false, 1=null is unknown and null=null is unknown. Turns out my problem was all wadded up with the IN predicate that I had used. IN was inappropriate for this use, as it utterly ignored all the ‘null’ cases, the ones I wanted to be selected. Turns out there is a predicate I have never used, called EXISTS. This predicate changes the sense of what is selected and when I reorganized my query to use EXISTS I went from 2700 to 1686. But still, 1686 is greater than my magic goal of 687, so there was something else I wasn’t seeing. I had removed the dead, the logic looked spot on – as I read from the criteria page that Lisa had given me it read spot-for-spot “bang on” correct. Every time I ran the query it dutifully spit out 1686 records. So, what the hell was I missing?
Computers do exactly what you tell them to do, nothing more and nothing less, unless you find a bug. There aren’t any bugs in my way so it was a failure with my query, somewhere. I listed out all the selected records and started to look at them in the database, seeing if I could spot something in the selected group that shouldn’t be there and that I missed in my query logic. The first record I brought up was utterly incorrect, as the “title bar” had the word “Duplicate” in it, and my query clearly states “NOT LIKE %dup%” so why the hell was it still selecting records with Duplicate in the title bar? Yeah, case. That’s what screwed me. Case. SQL Server is very dutiful and stripped out all the places where the LIKE clause found the text fragment of dup. But not Dup, or dUp, or duP. Or agonizingly, Duplicate. Because a scan for ‘dup’ will never be true when given ‘Duplicate’ to look at d <> D. So once I wrapped the title bar column name in the lower() function, and re-ran the data query, SQL dutifully spit out 687 records. My magic number.
So I won, god damned it. It took switching from IN to EXISTS, pitching JOIN overboard, taking out the dead people and forcing lower-case reckoning. So now the damned thing is done and I can move on with my life!