Tuesday, December 29, 2009

Thinking in Sets

Everyone who programs a database has heard many times that the SQL language is optimized for set based solutions rather than procedural solutions, but examples are seldom provided with that advice. Consequently many beginning SQL programmers don’t have a clear understanding of what ‘set based’ means in terms of the code they need to write to solve a specific problem.

Even for those who have an understanding of the concept, there are many programming problems for which a set based solution seems impossible. Sometimes that is true. It is not always possible to find a set based solution, but most of the time we can find one by using a bit of creative thinking. A good SQL programmer must develop the mental discipline to explore the set based possibilities thoroughly before falling back on the intuitive procedural solution.

In this article we will use a relatively simple example as an illustration of thinking in a set based way about a problem that also has an intuitive procedural solution.

The Business Case

When you visit the doctor’s office, usually the first thing the nurse does is put you on a scale, weigh you and then check your height. Checking your weight makes sense from a medical point of view, but have you ever wondered why the nurse records your height each time? Unless you are very young, your height hasn’t changed since your last visit and is not likely to ever change again.

The reason they always check your height is to guard against identity theft. Healthcare providers want to make sure that the services they provide are going to the person who gets the bill, not to an imposter with a forged identity card.

This kind of identity theft happens more frequently than you might think. HIPAA (Health Insurance Portability and Accountability Act) regulations now require an audit of changes in permanent physical characteristics in a patient’s history that might suggest identity theft.

Querying this kind of information provides us a good example for comparing procedural thinking and set based thinking when programming in SQL.

The Problem Statement

The generic programming problem is that the solution depends on the order of rows and requires the comparison of current row values with values in previous rows. This is a type of problem in which the procedural solution is intuitive but the set based solution is not so obvious.

In this particular problem we are looking for rows where the previous record for the same patient has a value for height that is different than the height on the current record. We want to return the patient’s unique medical record number, the date the change occurred, what the height was changed from and what the height was changed to. We do not want to return any records that do not mark a change in height for a patient.

A Procedural Approach

The intuitive, procedural way to attack this problem is to loop through the records for each patient one row at a time. We query the first record for the patient and save the patient’s original height in a variable. Then we loop through subsequent records for the patient, comparing height values. If we find the height is different on a subsequent record, we write an audit record, update the height variable with the current value and continue looping through the rows. Then we move to the next patient, perhaps using an outer cursor to provide the patient id to the inner cursor.

Here is code that would implement a procedural, looping method:

create table #HeightAudit -- output table(MedRecNumber varchar(20), CreateDate datetime, Height varchar(10), CurrentHeight varchar(10))

–- get all the patients

declare MRN_cur Cursor for select distinct MedRecNumber
from PatientInfo.dbo.PatientInfo
Open MRN_cur;Fetch next from MRN_Cur into @MedRecNumber

While @@FETCH_STATUS = 0
Begin -- outer loop
-- do this for every visit for the medical record number

declare Visit_cur Cursor for
select CreateDate , Height
from PatientInfo
where MedRecNumber = @MedRecNumber
order by CreateDate

open Visit_cur;

fetch next from Visit_cur into @CreateDate, @Height;

–- establish the starting height for each patient

Set @CurrentHeight = @Height

While @@FETCH_STATUS = 0
Begin -- inner loop

If @Height <> @CurrentHeight
begin

insert #HeightAudit values (@MedRecNumber, @CreateDate, @Height, @CurrentHeight)

Set @CurrentHeight = @Height end

fetch next from Visit_cur into @CreateDate, @Height;

end -- inner loop

close Visit_cur;
deallocate Visit_cur;

Fetch next from MRN_Cur into @MedRecNumber

End -- outer loop

close MRN_Cur;
deallocate MRN_Cur

Select * from #HeightAudit
drop table #HeightAudit

This works, but it is an inefficient method. It could be a serious performance problem when working with a large number of rows. How can we do this in a set based and presumably more efficient way?

A Set Based Approach

The difference between a procedural and set based solution comes down to the way you define the problem. Stated in its simplest form, the change we are interested in involves only two records, two consecutive visits by the same patient. Everything else is irrelevant.

We might start by ordering the data by the patient’s unique Medical Record Number (MRN) and then the visit date. In that way, the records of consequtive visits by the same patient are adjacent to each other. The problem is then reduced to finding a way to join consecutive records from this set.

When we understand the problem in that way the solution is not so difficult to discover. We need to create a sequence number for the sorted rows that can be used to join one record with the next in a self-join.

We can create a temporary table with the patient data and a column populated with an identity function to create a sequence id that will correspond to the order of rows in the table.
After a little thought we come up with a self-join condition like this to join each row with it’s adjacent row:

... from #tmptable t1
join #tmptable t2 on t1.SequenceNumber = (t2.SequenceNumber + 1)

The resulting set of records represents every possible opportunity for the value of the patient’s height to change i.e. a set of records such that each contains the data from each set of 2 consequtive records. At this point filtering out the records that do not represent a change is trivial. We simply review our statement of the problem: To qualify as a record of interest, the patient must be the same in consecutive visits and the two heights must be different. The code might look like this:

-- Code that creates the temporary table

select IDENTITY(bigint, 1, 1) AS SequenceNumber, PatientID, VisitDate as ChangeDate, Heightinto #OrderedInfo
into #OrderedInfo
from TblPatientAudit
where VisitDate between @StartDate
and @EndDate
order by PatientID, VisitDate;

-- query that returns a record for each change in height

select ord.PatientID
, ord.ChangeDate
, ord.Height as 'HeightChangedTo'
,ord2.Height as 'HeightChangedFrom'
from #OrderedInfo ord
join #OrderedInfo ord2 on ord.SequenceNumber = (ord2.SequenceNumber + 1)
Where (ord.Height<> ord2.Height –- this is a change in height
and ord.PatientID = ord2.PatientID) –- for the same patientorder by ord.PatientID, ord.ChangeDate, ord.SequenceNumber;

Relative performance of the two methods

For testing, we ran both the cursor and the set based code against a table containing 21,000 patient visits which contained 3200 changes of a patient’s height. The cursor based code ran in 4 minutes, 7 seconds. The set based code ran in less than one second. Both returned the same results.

The auditing requirements for a large healthcare provider can easily generate a million rows per day in the audit table. So, even if you run your audit reports for only a single day’s data, you will have a lot of rows to process, far too many for a cursor or other looping mechanism to handle efficiently.

Set Based Thinking

Note that both steps of this solution operate on sets of data, not on individual rows. compare this to the cursor solution where most operations are repeated for each row in a set. However, keep in mind as you search for set based solutions that a set can contain 1 or even zero rows. Operating on a set containing only one row however is not the same as iterating through the rows of a set. There is a subtle but important difference.

Nothing in this simple example is rocket science. You will encounter SQL problems that are much more difficult to solve in a set based way and some that are impossible. However even this example requires a significant mental adjustment for programmers who are new to SQL programming. It requires a conscious effort to pull yourself out of your comfort zone and think in an unaccustomed manner. Even in the most difficult situations, don’t give up on a set based solution until you have given it a fair amount of thought.

Back in the Saddle Again

I haven't touched this blog for several months. Most of my available writing time has been absorbed by the launch of our free monthly SQL Server newsletter and by an article I wrote for SQL Server magazine (to be published Feb 2010). However, it's resolution time and my resolution this year is to make posts on a more regular basis.

if you would like to get the free monthly SQL newsletter emailed to you, follow the link in the left panel of this blog and sign up. Your email address will only be used to send you the newsletter once per month. We won't spam you and we will not give your email address to anyone else.