Tuesday, December 29, 2009

Thinking in Sets

Everyone who programs a database has heard many times that the SQL language is optimized for set based solutions rather than procedural solutions, but examples are seldom provided with that advice. Consequently many beginning SQL programmers don’t have a clear understanding of what ‘set based’ means in terms of the code they need to write to solve a specific problem.

Even for those who have an understanding of the concept, there are many programming problems for which a set based solution seems impossible. Sometimes that is true. It is not always possible to find a set based solution, but most of the time we can find one by using a bit of creative thinking. A good SQL programmer must develop the mental discipline to explore the set based possibilities thoroughly before falling back on the intuitive procedural solution.

In this article we will use a relatively simple example as an illustration of thinking in a set based way about a problem that also has an intuitive procedural solution.

The Business Case

When you visit the doctor’s office, usually the first thing the nurse does is put you on a scale, weigh you and then check your height. Checking your weight makes sense from a medical point of view, but have you ever wondered why the nurse records your height each time? Unless you are very young, your height hasn’t changed since your last visit and is not likely to ever change again.

The reason they always check your height is to guard against identity theft. Healthcare providers want to make sure that the services they provide are going to the person who gets the bill, not to an imposter with a forged identity card.

This kind of identity theft happens more frequently than you might think. HIPAA (Health Insurance Portability and Accountability Act) regulations now require an audit of changes in permanent physical characteristics in a patient’s history that might suggest identity theft.

Querying this kind of information provides us a good example for comparing procedural thinking and set based thinking when programming in SQL.

The Problem Statement

The generic programming problem is that the solution depends on the order of rows and requires the comparison of current row values with values in previous rows. This is a type of problem in which the procedural solution is intuitive but the set based solution is not so obvious.

In this particular problem we are looking for rows where the previous record for the same patient has a value for height that is different than the height on the current record. We want to return the patient’s unique medical record number, the date the change occurred, what the height was changed from and what the height was changed to. We do not want to return any records that do not mark a change in height for a patient.

A Procedural Approach

The intuitive, procedural way to attack this problem is to loop through the records for each patient one row at a time. We query the first record for the patient and save the patient’s original height in a variable. Then we loop through subsequent records for the patient, comparing height values. If we find the height is different on a subsequent record, we write an audit record, update the height variable with the current value and continue looping through the rows. Then we move to the next patient, perhaps using an outer cursor to provide the patient id to the inner cursor.

Here is code that would implement a procedural, looping method:

create table #HeightAudit -- output table(MedRecNumber varchar(20), CreateDate datetime, Height varchar(10), CurrentHeight varchar(10))

–- get all the patients

declare MRN_cur Cursor for select distinct MedRecNumber
from PatientInfo.dbo.PatientInfo
Open MRN_cur;Fetch next from MRN_Cur into @MedRecNumber

While @@FETCH_STATUS = 0
Begin -- outer loop
-- do this for every visit for the medical record number

declare Visit_cur Cursor for
select CreateDate , Height
from PatientInfo
where MedRecNumber = @MedRecNumber
order by CreateDate

open Visit_cur;

fetch next from Visit_cur into @CreateDate, @Height;

–- establish the starting height for each patient

Set @CurrentHeight = @Height

While @@FETCH_STATUS = 0
Begin -- inner loop

If @Height <> @CurrentHeight
begin

insert #HeightAudit values (@MedRecNumber, @CreateDate, @Height, @CurrentHeight)

Set @CurrentHeight = @Height end

fetch next from Visit_cur into @CreateDate, @Height;

end -- inner loop

close Visit_cur;
deallocate Visit_cur;

Fetch next from MRN_Cur into @MedRecNumber

End -- outer loop

close MRN_Cur;
deallocate MRN_Cur

Select * from #HeightAudit
drop table #HeightAudit

This works, but it is an inefficient method. It could be a serious performance problem when working with a large number of rows. How can we do this in a set based and presumably more efficient way?

A Set Based Approach

The difference between a procedural and set based solution comes down to the way you define the problem. Stated in its simplest form, the change we are interested in involves only two records, two consecutive visits by the same patient. Everything else is irrelevant.

We might start by ordering the data by the patient’s unique Medical Record Number (MRN) and then the visit date. In that way, the records of consequtive visits by the same patient are adjacent to each other. The problem is then reduced to finding a way to join consecutive records from this set.

When we understand the problem in that way the solution is not so difficult to discover. We need to create a sequence number for the sorted rows that can be used to join one record with the next in a self-join.

We can create a temporary table with the patient data and a column populated with an identity function to create a sequence id that will correspond to the order of rows in the table.
After a little thought we come up with a self-join condition like this to join each row with it’s adjacent row:

... from #tmptable t1
join #tmptable t2 on t1.SequenceNumber = (t2.SequenceNumber + 1)

The resulting set of records represents every possible opportunity for the value of the patient’s height to change i.e. a set of records such that each contains the data from each set of 2 consequtive records. At this point filtering out the records that do not represent a change is trivial. We simply review our statement of the problem: To qualify as a record of interest, the patient must be the same in consecutive visits and the two heights must be different. The code might look like this:

-- Code that creates the temporary table

select IDENTITY(bigint, 1, 1) AS SequenceNumber, PatientID, VisitDate as ChangeDate, Heightinto #OrderedInfo
into #OrderedInfo
from TblPatientAudit
where VisitDate between @StartDate
and @EndDate
order by PatientID, VisitDate;

-- query that returns a record for each change in height

select ord.PatientID
, ord.ChangeDate
, ord.Height as 'HeightChangedTo'
,ord2.Height as 'HeightChangedFrom'
from #OrderedInfo ord
join #OrderedInfo ord2 on ord.SequenceNumber = (ord2.SequenceNumber + 1)
Where (ord.Height<> ord2.Height –- this is a change in height
and ord.PatientID = ord2.PatientID) –- for the same patientorder by ord.PatientID, ord.ChangeDate, ord.SequenceNumber;

Relative performance of the two methods

For testing, we ran both the cursor and the set based code against a table containing 21,000 patient visits which contained 3200 changes of a patient’s height. The cursor based code ran in 4 minutes, 7 seconds. The set based code ran in less than one second. Both returned the same results.

The auditing requirements for a large healthcare provider can easily generate a million rows per day in the audit table. So, even if you run your audit reports for only a single day’s data, you will have a lot of rows to process, far too many for a cursor or other looping mechanism to handle efficiently.

Set Based Thinking

Note that both steps of this solution operate on sets of data, not on individual rows. compare this to the cursor solution where most operations are repeated for each row in a set. However, keep in mind as you search for set based solutions that a set can contain 1 or even zero rows. Operating on a set containing only one row however is not the same as iterating through the rows of a set. There is a subtle but important difference.

Nothing in this simple example is rocket science. You will encounter SQL problems that are much more difficult to solve in a set based way and some that are impossible. However even this example requires a significant mental adjustment for programmers who are new to SQL programming. It requires a conscious effort to pull yourself out of your comfort zone and think in an unaccustomed manner. Even in the most difficult situations, don’t give up on a set based solution until you have given it a fair amount of thought.

Back in the Saddle Again

I haven't touched this blog for several months. Most of my available writing time has been absorbed by the launch of our free monthly SQL Server newsletter and by an article I wrote for SQL Server magazine (to be published Feb 2010). However, it's resolution time and my resolution this year is to make posts on a more regular basis.

if you would like to get the free monthly SQL newsletter emailed to you, follow the link in the left panel of this blog and sign up. Your email address will only be used to send you the newsletter once per month. We won't spam you and we will not give your email address to anyone else.

Saturday, July 11, 2009

TSQL change in SQL 2008 can break legacy code

As of the 2008 version, SQL Server no longer supports the old antiquatedd syntax for outer joins. It now requires the use of the newer JOIN keyword syntax. While it is a good idea to get rid of this flawed syntax, there is a considerable amount of legacy code that uses it. Before upgrading to SQL 2008 you should search your TSQL code for examples of the old outer join operators ( *=, =*) to see if this is going to affect your application.

Here is an example of the old syntax using the AdventureWorks sample database:

SELECT SalesTerritory.Name, SalesPerson.SalesPersonID
FROM Sales.SalesTerritory, SALES.SalesPerson
WHERE Sales.SalesPerson.TerritoryID *= Sales.SalesTerritory.TerritoryID


This syntax will generate an error in SQL Server 2008. To get around this you need to change the code to the accepted ANSI syntax or else set the database into compatibility level 80. Here is an example of the supported syntax

SELECT SalesTerritory.Name, SalesPerson.SalesPersonID
FROM Sales.SalesTerritory
LEFT JOIN Sales.SalesPerson on Sales.SalesPerson.TerritoryID = Sales.SalesTerritory.TerritoryID


So far Microsoft has only de-supported the outer join form of this syntax. The inner join form will still work. However, it should not be used in new code. I suspect it will be de-supported in a future release as it should be. Specifying join conditions in the WHERE clause makes it far too easy to leave a join condition out of your WHERE clause, especially in queries that join many tables.

Saturday, May 2, 2009

The Perils of XML Over-Use

Like all technologies, XML has its strong and weak points. It is useful in many situations. However it is often used in places where it is not appropriate. The performance cost of XML mis-use or over-use can be enormous.

I recently looked at the results of a performance comparison between data stored in 2 relational tables and the same data stored in a single xml-format table. The data consisted of approximately 100,000 orders and 1.2 million order line items. In the relational model they were stored as an Orders table and an OrderDetail table. The xml data was created in element format in one case and in attribute format in another. Both relational and xml data had similar indexes.

Below is a table summarizing the results. A cold execution means the query was executed after the buffers were flushed and no data was in cache. A hot execution was done after the data was in cache:


Data FormatCold Execution (ms}Hot Execution (ms)Storage (mb)
XML Element5937337054397
XML Attribute319031407
Relational Tables2501214


As you can see there is a HUGE performance degradation using element-style XML. It is on the order of 37,000% slower when hot execution times are compared with relational data retrieval. Attribute style XML fares better but it is still many times less efficient than data retrieval from relational tables.

As well as having performance issues, both forms of xml take almost twice the disk space of relational data as can be seen the Storage column of the table above.

XML has many legitimate uses in a database environment. It is a universally recognized data interchange format. It is self-documenting. It is ideal for exchanging small packets of information, storing configuration settings, etc. But high performance manipulation of large datasets is not something it does very efficiently.

Here is a URL to the actual test parameters and detailed results:

http://www.sqlsolutions.com/articles/articles/Performance%20of%20indexing%20of%20XML%20in%20MS%20SQL%202005.html

Wednesday, March 25, 2009

Get table sizes painlessly

Here is a handy script to get a quick look at table sizes for all tables in your database without putting unneeded load on a production server. It has no detrimental effect on performance. It gets row and page counts from the sysindexes table without having to touch your user tables. These counts are not exact but they are accurate enough for most purposes. The data is updated each time your statistics are updated and is usually within a fraction of a percent of the current values.

I use this query for a number of purposes. When I take a first look at a database, it gives me an idea of the size of the major tables. When performance tuning, I sort it by table name so I can quickly find the relative sizes of the tables in the query I am trying to optimize. It is easy to locate tables that were created for some purpose but never used. You can also use it to record database growth by taking snapshots at regular intervals.

I did not write this code as a stored procedure because in my work as a consultant, often I cannot create objects on the client’s production SQL Server. Scripts are handier and easier to modify for particular situations. You, however are free to enclose it in a stored procedure if that is better for you.

There is nothing clever about the code, but it is useful. The query returns the table name, the number of rows it contains, and the number of 8kb pages it contains, including pages used to store binary large objects like ntext and image data. It will run on all versions of SQL Server.

If Microsoft ever makes good on its threat to eliminate direct access to the system tables, then we will have to create a version that selects from the sys.objects and sys.indexes views. But right now it is simpler to have a single query that works on all versions.

-- GetTableSizes.sql

select o.name
, sum(i.rows) as rows
, sum(case when indid < 2 then i.dpages
when indid=255 then i.used
else 0 end) as allpages
from sysobjects o
join sysindexes i on o.id = i.id
where (indid < 2 or indid = 255)
and o.name not like 'sys%'
group by o.name, indid
order by rows desc

Tuesday, February 10, 2009

My head in the Cloud

Microsoft’s announced future direction is for SQL Server to become a “data platform” encompassing not only traditionally structured database data, but unstructured information such as text documents, xml documents, spreadsheets, streaming media, etc. Furthermore, they foresee that sooner rather than later, data management and storage will be a web service enshrouded in the Cloud, which has become a techno-chic synonym for the internet. The idea is that you don’t need to know where the data is stored or where your database management system is located. You just request data from a url somewhere (probably China where the technician work very cheap). When you need it, it appears out of the cloud.

I have my doubts that this will find quick acceptance. The internet is by nature unreliable and insecure. To date, the hackers are staying comfortably ahead of the hacker police.

I could be wrong and often am. No doubt such a scheme could be a great deal less expensive than maintaining your own database infrastructure. I am curious to know how many of you would delegate the safety and accessibility of your mission-critical data to the Cloud? If you have a minute, could you email me your thoughts on this?

ksurvance@sql-consulting.com
______________________

Sunday, January 11, 2009

TOP and ORDER BY operators in SQL 2005/8

Changes to the TSQL language beginning with SQL Server 2005 can change the order of results returned from a query selecting from a view. That in turn, has the potential to break some application functionality. Most sites will be unaffected, but some may find that certain code is broken. If your application depends on sorting results from a view defined as:

SELECT TOP 100 PERCENT…

…ORDER BY THISCOLUMN

You might be affected by this change.

Simply put, a view created with an ORDER BY clause is no longer guaranteed to return rows in the specified order. In order to guarantee the order of the result set In versions newer than SQL Server 2000 you must sort the result set in the query that selects from the view, not in the view definition.

If you are planning an upgrade from SQL 2000, you certainly want to be aware of this so you can test your existing code. If you have already upgraded you may want to make sure that subtle changes in application functionality have not gone unnoticed.

Views that are created with a TOP operator and an ORDER BY clause are still guaranteed to return the same rows, but the rows are not guaranteed to by in the sequence defined by the ORDER BY clause. In most cases the order will be correct, but not in all cases.

This is not a bug, it is a deliberate change made for the purpose of bringing TSQL into closer compliance with the ANSI SQL language standard. Views were never intended to return sorted results. But, because it was possible in previous versions of SQL Server, there is a significant amount of code in productions systems that will be affected by this change. Microsoft has a hotfix for later SQL versions which restores default SQL 2000 behavior. They describe it as a temporary fix designed to give you time to rewrite any code that is broken, so it is not likely to be around for very long.

If you are interested in the technical details of why this was made, I recommend looking at this article by Itzak Ben Gann in SQL Server magazine, January 2009:

http://www.sqlmag.com/Articles/ArticleID/100697/100697.html