Friday, January 25, 2013

There is really no point in having a blog if I am going to ignore it for years at a time.  However, most of my writing time is taken up in writing my monthly SQL newsletter (Get your FREE subscription here ) and the occasional article for SQL Server Pro magazine.

I have an archive of useful SQL Articles here.

I am also putting up a new site, sqlperformancetuning.com.  It will be on-line soon

Kurt

Tuesday, March 16, 2010

Smoke and Mirrors in SQL Server 2008 R2

Microsoft is touting its new Release 2 of SQL Server 2008. If they weren't planning to charge for it they would be honest and call it a service pack, not a Release. When you consider that SQL Reporting Services arrived in a free service pack for SQL 2000, what R2 offers as a full release is a bit underwhelming.

ENTERPRISE EDITION

R2 has its one flagship feature, self-service business intelligence in the PowerPivot for Excel feature, but that is not available in Standard Edition. I have not used the feature yet but I have my doubts that it will bring business intelligence to the masses as promised.

My opinion, based on my experience, is that the masses don’t want business intelligence. A two dimensional grid like a spreadsheet is about as far as many business users are willing to go.

There is only so much simplification that can be done to allow rank and file users to create and browse multi-dimensional data. (How many business users do you know who can even create and use an Excel pivot table?) Real BI may always be beyond the capacity of the casual business user.

STANDARD EDITION

For the Standard Edition user, there is even less that would drive you to upgrade. The backup compression feature of SQL Server 2008 Enterprise Edition has been added to the Standard Edition of R2. That’s about it.

NEW EDITIONS

R2 introduces two more SQL Server editions, Parallel Data Warehouse and Datacenter. They are dubbed Premium Editions because they come at a distinctly premium price. Either one will set you back $57,498 per processor (list price). However, Microsoft is the only database vendor who counts processor sockets rather than cores in its licensing scheme. So an 8 core processor will cost you the same as a single core. This makes the premium editions an attractive alternative to Oracle where you would need to pay for 8 licenses.

Oddly enough, these two new expensive powerhouse editions may be the most compelling reasons to upgrade, but only for those who need to deal with multi-terabyte databases and huge processing loads. These editions are designed to close the gap with Oracle at the high end of the database spectrum with attractive pricing. If you live on the upper edge of the spectrum you might want to look into what they offer.

Saturday, January 9, 2010

A Tool for Herding Cats

Managing developers is often like herding cats. Many development teams lack formal change control procedures and in many teams everyone has a sysadmin password. Sometimes in smaller companies there is not an adequate test environment and developers acquire the habit of making changes in production. Often they do not communicate those changes to the team lead or the other members of the team.

As we all know, this is a terribly dangerous practice, but we have all done it. Sometimes we have no choice.

if you are responsible for managing a situation like that and the bean-counters are not likely to approve the cost of a real change control platform, here is a script that will make your life a little easier. It will quickly show you the important things that have changed in a database since a point in time that you specify.



declare @cutoffdate datetime
, @days smallint;
set @days = -7; -- change this as necessary
set @cutoffdate = DATEADD(dd, @days,getdate());
select db_name() as DBName
, name
, type_desc
, case when create_date > @cutoffdate then 'New' else 'Changed' end as [Status]
, create_date
, modify_date
from sys.objects
where (modify_date > @cutoffdate or create_date > @cutoffdate)
and is_ms_shipped = 0 -- exclude system objects
and is_published = 0 -- exclude repl tables that will always show change
order by DB_NAME(), type_desc, name;


This is a handy tool for a quick look but it has a lot of limitations. It's redeeming feature is that it is simple and puts no load on the database.

It doesn't show you what change was made or who made it. It doesn't allow you to roll back changes. It doesn't include all the types of database objects (no indexes included for example). Worst of all, it doesn't tell you what has been dropped or deleted.

However, it does alert you to newly created objects and changes for object types that are most relevant to developers: stored procedures, triggers, tables, etc. This will at least allow you to ask around to find out why the change was made and who made it. Most importantly, if the wheels suddenly come off of your website, you can quickly see what, if anything, was changed in the database at the same time the problem started.

Tuesday, December 29, 2009

Thinking in Sets

Everyone who programs a database has heard many times that the SQL language is optimized for set based solutions rather than procedural solutions, but examples are seldom provided with that advice. Consequently many beginning SQL programmers don’t have a clear understanding of what ‘set based’ means in terms of the code they need to write to solve a specific problem.

Even for those who have an understanding of the concept, there are many programming problems for which a set based solution seems impossible. Sometimes that is true. It is not always possible to find a set based solution, but most of the time we can find one by using a bit of creative thinking. A good SQL programmer must develop the mental discipline to explore the set based possibilities thoroughly before falling back on the intuitive procedural solution.

In this article we will use a relatively simple example as an illustration of thinking in a set based way about a problem that also has an intuitive procedural solution.

The Business Case

When you visit the doctor’s office, usually the first thing the nurse does is put you on a scale, weigh you and then check your height. Checking your weight makes sense from a medical point of view, but have you ever wondered why the nurse records your height each time? Unless you are very young, your height hasn’t changed since your last visit and is not likely to ever change again.

The reason they always check your height is to guard against identity theft. Healthcare providers want to make sure that the services they provide are going to the person who gets the bill, not to an imposter with a forged identity card.

This kind of identity theft happens more frequently than you might think. HIPAA (Health Insurance Portability and Accountability Act) regulations now require an audit of changes in permanent physical characteristics in a patient’s history that might suggest identity theft.

Querying this kind of information provides us a good example for comparing procedural thinking and set based thinking when programming in SQL.

The Problem Statement

The generic programming problem is that the solution depends on the order of rows and requires the comparison of current row values with values in previous rows. This is a type of problem in which the procedural solution is intuitive but the set based solution is not so obvious.

In this particular problem we are looking for rows where the previous record for the same patient has a value for height that is different than the height on the current record. We want to return the patient’s unique medical record number, the date the change occurred, what the height was changed from and what the height was changed to. We do not want to return any records that do not mark a change in height for a patient.

A Procedural Approach

The intuitive, procedural way to attack this problem is to loop through the records for each patient one row at a time. We query the first record for the patient and save the patient’s original height in a variable. Then we loop through subsequent records for the patient, comparing height values. If we find the height is different on a subsequent record, we write an audit record, update the height variable with the current value and continue looping through the rows. Then we move to the next patient, perhaps using an outer cursor to provide the patient id to the inner cursor.

Here is code that would implement a procedural, looping method:

create table #HeightAudit -- output table(MedRecNumber varchar(20), CreateDate datetime, Height varchar(10), CurrentHeight varchar(10))

–- get all the patients

declare MRN_cur Cursor for select distinct MedRecNumber
from PatientInfo.dbo.PatientInfo
Open MRN_cur;Fetch next from MRN_Cur into @MedRecNumber

While @@FETCH_STATUS = 0
Begin -- outer loop
-- do this for every visit for the medical record number

declare Visit_cur Cursor for
select CreateDate , Height
from PatientInfo
where MedRecNumber = @MedRecNumber
order by CreateDate

open Visit_cur;

fetch next from Visit_cur into @CreateDate, @Height;

–- establish the starting height for each patient

Set @CurrentHeight = @Height

While @@FETCH_STATUS = 0
Begin -- inner loop

If @Height <> @CurrentHeight
begin

insert #HeightAudit values (@MedRecNumber, @CreateDate, @Height, @CurrentHeight)

Set @CurrentHeight = @Height end

fetch next from Visit_cur into @CreateDate, @Height;

end -- inner loop

close Visit_cur;
deallocate Visit_cur;

Fetch next from MRN_Cur into @MedRecNumber

End -- outer loop

close MRN_Cur;
deallocate MRN_Cur

Select * from #HeightAudit
drop table #HeightAudit

This works, but it is an inefficient method. It could be a serious performance problem when working with a large number of rows. How can we do this in a set based and presumably more efficient way?

A Set Based Approach

The difference between a procedural and set based solution comes down to the way you define the problem. Stated in its simplest form, the change we are interested in involves only two records, two consecutive visits by the same patient. Everything else is irrelevant.

We might start by ordering the data by the patient’s unique Medical Record Number (MRN) and then the visit date. In that way, the records of consequtive visits by the same patient are adjacent to each other. The problem is then reduced to finding a way to join consecutive records from this set.

When we understand the problem in that way the solution is not so difficult to discover. We need to create a sequence number for the sorted rows that can be used to join one record with the next in a self-join.

We can create a temporary table with the patient data and a column populated with an identity function to create a sequence id that will correspond to the order of rows in the table.
After a little thought we come up with a self-join condition like this to join each row with it’s adjacent row:

... from #tmptable t1
join #tmptable t2 on t1.SequenceNumber = (t2.SequenceNumber + 1)

The resulting set of records represents every possible opportunity for the value of the patient’s height to change i.e. a set of records such that each contains the data from each set of 2 consequtive records. At this point filtering out the records that do not represent a change is trivial. We simply review our statement of the problem: To qualify as a record of interest, the patient must be the same in consecutive visits and the two heights must be different. The code might look like this:

-- Code that creates the temporary table

select IDENTITY(bigint, 1, 1) AS SequenceNumber, PatientID, VisitDate as ChangeDate, Heightinto #OrderedInfo
into #OrderedInfo
from TblPatientAudit
where VisitDate between @StartDate
and @EndDate
order by PatientID, VisitDate;

-- query that returns a record for each change in height

select ord.PatientID
, ord.ChangeDate
, ord.Height as 'HeightChangedTo'
,ord2.Height as 'HeightChangedFrom'
from #OrderedInfo ord
join #OrderedInfo ord2 on ord.SequenceNumber = (ord2.SequenceNumber + 1)
Where (ord.Height<> ord2.Height –- this is a change in height
and ord.PatientID = ord2.PatientID) –- for the same patientorder by ord.PatientID, ord.ChangeDate, ord.SequenceNumber;

Relative performance of the two methods

For testing, we ran both the cursor and the set based code against a table containing 21,000 patient visits which contained 3200 changes of a patient’s height. The cursor based code ran in 4 minutes, 7 seconds. The set based code ran in less than one second. Both returned the same results.

The auditing requirements for a large healthcare provider can easily generate a million rows per day in the audit table. So, even if you run your audit reports for only a single day’s data, you will have a lot of rows to process, far too many for a cursor or other looping mechanism to handle efficiently.

Set Based Thinking

Note that both steps of this solution operate on sets of data, not on individual rows. compare this to the cursor solution where most operations are repeated for each row in a set. However, keep in mind as you search for set based solutions that a set can contain 1 or even zero rows. Operating on a set containing only one row however is not the same as iterating through the rows of a set. There is a subtle but important difference.

Nothing in this simple example is rocket science. You will encounter SQL problems that are much more difficult to solve in a set based way and some that are impossible. However even this example requires a significant mental adjustment for programmers who are new to SQL programming. It requires a conscious effort to pull yourself out of your comfort zone and think in an unaccustomed manner. Even in the most difficult situations, don’t give up on a set based solution until you have given it a fair amount of thought.

Back in the Saddle Again

I haven't touched this blog for several months. Most of my available writing time has been absorbed by the launch of our free monthly SQL Server newsletter and by an article I wrote for SQL Server magazine (to be published Feb 2010). However, it's resolution time and my resolution this year is to make posts on a more regular basis.

if you would like to get the free monthly SQL newsletter emailed to you, follow the link in the left panel of this blog and sign up. Your email address will only be used to send you the newsletter once per month. We won't spam you and we will not give your email address to anyone else.

Saturday, July 11, 2009

TSQL change in SQL 2008 can break legacy code

As of the 2008 version, SQL Server no longer supports the old antiquatedd syntax for outer joins. It now requires the use of the newer JOIN keyword syntax. While it is a good idea to get rid of this flawed syntax, there is a considerable amount of legacy code that uses it. Before upgrading to SQL 2008 you should search your TSQL code for examples of the old outer join operators ( *=, =*) to see if this is going to affect your application.

Here is an example of the old syntax using the AdventureWorks sample database:

SELECT SalesTerritory.Name, SalesPerson.SalesPersonID
FROM Sales.SalesTerritory, SALES.SalesPerson
WHERE Sales.SalesPerson.TerritoryID *= Sales.SalesTerritory.TerritoryID


This syntax will generate an error in SQL Server 2008. To get around this you need to change the code to the accepted ANSI syntax or else set the database into compatibility level 80. Here is an example of the supported syntax

SELECT SalesTerritory.Name, SalesPerson.SalesPersonID
FROM Sales.SalesTerritory
LEFT JOIN Sales.SalesPerson on Sales.SalesPerson.TerritoryID = Sales.SalesTerritory.TerritoryID


So far Microsoft has only de-supported the outer join form of this syntax. The inner join form will still work. However, it should not be used in new code. I suspect it will be de-supported in a future release as it should be. Specifying join conditions in the WHERE clause makes it far too easy to leave a join condition out of your WHERE clause, especially in queries that join many tables.

Saturday, May 2, 2009

The Perils of XML Over-Use

Like all technologies, XML has its strong and weak points. It is useful in many situations. However it is often used in places where it is not appropriate. The performance cost of XML mis-use or over-use can be enormous.

I recently looked at the results of a performance comparison between data stored in 2 relational tables and the same data stored in a single xml-format table. The data consisted of approximately 100,000 orders and 1.2 million order line items. In the relational model they were stored as an Orders table and an OrderDetail table. The xml data was created in element format in one case and in attribute format in another. Both relational and xml data had similar indexes.

Below is a table summarizing the results. A cold execution means the query was executed after the buffers were flushed and no data was in cache. A hot execution was done after the data was in cache:


Data FormatCold Execution (ms}Hot Execution (ms)Storage (mb)
XML Element5937337054397
XML Attribute319031407
Relational Tables2501214


As you can see there is a HUGE performance degradation using element-style XML. It is on the order of 37,000% slower when hot execution times are compared with relational data retrieval. Attribute style XML fares better but it is still many times less efficient than data retrieval from relational tables.

As well as having performance issues, both forms of xml take almost twice the disk space of relational data as can be seen the Storage column of the table above.

XML has many legitimate uses in a database environment. It is a universally recognized data interchange format. It is self-documenting. It is ideal for exchanging small packets of information, storing configuration settings, etc. But high performance manipulation of large datasets is not something it does very efficiently.

Here is a URL to the actual test parameters and detailed results:

http://www.sqlsolutions.com/articles/articles/Performance%20of%20indexing%20of%20XML%20in%20MS%20SQL%202005.html