Becky Edwards
asked on
Dealing with NuLL Values returns no data
I want to filter out patients who have a 1, or a 2 in the following field, leaving patients who either have a NULL value or any of the other values.
When I put in this filter
PATIENT_TYPE_C NAME
1 Anonymous
2 Confidential
3 Employee
4 Dismissed
5 Verified with no Medicare
6 Prisoner
7 VIP
How can I write my filter to account for the NULL value?
When I put in this filter
pt.PATIENT_TYPE_C not in (1,2)
it doesn't return anyone because the patients currently have all NULL values in that field.PATIENT_TYPE_C NAME
1 Anonymous
2 Confidential
3 Employee
4 Dismissed
5 Verified with no Medicare
6 Prisoner
7 VIP
How can I write my filter to account for the NULL value?
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
>How can I write my filter to account for the NULL value?
The above expert answers are correct, but just to add some context NULL cannot be compared to anything, so NULL NOT IN (1,2) does not return True, hence why your query returns no rows.
So to pull this off you'll have to convert every single time to a non-NULL value that can participate in expressions.
The above expert answers are correct, but just to add some context NULL cannot be compared to anything, so NULL NOT IN (1,2) does not return True, hence why your query returns no rows.
So to pull this off you'll have to convert every single time to a non-NULL value that can participate in expressions.
NULL and NOT IN do not get along well. When you see NOT IN against a nullable field, expect trouble. The workarounds above are reasonable -- I use them both. Ideally, one would design the tables to not use NULLs where they are not required and avoid all the issues that NULL can introduce.
Another workaround is to create a non-nullable persisted calculated column that mirrors the original column but replaces NULL with an otherwise invalid value such as zero. Queries that were written for the original column still work, and you can query against the new one when you need to. If the table is huge, however, you might want to think twice before doing that -- you do need to know what you are doing, and be aware of the potential for performance impact and fragmentation.
Another workaround is to create a non-nullable persisted calculated column that mirrors the original column but replaces NULL with an otherwise invalid value such as zero. Queries that were written for the original column still work, and you can query against the new one when you need to. If the table is huge, however, you might want to think twice before doing that -- you do need to know what you are doing, and be aware of the potential for performance impact and fragmentation.
Don't use function(s) against a column in the WHERE clause (unless absolutely required) because it can hurt performance.
You must check for NULL directly, using IS NULL (or IS NOT NULL):
(pt.PATIENT_TYPE is null OR pt.PATIENT_TYPE_C not in (1,2) )
You must check for NULL directly, using IS NULL (or IS NOT NULL):
(pt.PATIENT_TYPE is null OR pt.PATIENT_TYPE_C not in (1,2) )
The OR tends to impact performance about as much as wrapping the column in a function, if there is an otherwise a suitable index for the predicate. If there is no such index, it probably isn't going to matter anyway. If there is a suitable index, SQL Server might do a partial scan (whether using OR or the function, either way), which can work out fine, sometimes. As always, "it depends."
An index against a non-persisted computed column that copies the original column but eliminates the NULL, along with any other columns needed, is another possibility, if all the restrictions on doing that can be met. It has the advantage of not messing with the clustered index rows. (Disclaimer: I haven't actually done this - I always ran into restrictions.) Not available before SQL 2012, I don't believe.
An index against a non-persisted computed column that copies the original column but eliminates the NULL, along with any other columns needed, is another possibility, if all the restrictions on doing that can be met. It has the advantage of not messing with the clustered index rows. (Disclaimer: I haven't actually done this - I always ran into restrictions.) Not available before SQL 2012, I don't believe.
The OR tends to impact performance about as much as wrapping the column in a function
Actually it doesn't. That's easy to prove for yourself by building a reasonably large sample table, say 1M rows, and testing column IS NULL vs NULL(column, ...)
It's not so easy, because it can depend on so many factors. I tested on a 30,000,000 row table with a 2-predicate ANDed search chosen to return just 4-6 rows, after adding a matching index to the table, and it actually read 4-6 index rows, sometimes. Switching to OR (the predicate was chosen to return few enough rows in this case to not run out of memory) caused it to read 30,000,000 rows. Changing back to AND and wrapping column 1 (in the index) with a do-nothing function - COALESCE(xxx, NULL) - had the same effect (30 million).
Wrapping the column matching column 2 of the index had less effect due to a partial index scan, although the size of the scan (up to 19,000 rows) varied depending on the exact value of the 2nd test due, apparently, to the fact that the second column was nullable. I thought that was a little weird, but not unusual.
The 30,000,000 row reads correspond to non-SARGable predicates which prevent use of the index. A non-SARGable predicate that still allows some of the index columns to be used may have less effect. If table sizes are small, none of this may matter at all.
On my machine, the 30 million reads took about the same real time as the 4 reads. Apparently, the index fit entirely in the cache, and there was no noticeable performance degradation.
I didn't try the exact scenarios above because I didn't have a database sitting around that contained large amounts of the right data to do that. It is a different case and it could behave quite differently because IN is at least partially SARGable. Which is interesting, because it is a form of OR, and there may be other situations where OR can use an index.
Performance is also, obviously, highly dependent upon which indexes have been provided. If you think it might work, run it, look at the execution plan, and if it does work then for today, at least for a little while, it works. If it doesn't, add or change indexes and repeat, trying not to break anything else.
Wrapping the column matching column 2 of the index had less effect due to a partial index scan, although the size of the scan (up to 19,000 rows) varied depending on the exact value of the 2nd test due, apparently, to the fact that the second column was nullable. I thought that was a little weird, but not unusual.
The 30,000,000 row reads correspond to non-SARGable predicates which prevent use of the index. A non-SARGable predicate that still allows some of the index columns to be used may have less effect. If table sizes are small, none of this may matter at all.
On my machine, the 30 million reads took about the same real time as the 4 reads. Apparently, the index fit entirely in the cache, and there was no noticeable performance degradation.
I didn't try the exact scenarios above because I didn't have a database sitting around that contained large amounts of the right data to do that. It is a different case and it could behave quite differently because IN is at least partially SARGable. Which is interesting, because it is a form of OR, and there may be other situations where OR can use an index.
Performance is also, obviously, highly dependent upon which indexes have been provided. If you think it might work, run it, look at the execution plan, and if it does work then for today, at least for a little while, it works. If it doesn't, add or change indexes and repeat, trying not to break anything else.
Best approach use below
SELECT [all_column_rquired]
From [yourtable] x
OUTER APPLY
(
SELECT * FROM
(
SELECT 1 a
UNION ALL
SELECT 2
)t
WHERE x.[pt.PATIENT_TYPE_C] = t.a
)r
WHERE x.[pt.PATIENT_TYPE_C] IS NULL
In a nutshell ...
NOT In pathetic in terms of performance,...
NOT In pathetic in terms of performance,...
ASKER
Thanks everyone. I will try these when I get to a computer.
@Megan Brooks
You need to provide the specific code and test(s) you ran. An "IS NULL OR" condition does not prevent index seeks.
You need to provide the specific code and test(s) you ran. An "IS NULL OR" condition does not prevent index seeks.
ASKER
Jim You are correct. I am not worried so much about performance issues for this one because the database has thousands of tiny tables rather than large ones. I typically have to join 15 to 20 tables to get any decent data. Pretty sure this table is not indexed either.
ASKER
This worked perfectly. I am now getting data. Thank you Eric, and everyone else for your great comments and assistance! I know a WHOLE lot more about NULL values than ever.... :)
If performance isn't an issue, just use IS NULL OR or ISNULL(<column>, <something not used>) keep NOT IN from seeing the NULL.
If you do a lot of reporting on the database then you might want to consider extracting to separate reporting database that is customized for your reporting requirements. Writing reports against a highly normalized or highly hacked-on database can be a lot of trouble, especially if the schema changes periodically, as might be the case if you are querying against a vendor-supplied system, and in particular any EHR system I have seen.
If you do a lot of reporting on the database then you might want to consider extracting to separate reporting database that is customized for your reporting requirements. Writing reports against a highly normalized or highly hacked-on database can be a lot of trouble, especially if the schema changes periodically, as might be the case if you are querying against a vendor-supplied system, and in particular any EHR system I have seen.
Open in new window