Link to home
Start Free TrialLog in
Avatar of Robb Hill
Robb HillFlag for United States of America

asked on

Help with SQL Query

I have a table that I am trying to pull some data out of...Here is what I am trying to do.

I have a table...

There is a primary key of int.  there is a forign key called docid.  Doc Id can be repeated.  And a date created field

Docid is a varchar...and is in the following format.

it always begins with DOC + some integer
the integer will be either 7 or 8 digits.

so doc1234567

or doc12345678

Lets say I had in the table on record 10 doc1000000  - which could translate to the millionths document.

Then the next day a record was created that the  doc value was doc200000 - which would mean the number incremented by a million documents.

This basically would mean an error in how the inserts happened.


I need a query that can detect this type of situation and I am talking about millions of rows.


so I would want to capture the date of when any jump like this happened...and what the before and after values were or how much it incremented by.

It may also be valuable to know a count of all the duplicate docids in this table as well as that number could indicate an issue.  I know how to get this but not sure how to put it all together in one big query..

a count of dups would be simple..

select count(docid) as cntdocid, docid from table
groupby docid
having count(docid) > 1


Please help
Avatar of PortletPaul
PortletPaul
Flag of Australia image

Does docid get updated for any reason?
Can rows be deleted from this table?

As there are millions of rows please identify what indexes exist on that table.
Avatar of Robb Hill

ASKER

yes there could be deletes and updates.

An update would not be normal...it would be due to some data cleanup type scenario.  I cant think of any reason an update happening would matter for the purpose of this query.

Deletes could happen so a gap from a delete would be expected.

I cant get into indexing at this time as it might be different from one version of this table to another.

Can we just assume indexes are as they should be.  

I would assume the obvious on the primary key..and a nonclustered on the docid.  I doubt date is indexed
My concern with the indexing is that you really may have performance issues. Having no index on that date column may be especially bad.

My advice would be to have a small but representative table available, which mimics the indexing on the actual table so you can inspect execution plans and do testing without waiting and waiting.

Here is one query approach which requires use of LAG() OVER()  [& hence my concern over the indexing]

CREATE TABLE Table1
    (   ID int NOT NULL IDENTITY (1,1) PRIMARY KEY
      , [docid] varchar(11)
      , [date_created] datetime
      , INDEX IX_docid NONCLUSTERED (docid)
    )
;
    
INSERT INTO Table1
    ([docid], [date_created])
VALUES
    ('doc1234567', '2015-01-01 00:00:00'),
    ('doc1234568', '2015-02-01 00:00:00'),
    ('doc1000000', '2015-03-01 00:00:00')
;

Open in new window

select
*
from (
    select
            id, date_created, docid, lag(next_docid) over(order by date_created, id) expected_docid
    from (
        select
                    *, 'doc' + convert(varchar(20),cj.doc_int ) next_docid
            from Table1
            cross apply (select try_convert(int,replace(docid,'doc','')) + 1 doc_int) cj
         ) x
    ) xx
where docid <> expected_docid
order by id
;

+----+----+---------------------+------------+----------------+
|    | id |    date_created     |   docid    | expected_docid |
+----+----+---------------------+------------+----------------+
|  1 |  3 | 01.03.2015 00:00:00 | doc1000000 | doc1234569     |
+----+----+---------------------+------------+----------------+

Open in new window

see it working here: http://rextester.com/TCFN69548
So here is the output I am getting..the query performs quite well.  So indexing looks good.

but not completely following the logic or output...Here is the first 20 rows returned
I renamed some of the columns names..ndocument would be id and the others should be self explanatory.


ndocument      created                                      documentid       expected_documentid
2050              2013-08-13 15:47:28.000      DOC0001755       doc1755
2051             2013-08-13 15:47:29.000      DOC0001756       doc1756
2052             2013-08-13 15:48:47.000      DOC0001757 doc1757
2053            2013-08-15 16:00:52.000      DOC0001758       doc1269649
7324            2013-02-18 00:00:00.000      DOC0007419       doc6163463
7325           2013-02-18 00:00:00.000      DOC0007420       doc7420
7326           2013-02-18 00:00:00.000      DOC0007421       doc7421
7327           2013-02-18 00:00:00.000      DOC0007422       doc7422
7328           2013-02-18 00:00:00.000      DOC0007423       doc7423
due to the fact that there have been some data data cleanup...here are some other assumptions to take into play..or assumptions to remove.

The primary key would be indicative of the order in which the file was orginally input into the system.
This docid will not always line up in order by which it was created..due to data manipulations...this is just do to maintence over time..not best db practice but I cannot change that now.

With that being said..you could have a very new document with a lower docid ....

The main thing to catch here would be jumps in the docid.

once again they will be doc  + 7 or 8 integers.

if there are large gaps between these integers I need to indentify them and the date.

as that could be an indication of someone doing some type of insert (the wrong way)

I need to catch this.

I hope this helps
I have not used that function before by the way..pretty cool
OK, Didn't know the docid was zero padded, which is why you got so many returned rows.

As performance is ok try this which also tests for duplicates (if you need this in the same query)
select
    *
from (
    select
    *
    , lead(docid) over(order by id) next_docid
    , count(*) over(partition by docid) is_dup_ifover1
    from table1
    ) d
cross apply (
    select 
        try_convert(int,replace(docid,'doc','')) docint
      , try_convert(int,replace(next_docid,'doc','')) next_docint
    ) cj
where abs(next_docint - docint) > 6
or is_dup_ifover1 > 1
;

Open in new window

You can alter that 6 in the where clause to suit
Can you explain how this is working...not sure im following how this gets the info.

I probably need to take out the select *.....there are alot of columns on this table...

Query is still running at 8 minutes...took about 5 minutes before it returned data.


but out side of that  can you please explain.
I wouldnt want it to filter out a duplicate from the result....just a count of the number of dups...having dups is normal here..but can also be a sign that something is wrong too...

its not as important...just so long as we are not filtering out any dups.
Robb, hopefully you have run the query and figured out what it is showing.

Whenever I use "select *" it is because I have no idea what you want to see, and hence that is up to you.

The most recent query I proposed does 2 things
a. looks for "jumps" in the docid where in adjacent rows the docid numbers are different by more than 6
b. if a docid is referenced in more than 1 row it will list each of those

b. is optional, just remove lines 7 and 16 from my most recent query
ok I changed the 6 before running...not realizing..that is why I got so many returns..

So the 6 is the number I want to define my jump by so to speak.


So if I wanted to only see number that jumped by 1000...that would be where I changed the number..


One other annoying thing..but nothing I can do about ..is the sql I am running I have to use 100 compatibility mode.

These databases are between 2008 through 2016...but our code is only certified at that compatibility level.

Meaning the try_convert is not usable.

I was really excited about that one..lol.....and then I was told I could not raise the level at this time.


Can we get the same results another way.

This has been a good exercise for me as I have never used the lag or lead before.  Not sure that I ever needed to.  So glad something new came into this question.

Thanks again for your help.
ASKER CERTIFIED SOLUTION
Avatar of PortletPaul
PortletPaul
Flag of Australia image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
this version is giving the following error

Conversion failed when converting the varchar value 'DOC0000112' to data type int.
As I  warned may happen. I  am unable to guarantee convert () or isnumeric () will always work.

Try_convert () is more reliable in that regard.


Not sure why it failed on that data. Is the collation case sensitive or case insensitive. Perhaps the replace did not work?

As I  cannot access your data i can't do much more....
Well we are doing a test on a varchar...

so its testing for isnumeric...not sure how that would work..considering the data type.

isnumeric would be false..I would think..

Then we are comparing on a convert to integer..which its not an integer if it has DOC concatneated...its varchar..

The data on that column is always the following.

Its varchar...and it will be a varchar(20)..sometimes varchar(10)

Either way the value in the field will be DOC123456 or DOC1234567
Robb

IF you are going to report errors then I can only assume you are using the most recent query I proposed - without changes. If there are any changes then please add that changed query along with the error.

In my most recent query I use ISNUMERIC after a replace so DOC12345 after replace of DOC should be 12345, and ISNUMERIC(12345) is true

This occurs inside the CROSS APPLY (see below)

cross apply (
    select
        case when isnumeric(replace(docid,'doc','')) = 1 then  convert(int,replace(docid,'doc','')) else NULL end docint
      , case when isnumeric(replace(next_docid,'doc',''))=1 then convert(int,replace(next_docid,'doc','')) else NULL end next_docint
    ) cj

So try using uppercase DOC instead of lowercase, if that makes a difference then your collation is probably case sensitive
I am thinking its the trailing zeros...

doc or DOC doesnt matter in this case.

The error message is
Conversion failed when converting the varchar value 'DOC0000001' to data type int.

im thinking its not evaluating 1 but 00000001
Robb

select ISNUMERIC('DOC000001'), ISNUMERIC(replace('DOC000001','DOC',''))

that returns 0, 1

leading zeros do NOT invalidate the isnumeric test

something else is the problem, most likely the REPLACE

did you try UPPER case as I discussed before?

please paste in the EXACT query you are using that produces the error.
my fault...

the math..I had wrong
where abs(next_docint - docint) > 6

I accidentally had changed next_docint to a varchar value...so the math here was failing
:) yep, that would do it.....

So we are good to go now?
I think so.  

Thanks for your help!!
awesome!
a closed question is a good question (hint)
THanks so much!