Link to home
Start Free TrialLog in
Avatar of jay-are
jay-areFlag for United States of America

asked on

Sql table missing data HELP!

Hello Experts!

One of my sql tables is doing something strange and I need help cleaning it up.  This would be an easy fix in excel if the whole table would fit!  This entire table has somewhere around 1.4 million rows I think.  Here's what the data looks like:

SRC     Date     AcctNum     RefNum     Amt
  9      4-3-05     1001             &          500
  9      4-3-05     1001             &          -500
  9      4-3-05     1001             &          100
  9      4-3-05     1001          12345       300

So the entire table goes in blocks like this.  In excel to fix this (usually the table is very small because I only import current month) I would simply reverse the tables order so the correct refnum is first, then I would write a simple formula to copy the refnum down and overwrite the "&"'s until it hits a new refnum.

Now due to the table size I can't clean this data up in excel.  So I'm kinda lost in ms sql on cleaning this.  I don't even know how to import it into sql and reverse the table order so the correct refnum starts first.  

Can this be done with a stored procedure?  I've been looking at the sql commands and I can't seem to find anything on filling down based on some constraints.  The thing is that the src and date will always be the same per "group" of data.  This table is just a journal of all accounting postings.  So the refnum should be the same but the system that uses this data knows how to copy down the refnum for each line.  
I'm lost at this point and I have to get this cleaned up as I can't link any of my reports based on refnum.

Help!
Avatar of Brian Crowe
Brian Crowe
Flag of United States of America image

Please explain the logici of how RefNum is filled again?  In the example provided is RefNum '12345' for all instances of '9'/'4-3-05'/'1001' or speaking generally is RefNum the same for all each DISTINCT SRC, Date, AcctNum?

Gonna take a stab...

UPDATE myTable
SET RefNum = tblRefNum.MaxRefNum
FROM myTable
INNER JOIN (SELECT SRC, Date, AcctNum, Max(RefNum) MaxRefNum FROM myTable) AS tblRefNum
     ON myTable.SRC = tblRefNum.SRC AND myTAble.Date = tblRefNum.Date AND myTable.AcctNum = tblRefNum.AcctNum
correction...

UPDATE myTable
SET RefNum = tblRefNum.MaxRefNum
FROM myTable
INNER JOIN (SELECT SRC, Date, AcctNum, Max(RefNum) MaxRefNum FROM myTable GROUP BY SRC, Date, AcctNum) AS tblRefNum
     ON myTable.SRC = tblRefNum.SRC AND myTAble.Date = tblRefNum.Date AND myTable.AcctNum = tblRefNum.AcctNum
Avatar of jay-are

ASKER

For some clarification the data itself.  The import I do pulls the data in just as you see it above.  So for each block of postings you'll have 1 constant date and src.  The only thing that can change is the acctnum, amount.
Here is a copy/paste from enterprise manager:

      1111      -19500      233758      PAYMENT      &      5      3/1/2004
      1111      -12385      &      PAYMENT      &      5      3/1/2004
      1111      -8685      &      PAYMENT      &      5      3/1/2004
      1111      -1885      &      PAYMENT      &      5      3/1/2004
      1111      -16585      &      PAYMENT      &      5      3/1/2004
      1111      -2885      &      PAYMENT      &      5      3/1/2004
      1111      -6085      &      PAYMENT      &      5      3/1/2004
      1111      -15085      77678      PAYMENT      &      5      3/1/2004
      1111      -16425      500521      PAYMENT      030104      5      3/1/2004
      1111      100      102792      TOWN ANI      KS012076      2      3/1/2004
      2429      -725             PARTS PHYSE0001-0104      13      1/1/2004
      1111      16885      &      GEORGIA DEALERS AUTO AUCTION      2LA03355      2      3/2/2004
      1111      6535      &      GEORGIA DEALERS AUTO AUCTION      9UJ26183      2      3/2/2004
      1111      6985      &      GEORGIA DEALERS AUTO AUCTION      YE057758      2      3/2/2004
      1111      17585      &      GEORGIA DEALERS AUTO AUCTION      3Z265973      2      3/2/2004
      1111      8285      &      GEORGIA DEALERS AUTO AUCTION      WKA76860      2      3/2/2004
      1111      9485      &      GEORGIA DEALERS AUTO AUCTION      2J038393      2      3/2/2004
      1111      10685      &      GEORGIA DEALERS AUTO AUCTION      &      2      3/2/2004
      1111      -10685      &      GEORGIA DEALERS AUTO AUCTION      9NB60332      2      3/2/2004
      1111      10685      &      GEORGIA DEALERS AUTO AUCTION      &      2      3/2/2004
      1111      -10685      77678      GEORGIA DEALERS AUTO AUCTION      9NB60332A      2      3/2/2004
      1111      -6000      &      WHSL PAYMENT W1151615      &      5      3/3/2004
      1111      -500      &      WHSL PAYMENT LU261994      &      5      3/3/2004
      1111      -800      &      WHSL PAYMENT SK183502      &      5      3/3/2004
      1111      -1700      &      WHSL PAYMENT NU086652      &      5      3/3/2004
      1111      -800      &      WHSL PAYMENT NZ214085      &      5      3/3/2004
      1111      -50      &      WHSL PAYMENT KY652201      &      5      3/3/2004
      1111      -1700      &      WHSL PAYMENT VKC44054      &      5      3/3/2004
      1111      -400      &      WHSL PAYMENT MB129744      &      5      3/3/2004
      1111      -8000      102792      WHSL PAYMENT YEA99084      030304      5      3/3/2004

Sorry it doesn't come out as well as I hoped but you get the idea of what is happening.  Usually I'd just export this table into excel, reverse it's order (just my preference so I'm working down instead of up) and write this formula to clean it up:  =IF(I11="&",L10,I11).  So I create a whole new colum called "cleanref" or something like that and fill it with that formula.  So when I come to a record that has multiple lines of "&"'s I look at the previous cleanref data for the answer.  Hopefully that makes sense!  :)

What I failed to mention in my first post was the control number column which is doing the same thing as the refnum.  
UPDATE myTable
SET RefNum = tblFiller.MaxRefNum, [control number] = tblFiller.MaxControlNumber
FROM myTable
INNER JOIN (SELECT AcctNum, Max(RefNum) MaxRefNum, Max([control number]) AS MaxControlNumber
     FROM myTable GROUP BY AcctNum) AS tblFiller
     ON myTable.AcctNum = tblFiller.AcctNum
my bad...that won't work.  It's starting to look like you're going to need a CURSOR because there is no logical means to group the records with '&' with the corresponding record with a value
Do you have an IDENTITY column on your current table?  If not, create a new table the same as the current one except that it includes an IDENTITY column:

SELECT IDENTITY(INT, 1, 1) AS ident, oldTable.*
INTO newTable
FROM oldTable

ALTER TABLE newTable
ADD CONSTRAINT newTable_Ix1
UNIQUE CLUSTERED (ident)


Then you can update the missing values like so:

UPDATE newTable
SET RefNum = (
    SELECT Refnum
    FROM newTable nt2
    WHERE nt2.ident = (
        SELECT MIN(ident)
        FROM newTable nt3
        WHERE nt3.ident > nt.ident
            AND Refnum <> '&'
        )
    )
WHERE refNum = '&'

And again, substituting AcctNum for RefNum.
Finally, if the original table has no indexes/fks/checks/constraints/etc. that are difficult to re-create, simply drop the original table, drop the ident column on the new table, then re-name the new table to the old table name.

If the orig table has too many indexes/fks/checks/constraints/etc., then just TRUNCATE TABLE the original table (if possible; use DELETE FROM if TRUNCATE not allowed), then insert all rows in the newTable to the oldTable, then delete the new table.


NOTE: Technically there needs to be an IDENTITY column on the original table when it is loaded; otherwise you can't be sure be sure that SQL will later read/copy the rows in the same order as originally loaded.
Avatar of jay-are

ASKER

Sorry, I do have an identity column called IDS.ID, it's incremental.  
ASKER CERTIFIED SOLUTION
Avatar of Scott Pletcher
Scott Pletcher
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of jay-are

ASKER

Scott:

Looks good, I'm running this in query analyzer right now to test it.  So far it's up to 14 minutes.  Will let you know once it finishes!
Avatar of jay-are

ASKER

Ok this is close.  I had to kill the update that was running on the main table because it was up to 55 minutes and it was taking up 100% cpu usage!  So I shortened the table to just a few entries to test on.  Here is the test data before the update:

account      amount      contnum      desc      refnum      src      date      ids.id
1111      -19500      233758      PAYMENT      &      5      3/1/2004      
1111      -12385      &      PAYMENT      &      5      3/1/2004      
1111      -8685      &      PAYMENT      &      5      3/1/2004      
1111      -1885      &      PAYMENT      &      5      3/1/2004      
1111      -16585      &      PAYMENT      &      5      3/1/2004      
1111      -2885      &      PAYMENT      &      5      3/1/2004      
1111      -6085      &      PAYMENT      &      5      3/1/2004      
1111      -15085      77678      PAYMENT      &      5      3/1/2004      
1111      -16425      500521      PAYMENT      30104      5      3/1/2004      
1111      100      102792      TOWN AN KS012076      2      3/1/2004      
2429      -725             PARTS PH      SE0001-0104      13      1/1/2004      
1111      16885      &      GEORGIA       2LA03355      2      3/2/2004      
1111      6535      &      GEORGIA       9UJ26183      2      3/2/2004      
1111      6985      &      GEORGIA       YE057758      2      3/2/2004      
1111      17585      &      GEORGIA       3Z265973      2      3/2/2004      
1111      8285      &      GEORGIA       WKA76860      2      3/2/2004
1111      9485      &      GEORGIA       2J038393      2      3/2/2004
1111      10685      &      GEORGIA       &      2      3/2/2004
1111      -10685      &      GEORGIA       9NB60332      2      3/2/2004
1111      10685      &      GEORGIA       &      2      3/2/2004
1111      -10685      77678      GEORGIA       9NB60332A      2      3/2/2004
1111      -6000      &      WHSL PAY      &      5      3/3/2004
1111      -500      &      WHSL PAY      &      5      3/3/2004
1111      -800      &      WHSL PAY      &      5      3/3/2004
1111      -1700      &      WHSL PAY      &      5      3/3/2004
1111      -800      &      WHSL PAY      &      5      3/3/2004
1111      -50      &      WHSL PAY &      5      3/3/2004
1111      -1700      &      WHSL PAY &      5      3/3/2004
1111      -400      &      WHSL PAY &      5      3/3/2004
1111      -8000      102792      WHSL PAY 30304      5      3/3/2004

After running this update:

UPDATE jrnlhistory_test
SET RefNum = (
    SELECT Refnum
    FROM jrnlhistory_test yt2
    WHERE yt2.[ids.id] = (
        SELECT MIN([ids.id])
        FROM jrnlhistory_test yt3
        WHERE yt3.[ids.id] > jrnlhistory_test.[ids.id]
            AND yt3.Refnum <> '&'
        )
    )
WHERE refNum = '&'

I got this date:

acct      amt      ctrlnum      desc      refnum      src      date      id
1111      -19500      NULL      PAYMENT      KS012076      5      3/1/2004 0:00      1
1111      -12385      &      PAYMENT      KS012076      5      3/1/2004 0:00      2
1111      -8685      &      PAYMENT      KS012076      5      3/1/2004 0:00      3
1111      -1885      &      PAYMENT      KS012076      5      3/1/2004 0:00      4
1111      -16585      &      PAYMENT      KS012076      5      3/1/2004 0:00      5
1111      -2885      &      PAYMENT      KS012076      5      3/1/2004 0:00      6
1111      -6085      &      PAYMENT      KS012076      5      3/1/2004 0:00      7
1111      -15085      NULL      PAYMENT      KS012076      5      3/1/2004 0:00      8
1111      -16425      NULL      PAYMENT      NULL      5      3/1/2004 0:00      9
1111      100      NULL      TOWN AN      KS012076      2      3/1/2004 0:00      10
2429      -725             PARTS P      SE0001-0104      13      1/1/2004 0:00      11
1111      16885      &      GEORGIA      2LA03355      2      3/2/2004 0:00      12
1111      6535      &      GEORGIA      9UJ26183      2      3/2/2004 0:00      13
1111      6985      &      GEORGIA      YE057758      2      3/2/2004 0:00      14
1111      17585      &      GEORGIA      3Z265973      2      3/2/2004 0:00      15
1111      8285      &      GEORGIA      WKA76860      2      3/2/2004 0:00      16
1111      9485      &      GEORGIA      2J038393      2      3/2/2004 0:00      17
1111      10685      &      GEORGIA      9NB60332      2      3/2/2004 0:00      18
1111      -10685      &      GEORGIA      9NB60332      2      3/2/2004 0:00      19
1111      10685      &      GEORGIA      9NB60332A      2      3/2/2004 0:00      20
1111      -10685      NULL      GEORGIA      9NB60332A      2      3/2/2004 0:00      21
1111      -6000      &      WHSL PA      NULL      5      3/3/2004 0:00      22
1111      -500      &      WHSL PA      NULL      5      3/3/2004 0:00      23
1111      -800      &      WHSL PA      NULL      5      3/3/2004 0:00      24
1111      -1700      &      WHSL PA      NULL      5      3/3/2004 0:00      25
1111      -800      &      WHSL PA      NULL      5      3/3/2004 0:00      26
1111      -50      &      WHSL PA      NULL      5      3/3/2004 0:00      27
1111      -1700      &      WHSL PA      NULL      5      3/3/2004 0:00      28
1111      -400      &      WHSL PA      NULL      5      3/3/2004 0:00      29
1111      -8000      NULL      WHSL PA      NULL      5      3/3/2004 0:00      30

Did I mess up the update?  
Avatar of jay-are

ASKER

Wait, that was my fault on the import.  I put the data in an excel sheet then imported it into sql and it was right aligning some rows which caused something to mess up in sql.

I did it via csv with the data aligned the same way and it works fine!

Thanks!!
An index on id (or perhaps (id, refnum), although that will involve a lot of index updates) might help the performance considerably, if SQL uses the index.

Also, since you will be replacing "&" with a longer string, you should make the id a clustered index and set the FILLFACTOR to, say, about 70% to allow free space for the rows to expand without having to be moved to another page.  That will also help tremendously with performance.  When done with all updates, change the fillfactor to 90/95 and rebuild the clustered index, to compact the rows.
Avatar of jay-are

ASKER

One last question about this update.

I only have to run this once when I initially import this history file.
I will run this same update for both the refnum column and the ctrlnum column.  The table is over 1.4 million records long!  

When I first ran the update I ran it on the entire table and it got up to 55 minutes or so and I had to just kill the query because it basically shut the sql server down.  No one could access the server period while it was running this query.
After I stopped the query I took a look at the table and it looked like no rows had been affected.  How long will it take to actually complete this update?  Should I just start it at 2 a.m. and hope it finishes before everyone arrives in the morning?  Anyway to speed it up?  Maybe lower the cpu usage sql gets?

Thanks!
Use the index and FILLFACTOR as noted above.  That should speed it up.

Btw, when you cancel it, you won't see any of the changes because SQL will roll them back out :-)
Avatar of jay-are

ASKER

Just a follow up with this solution:

Scott, I indexed based on just the ID field.  I also set it's fillfactor to 90.

BIG help!  At first the update would simply run forever.  Now it cleans that whole table in 13 seconds!  :)

Thanks for your help!
Thanks for the follow up!  And I'm very glad that you finally got a good result.