• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 548
  • Last Modified:

25 million row table problem for a select

We have a table here with 25 million rows or more.

We'd like to extract 1months worth of data from the table into another table that can be worked on and not impact the live system.

The current process is to bcp out the whole table and bcp back into a copy of table
Then run a insert into a temp table using a sproc to get one months worth of data back only.
This copy of the table has a non clusted index built on it once its had main table data refreshed into it.
Obviously long winded.

What are suggestion in sql code please to do it a better way?
The table has following format

The large table has a non clus index on it shown below
No indexes can be created on this application table.
CREATE NONCLUSTERED INDEX [IX_DOCHSTRY_NUMVER] ON [MHGROUP].[DOCHISTORY]
(
      [DOCNUM] ASC,
      [VERSION] ASC
)WITH (SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, IGNORE_DUP_KEY = OFF, ONLINE = OFF) ON [PRIMARY]

ive included the table below and we are only interested in fields
docuser
docnum
duration

Any suggestions please as complete at possible
please ask for clarification





CREATE TABLE [MHGROUP].[DOCHISTORY](
	[DOCNUM] [float] NULL,
	[VERSION] [int] NULL,
	[ACTIVITY] [varchar](254) COLLATE Latin1_General_CI_AS NULL,
	[ACTIVITY_CODE] [int] NULL,
	[ACTIVITY_DATETIME] [datetime] NULL,
	[DURATION] [int] NULL,
	[PAGES_PRINTED] [int] NULL,
	[NUM1] [float] NULL,
	[NUM2] [float] NULL,
	[NUM3] [float] NULL,
	[DATA1] [varchar](254) COLLATE Latin1_General_CI_AS NULL,
	[DATA2] [varchar](254) COLLATE Latin1_General_CI_AS NULL,
	[DOCUSER] [varchar](64) COLLATE Latin1_General_CI_AS NULL,
	[APPNAME] [varchar](32) COLLATE Latin1_General_CI_AS NULL,
	[LOCATION] [varchar](32) COLLATE Latin1_General_CI_AS NULL,
	[COMMENTS] [text] COLLATE Latin1_General_CI_AS NULL
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]

Open in new window

0
mooriginal
Asked:
mooriginal
  • 8
  • 5
  • 3
  • +2
1 Solution
 
chapmandewCommented:
On your big table, does the date field that you use to segment the data have an index on it?  You could just do smoething like this:

insert into OneMonthsWorth(fieldvalues)
select fiedlvalues
from bigtable with(nolock)
where datefield >= '1/1/2008' and
datefield < '2/1/2008'
0
 
Cedric_DCommented:
I have 2 questions:

1) why bcp out/bcp in, instead of simple   insert into BACKUP_SERVER.DB..TBL from LIVE_SERVER.DB..TBL with (nolock) ?

2) for me very strange to use float type for identity column. Any operation will cause costly conversions using FPU.

3) there is no PRIMARY KEY - this is acceptable, but not a good technique. It will lead to accidental inserting of two identical lines.

4) I cannot found DATETIME column. How are you determine last month data?? Or di I missed smthg?

5) When it is HISTORY-like table and you mostly retrieve data by time, then is would be wise to create clustered index by date - then it will be fast of any small time segment on any huge data amount.
0
Fill in the form and get your FREE NFR key NOW!

Veeam is happy to provide a FREE NFR server license to certified engineers, trainers, and bloggers.  It allows for the non‑production use of Veeam Agent for Microsoft Windows. This license is valid for five workstations and two servers.

 
chapmandewCommented:
[ACTIVITY_DATETIME] [datetime] NULL,

Clustered index on the date field is the best approach, I agree.
0
 
Mark WillsTopic AdvisorCommented:


On the big table, what about a "covering" index using Activity_Datetime (assuming that is where you will get the "one-month" and then have included in that index,
docuser
docnum
duration


That way, all your access should be at the index level without having to go to the data rows themselves, and avoids any locking conflict and should be quicker to then do a direct select from the "big" table into the monthly one like chapmandews first post

The clustered index will take some time to establish, and if it is not overly unique there is a lot of hashing going on - unlike chapmandews last post :)
0
 
chapmandewCommented:
>>ive included the table below and we are only interested in fields
docuser
docnum
duration

the covering index approach would work if and only if these are the only fields used at all in the query.
0
 
Cedric_DCommented:
however, you should better make [ACTIVITY_DATETIME] as NOT NULL.
0
 
Mark WillsTopic AdvisorCommented:
Well, the query would still work, but no where near as efficiently as just using the index pages rather than the data rows. And do apologise for just blasting in on this thread... and not discrediting the clustered index as such, just think there is a "lighter" method considering the very restricted output requirement...

If there was going to be a clustered index would have thought that maybe docnum or the introduction of an identity field would have been likely candidates (can only be one), becuae it would "seem" that they are less likely to change, be unique, and the primary key. With that in place, can then build other indexes which can use the primary key as a row pointer in any subsequent indexes being built - that would improve overall performance. But I am guessing we are not seeing the full structure..

Oh, and agree that float is not a good identity field - but do not see that being used - in fact do not see an identity column anywhere, better if it was decimal if needing big numbers, but at 25million rows, won't be overly quick to rebuild.
0
 
mooriginalAuthor Commented:
ill try and answer initially as i get through posts.

1. cant add any indexes or changes at all to the table as is an application table and doing so will invalidate the support we get from them...
can add any indexes i like or do whatever to the temp or copy tables generated as solution.

2. doing an insert into from the live table to another table was taking too long - and so the route to take all processing out to another table was taken to lesson any impact on the live table even if done out of hours processing..
[of course this can be changed if considered not needed ]

3. Dochistory_Copy - which is the full bcp copy of Dochistory has an index on it below.
this index is used to make faster running of the the sproc below that does the processing into a smaller table for running reports:
SELECT
            DocUser                        =      DOCUSER,
            Docnum                        =      DOCNUM,
            Duration                  =      SUM(DURATION),
           Activity_Date                = Activity_Datetime
FROM         DBA.dbo.DOCHISTORY_COPY
 
WHERE ACTIVITY = 'checkin'
AND ACTIVITY_DATETIME >= CONVERT(DATETIME, CONVERT(VARCHAR(8), DATEADD(MONTH,-1, GETDATE()), 120) + '01', 120)
 AND ACTIVITY_DATETIME < CONVERT(DATETIME, CONVERT(VARCHAR(8), GETDATE(), 120) + '01', 120)
GROUP BY DOCNUM, ACTIVITY_DATETIME, DOCUSER
order by activity_date

CREATE NONCLUSTERED INDEX [Index_MA1] ON [dbo].[DOCHISTORY_Copy] 
(
	[DOCUSER] ASC,
	[ACTIVITY] ASC,
	[ACTIVITY_DATETIME] ASC,
	[DOCNUM] ASC
)
INCLUDE ( [DURATION]) WITH (SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, IGNORE_DUP_KEY = OFF, ONLINE = OFF) ON [PRIMARY]

Open in new window

0
 
Mark WillsTopic AdvisorCommented:
Point 1 is just a small snag then...

Find it a bit strange that an "insert (...) select a couple of columns from bigtable" was taking so long... considering exporting the entire table, loading, then extracting. Or was that to something other than the "temp" or "copy" tables ?

Indexes should match the same ordnial position of the queries - if selecting activity = "checkin" as first port of call, then move it up in the index... in fact, could simply move docuser down the list.

But just a bit confused now... Is there a speed / performance problem with the temp / copy table - or getting the data from the original table, or you are not worried about the original table, just the copy ?
0
 
mooriginalAuthor Commented:
the sql below is being executed on the dochistory_copy table
because when it was being run on the live table - so missing all these temp tables etc
was looking up lots of resources etc..

so decision was to do it some other way - where the insert into worktable Secretary_report would not be taken from live if possible..

the idea then is to suggest -
1. nope just do it from live table it will work -
or
2. some other way then bcp the whole table


CREATE PROCEDURE [dbo].[usp_Sec_Report]
AS
INSERT INTO Secretary_Report (DocUser, Docnum, Duration)
 
SELECT 
            DocUser                        =      DOCUSER,
            Docnum                        =      DOCNUM,
            Duration                  =      SUM(DURATION)
 
 
FROM         DBA.dbo.DOCHISTORY_COPY
 
WHERE ACTIVITY = 'checkin'
  AND ACTIVITY_DATETIME >= CONVERT(DATETIME, CONVERT(VARCHAR(8), DATEADD(MONTH,-1, GETDATE()), 120) + '01', 120)
  AND ACTIVITY_DATETIME < CONVERT(DATETIME, CONVERT(VARCHAR(8), GETDATE(), 120) + '01', 120)
GROUP BY DOCNUM, DOCUSER

Open in new window

0
 
Cedric_DCommented:
Well, test add following clustered index on DOCHISTORY_COPY (must use):

CREATE CLUSTERED INDEX [Index_C] ON [dbo].[DOCHISTORY_Copy]
(
        [ACTIVITY_DATETIME] ASC,
        [ACTIVITY] ASC
        [DOCNUM] ASC
        [DOCUSER] ASC
)
INCLUDE ( [DURATION]) WITH (SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, IGNORE_DUP_KEY = OFF, ONLINE = OFF) ON [PRIMARY]

and/or following nonclustered:

CREATE NONCLUSTERED INDEX [Index_MA2] ON [dbo].[DOCHISTORY_Copy]
(
        [ACTIVITY] ASC,
        [DOCNUM] ASC,
        [DOCUSER] ASC,
        [ACTIVITY_DATETIME] ASC
)
INCLUDE ( [DURATION]) WITH (SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, IGNORE_DUP_KEY = OFF, ONLINE = OFF) ON [PRIMARY]


Which index SQL Server will select, depends on actual data.

I'm repeat, for all history-like tables almost always is most efficient clustered index which begins with datetime column, to reteieve specific small period of time from a whole history.
Clustered indices are effective on diapasones, while nonclustered - on specific filters. When filtering by Activity is very effective (for example, exclude >90% of data, i.e. many other Activity types exists), then my second nonclustered index should be chosen. (And it will be chosen automatically by SQL).

0
 
Mark WillsTopic AdvisorCommented:
Would like to think that the live table would work OK. The Group by might be a bit heavy, and possibly easier to do that on the result set, not the input...

So, how would something like :

CREATE PROCEDURE [dbo].[usp_Sec_Report]
AS

DECLARE @StartDate datetime
DECLARE @EndDate datetime
set @startdate = convert(varchar(6),dateadd(mm,-1,getdate()),112)+'01 00:00:00'
set @enddate = convert(datetime,convert(varchar(6),getdate(),112)+'01') -0.0000001

SELECT DocUser, Docnum, Duration
INTO   #tmp
FROM   dochistory --DBA.dbo.DOCHISTORY_COPY
WHERE  ACTIVITY = 'checkin'
  AND  ACTIVITY_DATETIME between @startdate and @enddate
 

INSERT INTO Secretary_Report (DocUser, Docnum, Duration)
select docuser,docnum,sum(duration)  from #tmp
GROUP BY DOCNUM, DOCUSER

GO
0
 
Mark WillsTopic AdvisorCommented:
Oh, and that FROM DOCHISTORY should be FROM DOCHISTORY WITH (NOLOCK)

and you might need to check for dupes on insert or updates into secretary_report depending on when that table is created...
0
 
mooriginalAuthor Commented:
thanks guys for ideas
ill give this a quick try later on today see how long it takes to come back
I may schedule this to run against the live table - this evening and see how long it takes ...

0
 
Mark WillsTopic AdvisorCommented:
yep, sounds like a plan...
0
 
mooriginalAuthor Commented:
that worked ok and got results back
i think ill bite the bullet and run it out of hours direct from live
0
 
Mark WillsTopic AdvisorCommented:
Oh-Oh... is that a good thing ? Just kidding, thought you were going to run it against live...

And as far as third party packages are concerned, adding non-restrictive indexes should not be a problem - would submit the script to their support department and see if they object, and if so why, so long as you have a script and in no way dependant on them maintaining it, could even put it into the stored procedure and then drop it afterwards.

The index would be :

CREATE INDEX [IDX_DOCHISTORY_ACTIVITY_ACTIVITY_DATETIME] ON [dbo].[DOCHISTORY]
(
        [ACTIVITY],
        [ACTIVITY_DATETIME]
)
INCLUDE ( [DOCNUM], [DOCUSER], [DURATION] ) ;


0
 
mooriginalAuthor Commented:
chaars
0
 
Mark WillsTopic AdvisorCommented:
Just saw you have accepted an answer whilst I was typing up the above. Was expecting to hear back after tonight's after-hours run... So,  thank you very much !
0

Featured Post

Free Tool: Site Down Detector

Helpful to verify reports of your own downtime, or to double check a downed website you are trying to access.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

  • 8
  • 5
  • 3
  • +2
Tackle projects and never again get stuck behind a technical roadblock.
Join Now