asked on

Fastest way to copy rows from one table to another

I have a database with a table that has 8 million rows in it, table was badly designed with no primary key not an identity column, I have some dat values that are entered into a varchar column, I need to query this table for records between certain years. The entries in the db at this point for the datatime are entered as a varchar with the following format 'yyyyMMddhhmm: no seconds and no seperators '200909301725' instead of '2009-09-30 17:25",
My question is how can I insert a primart key, and an indentity column and convert the values of what is supposed to be datetime ti an actual datetime and what do I need to index inorder to have the query exccecute fast enough when having a condition on a year where date>'2009-01-01T00:00:00' , pleasde note that each year has about 500,000 records that will be retrieved.

Thank you

the_bill

i would use DTS to import to a new table that has an identity column, then use a function to convert the string to a date

xav056

ASKER

should I not convert the strring to date before importing, I need to change the column type from varchar(20) to datetime, I do not thing a simple alter function would do the trick as the data in the column is not properly formatted,

ASKER CERTIFIED SOLUTION

dqmq

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

SOLUTION

the_bill

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

dqmq

>should I not convert the strring to date before importing, I need to change the column type from varchar(20) to datetime, I do not thing a simple alter function would do the trick as the data in the column is not properly formatted

You can convert from string to date WHILE importing or updating the new datetime column. Alter will not work.

The conversion from varchar to date requires the varchar to first be reformatted to one of the acceptable date-string formats. See BOL for the CONVERT function. But given the format you are starting with, that is not too difficult. (Now is time for prayers that your date formatting is consistent across all rows. In my experience, there's a snowball's chance in hell that all your dates are expressed properly).

xav056

ASKER

dqmq:
I need to add the identity colimn to use it as a primary key, Ii do not really have a godd primary key to use in my table(not even a combination of columns)
the qusetion here how can I add the identity column the fastest way I have 8 million records and it seems that it is taking for ever to do,
same thing for populatinf the date, whats the fastest way to do it

Thank you

xav056

ASKER

the_bill:
that was what I initially though off, but again its taking too much time to insert 8 million records using insert into table 1 (col1,col2)
select (col1,col2) from table 2

thank you

_agx_

> I have 8 million records and it seems that it is taking for ever to do

Is there some reason you need to do this all in one pass? It's not going to be instantaneous no matter how you
slice it. When I work with extremely large tables, I break things up into smaller chunks. So the logs, etc..
don't get bogged down. I set up a loop and inside it process chunks of data for some data determined range
(1mo, 6mo, ... whatever makes the most sense). Each iteration processes the current chunk, then repeats until
everything is processed. Not very elegant, but it's worked well for me in the past with this type of one-time
process.

xav056

ASKER

K I managed to transfer the data it actually took 110 minutes to copy the 8 million records from one table without an indentity column to one with an identity column, I also created new datecolumn as suggested by dqmq and did the conversion from varchar to date, that took 12 minutes to complete for the table,
Now I need help with setting up the indexes for querying the data fast,
I would like to select
SElect col1,col2,col3,col4,col5,col6,col7
from table
where date is between 'date1' and 'date2'

I added non clustered index on date assuming that date is col7
If I do select col7
from table
where date netween 'date1' and date2
I get result for 500,000 in 5 seconds which is not too bad
but If i include in my select statement more than the date col, 7 columns as in the sample above the quesry takes about 3 minutes, which is too much
How can I set up my indexes to get the query to run fast

Thanks

dqmq

Adding more columns in the select clause does NOT ordinarily impact performance (unless those columns are also used in the where clause). I'd start by adding one column at a time until you find the culprit. Then explore the nature of that column.

Also, you can show the execution plan to see what indexes are used by your query.

dqmq

One more thing (a rant, I suppose).

A relation (table is a close approximation) is supposed to represent business assertions. In fact, every row should represent exactly one business assertion. That business assertion must be uniquely identifiable by a primary key. If not, you are in BIG doo.

Now, I think you know that, because of your initiatlve to add the identity column. However, the identity column is totally meaningless in business terms and therefore really can't improve the business assertion represented by the row and really can't serve the intended purpose of the primary key. There are some good reasons for identity columns, but needing a primary key is usually not one of them. After more than 15 years in the business, I cannot remember a situation where a unique key other than an identity column wasn't important.

In other words, if the rows in your table are not unique, then the table does not represent a business relation. Arbtrarily tossing in a meaningless column to make the rows unqiue does not address that problem.

Now, I'll get off the soap box.

xav056

ASKER

Not quite sure why the identity column is a bad idea,
Lets say I want to have foreign constraints
One table has 8 columns where 4 columsn combined together could be used as a primary key, When I want to make sure that another table would lookup against a certain primary key, my intuition would tell me to add a foreighn key constraint but if i have a coposite primary key IO would have to repeat the 4 columns from table 1 and insert them in table 2, would it not be better in such a scenario to insert an identity column in table 1 and reference that as foreign key in table 2?
Thank you

dqmq

Let's be clear: the debate is not about identity columns, it's about surrogate keys. The pros/cons of assigning the identity property to a surrogate key is another discussion entirely.

I didn't mean to imply that surrogate keys are bad; for sure they have their place. One good reason for a surrogate key is if the natural key is to wide. By that I mean, one column with too many characters or a composite key with too many columns. One the one hand, joins are simpler and more efficient when they involve a single column. On the other, joins are often unnecessary when columns of the parent table propogate to the child.

But more importantly, natural keys convey a business rule. If you replace a natural key with a surrogate key, then you lose that business rule. I use surrogate keys when they make sense, but I NEVER drop the natural key when I do so. Why? Because the natural key is important to the integrity of the data and helps the database faithfully represent the business. In your case, absence of a natural key suggests the table design does not represent a well-defined, identifiable "thing" to the business. Just throwing on a surrogate key does not change that.

I'm not opposing the surrogate key so much as encouraging you to discover and assert an alternate natural key. If you do not do that, you are likely to eventually have bad data. Either multiple rows for the same thing, or one row for multiple things, or a right row and a wrong row and the inability to tell them apart.

When you mentioned that your table did not have a set of columns that uniquely identify a row, red flags went up. That's a concern indeed, but it's naive to think adding a surrogate key addresses it.

Do I make sense?