Speeding up joining massive tables (4m+ rows) that take 6mins+

glenramsey
glenramsey used Ask the Experts™
on
I have my main table (Table1) which contains approx 500,000 rows - my queries on this table perform perfectly.

I have another table (Table2) which has additional breakdown information of the data held with Table1.  This table, Table2, contains approx 4 million rows.

When I join Table2 to Table one my queries go from taking <1sec to 6mins+

Both tables are indexed.  There are no key constraints.  The foreign key in Table2, which relates to one of the index keys in Table1, is made up of other information (a bit of a blob I'm affraid) eg: "I\123" might be a typical contents of the key in Table2 whereas in Table1 it would just be "123".

Other than that the only thing that I can think is slowing the query is the size of Table2.

I have also used SQL's Estimated Execution Plan which does highlight that one process takes up 44% which is very high compared to all the other processes which never raise higher than 1% - it indicates that the process taking up 44% is a "Bookmark Lookup"

Hope someone can help
Comment
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®
Can you lost the query and also post which columns you have indexed.
Also, try showing turning SHOWPLAN_TEST on to see how the query is being executed.  You can also use show execution plan in query analyzer.

see http://www.quest.com/whitepapers/tuning_article_4.pdf for some interpretation of the showplan_text

adw
this isnot too bad of a timing for printing entire tables. Unless you imply that it takes 6 min to select 1 record i would say look at physical schema.

Commented:
i   think you have to show us the query!!
Ensure you’re charging the right price for your IT

Do you wonder if your IT business is truly profitable or if you should raise your prices? Learn how to calculate your overhead burden using our free interactive tool and use it to determine the right price for your IT services. Start calculating Now!

Author

Commented:
This is a very long post, appoligies.  I have included as much info as I can and think that will help.

I cover: Description of theory of the join, the sql query, the two main tables field descriptions including their indexes.

SQL query is at the end of the this blurb:

Table1 is the table with 500,000 rows and Table2InvoicesDetails is the table with 4million+ rows.

I imagine a lot of the problems will come down to the fact that the data has come from a C-ISAM database and the field types are a bit loose to say the least - therefore I have had to convert a lot on the fly - once the data is imported into MS-SQL i can't then change the field type to, for example, date because there are some non-date entries.  Because the import is run every night it would be impossible o keep having the field type changed prior to import, a query executed to remove the non-date entries then have the field converted to a date field.

In addition, some of the indexes (for example on date fields) make logical sense, but I wonder, because these fields are still type: varchar, does the index actually make much difference, because it is indexing text?

Neither table has a truly unique field, so i created one on friday (simply an ID field, type: int with auto-number)which brought the query to 4 minutes, which is a little better.

The theory behind joining table1 and table2 is that table1 has the majority of info required, table2 gives two more bits of information about each invoice.  Table2 contains at least 2 rows for each invoice in Table1 and usually more depending on what items are on the invoice.  I have considered having a query run each night to generate a new table with the combined information run between 'sensible' date ranges - while the query might take several hours at least i will end up with a data from a realistically useable date range and hopefully improve the querties because of this and because there is not join occuring between Table1 and Table2.

(I have included a copy of the fields from both Table1 and Table2 at the end of this post)

SELECT Table1Invoices.invoice_no,Table1Invoices.company_no,Table3Regions.job_area,LTRIM(Table1Invoices.site_no) AS site_no,Table1Invoices.account_no,Table1Invoices.payment_date,Table1Invoices.transaction_type,Table1Invoices.transaction_stat,Table1Invoices.keybtx2,Table1Invoices.keybtx3,CONVERT(real,Table1Invoices.goods_value) goods_value,CONVERT(real,Table1Invoices.vat1_value) vat1_value,CONVERT(real,Table1Invoices.total_value) total_value,Table2InvoicesDetails.sam_code_n AS sam_code,Table2InvoicesDetails.cost_code,Table2InvoicesDetails.gcf_value,Table2InvoicesDetails.entry_seq FROM Table1Invoices LEFT JOIN Table3Regions ON LTRIM(Table1Invoices.site_no)=LTRIM(Table3Regions.site_no) AND Table1Invoices.company_no=Table3Regions.company_no JOIN Table2InvoicesDetails ON Table1Invoices.company_no COLLATE Latin1_General_CI_AS = Table2InvoicesDetails.company_no AND Table2InvoicesDetails.source_ref COLLATE Latin1_General_CI_AS LIKE 'I%'+Table1Invoices.invoice_no+'%' WHERE Table1Invoices.account_no='ACC1' AND Table1Invoices.company_no='1' AND Table1Invoices.transaction_stat='P' AND NOT Table1Invoices.transaction_type='CASH' AND CONVERT(datetime,Table1Invoices.payment_date) <= '2002-11-22' AND ledger_type_psx='PUR' AND (Table2InvoicesDetails.entry_seq <= 601) AND (NOT (LTRIM(Table2InvoicesDetails.dept_code) = '220')) AND (Table2InvoicesDetails.system_reg='PUR') ORDER BY Table1Invoices.company_no,Table1Invoices.account_no,Table3Regions.job_area,Table1Invoices.site_no,Table1Invoices.payment_date,Table1Invoices.transaction_stat,Table1Invoices.keybtx2,Table1Invoices.keybtx3,Table1Invoices.transaction_type,Table1Invoices.invoice_no

Data fields as follows (you will notice that most fields are varchar due to the poor quality of the original data - data is imported nightly into MS-SQL)

Table1: (id,account_no,invoice_date,payment_date,due_date)

id      int      4      0
ACCOUNT_NO      varchar      50      1
BATCH_REF      varchar      50      1
CHEQUE_NO      varchar      50      1
COMPANY_NO      varchar      50      1
COUNT_REG      varchar      50      1
DATE_REG      varchar      50      1
DISCOS      varchar      50      1
DISCOUNT_OVERRIDE      varchar      50      1
DISCOUNT_TYPE      varchar      50      1
DISCOUNT_VALUE      varchar      50      1
DISCTD      varchar      50      1
DUE_DATE      varchar      50      1
ENTRY_SEQ      varchar      50      1
GOODS_VALUE      varchar      50      1
INVOICE_DATE      varchar      50      1
INVOICE_NO      varchar      50      1
KEYBTX2      varchar      50      1
KEYBTX3      varchar      50      1
LEDGER_TYPE_PSX      varchar      50      1
OUTSTANDING      varchar      50      1
PAID      varchar      50      1
PAYMENT_DATE      varchar      50      1
PAYMENT_DISCOUNT      varchar      50      1
PAYMENT_PERIOD      varchar      50      1
PAYMENT_REF      varchar      50      1
PAYMENT_STATUS      varchar      50      1
PAYMENT_VALUE      varchar      50      1
PAY_CODE      varchar      50      1
PAY_CODE_ACTION      varchar      50      1
PAY_CODE_DESC      varchar      50      1
PERIOD_END_DATE      varchar      50      1
PERIOD_TABLE      varchar      50      1
SITE_NO      varchar      50      1
SUPPLIER_REF      varchar      50      1
SYSTEM_REG      varchar      50      1
TOTAL_VALUE      varchar      50      1
TRANSACTION_STAT      varchar      50      1
TRANSACTION_TYPE      varchar      50      1
VAT1_CODE      varchar      50      1
VAT1_VALUE      varchar      50      1

Table2: (indexes: ID, COMPANY_NO, SOURCE_REF) - SOURCE_REF is the 'blob' that contains the invoice_no that joins Table2
id      int      4      0
COMPANY_NO      varchar      2      1
COST_CODE      varchar      3      1
COUNT_REG      varchar      254      1
DATE_REG      varchar      50      1
DEPT_CODE      varchar      6      1
ENTRY_SEQ      varchar      254      1
GCF_COMMENT      varchar      30      1
GCF_DATE      varchar      50      1
GCF_OUTPUT      float      8      0
GCF_QUANTITY      float      8      0
GCF_VALUE      float      8      0
GROUP_CODE      varchar      3      1
LEDGER_TYPE_CNP      varchar      3      1
PERIOD_END_DATE_C      varchar      50      1
PERIOD_END_DATE_N      varchar      50      1
PERIOD_TABLE_C      varchar      2      1
PERIOD_TABLE_N      varchar      2      1
PREV_YEAR_END      varchar      50      1
SAM_CODE_C      varchar      5      1
SAM_CODE_N      varchar      5      1
SITE_NO      varchar      6      1
SOURCE_CLASS      varchar      3      1
SOURCE_REF      varchar      15      1
SOURCE_TYPE      varchar      3      1
SOURCE_TYPE_REF      varchar      18      1
SYSTEM_REG      varchar      3      1
TRANSACTION_STAT      varchar      1      1
WEEK_END_DATE      varchar      50      1
could you please time this query
-----------------------------------------
select
       ti.company_no
     , ti.invoice_no
     , ti.account_no
     , ti.transaction_stat
     , ti.transaction_type
     , ti.payment_date
     , ledger_type_psx
     , td.entry_seq
     , td.dept_code
     , td.system_reg
from
     Table1Invoices as ti
inner join
     Table2InvoicesDetails as id
          Table1Invoices.company_no COLLATE Latin1_General_CI_AS = Table2InvoicesDetails.company_no
     AND
          Table2InvoicesDetails.source_ref COLLATE Latin1_General_CI_AS LIKE 'I%'+Table1Invoices.invoice_no+'%'
     WHERE
          Table1Invoices.account_no='ACC1'
     AND
          Table1Invoices.company_no='1'
     AND
          Table1Invoices.transaction_stat='P'
     AND NOT
          Table1Invoices.transaction_type='CASH'
     AND
          CONVERT(datetime,Table1Invoices.payment_date) <= '2002-11-22'
     AND
     ledger_type_psx='PUR'
     AND
          Table2InvoicesDetails.entry_seq <= 601
     AND NOT
          LTRIM(Table2InvoicesDetails.dept_code) = '220'
     AND
          Table2InvoicesDetails.system_reg='PUR'

-- cheers

Author

Commented:
cheers miron, nah that took exactly the same length of time.  all i noticed you'd done is the aliasing of the table names, i can see why you thought that might make a diff but unfortuantely not.

cheers
nope,

what i did is i removed all the columns, except the ones used in the inner join.
The timing has indicated that it is purely the join that is slow, at least slower then you want it to be. The I/O generated by additional columns is not a factor in query execution.
You are still not answering one question I posted. How selective is this query, does it select 1% of records in the table Table1Invoices, less then 1%, more then 1%.

Would be nice to know sql server version as well.

Author

Commented:
sorry miron.  the number of rows returned is around 1,000 - the total row count is around 500,000 in Table1, so it is less.
thanks
thank you, knowing sql server version helps a lot to determine what instrumentation we have for solution.

Author

Commented:
i'm not doing so well here i am :) MS-SQL 2000 - i got so bogged down in describing the problem forgot to give details, sorry :)

as i said, using query analyser i have performed the 'show execution plan', which does offer some help in that a large percentage of the process is taken up with the Table2 join - unfortunately i couldn't get analyser to produce this output.

Author

Commented:
in light of the results from this topic i have decided to go through and ensure i have tableds with true data types, thus making querying the data more efficient especially when creating indexes.

i am currently in the process of writing DTS import queries that selectivily bring the data over from C-ISAM to MS-SQL where the data conforms the data type it should be.

i am having some difficulty detecting for data type int, as i cannot successfully use ISNUMERIC() as this does not check for just int's (it allows decimals and any NUMERIC value basically).

does anyone know of a way to only select data that is of type int?

thanks

Author

Commented:
Further developments.

I have had success transporting the data into a format that has correct types, although a little bit of a bodge as I still cannot successfully detect for Int data types (see above post).

However, I have reduced the row count in Table2 for testing.  I have basically taken a subset of the data that the query only looks at.  This reduces the range to 250,000 rows in Table2 (still 500,000 in Table1).  I notice that the query execution time is only 30 seconds faster, which I think concludes that it is more than just rows that is causing me problems.

My next thought is that the joining key, that both Table1 and Table2 are joined on, is not 'true'.  Table1 has it's key as '123' whereas Table2 has it's key blobbed like 'I/123/G' at which point I have to apend the strings either side of the invoice number to get a match.  THis might be the reason for the slow speed - I am having difficulties copying the data out using an INSERT INTO from a SELECT statement that removes the strings to create a data type int.

If anyone is still reading this post I would be interested in their thoughts :)
thanks

Author

Commented:
More...

It would appear that true data typing makes the biggest difference.

After cleaning the data to enable me to use true types I have ended with a snapshot of the 4million+ rows now at 130,000 - this is obviously a lot lower, however my queries are bringing back result in 3 seconds.  I beleive once I have completed the data cleaning for all 4million rows I will see acceptiable times.

I consider the time of 3 seconds reasonable for the join between 500,000 and 130,000 and obviously expect this time to increase with larger tables at which point I would again see it as reasonable for me to have to increase processing power.
this should help to retrieve numeric key from key that is "blobbed" like 'I/123/G'


select
     substring(
          source_ref,
          charindex( '/', source_ref) + 1,
           charindex( '/',
                 right( source_ref, len(source_ref) - charindex( '/', source_ref ) )
          ) -1
     )
from
Table2InvoicesDetails
also,

any chance changing

* dates in the field
Table1Invoices.payment_date
to datetime data type.

* Table2InvoicesDetails.dept_code
so that it does not need to be trimmed at the time of join,
that will help sql server to use an index on the field.

Author

Commented:
miron, thanks again.  yes, i am currently in the process of removing an extranous sql, such as ltrim.  the dates i have already converted - i pretty much now have tables with true data types, it's made a massive difference.  when first posting this topic i did think this would be the case, but still thought 6minutes to perform the query was still too slow.  it's been a good learning process.

many thanks for the sql to grab the invoice number - i had used substring but on a fixed character count which has it's obvious problems if this number length increases (at present it is fixed, however software vendors have a habbit of changing stuff without telling anyone :) so your code will come in very useful.

i will be attempting to perform the query with the full 4million+ rows today, although MS-SQL has started barking on about log file size which is very annoying, i have had this problem once before grrrrrr.
glenramsey:
This old question needs to be finalized -- accept an answer, split points, or get a refund.  For information on your options, please click here-> http:/help/closing.jsp#1 
EXPERTS:
Post your closing recommendations!  No comment means you don't care.

Do more with

Expert Office
Submit tech questions to Ask the Experts™ at any time to receive solutions, advice, and new ideas from leading industry professionals.

Start 7-Day Free Trial