We help IT Professionals succeed at work.

Join via int vs varchar(200)

86 Views
Last Modified: 2019-02-25
How much does performance differ when doing a SQL join on a key that is a standard int vs a varchar(200) if one table has 6mil records and the other table has 4mil? I'm using MS SQL Server 2016 and most always I use standard int columns for my primary and foreign key relationships, but we have a specific business case that needs to join via a varchar(200) and I just want to know what to reasonably expect.

Thanks!
Comment
Watch Question

Russ SuterSenior Software Developer
CERTIFIED EXPERT

Commented:
It really depends on what indexes are available. Assuming you're not using any functions in your JOIN statement and tight indexes are available there's really no significant performance difference. All other things being equal you'll get slightly better performance out of an INT because each index page can hold more data. Less index paging = faster indexing.

If you have a VARCHAR(200) column that is indexed the join will be faster than using an INT column that is not indexed. Viewing the query execution plan can help you determine what is the best index to create to speed up your query.
Scott PletcherSenior DBA
CERTIFIED EXPERT
Most Valuable Expert 2018
Distinguished Expert 2019

Commented:
In particular, you need to reduce the total number of rows that need to be considered by proper indexing, particularly by best clustering of the table.

That said, if the varchar column really is longer than, say, 30-40 bytes, you will get a noticeable performance hit when joining large numbers of rows.
CERTIFIED EXPERT

Commented:
assuming the varchar col is properly indexed, and given the small number of rows, there is high chances the index lives in memory so the performance hit would be neglectible.

longer answer : the comparison does not matter much. most of the performance hit comes from the time that is required to load the index pages in memory. with HUGE tables, the hit can be quite noticeable. it is sometimes efficient to only index the first few bytes of the varchar. or use a non unique calculated column with for example the crc32 of the varchar in both tables, and use that for the join.
Scott PletcherSenior DBA
CERTIFIED EXPERT
Most Valuable Expert 2018
Distinguished Expert 2019

Commented:
I don't agree that 6million rows joining to 4million rows is a "small number of rows".  That's a fairly large join, esp. if it can't be done efficiently.

You also have to keep in mind that SQL has to do a case-insensitive match on char data, which it does have to worry about for int data.
CERTIFIED EXPERT

Commented:
Which is why i mentionned "properl indexed"

Additionally, if there is no index and a bad exec plan, the difference will still hardly be noticeable. Feel free to run a benchmark.

I see no reason why the comparison would be case insensitive. That would add a small additional cpu load, likely consuming about 0.01 seconds of cpu time if the indexes are in memory. If they are not, the implied i/o toll would hinder performance waaaay more than whatever cpu power is added by the case insensivity.

Also note that in most use cases, there would be a where clause that would select fewer rows on the left side of the join. So again, this is entirely i/o bound
Scott PletcherSenior DBA
CERTIFIED EXPERT
Most Valuable Expert 2018
Distinguished Expert 2019

Commented:
I see no reason why the comparison would be case insensitive.

Because the underlying collations are in SQL Server typically case insensitive by default.   Are you saying you expect SQL to do a case-sensitive search on a column(s) with a case-insensitive collation??

I also can't imagine why you'd expect 10M total rows to already be in memory.  That would be extremely unusual unless they were extraordinarily active tables.
CERTIFIED EXPERT

Commented:
my bad regarding the case insensitive assumtion : i missed the sql server topic and was answering for the general sql case. that said, it's still only small cpu compared with the implied i/o

i imagine the INDEXES to be in memory : varchar 20 would require about 22 bytes per row tops. probably half in a real life scenario. times 10M produces a few hundred megs. i'd assume a modern mildly busy server with most likely gigabyte or rather tens on gigabytes would keep 2-300Ms worth of indexes in ram.

again 10M rows is really small. assuming there is an index, my laptop would handle the join before i get to release the enter key if i do not write the resulting query results on disk.

if the query needs to fetch on  disk data, the extra cpu power whould be next to nothing if the query requires a few seconds to complete due to i/o anyway

that said, i do not believe such joins are good practice
Scott PletcherSenior DBA
CERTIFIED EXPERT
Most Valuable Expert 2018
Distinguished Expert 2019

Commented:
I don't follow.  I can't see SQL  matching 10M rows on a single-column index and then having to do 10M lookups.  

varchar 20 would require about 22 bytes per row tops
Even then, even those index rows won't stay in cache forever.  If this is a daily process, say, I'd never expect those rows to still be buffered from the previous day.

OP explicitly stated varchar(200).  And row overhead is always a minimum of 7 bytes.

Best for matching all rows would be for the tables to be clustered on the join key, which means the rows will be wider, possibly quite wide (esp. if they developer-"designed" tables, i.e., with virtually no true normalization done).
CERTIFIED EXPERT

Commented:
if this is a daily process, i second that thought.

best benchmark it in a real life scenario.

in a batch use case, assuming all the data on the left join is read, and the varchar is 200 bytes rather than 20, the difference would be quite noticeable, by a few seconds which would be the required time to read a few GB on disk. rather than a few tens of MB
Scott PletcherSenior DBA
CERTIFIED EXPERT
Most Valuable Expert 2018
Distinguished Expert 2019

Commented:
Again, you can't read just an index with only the key column.  You'll need to read additional columns, which will cause additional I/O.

There is just no way that reading 6M+ and 4M+ rows of at least 300 bytes each will be "small data" than can stay in buffers, not likely for even an hour on a reasonably busy server.
CERTIFIED EXPERT

Commented:
i see no reason to assume the use case is actually reading through all of the 6M rows.

again, i would not do or recommend it either, espetially on a 200Bytes field without specific optimisations.

i guess we're actually seeing the same thing. maybe just thinking of different use cases. with 200 bytes, and a regular index, i can think of a few where the difference would simply kill the server.

@eviglotti : would you mind giving a few hints regarding the use cases : numer of rows that would actually be looked for, how often that runs, how scattered the rows would be on the right hand side, ...
CERTIFIED EXPERT

Commented:
@scott

Again, you can't read just an index with only the key column.  You'll need to read additional columns, which will cause additional I/O.

i'm the one who does not follow, now. you'd have to read these other columns whatever the join is based on... if you do have to read through all the table's data anyway, the difference will be neglectible.

Author

Commented:
I have been following along on all of the comments, so thank you.

A few thoughts:

1. I'm not a DBA, so I don't know too much about clustering and indexes, but I can mention that the varchar(200) on both the fact and dimension are indexed, but just with the normal MS SQL Server GUI index, which is a non-clustered, non-unique index.

2. Technically not all 200 length is filled, realistically these are typically only 20 or so characters long, we just kept them at 200 to be safe. Does that matter at all?

3. In actual practice, the fact table is typically filtered to maybe a couple hundred thousand records and then joined to the various dimensions. Some of the dimensions are super small - say only a couple hundred records, but this one large dimension is the one with 4mil or so records.

4. We typically may only do this a few times a day and most all of the data warehouse tables are truncated and reloaded each and every day.

5. Not entirely sure about how scattered the rows are since I'm not typically diving into the low level technical details on how the data is stored, fragmented, etc in SQL server.

I'm not sure that helps too much, but based on that at least, what's the general consensus, would a join of this type take twice as long to run with varchar(200) keys vs just standard int keys or are we generally thinking that the speed difference would be fairly negligible? I know obviously I could just try it and time it, but I'm just trying to be more objective and have a deeper understanding of this all. I do know that @skullnobrains was saying this isn't good practice, which was my general intuition on this whole thing as well, but I can't quite put my finger on just why.

Thanks everyone!
CERTIFIED EXPERT
Commented:
This problem has been solved!
(Unlock this solution with a 7-day Free Trial)
UNLOCK SOLUTION

Author

Commented:
Thank you everyone!

Gain unlimited access to on-demand training courses with an Experts Exchange subscription.

Get Access
Why Experts Exchange?

Experts Exchange always has the answer, or at the least points me in the correct direction! It is like having another employee that is extremely experienced.

Jim Murphy
Programmer at Smart IT Solutions

When asked, what has been your best career decision?

Deciding to stick with EE.

Mohamed Asif
Technical Department Head

Being involved with EE helped me to grow personally and professionally.

Carl Webster
CTP, Sr Infrastructure Consultant
Empower Your Career
Did You Know?

We've partnered with two important charities to provide clean water and computer science education to those who need it most. READ MORE

Ask ANY Question

Connect with Certified Experts to gain insight and support on specific technology challenges including:

  • Troubleshooting
  • Research
  • Professional Opinions