Join via int vs varchar(200)

eviglotti
eviglotti used Ask the Experts™
on
How much does performance differ when doing a SQL join on a key that is a standard int vs a varchar(200) if one table has 6mil records and the other table has 4mil? I'm using MS SQL Server 2016 and most always I use standard int columns for my primary and foreign key relationships, but we have a specific business case that needs to join via a varchar(200) and I just want to know what to reasonably expect.

Thanks!
Comment
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®
Russ SuterSenior Software Developer

Commented:
It really depends on what indexes are available. Assuming you're not using any functions in your JOIN statement and tight indexes are available there's really no significant performance difference. All other things being equal you'll get slightly better performance out of an INT because each index page can hold more data. Less index paging = faster indexing.

If you have a VARCHAR(200) column that is indexed the join will be faster than using an INT column that is not indexed. Viewing the query execution plan can help you determine what is the best index to create to speed up your query.
Scott PletcherSenior DBA
Most Valuable Expert 2018
Top Expert 2014

Commented:
In particular, you need to reduce the total number of rows that need to be considered by proper indexing, particularly by best clustering of the table.

That said, if the varchar column really is longer than, say, 30-40 bytes, you will get a noticeable performance hit when joining large numbers of rows.
assuming the varchar col is properly indexed, and given the small number of rows, there is high chances the index lives in memory so the performance hit would be neglectible.

longer answer : the comparison does not matter much. most of the performance hit comes from the time that is required to load the index pages in memory. with HUGE tables, the hit can be quite noticeable. it is sometimes efficient to only index the first few bytes of the varchar. or use a non unique calculated column with for example the crc32 of the varchar in both tables, and use that for the join.
Ensure you’re charging the right price for your IT

Do you wonder if your IT business is truly profitable or if you should raise your prices? Learn how to calculate your overhead burden using our free interactive tool and use it to determine the right price for your IT services. Start calculating Now!

Scott PletcherSenior DBA
Most Valuable Expert 2018
Top Expert 2014

Commented:
I don't agree that 6million rows joining to 4million rows is a "small number of rows".  That's a fairly large join, esp. if it can't be done efficiently.

You also have to keep in mind that SQL has to do a case-insensitive match on char data, which it does have to worry about for int data.
Which is why i mentionned "properl indexed"

Additionally, if there is no index and a bad exec plan, the difference will still hardly be noticeable. Feel free to run a benchmark.

I see no reason why the comparison would be case insensitive. That would add a small additional cpu load, likely consuming about 0.01 seconds of cpu time if the indexes are in memory. If they are not, the implied i/o toll would hinder performance waaaay more than whatever cpu power is added by the case insensivity.

Also note that in most use cases, there would be a where clause that would select fewer rows on the left side of the join. So again, this is entirely i/o bound
Scott PletcherSenior DBA
Most Valuable Expert 2018
Top Expert 2014

Commented:
I see no reason why the comparison would be case insensitive.

Because the underlying collations are in SQL Server typically case insensitive by default.   Are you saying you expect SQL to do a case-sensitive search on a column(s) with a case-insensitive collation??

I also can't imagine why you'd expect 10M total rows to already be in memory.  That would be extremely unusual unless they were extraordinarily active tables.
my bad regarding the case insensitive assumtion : i missed the sql server topic and was answering for the general sql case. that said, it's still only small cpu compared with the implied i/o

i imagine the INDEXES to be in memory : varchar 20 would require about 22 bytes per row tops. probably half in a real life scenario. times 10M produces a few hundred megs. i'd assume a modern mildly busy server with most likely gigabyte or rather tens on gigabytes would keep 2-300Ms worth of indexes in ram.

again 10M rows is really small. assuming there is an index, my laptop would handle the join before i get to release the enter key if i do not write the resulting query results on disk.

if the query needs to fetch on  disk data, the extra cpu power whould be next to nothing if the query requires a few seconds to complete due to i/o anyway

that said, i do not believe such joins are good practice
Scott PletcherSenior DBA
Most Valuable Expert 2018
Top Expert 2014

Commented:
I don't follow.  I can't see SQL  matching 10M rows on a single-column index and then having to do 10M lookups.  

varchar 20 would require about 22 bytes per row tops
Even then, even those index rows won't stay in cache forever.  If this is a daily process, say, I'd never expect those rows to still be buffered from the previous day.

OP explicitly stated varchar(200).  And row overhead is always a minimum of 7 bytes.

Best for matching all rows would be for the tables to be clustered on the join key, which means the rows will be wider, possibly quite wide (esp. if they developer-"designed" tables, i.e., with virtually no true normalization done).
if this is a daily process, i second that thought.

best benchmark it in a real life scenario.

in a batch use case, assuming all the data on the left join is read, and the varchar is 200 bytes rather than 20, the difference would be quite noticeable, by a few seconds which would be the required time to read a few GB on disk. rather than a few tens of MB
Scott PletcherSenior DBA
Most Valuable Expert 2018
Top Expert 2014

Commented:
Again, you can't read just an index with only the key column.  You'll need to read additional columns, which will cause additional I/O.

There is just no way that reading 6M+ and 4M+ rows of at least 300 bytes each will be "small data" than can stay in buffers, not likely for even an hour on a reasonably busy server.
i see no reason to assume the use case is actually reading through all of the 6M rows.

again, i would not do or recommend it either, espetially on a 200Bytes field without specific optimisations.

i guess we're actually seeing the same thing. maybe just thinking of different use cases. with 200 bytes, and a regular index, i can think of a few where the difference would simply kill the server.

@eviglotti : would you mind giving a few hints regarding the use cases : numer of rows that would actually be looked for, how often that runs, how scattered the rows would be on the right hand side, ...
@scott

Again, you can't read just an index with only the key column.  You'll need to read additional columns, which will cause additional I/O.

i'm the one who does not follow, now. you'd have to read these other columns whatever the join is based on... if you do have to read through all the table's data anyway, the difference will be neglectible.

Author

Commented:
I have been following along on all of the comments, so thank you.

A few thoughts:

1. I'm not a DBA, so I don't know too much about clustering and indexes, but I can mention that the varchar(200) on both the fact and dimension are indexed, but just with the normal MS SQL Server GUI index, which is a non-clustered, non-unique index.

2. Technically not all 200 length is filled, realistically these are typically only 20 or so characters long, we just kept them at 200 to be safe. Does that matter at all?

3. In actual practice, the fact table is typically filtered to maybe a couple hundred thousand records and then joined to the various dimensions. Some of the dimensions are super small - say only a couple hundred records, but this one large dimension is the one with 4mil or so records.

4. We typically may only do this a few times a day and most all of the data warehouse tables are truncated and reloaded each and every day.

5. Not entirely sure about how scattered the rows are since I'm not typically diving into the low level technical details on how the data is stored, fragmented, etc in SQL server.

I'm not sure that helps too much, but based on that at least, what's the general consensus, would a join of this type take twice as long to run with varchar(200) keys vs just standard int keys or are we generally thinking that the speed difference would be fairly negligible? I know obviously I could just try it and time it, but I'm just trying to be more objective and have a deeper understanding of this all. I do know that @skullnobrains was saying this isn't good practice, which was my general intuition on this whole thing as well, but I can't quite put my finger on just why.

Thanks everyone!
2)
it does a lot. keeping them 200 bytes also can make it a tiny bit slower.
integers will likely useup 4 bytes. ( it can be less or more depending on the range but a small number anyway )
varchars will store the length of the data and then the data itself.
so the per row required storage would be 1 byte overhead + the average row length.
... assuming the above is correct, that would produce an overall 5 or 6 times bigger index : 16Megs vs about 100 + whatever the overhead produced by the indexation itself.

3) the join is only performed on the filtered rows ( after applying whatever filters can be applied on the LEFT hand side )
the sql server will either read all the lines or read an index to grab those 200lines and then perform joins, aka lookups in the index of the right hand table.
most likely you'll load the whole index in memory at that time. sometimes the sql server is able to optimise so the lookups appear in order.

4) you'll probably loose whatever time it takes to load 100Megs in memory. this should be about 1-2 second. and use up a litttle more cpu power during the actual join operation. this should be neglectible.

the above is based on what you said and various conjectures. i'm probably off by a multitude of factors, and it is fairly possible others would disagree with this analysis. @scott seem to have quite some knowlege and i'm honestly not sure my analysis and conjectures are more spot-on than his

if you really want a deep understanding, you'd need to both time the queries and look at their execution plan to figure out what the sql server does in your specific case. obviously, once you played the query, ulterior runs will be faster because more things will be cached in memory.

a few other thoughts :
- if botjh are fast and take the same time, that means everything is already in the cache
- if both are fast and there is a slight overhead, my conjectures are probably correct
- if ids are fast and varchar much slower, that probably means the server is reaching a limit where everything will soon be slower soon
- if ids are slow and varchar much slower, that means the server lacks ram and is paging in and out to perform a scattered join in the first place. and things will quickly get worse.
- if both are long and take the same time, that means something else is slowing down the query

Author

Commented:
Thank you everyone!

Do more with

Expert Office
Submit tech questions to Ask the Experts™ at any time to receive solutions, advice, and new ideas from leading industry professionals.

Start 7-Day Free Trial