Solved: Join via int vs varchar(200)

Create Account

Microsoft SQL Server

Questions

Followers

Top Experts

eviglotti

Join via int vs varchar(200)

How much does performance differ when doing a SQL join on a key that is a standard int vs a varchar(200) if one table has 6mil records and the other table has 4mil? I'm using MS SQL Server 2016 and most always I use standard int columns for my primary and foreign key relationships, but we have a specific business case that needs to join via a varchar(200) and I just want to know what to reasonably expect.

Thanks!

Zero AI Policy

We believe in human intelligence. Our moderation policy strictly prohibits the use of LLM content in our Q&A threads.

Russ Suter

It really depends on what indexes are available. Assuming you're not using any functions in your JOIN statement and tight indexes are available there's really no significant performance difference. All other things being equal you'll get slightly better performance out of an INT because each index page can hold more data. Less index paging = faster indexing.

If you have a VARCHAR(200) column that is indexed the join will be faster than using an INT column that is not indexed. Viewing the query execution plan can help you determine what is the best index to create to speed up your query.

Scott Pletcher🇺🇸

In particular, you need to reduce the total number of rows that need to be considered by proper indexing, particularly by best clustering of the table.

That said, if the varchar column really is longer than, say, 30-40 bytes, you will get a noticeable performance hit when joining large numbers of rows.

skullnobrains

assuming the varchar col is properly indexed, and given the small number of rows, there is high chances the index lives in memory so the performance hit would be neglectible.

longer answer : the comparison does not matter much. most of the performance hit comes from the time that is required to load the index pages in memory. with HUGE tables, the hit can be quite noticeable. it is sometimes efficient to only index the first few bytes of the varchar. or use a non unique calculated column with for example the crc32 of the varchar in both tables, and use that for the join.

EARN REWARDS FOR ASKING, ANSWERING, AND MORE.

Earn free swag for participating on the platform.

Scott Pletcher🇺🇸

I don't agree that 6million rows joining to 4million rows is a "small number of rows". That's a fairly large join, esp. if it can't be done efficiently.

You also have to keep in mind that SQL has to do a case-insensitive match on char data, which it does have to worry about for int data.

skullnobrains

Which is why i mentionned "properl indexed"

Additionally, if there is no index and a bad exec plan, the difference will still hardly be noticeable. Feel free to run a benchmark.

I see no reason why the comparison would be case insensitive. That would add a small additional cpu load, likely consuming about 0.01 seconds of cpu time if the indexes are in memory. If they are not, the implied i/o toll would hinder performance waaaay more than whatever cpu power is added by the case insensivity.

Also note that in most use cases, there would be a where clause that would select fewer rows on the left side of the join. So again, this is entirely i/o bound

Scott Pletcher🇺🇸

I see no reason why the comparison would be case insensitive.

Because the underlying collations are in SQL Server typically case insensitive by default. Are you saying you expect SQL to do a case-sensitive search on a column(s) with a case-insensitive collation??

I also can't imagine why you'd expect 10M total rows to already be in memory. That would be extremely unusual unless they were extraordinarily active tables.

Get a FREE t-shirt when you ask your first question.

We believe in human intelligence. Our moderation policy strictly prohibits the use of LLM content in our Q&A threads.

skullnobrains

my bad regarding the case insensitive assumtion : i missed the sql server topic and was answering for the general sql case. that said, it's still only small cpu compared with the implied i/o

i imagine the INDEXES to be in memory : varchar 20 would require about 22 bytes per row tops. probably half in a real life scenario. times 10M produces a few hundred megs. i'd assume a modern mildly busy server with most likely gigabyte or rather tens on gigabytes would keep 2-300Ms worth of indexes in ram.

again 10M rows is really small. assuming there is an index, my laptop would handle the join before i get to release the enter key if i do not write the resulting query results on disk.

if the query needs to fetch on disk data, the extra cpu power whould be next to nothing if the query requires a few seconds to complete due to i/o anyway

that said, i do not believe such joins are good practice

Scott Pletcher🇺🇸

I don't follow. I can't see SQL matching 10M rows on a single-column index and then having to do 10M lookups.

varchar 20 would require about 22 bytes per row tops

Even then, even those index rows won't stay in cache forever. If this is a daily process, say, I'd never expect those rows to still be buffered from the previous day.

OP explicitly stated varchar(200). And row overhead is always a minimum of 7 bytes.

Best for matching all rows would be for the tables to be clustered on the join key, which means the rows will be wider, possibly quite wide (esp. if they developer-"designed" tables, i.e., with virtually no true normalization done).

skullnobrains

if this is a daily process, i second that thought.

best benchmark it in a real life scenario.

in a batch use case, assuming all the data on the left join is read, and the varchar is 200 bytes rather than 20, the difference would be quite noticeable, by a few seconds which would be the required time to read a few GB on disk. rather than a few tens of MB

EARN REWARDS FOR ASKING, ANSWERING, AND MORE.

Earn free swag for participating on the platform.

Scott Pletcher🇺🇸

Again, you can't read just an index with only the key column. You'll need to read additional columns, which will cause additional I/O.

There is just no way that reading 6M+ and 4M+ rows of at least 300 bytes each will be "small data" than can stay in buffers, not likely for even an hour on a reasonably busy server.

skullnobrains

i see no reason to assume the use case is actually reading through all of the 6M rows.

again, i would not do or recommend it either, espetially on a 200Bytes field without specific optimisations.

i guess we're actually seeing the same thing. maybe just thinking of different use cases. with 200 bytes, and a regular index, i can think of a few where the difference would simply kill the server.

@eviglotti : would you mind giving a few hints regarding the use cases : numer of rows that would actually be looked for, how often that runs, how scattered the rows would be on the right hand side, ...

skullnobrains

@scott

Again, you can't read just an index with only the key column. You'll need to read additional columns, which will cause additional I/O.

i'm the one who does not follow, now. you'd have to read these other columns whatever the join is based on... if you do have to read through all the table's data anyway, the difference will be neglectible.

Get a FREE t-shirt when you ask your first question.

We believe in human intelligence. Our moderation policy strictly prohibits the use of LLM content in our Q&A threads.

eviglotti

ASKER

I have been following along on all of the comments, so thank you.

A few thoughts:

1. I'm not a DBA, so I don't know too much about clustering and indexes, but I can mention that the varchar(200) on both the fact and dimension are indexed, but just with the normal MS SQL Server GUI index, which is a non-clustered, non-unique index.

2. Technically not all 200 length is filled, realistically these are typically only 20 or so characters long, we just kept them at 200 to be safe. Does that matter at all?

3. In actual practice, the fact table is typically filtered to maybe a couple hundred thousand records and then joined to the various dimensions. Some of the dimensions are super small - say only a couple hundred records, but this one large dimension is the one with 4mil or so records.

4. We typically may only do this a few times a day and most all of the data warehouse tables are truncated and reloaded each and every day.

5. Not entirely sure about how scattered the rows are since I'm not typically diving into the low level technical details on how the data is stored, fragmented, etc in SQL server.

I'm not sure that helps too much, but based on that at least, what's the general consensus, would a join of this type take twice as long to run with varchar(200) keys vs just standard int keys or are we generally thinking that the speed difference would be fairly negligible? I know obviously I could just try it and time it, but I'm just trying to be more objective and have a deeper understanding of this all. I do know that @skullnobrains was saying this isn't good practice, which was my general intuition on this whole thing as well, but I can't quite put my finger on just why.

Thanks everyone!

ASKER CERTIFIED SOLUTION

skullnobrains

membership

Signing up is free and takes 30 seconds. No credit card required.

Create Account

eviglotti

ASKER

Thank you everyone!

Microsoft SQL Server

Questions

Followers

Top Experts

Microsoft SQL Server is a suite of relational database management system (RDBMS) products providing multi-user database access functionality.SQL Server is available in multiple versions, typically identified by release year, and versions are subdivided into editions to distinguish between product functionality. Component services include integration (SSIS), reporting (SSRS), analysis (SSAS), data quality, master data, T-SQL and performance tuning.