Microsoft SQL Server
--
Questions
--
Followers
Top Experts
Thanks!
Zero AI Policy
We believe in human intelligence. Our moderation policy strictly prohibits the use of LLM content in our Q&A threads.
If you have a VARCHAR(200) column that is indexed the join will be faster than using an INT column that is not indexed. Viewing the query execution plan can help you determine what is the best index to create to speed up your query.
That said, if the varchar column really is longer than, say, 30-40 bytes, you will get a noticeable performance hit when joining large numbers of rows.
longer answer : the comparison does not matter much. most of the performance hit comes from the time that is required to load the index pages in memory. with HUGE tables, the hit can be quite noticeable. it is sometimes efficient to only index the first few bytes of the varchar. or use a non unique calculated column with for example the crc32 of the varchar in both tables, and use that for the join.






EARN REWARDS FOR ASKING, ANSWERING, AND MORE.
Earn free swag for participating on the platform.
You also have to keep in mind that SQL has to do a case-insensitive match on char data, which it does have to worry about for int data.
Additionally, if there is no index and a bad exec plan, the difference will still hardly be noticeable. Feel free to run a benchmark.
I see no reason why the comparison would be case insensitive. That would add a small additional cpu load, likely consuming about 0.01 seconds of cpu time if the indexes are in memory. If they are not, the implied i/o toll would hinder performance waaaay more than whatever cpu power is added by the case insensivity.
Also note that in most use cases, there would be a where clause that would select fewer rows on the left side of the join. So again, this is entirely i/o bound
I see no reason why the comparison would be case insensitive.
Because the underlying collations are in SQL Server typically case insensitive by default. Are you saying you expect SQL to do a case-sensitive search on a column(s) with a case-insensitive collation??
I also can't imagine why you'd expect 10M total rows to already be in memory. That would be extremely unusual unless they were extraordinarily active tables.

Get a FREE t-shirt when you ask your first question.
We believe in human intelligence. Our moderation policy strictly prohibits the use of LLM content in our Q&A threads.
i imagine the INDEXES to be in memory : varchar 20 would require about 22 bytes per row tops. probably half in a real life scenario. times 10M produces a few hundred megs. i'd assume a modern mildly busy server with most likely gigabyte or rather tens on gigabytes would keep 2-300Ms worth of indexes in ram.
again 10M rows is really small. assuming there is an index, my laptop would handle the join before i get to release the enter key if i do not write the resulting query results on disk.
if the query needs to fetch on disk data, the extra cpu power whould be next to nothing if the query requires a few seconds to complete due to i/o anyway
that said, i do not believe such joins are good practice
varchar 20 would require about 22 bytes per row topsEven then, even those index rows won't stay in cache forever. If this is a daily process, say, I'd never expect those rows to still be buffered from the previous day.
OP explicitly stated varchar(200). And row overhead is always a minimum of 7 bytes.
Best for matching all rows would be for the tables to be clustered on the join key, which means the rows will be wider, possibly quite wide (esp. if they developer-"designed" tables, i.e., with virtually no true normalization done).
best benchmark it in a real life scenario.
in a batch use case, assuming all the data on the left join is read, and the varchar is 200 bytes rather than 20, the difference would be quite noticeable, by a few seconds which would be the required time to read a few GB on disk. rather than a few tens of MB






EARN REWARDS FOR ASKING, ANSWERING, AND MORE.
Earn free swag for participating on the platform.
There is just no way that reading 6M+ and 4M+ rows of at least 300 bytes each will be "small data" than can stay in buffers, not likely for even an hour on a reasonably busy server.
again, i would not do or recommend it either, espetially on a 200Bytes field without specific optimisations.
i guess we're actually seeing the same thing. maybe just thinking of different use cases. with 200 bytes, and a regular index, i can think of a few where the difference would simply kill the server.
@eviglotti : would you mind giving a few hints regarding the use cases : numer of rows that would actually be looked for, how often that runs, how scattered the rows would be on the right hand side, ...
Again, you can't read just an index with only the key column. You'll need to read additional columns, which will cause additional I/O.
i'm the one who does not follow, now. you'd have to read these other columns whatever the join is based on... if you do have to read through all the table's data anyway, the difference will be neglectible.

Get a FREE t-shirt when you ask your first question.
We believe in human intelligence. Our moderation policy strictly prohibits the use of LLM content in our Q&A threads.
A few thoughts:
1. I'm not a DBA, so I don't know too much about clustering and indexes, but I can mention that the varchar(200) on both the fact and dimension are indexed, but just with the normal MS SQL Server GUI index, which is a non-clustered, non-unique index.
2. Technically not all 200 length is filled, realistically these are typically only 20 or so characters long, we just kept them at 200 to be safe. Does that matter at all?
3. In actual practice, the fact table is typically filtered to maybe a couple hundred thousand records and then joined to the various dimensions. Some of the dimensions are super small - say only a couple hundred records, but this one large dimension is the one with 4mil or so records.
4. We typically may only do this a few times a day and most all of the data warehouse tables are truncated and reloaded each and every day.
5. Not entirely sure about how scattered the rows are since I'm not typically diving into the low level technical details on how the data is stored, fragmented, etc in SQL server.
I'm not sure that helps too much, but based on that at least, what's the general consensus, would a join of this type take twice as long to run with varchar(200) keys vs just standard int keys or are we generally thinking that the speed difference would be fairly negligible? I know obviously I could just try it and time it, but I'm just trying to be more objective and have a deeper understanding of this all. I do know that @skullnobrains was saying this isn't good practice, which was my general intuition on this whole thing as well, but I can't quite put my finger on just why.
Thanks everyone!
Microsoft SQL Server
--
Questions
--
Followers
Top Experts
Microsoft SQL Server is a suite of relational database management system (RDBMS) products providing multi-user database access functionality.SQL Server is available in multiple versions, typically identified by release year, and versions are subdivided into editions to distinguish between product functionality. Component services include integration (SSIS), reporting (SSRS), analysis (SSAS), data quality, master data, T-SQL and performance tuning.