First off, I'm a SQL Server newbie and I may have gotten in over my head a little here, but you gotta learn somehow.
I have read everything I can get my hands on about the new Fuzzy Grouping feature in SSIS and I have created a package that looks for duplicates in one of my DB Tables. The table has 6 fields and about half a million rows. I need the package to use Fuzzy Grouping too look for "near duplicates", in the table and copy the "duplicates" to another table where they can be reviewed and eventually have the IDs resolved so that only one entry for each actually "entity" exists in the table.
The package I created works great in my test environment (much smaller table), but when it is run on the production server with the large table, the package takes almost a day to run and the last time I ran it the combined size of the TempDB files was several hundred GIGS!
I read on MSDN that the size of TempDB can become "quite large", but that's about as descriptive as they get. I'm sure there is some basic step that I am missing that will keep the size of TempDB from growing out of control, but like I said, I'm new at this stuff, and I may have tried to "run before I really knew how to walk", so to speak. Regardless, I need to make this work somehow and if anyone can offer some advice I would greatly appreciate it.