asked on

Process Large Volume of Records

I'm looking for a way to improve performance on inserting/updating a bunch of records in a SQL database. I can't use any native SQL DTS or SSIS tools to get the data in there either. So I'm stuck using C# to read in this CSV file and compare each record with that of what is in the SQL table, and if it's not there, insert it. My code works, however, it's far from efficient. I've tried a couple things, such as changing one of my foreach loops to a for loop and instead of using a List<>, I'm now using an array of objects.

Right now I have a CSV file that gets parsed into a datatable with over 150,000 records in it. The SQL table, at this point, doesn't have anything in it, so it will insert all of these records which calls for over 150,000 insert statements. Once they're all in there, this will get a new CSV file each week and compare all the records again and see if ANY property has changed, and if so, update that record, and again, if anything new is there, add that record, which should only be 1000 or so.

The problem is the amount of memory being used in this process. Because of all the SQL calls, SQL is starting to jump up in the amount of memory it uses, and I let the app run for about 10 minutes yesterday, and it got up to about 600mb of ram usage and kept climbing. I've even tried doing a GC.Collect after every 500 records or so and that doesn't seem to be helping either.

I've attached the code here to see if anyone can shed any light for some optimization or some techniques I can use for processing this many records.

private void ParseMovieData(DataTable dataTable)
	{           
 	    Movie[] movies = repository.FindAll().ToArray();
            int countCreated = 0;
            int countUpdated = 0;
            int countTotal = 0;
            int movieCount = movies.Length;
            Movie movie = null;
            DataRow row;
            bool update;
 
            do
            {
                row = dataTable.Rows[0];
                update = false;
 
                for (int i = 0; i < movieCount; i++)
                {
                    
                    movie = movies[i];
 
                    if (string.IsNullOrEmpty(row.ItemArray[0].ToString()) ||
                        row.ItemArray[0].ToString().ToUpper() != movie.Name.ToUpper()) continue;
                    update = true;
                    break;
                }
 
                if (update)
                {
                    UpdateMovie(repository, row, movie);
                    countUpdated++;
                    Console.Write("{0}: Added: {1}.  Updated: {2}.\r", DateTime.Now, countCreated, countUpdated);
                }
                else
                {
                    AddMovie(repository, row, new Movie());
                    countCreated++;
                    Console.Write("{0}: Added: {1}.  Updated: {2}.\r", DateTime.Now, countCreated, countUpdated);
                }
                
                dataTable.Rows.Remove(row);
 
                if (countTotal > 500)
                {
                    GC.Collect();
                    countTotal = 0;
                }
 
            } while (dataTable.Rows.Count > 0);
 
            UpdateLog(true, countCreated, countUpdated, "Update success.");
	}
 
	private static void AddMovie(IRepository<Movie> repository, DataRow row, Movie movie)
        {
            movie.Name = row.ItemArray[0].ToString();
            movie.Studio = row.ItemArray[1].ToString();
 
            if (!string.IsNullOrEmpty(row.ItemArray[2].ToString()))
            {
                movie.Released = DateTime.Parse(row.ItemArray[2].ToString());
            }
            else
            {
                movie.Released = null;
            }
 
            movie.Status = row.ItemArray[3].ToString();
            movie.Sound = row.ItemArray[4].ToString();
            movie.Versions = row.ItemArray[5].ToString();
 
            if (row.ItemArray[6] != null && !string.IsNullOrEmpty(row.ItemArray[6].ToString()))
            {
                movie.Price = Convert.ToDouble(row.ItemArray[6]);
            }
            else
            {
                movie.Price = null;
            }
 
            movie.Rating = row.ItemArray[7].ToString();
            movie.Year = row.ItemArray[8].ToString();
            movie.Genre = row.ItemArray[9].ToString();
            movie.Aspect = row.ItemArray[10].ToString();
 
            if (row.ItemArray[11] != null && !string.IsNullOrEmpty(row.ItemArray[11].ToString()))
            {
                movie.UPC = Convert.ToInt64(row.ItemArray[11]);
            }
            else
            {
                movie.UPC = null;
            }
 
            if (!string.IsNullOrEmpty(row.ItemArray[12].ToString()))
            {
                movie.DVDReleaseDate = DateTime.Parse(row.ItemArray[12].ToString());
            }
            else
            {
                movie.DVDReleaseDate = null;
            }
 
            if (row.ItemArray[13] != null && !string.IsNullOrEmpty(row.ItemArray[13].ToString()))
            {
                movie.Dbid = Convert.ToInt64(row.ItemArray[13]);
            }
            else
            {
                movie.Dbid = null;
            }
 
            if (!string.IsNullOrEmpty(row.ItemArray[14].ToString()))
            {
                movie.TimeStamp = DateTime.Parse(row.ItemArray[14].ToString());
            }
            else
            {
                movie.TimeStamp = null;
            }
 
            repository.Create(movie);
        }
 
	private static void UpdateMovie(IRepository<Movie> repository, DataRow row, Movie movie)
        {
            bool update = false;
 
            #region Set Properties
 
            #region Set Name
 
            if ((!string.IsNullOrEmpty((string)row.ItemArray[0])) && movie.Name != (string)row.ItemArray[0])
            {
                movie.Name = (string)row.ItemArray[0];
                update = true;
            }
 
            #endregion
 
            #region Set Studio
 
            if (movie.Studio != GetStringToUpdate(row.ItemArray[1], movie.Studio))
            {
                movie.Studio = GetStringToUpdate(row.ItemArray[1], movie.Studio);
                update = true;
            }
 
            #endregion
 
            #region Set Released Date
 
            if (movie.Released != GetDateToUpdate(row.ItemArray[2], movie.Released))
            {
                movie.Released = GetDateToUpdate(row.ItemArray[2], movie.Released);
                update = true;
            }
 
            #endregion
 
            #region Set Status
 
            if (movie.Status != GetStringToUpdate(row.ItemArray[3], movie.Status))
            {
                movie.Status = GetStringToUpdate(row.ItemArray[3], movie.Status);
                update = true;
            }
 
            #endregion
 
            #region Set Sound
 
            if (movie.Sound != GetStringToUpdate(row.ItemArray[4], movie.Sound))
            {
                movie.Sound = GetStringToUpdate(row.ItemArray[4], movie.Sound);
                update = true;
            }
 
            #endregion
 
            #region Set Versions
 
            if (movie.Versions != GetStringToUpdate(row.ItemArray[5], movie.Versions))
            {
                movie.Versions = GetStringToUpdate(row.ItemArray[5], movie.Versions);
                update = true;
            }
 
            #endregion
 
            #region Set Price
 
            if (movie.Price != GetDoubleToUpdate(row.ItemArray[6], movie.Price))
            {
                movie.Price = GetDoubleToUpdate(row.ItemArray[6], movie.Price);
                update = true;
            }
 
            #endregion
 
            #region Set Rating
 
            if (movie.Rating != GetStringToUpdate(row.ItemArray[7], movie.Rating))
            {
                movie.Rating = GetStringToUpdate(row.ItemArray[7], movie.Rating);
                update = true;
            }
 
            #endregion
 
            #region Set Year
 
            if (movie.Year != GetStringToUpdate(row.ItemArray[8], movie.Year))
            {
                movie.Year = GetStringToUpdate(row.ItemArray[8], movie.Year);
                update = true;
            }
 
            #endregion
 
            #region Set Genre
 
            if (movie.Genre != GetStringToUpdate(row.ItemArray[9], movie.Genre))
            {
                movie.Genre = GetStringToUpdate(row.ItemArray[9], movie.Genre);
                update = true;
            }
 
            #endregion
 
            #region Set Aspect
 
            if (movie.Aspect != GetStringToUpdate(row.ItemArray[10], movie.Aspect))
            {
                movie.Aspect = GetStringToUpdate(row.ItemArray[10], movie.Aspect);
                update = true;
            }
 
            #endregion
 
            #region Set UPC
 
            if (movie.UPC != GetLongToUpdate(row.ItemArray[11], movie.UPC))
            {
                movie.UPC = GetLongToUpdate(row.ItemArray[11], movie.UPC);
                update = true;
            }
 
            #endregion
 
            #region Set DVDReleaseDate
 
            if (movie.DVDReleaseDate != GetDateToUpdate(row.ItemArray[12], movie.DVDReleaseDate))
            {
                movie.DVDReleaseDate = GetDateToUpdate(row.ItemArray[12], movie.DVDReleaseDate);
                update = true;
            }
 
            #endregion
 
            #region Set DBID
 
            if (movie.Dbid != GetLongToUpdate(row.ItemArray[13], movie.Dbid))
            {
                movie.Dbid = GetLongToUpdate(row.ItemArray[13], movie.Dbid);
                update = true;
            }
 
            #endregion
 
            #region Set TimeStamp
 
            if (movie.TimeStamp != GetDateToUpdate(row.ItemArray[14], movie.TimeStamp))
            {
                movie.TimeStamp = GetDateToUpdate(row.ItemArray[14], movie.TimeStamp);
                update = true;
            }
 
            #endregion
 
            #endregion
 
            if (update)
            {
                repository.Update(movie);
            }
        }
 
	private static long? GetLongToUpdate(object item, long? property)
        {
            if (item != null)
            {
                if (property != Convert.ToInt64(item))
                {
                    return Convert.ToInt64(item);
                }
            }
            else
            {
                return null;
            }
 
            return property;
        }
 
        private static double? GetDoubleToUpdate(object item, double? property)
        {
            if (item != null)
            {
                if (property != Convert.ToDouble(item))
                {
                    return Convert.ToDouble(item);
                }
            }
            else
            {
                return null;
            }
 
            return property;
        }
 
        private static string GetStringToUpdate(object item, string property)
        {
            if (item != null)
            {
                if (property != (string)item)
                {
                    return (string)item;
                }
            }
            else
            {
                return null;
            }
 
            return property;
        }
 
        private static DateTime? GetDateToUpdate(object item, DateTime? property)
        {
            if (item != null && !string.IsNullOrEmpty(item.ToString()))
            {
                if (property != null)
                {
                    if (property != DateTime.Parse(item.ToString()))
                    {
                        return DateTime.Parse(item.ToString());
                    }
                }
                else
                {
                    return DateTime.Parse(item.ToString());
                }
            }
            else
            {
                return null;
            }
 
            return property;
        }

Open in new window

Gorkem Yuksel

Hi Phreak3eb,

I do something similar in an application I created. To make it efficient and really fast.. I grab the CSV, parse it into a datatable object, then do a BulkCopy into the SQL server into a temporary table. From there, it is fairly easy to write a stored procedure to check if it exists, or requires updating... You can cut down your processing time and memory usuage by about 80-90%.

Cheers,

G.

tculler

Well, I'd need more details to give a really sophisticated explanation, but this should help (as long as the file is line-delimited. I know you're saying CSV, but usually they're line-delimited anyway)


private void LoadValuesExample(System.Data.IDbConnection myConnection)
{
	// Keep adding {N} to formatting args for every field you need
	const System.String INSERT_INTO = "INSERT INTO tableName VALUES({0}, {1});";
 
	using(System.Data.IDbCommand cmd = myConnection.CreateCommand())
	using(System.IO.StreamReader reader = new System.IO.StreamReader(System.IO.File.OpenRead("Path.csv")))
	{
		while(!reader.EndOfStream)
		{
			// Make sure ReadLine() calls match number of formatting args
			cmd.CommandText = System.String.Format(INSERT_INTO, reader.ReadLine(), reader.ReadLine());
			cmd.ExecuteNonQuery();
		}
	}
}

Open in new window

dportas

Can you use BULK INSERT for this? 150,000 rows is not large . . . unless you try to do silly things like inserting it one row at a time using C#!

ASKER CERTIFIED SOLUTION

dportas

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

Phreak3eb

ASKER

gyuksel:

Could you explain what you mean by or how to do a "BulkCopy"?

SOLUTION

Anurag Thakur

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

Phreak3eb

ASKER

Ok. I've been able to get the data from the CSV file into a staging SQL table using OPENROWSET as suggested by dportas and raqi0017. It works really well. Now I'm just left with the problem of updating the records in the main table based on changes to fields in the staging table AND inserting new records from staging table into main table.

Can you guys give me a quick example of how I could use a stored proc that would compare the fields and update as needed and insert records from staging that are new into the main table?

SOLUTION

dportas

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

Phreak3eb

ASKER

Ok. I'm trying to use MERGE, but when I use the attached code in a query, I'm getting:

Msg 102, Level 15, State 1, Line 4
Incorrect syntax near 'MERGE'.
Msg 137, Level 15, State 2, Line 5
Must declare the scalar variable "@Dbid".

And this was taken directly off the technet link you gave me. So I'm not sure why I'm getting a syntax error.

BEGIN
    SET NOCOUNT ON;
           
    MERGE Movies.Dbid AS target
    USING (SELECT @Dbid, @Status, @Studio, @Released, @Sound, @Versions, @Price, @Rating, @Year, @Genre, @Aspect, @UPC, @DVDReleaseDate, @TimeStamp, @name) AS source (Dbid, Status, Studio, Released, Sound, Versions, Price, Rating, Year, Genre, Aspect, UPC, DVDReleaseDate, TimeStamp, Name)
    ON (target.Dbid = source.Dbid)
    WHEN MATCHED THEN 
        UPDATE SET	Name = source.Name,
					Dbid = source.Dbid,
					Status = source.Status,
					Released = source.Release,
					Sound = source.Sound,
					Versions = source.Versions,
					Price = source.Price,
					Rating = source.Rating,
					Year = source.Year,
					Genre = source.Genre,
					Aspect = source.Aspect,
					UPC = source.UPC,
					DVDReleaseDate = source.DVDReleaseDate,
					TimeStamp = source.TimeStamp					
	WHEN NOT MATCHED THEN	
	    INSERT (Dbid, Status, Studio, Released, Sound, Versions, Price, Rating, Year, Genre, Aspect, UPC, DVDReleaseDate, TimeStamp, Name)
	    VALUES (source.Dbid, source.Status, source.Studio, source.Released, source.Sound, source.Versions, source.Price, source.Rating, source.Year, source.Genre, source.Aspect, source.UPC, source.DVDReleaseDate, source.TimeStamp, source.Name)
	    --OUTPUT deleted.*, $action, inserted.* INTO #MyTempTable;
END;

Open in new window

dportas

You posted in the SQL Server 2008 zone, are you sure you are using SQL Server 2008? MERGE isn't supported in earlier versions.

It also seems you haven't declared all your variables.

Phreak3eb

ASKER

Yeah. I realized that my dev box is using SQL 2005. I'll try the 2008 box later.

Anurag Thakur

i will just achieve the same functionality by using the select statements
something like this
insert into tablename (colnames........)
select colnamens.....
from temptable
where condition to exclude the records and find unique ones

Phreak3eb

ASKER

Ok. I got the MERGE statement to work... kind of. If there aren't rows in the target table that match to the source table, it inserts them as it should. However, if I change the Studio field for instance in the source table, it won't update it in the target table. Here is the exact MERGE statement I'm using. (Slightly modified from the one I made in a previous post).

MERGE INTO [Movies] As [target]
    USING (SELECT [Dbid], [Status], Studio, Released, Sound, Versions, Price, Rating, [Year], Genre, Aspect, UPC, DVDReleaseDate, [TimeStamp], Name FROM MoviesStaging) AS source
    ON ([target].[Dbid] = source.[Dbid])
    WHEN MATCHED THEN 
        UPDATE SET	Name = source.Name,
			[Dbid] = source.[Dbid],
			[Status] = source.[Status],
			Released = source.Released,
			Sound = source.Sound,
			Versions = source.Versions,
			Price = source.Price,
			Rating = source.Rating,
			[Year] = source.[Year],
			Genre = source.Genre,
			Aspect = source.Aspect,
			UPC = source.UPC,
			DVDReleaseDate = source.DVDReleaseDate,
			[TimeStamp] = source.[TimeStamp]					
	WHEN NOT MATCHED THEN	
	    INSERT ([Dbid], [Status], Studio, Released, Sound, Versions, Price, Rating, [Year], Genre, Aspect, UPC, DVDReleaseDate, [TimeStamp], Name)
	    VALUES (source.[Dbid], source.[Status], source.Studio, source.Released, source.Sound, source.Versions, source.Price, source.Rating, source.[Year], source.Genre, source.Aspect, source.UPC, source.DVDReleaseDate, source.[TimeStamp], source.Name);

Open in new window

Phreak3eb

ASKER

Nevermind. I just realized I didn't have the Studio field in my update statement. Everything is working fine. Thank you for your help.

Phreak3eb

ASKER

50 to raqi0017 for providing links to examples I used to get the final OPENROWSET statement to work correctly.