Phreak3eb
asked on
Process Large Volume of Records
I'm looking for a way to improve performance on inserting/updating a bunch of records in a SQL database. I can't use any native SQL DTS or SSIS tools to get the data in there either. So I'm stuck using C# to read in this CSV file and compare each record with that of what is in the SQL table, and if it's not there, insert it. My code works, however, it's far from efficient. I've tried a couple things, such as changing one of my foreach loops to a for loop and instead of using a List<>, I'm now using an array of objects.
Right now I have a CSV file that gets parsed into a datatable with over 150,000 records in it. The SQL table, at this point, doesn't have anything in it, so it will insert all of these records which calls for over 150,000 insert statements. Once they're all in there, this will get a new CSV file each week and compare all the records again and see if ANY property has changed, and if so, update that record, and again, if anything new is there, add that record, which should only be 1000 or so.
The problem is the amount of memory being used in this process. Because of all the SQL calls, SQL is starting to jump up in the amount of memory it uses, and I let the app run for about 10 minutes yesterday, and it got up to about 600mb of ram usage and kept climbing. I've even tried doing a GC.Collect after every 500 records or so and that doesn't seem to be helping either.
I've attached the code here to see if anyone can shed any light for some optimization or some techniques I can use for processing this many records.
Right now I have a CSV file that gets parsed into a datatable with over 150,000 records in it. The SQL table, at this point, doesn't have anything in it, so it will insert all of these records which calls for over 150,000 insert statements. Once they're all in there, this will get a new CSV file each week and compare all the records again and see if ANY property has changed, and if so, update that record, and again, if anything new is there, add that record, which should only be 1000 or so.
The problem is the amount of memory being used in this process. Because of all the SQL calls, SQL is starting to jump up in the amount of memory it uses, and I let the app run for about 10 minutes yesterday, and it got up to about 600mb of ram usage and kept climbing. I've even tried doing a GC.Collect after every 500 records or so and that doesn't seem to be helping either.
I've attached the code here to see if anyone can shed any light for some optimization or some techniques I can use for processing this many records.
private void ParseMovieData(DataTable dataTable)
{
Movie[] movies = repository.FindAll().ToArray();
int countCreated = 0;
int countUpdated = 0;
int countTotal = 0;
int movieCount = movies.Length;
Movie movie = null;
DataRow row;
bool update;
do
{
row = dataTable.Rows[0];
update = false;
for (int i = 0; i < movieCount; i++)
{
movie = movies[i];
if (string.IsNullOrEmpty(row.ItemArray[0].ToString()) ||
row.ItemArray[0].ToString().ToUpper() != movie.Name.ToUpper()) continue;
update = true;
break;
}
if (update)
{
UpdateMovie(repository, row, movie);
countUpdated++;
Console.Write("{0}: Added: {1}. Updated: {2}.\r", DateTime.Now, countCreated, countUpdated);
}
else
{
AddMovie(repository, row, new Movie());
countCreated++;
Console.Write("{0}: Added: {1}. Updated: {2}.\r", DateTime.Now, countCreated, countUpdated);
}
dataTable.Rows.Remove(row);
if (countTotal > 500)
{
GC.Collect();
countTotal = 0;
}
} while (dataTable.Rows.Count > 0);
UpdateLog(true, countCreated, countUpdated, "Update success.");
}
private static void AddMovie(IRepository<Movie> repository, DataRow row, Movie movie)
{
movie.Name = row.ItemArray[0].ToString();
movie.Studio = row.ItemArray[1].ToString();
if (!string.IsNullOrEmpty(row.ItemArray[2].ToString()))
{
movie.Released = DateTime.Parse(row.ItemArray[2].ToString());
}
else
{
movie.Released = null;
}
movie.Status = row.ItemArray[3].ToString();
movie.Sound = row.ItemArray[4].ToString();
movie.Versions = row.ItemArray[5].ToString();
if (row.ItemArray[6] != null && !string.IsNullOrEmpty(row.ItemArray[6].ToString()))
{
movie.Price = Convert.ToDouble(row.ItemArray[6]);
}
else
{
movie.Price = null;
}
movie.Rating = row.ItemArray[7].ToString();
movie.Year = row.ItemArray[8].ToString();
movie.Genre = row.ItemArray[9].ToString();
movie.Aspect = row.ItemArray[10].ToString();
if (row.ItemArray[11] != null && !string.IsNullOrEmpty(row.ItemArray[11].ToString()))
{
movie.UPC = Convert.ToInt64(row.ItemArray[11]);
}
else
{
movie.UPC = null;
}
if (!string.IsNullOrEmpty(row.ItemArray[12].ToString()))
{
movie.DVDReleaseDate = DateTime.Parse(row.ItemArray[12].ToString());
}
else
{
movie.DVDReleaseDate = null;
}
if (row.ItemArray[13] != null && !string.IsNullOrEmpty(row.ItemArray[13].ToString()))
{
movie.Dbid = Convert.ToInt64(row.ItemArray[13]);
}
else
{
movie.Dbid = null;
}
if (!string.IsNullOrEmpty(row.ItemArray[14].ToString()))
{
movie.TimeStamp = DateTime.Parse(row.ItemArray[14].ToString());
}
else
{
movie.TimeStamp = null;
}
repository.Create(movie);
}
private static void UpdateMovie(IRepository<Movie> repository, DataRow row, Movie movie)
{
bool update = false;
#region Set Properties
#region Set Name
if ((!string.IsNullOrEmpty((string)row.ItemArray[0])) && movie.Name != (string)row.ItemArray[0])
{
movie.Name = (string)row.ItemArray[0];
update = true;
}
#endregion
#region Set Studio
if (movie.Studio != GetStringToUpdate(row.ItemArray[1], movie.Studio))
{
movie.Studio = GetStringToUpdate(row.ItemArray[1], movie.Studio);
update = true;
}
#endregion
#region Set Released Date
if (movie.Released != GetDateToUpdate(row.ItemArray[2], movie.Released))
{
movie.Released = GetDateToUpdate(row.ItemArray[2], movie.Released);
update = true;
}
#endregion
#region Set Status
if (movie.Status != GetStringToUpdate(row.ItemArray[3], movie.Status))
{
movie.Status = GetStringToUpdate(row.ItemArray[3], movie.Status);
update = true;
}
#endregion
#region Set Sound
if (movie.Sound != GetStringToUpdate(row.ItemArray[4], movie.Sound))
{
movie.Sound = GetStringToUpdate(row.ItemArray[4], movie.Sound);
update = true;
}
#endregion
#region Set Versions
if (movie.Versions != GetStringToUpdate(row.ItemArray[5], movie.Versions))
{
movie.Versions = GetStringToUpdate(row.ItemArray[5], movie.Versions);
update = true;
}
#endregion
#region Set Price
if (movie.Price != GetDoubleToUpdate(row.ItemArray[6], movie.Price))
{
movie.Price = GetDoubleToUpdate(row.ItemArray[6], movie.Price);
update = true;
}
#endregion
#region Set Rating
if (movie.Rating != GetStringToUpdate(row.ItemArray[7], movie.Rating))
{
movie.Rating = GetStringToUpdate(row.ItemArray[7], movie.Rating);
update = true;
}
#endregion
#region Set Year
if (movie.Year != GetStringToUpdate(row.ItemArray[8], movie.Year))
{
movie.Year = GetStringToUpdate(row.ItemArray[8], movie.Year);
update = true;
}
#endregion
#region Set Genre
if (movie.Genre != GetStringToUpdate(row.ItemArray[9], movie.Genre))
{
movie.Genre = GetStringToUpdate(row.ItemArray[9], movie.Genre);
update = true;
}
#endregion
#region Set Aspect
if (movie.Aspect != GetStringToUpdate(row.ItemArray[10], movie.Aspect))
{
movie.Aspect = GetStringToUpdate(row.ItemArray[10], movie.Aspect);
update = true;
}
#endregion
#region Set UPC
if (movie.UPC != GetLongToUpdate(row.ItemArray[11], movie.UPC))
{
movie.UPC = GetLongToUpdate(row.ItemArray[11], movie.UPC);
update = true;
}
#endregion
#region Set DVDReleaseDate
if (movie.DVDReleaseDate != GetDateToUpdate(row.ItemArray[12], movie.DVDReleaseDate))
{
movie.DVDReleaseDate = GetDateToUpdate(row.ItemArray[12], movie.DVDReleaseDate);
update = true;
}
#endregion
#region Set DBID
if (movie.Dbid != GetLongToUpdate(row.ItemArray[13], movie.Dbid))
{
movie.Dbid = GetLongToUpdate(row.ItemArray[13], movie.Dbid);
update = true;
}
#endregion
#region Set TimeStamp
if (movie.TimeStamp != GetDateToUpdate(row.ItemArray[14], movie.TimeStamp))
{
movie.TimeStamp = GetDateToUpdate(row.ItemArray[14], movie.TimeStamp);
update = true;
}
#endregion
#endregion
if (update)
{
repository.Update(movie);
}
}
private static long? GetLongToUpdate(object item, long? property)
{
if (item != null)
{
if (property != Convert.ToInt64(item))
{
return Convert.ToInt64(item);
}
}
else
{
return null;
}
return property;
}
private static double? GetDoubleToUpdate(object item, double? property)
{
if (item != null)
{
if (property != Convert.ToDouble(item))
{
return Convert.ToDouble(item);
}
}
else
{
return null;
}
return property;
}
private static string GetStringToUpdate(object item, string property)
{
if (item != null)
{
if (property != (string)item)
{
return (string)item;
}
}
else
{
return null;
}
return property;
}
private static DateTime? GetDateToUpdate(object item, DateTime? property)
{
if (item != null && !string.IsNullOrEmpty(item.ToString()))
{
if (property != null)
{
if (property != DateTime.Parse(item.ToString()))
{
return DateTime.Parse(item.ToString());
}
}
else
{
return DateTime.Parse(item.ToString());
}
}
else
{
return null;
}
return property;
}
Well, I'd need more details to give a really sophisticated explanation, but this should help (as long as the file is line-delimited. I know you're saying CSV, but usually they're line-delimited anyway)
private void LoadValuesExample(System.Data.IDbConnection myConnection)
{
// Keep adding {N} to formatting args for every field you need
const System.String INSERT_INTO = "INSERT INTO tableName VALUES({0}, {1});";
using(System.Data.IDbCommand cmd = myConnection.CreateCommand())
using(System.IO.StreamReader reader = new System.IO.StreamReader(System.IO.File.OpenRead("Path.csv")))
{
while(!reader.EndOfStream)
{
// Make sure ReadLine() calls match number of formatting args
cmd.CommandText = System.String.Format(INSERT_INTO, reader.ReadLine(), reader.ReadLine());
cmd.ExecuteNonQuery();
}
}
}
Can you use BULK INSERT for this? 150,000 rows is not large . . . unless you try to do silly things like inserting it one row at a time using C#!
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
gyuksel:
Could you explain what you mean by or how to do a "BulkCopy"?
Could you explain what you mean by or how to do a "BulkCopy"?
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
Ok. I've been able to get the data from the CSV file into a staging SQL table using OPENROWSET as suggested by dportas and raqi0017. It works really well. Now I'm just left with the problem of updating the records in the main table based on changes to fields in the staging table AND inserting new records from staging table into main table.
Can you guys give me a quick example of how I could use a stored proc that would compare the fields and update as needed and insert records from staging that are new into the main table?
Can you guys give me a quick example of how I could use a stored proc that would compare the fields and update as needed and insert records from staging that are new into the main table?
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
Ok. I'm trying to use MERGE, but when I use the attached code in a query, I'm getting:
Msg 102, Level 15, State 1, Line 4
Incorrect syntax near 'MERGE'.
Msg 137, Level 15, State 2, Line 5
Must declare the scalar variable "@Dbid".
And this was taken directly off the technet link you gave me. So I'm not sure why I'm getting a syntax error.
Msg 102, Level 15, State 1, Line 4
Incorrect syntax near 'MERGE'.
Msg 137, Level 15, State 2, Line 5
Must declare the scalar variable "@Dbid".
And this was taken directly off the technet link you gave me. So I'm not sure why I'm getting a syntax error.
BEGIN
SET NOCOUNT ON;
MERGE Movies.Dbid AS target
USING (SELECT @Dbid, @Status, @Studio, @Released, @Sound, @Versions, @Price, @Rating, @Year, @Genre, @Aspect, @UPC, @DVDReleaseDate, @TimeStamp, @name) AS source (Dbid, Status, Studio, Released, Sound, Versions, Price, Rating, Year, Genre, Aspect, UPC, DVDReleaseDate, TimeStamp, Name)
ON (target.Dbid = source.Dbid)
WHEN MATCHED THEN
UPDATE SET Name = source.Name,
Dbid = source.Dbid,
Status = source.Status,
Released = source.Release,
Sound = source.Sound,
Versions = source.Versions,
Price = source.Price,
Rating = source.Rating,
Year = source.Year,
Genre = source.Genre,
Aspect = source.Aspect,
UPC = source.UPC,
DVDReleaseDate = source.DVDReleaseDate,
TimeStamp = source.TimeStamp
WHEN NOT MATCHED THEN
INSERT (Dbid, Status, Studio, Released, Sound, Versions, Price, Rating, Year, Genre, Aspect, UPC, DVDReleaseDate, TimeStamp, Name)
VALUES (source.Dbid, source.Status, source.Studio, source.Released, source.Sound, source.Versions, source.Price, source.Rating, source.Year, source.Genre, source.Aspect, source.UPC, source.DVDReleaseDate, source.TimeStamp, source.Name)
--OUTPUT deleted.*, $action, inserted.* INTO #MyTempTable;
END;
You posted in the SQL Server 2008 zone, are you sure you are using SQL Server 2008? MERGE isn't supported in earlier versions.
It also seems you haven't declared all your variables.
It also seems you haven't declared all your variables.
ASKER
Yeah. I realized that my dev box is using SQL 2005. I'll try the 2008 box later.
i will just achieve the same functionality by using the select statements
something like this
insert into tablename (colnames........)
select colnamens.....
from temptable
where condition to exclude the records and find unique ones
something like this
insert into tablename (colnames........)
select colnamens.....
from temptable
where condition to exclude the records and find unique ones
ASKER
Ok. I got the MERGE statement to work... kind of. If there aren't rows in the target table that match to the source table, it inserts them as it should. However, if I change the Studio field for instance in the source table, it won't update it in the target table. Here is the exact MERGE statement I'm using. (Slightly modified from the one I made in a previous post).
MERGE INTO [Movies] As [target]
USING (SELECT [Dbid], [Status], Studio, Released, Sound, Versions, Price, Rating, [Year], Genre, Aspect, UPC, DVDReleaseDate, [TimeStamp], Name FROM MoviesStaging) AS source
ON ([target].[Dbid] = source.[Dbid])
WHEN MATCHED THEN
UPDATE SET Name = source.Name,
[Dbid] = source.[Dbid],
[Status] = source.[Status],
Released = source.Released,
Sound = source.Sound,
Versions = source.Versions,
Price = source.Price,
Rating = source.Rating,
[Year] = source.[Year],
Genre = source.Genre,
Aspect = source.Aspect,
UPC = source.UPC,
DVDReleaseDate = source.DVDReleaseDate,
[TimeStamp] = source.[TimeStamp]
WHEN NOT MATCHED THEN
INSERT ([Dbid], [Status], Studio, Released, Sound, Versions, Price, Rating, [Year], Genre, Aspect, UPC, DVDReleaseDate, [TimeStamp], Name)
VALUES (source.[Dbid], source.[Status], source.Studio, source.Released, source.Sound, source.Versions, source.Price, source.Rating, source.[Year], source.Genre, source.Aspect, source.UPC, source.DVDReleaseDate, source.[TimeStamp], source.Name);
ASKER
Nevermind. I just realized I didn't have the Studio field in my update statement. Everything is working fine. Thank you for your help.
ASKER
50 to raqi0017 for providing links to examples I used to get the final OPENROWSET statement to work correctly.
I do something similar in an application I created. To make it efficient and really fast.. I grab the CSV, parse it into a datatable object, then do a BulkCopy into the SQL server into a temporary table. From there, it is fairly easy to write a stored procedure to check if it exists, or requires updating... You can cut down your processing time and memory usuage by about 80-90%.
Cheers,
G.