Solved

How do I import a malformed CSV file into SQLEXPRESS/Server 2008 using BULK INSERT with error tolerance

Posted on 2011-02-17
10
642 Views
Last Modified: 2012-05-11
I am a total beginner to SQL and although I understand the concepts I do not understand the syntax. I have searched long enough to make my eyes bleed so now I need to just ask:

How do you read a CSV file into an SQLEXPRESS/Server 2008 table when the CSV file is not well formed?

The data file is from a data recorder and the first line in the file is always the column headers. The data that follows may have sync data inserted between the lines of sensor data to indicate that something happened while recording and that the data may have a gap in the readings. Obviously these sync lines are not predictable. And there is also a possibility that the data may just be malformed. The sync lines contain the MESSAGE ID and a WORD COUNT of 0 followed by a comma. The comma is an artifact of the way the data recorder fills out incomplete fields (ie. a bug we can't fix.)

Here are a few lines from a sensor data file:

MESSAGE ID, WORD COUNT, BUFFER NUMBER, TIME MSP (SECONDS), TIME LSP (SECONDS), NUMBER OF MEAS, MEAS 1 CODE, MEAS 1 VEHICLE ID, MEAS 1 FREQ, MEAS 1 VALIDITY, MEAS 1 RANGE (METERS)
7159,     0,
7159,     0,
7159,    75,    0, 5.8982400000000000e+005, 3.0441000000000004e+003,   18,    1,   18,    0,    0,-4.2217182187500000e+006
7159,    75,    0, 5.8982400000000000e+005, 3.0442000000000003e+003,   18,    1,   18,    0,    1,-4.2217274531250000e+006
7159,    75,    0, 5.8982400000000000e+005, 3.0443000000000002e+003,   18,    1,   18,    0,    1,-4.2217365625000000e+006
7159,    75,    0, 5.8982400000000000e+005, 3.0444000000000001e+003,   18,    1,   18,    0,    1,-4.2217456250000000e+006
7159,    75,    0, 5.8982400000000000e+005, 3.0445000000000000e+003,   18,    1,   18,    0,    1,-4.2217548437500000e+006

Open in new window


This is the code I have written so far to read the data into a temporary table and then parse the data to the permanent table.

-- Start with a clean slate and drop the RawSensorData table if it 
-- already exists 
IF OBJECT_ID ('#RawSensorData', 'U') IS NOT NULL
	DROP TABLE #RawSensorData

-- Create a temporary table that we can read the raw sensor data into
CREATE TABLE #RawSensorData ([RawSensorData] VARCHAR(MAX))

-- Read in the raw sensor data as a string
BULK INSERT #RawSensorData 
	FROM 'Sensor.csv'
	WITH (
	DATAFILETYPE='CHAR',
	FIELDTERMINATOR='\r\n'
) 

-- Drop the SensorData table if it already exists
IF OBJECT_ID ('[dbo].[SensorData]', 'U') IS NOT NULL 
	DROP TABLE [dbo].[SensorData]

-- Create the SensorData table
IF OBJECT_ID ('[dbo].[SensorData]', 'U') IS NULL
	CREATE TABLE [dbo].[SensorData] 
	(
		[ MESSAGE ID] smallint,
		[ WORD COUNT] smallint,
		[ BUFFER NUMBER] smallint,
		[ TIME MSP (SECONDS)] float,
		[ TIME LSP (SECONDS)] float,
		[ NUMBER OF MEAS] smallint,
		[ MEAS 1 CODE] smallint,
		[ MEAS 1 VEHICLE ID] smallint,
		[ MEAS 1 FREQ] smallint,
		[ MEAS 1 VALIDITY] smallint,
		[ MEAS 1 RANGE (METERS)] float
	)

-- Insert the data from the RawSensorData table into
-- the parsed SensorData table
INSERT INTO
	[dbo].[SensorData] 
	(
		[ MESSAGE ID],
		[ WORD COUNT],
		[ BUFFER NUMBER],
		[ TIME MSP (SECONDS)],
		[ TIME LSP (SECONDS)],
		[ NUMBER OF MEAS],
		[ MEAS 1 CODE],
		[ MEAS 1 VEHICLE ID],
		[ MEAS 1 FREQ],
		[ MEAS 1 VALIDITY],
		[ MEAS 1 RANGE (METERS)]
	)
SELECT

-- What do I do here to take the single column data from
-- the RawSensorData table and convert to the multi-column
-- SensorData table?

FROM RawSensorData

-- Delete the temporary table
DROP TABLE #RawSensorData

Open in new window


So I have two questions:

1. What do I need to put in the SELECT statement to take the single column data from the RawSensorData table and convert to the multi-column SensorData table?

2. How do I count how many sync lines and error lines are in the original data?

And... Business Rules dictate that I cannot use BCP, DTS, SSIS, and OPENROWSET or anything else that either calls an external program or that would require changing the Surface Area Configuration.

Capturing the lines with errors would be a bonus but is not necessary.

As you can see, the data is captured with 100 nanosecond accuracy and will generate files that are several million lines so efficient code is a major plus.

And since I am new to this I am looking for example code, the more specific the better, general suggestions would not be helpful at this point.

Thanks

<<< Everything is easier when you know how. >>>




0
Comment
Question by:Mr_Difficult
  • 8
  • 2
10 Comments
 
LVL 23

Assisted Solution

by:wdosanjos
wdosanjos earned 500 total points
Comment Utility
I think you can try the following to load your #RawSensorData table (I did not test it).  Then you can use CAST to insert the values from #RawSensorData on to the SensorData table.

-- Create a temporary table that we can read the raw sensor data into
	CREATE TABLE #RawSensorData  
	(
		[ MESSAGE ID] varchar(50),
		[ WORD COUNT] varchar(50),
		[ BUFFER NUMBER] varchar(50),
		[ TIME MSP (SECONDS)] varchar(50),
		[ TIME LSP (SECONDS)] varchar(50),
		[ NUMBER OF MEAS] varchar(50),
		[ MEAS 1 CODE] varchar(50),
		[ MEAS 1 VEHICLE ID] varchar(50),
		[ MEAS 1 FREQ] varchar(50),
		[ MEAS 1 VALIDITY] varchar(50),
		[ MEAS 1 RANGE (METERS)] varchar(50)
	)



-- Read in the raw sensor data as a string
BULK INSERT #RawSensorData 
	FROM 'Sensor.csv'
	WITH (
	DATAFILETYPE='CHAR',
	FIRSTROW=2,
	FIELDTERMINATOR=',',
	ROWTERMINATOR='\r\n',
	MAXERRORS=999999999,
	ERRORFILE='Sensor.error.csv'
) 

Open in new window


More on the BULK INSERT here:
http://msdn.microsoft.com/en-us/library/ms188365.aspx
0
 

Author Comment

by:Mr_Difficult
Comment Utility
Interesting first attempt. After testing the code I find that it will read in the header line but nothing else. I am assuming that setting FIRSTROW=2 is an attempt to skip the first two lines of sync data but keep in mind that the sync data lines can occur anywhere within the data as many times as needed to indicate that the data recorder had a problem and that the data has a gap.

I have done some research before asking for help here and the consensus was that BULK INSERT works very very well if the data is well formed. In my case BULK INSERT seems to especially have a problem with the sync data lines, a field terminator immediately precedes a row terminator, this causes it to not only throw out lines two and three as having errors but also line four. The BULK INSERT parser seems to get a little bit confused. I have also read that reading the whole file in as rows of data that are later parsed line-by-line is the way to handle these problem files, that was my attempt in the code I submitted. This is where my very limited knowledge ends, how to parse the data from the staging database.

So I am anxious to see what comes next. Remember I need to meet the criteria in the original post for me to be successful.

Thanks
0
 

Author Comment

by:Mr_Difficult
Comment Utility
Bah, I can't understand what my debugger is telling me, the code reads zero rows, not the header row, what I was seeing was the column names from the CREATE TABLE call.

0
 

Author Comment

by:Mr_Difficult
Comment Utility
I haven't seen any activity for a bit now so how about this: How would I use any one of the many split/parse functions that I have seen on many of the SQL blog sites? Almost all of them will take in a string and the delimiter and return a table of elements parsed out in one column and many rows. What would be the calling syntax to go from the #RawSensorData table to the SensorData table? Assume for example that this is the split function, many are similar to this one:

CREATE FUNCTION dbo.Split(@String varchar(8000), @Delimiter char(1))     
returns @temptable TABLE (items varchar(8000))     
as     
begin     
	declare @idx int     
	declare @slice varchar(8000)     
    
	select @idx = 1     
		if len(@String)<1 or @String is null  return     
    
	while @idx!= 0     
	begin     
		set @idx = charindex(@Delimiter,@String)     
		if @idx!=0     
			set @slice = left(@String,@idx - 1)     
		else     
			set @slice = @String     
		
		if(len(@slice)>0)
			insert into @temptable(Items) values(@slice)     

		set @String = right(@String,len(@String) - @idx)     
		if len(@String) = 0 break     
	end 
return     
end

Open in new window


I may be totally off base in this approach and it may take forever to run, who knows, but
I am looking at this approach until someone has something better that works for me.

Thanks
0
 

Author Comment

by:Mr_Difficult
Comment Utility
Ignore the last approach, it will produce a large steaming pile of fail. Iterating through the table and passing in each column as a string and parsing it will take forever. One of my smaller files has 1.5 million lines of data, even if each conversion took only 6 mSec it would take 2 and a half hours to process the entire file, if BULK INSERT worked just a little bit more tolerantly it would do it in moments. Not a good trade off to parse each line.

So still looking for suggestions.
0
Free Trending Threat Insights Every Day

Enhance your security with threat intelligence from the web. Get trending threat insights on hackers, exploits, and suspicious IP addresses delivered to your inbox with our free Cyber Daily.

 
LVL 23

Expert Comment

by:wdosanjos
Comment Utility
I tried different approaches, but none satisfy all the posted requirements.

I think there is no simple solution.  Probably the alternative that will provide the best performance is to pre-process the Sensor.csv file out of SQL Server to make it compliant with the BULK INSERT operation (i.e. add the missing commas to the sync records).  I assume there is a 'copy process' where the file is copied from the recorder to a folder accessible by SQL Server.  You could tap into that process to 'fix' the file.  (or the process that executes the T-SQL script)

The C# code to fix the file would be something like this:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;

namespace FixSensorFile
{
    class Program
    {
        static void Main(string[] args)
        {
            int layoutFieldCount = -1;

            for (string record = Console.In.ReadLine(); record != null; record = Console.In.ReadLine())
            {
                int fieldCount = record.Count(c => c == ',');

                if (layoutFieldCount < 0)
                {
                    layoutFieldCount = fieldCount; // Count number of fields
                }
                else
                {
                    if (fieldCount < layoutFieldCount)
                    {
                        int missingCommas = layoutFieldCount - fieldCount;

                        record = record.PadRight(record.Length + missingCommas, ',');
                    }
                }

                Console.Out.WriteLine(record);
            }
        }
    }
}

Open in new window


To execute from the command line, it will be something like this:
FixSensorFile.exe <C:\Temp\Sensor.csv >C:\Temp\SensorFixed.csv
0
 

Accepted Solution

by:
Mr_Difficult earned 0 total points
Comment Utility
Although would probably be the easiest way to precondition the data it would violate one of my business rules against calling an external program to manipulate the source data. I appreciate that you are sticking with this and I will keep working on a solution also.

Thanks
0
 

Author Comment

by:Mr_Difficult
Comment Utility
I do not think it is possible to do what I want, so I have settled on reading in the data to a table as a whole string and deleting all of the data lines that do not contain enough fields to be valid data. I can count how many lines have been imported and how many have been deleted so that gets me past one of my initial problems. This is the code that I developed to do that:

-- If a staging TABLE already exists it might not be ours so we need
-- to remove it from the database.
IF OBJECT_ID ('StagingTable') IS NOT NULL 
DROP TABLE StagingTable

-- CREATE our staging TABLE with an IDENTITY column.
CREATE TABLE StagingTable 
(
	ID INT IDENTITY NOT NULL PRIMARY KEY,	-- tracks original data order
	[RawData] VARCHAR(8000)		-- the entire line of data
)

-- When doing a BULK INSERT into a TABLE the number of columns of data 
-- MUST match the number of columns in the TABLE. Since we are reading 
-- all of the data as a single column and we have created an IDENTITY 
-- column BULK INSERT into a TABLE will not work. However, we can create
-- a VIEW into the TABLE and BULK INSERT into the VIEW instead.

-- If the staging VIEW already exists it might not be ours so we need
-- to remove it from the database.
IF OBJECT_ID ('dbo.StagingView', 'V') IS NOT NULL
DROP VIEW dbo.StagingView

GO

-- CREATE our staging VIEW with just the column for the raw data
CREATE VIEW StagingView
AS SELECT [RawData]
FROM  StagingTable

GO

-- BULK INSERT the data file into the VIEW
BULK INSERT StagingView
FROM 'Sensor.csv'
WITH (
  DATAFILETYPE = 'CHAR'
, FIELDTERMINATOR = ''
, TABLOCK
) 

PRINT N'Total rows imported ' + CAST( @@ROWCOUNT AS VARCHAR(10) )

-- Count the number of fields in the header row
DECLARE @nFieldCount INT

SELECT 
	@nFieldCount = LEN ([RawData]) - LEN (REPLACE ([RawData], ',', ''))
FROM 
	[dbo].[Stagingtable]
WHERE 
	[dbo].[StagingTable].[ID] = 1

-- Delete the rows that have malformed data
DELETE
FROM
	[dbo].[StagingTable]
WHERE
	LEN ([RawData]) - LEN (REPLACE ([RawData], ',', '')) <> @nFieldCount

PRINT N'Total rows deleted  ' + CAST( @@ROWCOUNT AS VARCHAR(10) )
PRINT N'Total fields        ' + CAST( @nFieldCount + 1 AS VARCHAR(10) )

Open in new window


This works pretty well and can import and process a million and a half row file and remove all of the malformed data in about 15 seconds.

This still only gets me halfway to solving the total problem, how to get the data in the RawData column split into separate columns. One solution I tried took almost 30 minutes, that is totally out of line with what I need to do.

So I think this thread is done and I will need to look into different alternatives to solve my problem.
0
 

Author Comment

by:Mr_Difficult
Comment Utility
Points were given because the expert provided a possible solution that could work if the problem constraints were loosened.
0
 

Author Closing Comment

by:Mr_Difficult
Comment Utility
The question appears to be unsolvable given the constraints so my conclusion is the real answer.
0

Featured Post

Backup Your Microsoft Windows Server®

Backup all your Microsoft Windows Server – on-premises, in remote locations, in private and hybrid clouds. Your entire Windows Server will be backed up in one easy step with patented, block-level disk imaging. We achieve RTOs (recovery time objectives) as low as 15 seconds.

Join & Write a Comment

In this article I will describe the Detach & Attach method as one possible migration process and I will add the extra tasks needed for an upgrade when and where is applied so it will cover all.
Use this article to create a batch file to backup a Microsoft SQL Server database to a Windows folder.  The folder can be on the local hard drive or on a network share.  This batch file will query the SQL server to get the current date & time and wi…
Here's a very brief overview of the methods PRTG Network Monitor (https://www.paessler.com/prtg) offers for monitoring bandwidth, to help you decide which methods you´d like to investigate in more detail.  The methods are covered in more detail in o…
This video explains how to create simple products associated to Magento configurable product and offers fast way of their generation with Store Manager for Magento tool.

771 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

10 Experts available now in Live!

Get 1:1 Help Now