Did you know that ransomware is the most widespread, destructive malware in the world today? It accounts for 39% of all security breaches, with ransomware gangsters projected to make $11.5B in profits from online extortion by 2019.
A client recently asked me to develop an application which will read data from CSV files they receive each month from other companies they work with. Unfortunately, there is no approved EDI (Electronic Data Interchange) file format for this data, so what they get are fifteen different file formats with varying numbers of columns and inconsistent column names. Initially, I saw this as relatively simple process as I've created similar custom import specification applications before; then I received the first of 5 files with greater than 255 columns and realized my options had changed, as Access would only allow 255 data columns in a single table.
The data in these files pertains to natural gas processing as it is reported by the processor back to the producer. The data contains up to about 50 columns (4 of which uniquely define each record) which are related to a specific meter that the gas is downloaded from. The remainder of the columns relate to 5 different one-to-many tables, each of which contains from 4 to 10 columns, and from 3 to 10 records. In these files, the data from these 1-to-many tables has all been denormalized. As an example, assume that one of the "many" tables contains ten columns (A - J) and ten rows (1 -10). The resulting set of 100 columns in this CSV file contains column names similar to [1_A], [1_B], ... [10_I], [10_J]; and the data providers did this with up to five different 1-to-many tables.
As I mulled over my options, I considered:
1. Break the file into multiple parts. I rejected this option because I needed to be able to read data into a single staging table in order to facilitate the process of importing it from the staging table into the production tables.
2. Open the file as a text file and read each record and column one at a time. This is the method that was recommended to me by a bunch of my peers and MVPs, but because of the variability in the number and names of the columns in the CSV files I balked at this idea as well.
3. Import the data directly into a SQL Server table using the OpenRowset command. I've done this before, although not with more than 255 columns at a time, so I initially selected this method as the solution to my problem.
Importing the data into SQL Server
In the past, the method I've used to import data from Excel into SQL Server is the OpenRowset command. To accomplish this task, I created a stored procedure (SP) which used the OpenRowset command to import the data directly from the CSV file into a SQL table. It took a while to get this working because there are multiple settings required to get the OpenRowset command to work, and I was working on a server which had not previously been configured to perform this function. I also had to modify my previous procedure to read from a CSV file instead of Excel. After over 8 hours of research and testing, I eventually got this process working on one of the narrow files (< 255 columns) but when I ran the procedure against one of the wide files (>255 columns) I only found 255 columns in the resulting table. In retrospect, I should have known that using the Microsoft.ACE.OLEDB.12.0 provider in the OpenRowset syntax would have this result, as it is basically the same driver Access uses when importing data into Access.
So I returned to the drawing board, and after a little more research, I found a reference to the SQL Server Bulk Insert command and realized that with a little effort I could use that command to read the data. The effort involved:
1. Creating the staging table (I didn't have to do this with the OpenRowset command)
2. Defining all of the fields associated with the data file (columns in the destination table must match those in the source file)
3. Executing the Bulk Insert process to pull the data from the CSV file into the SQL table.
Creating the staging table
Since the queries I would eventually use to read the data from the staging table into the production tables would be written dynamically, at run time, I needed to use the same table for each of the 15+ CSV files.
To do this, I developed a stored procedure which deleted the staging table if it already existed and then recreated it with a single ID column. In the code below, I could have hard-coded the table name but chose to make this procedure a bit more flexible by requiring that the table name be passed to the stored procedure.
CREATE PROCEDURE [dbo].[df_Temp_Statement_01_Drop_and_Create_Table] @Tablename nvarchar(100) AS BEGIN SET NOCOUNT ON; declare @SQL as nvarchar(max) set @SQL = 'if Object_ID('''+ @Tablename + ''', ''U'') IS NOT NULL DROP TABLE ' + @Tablename + ';' Exec (@SQL) set @sql = 'CREATE TABLE [dbo].[' + @Tablename + ']([ID] [int])' Exec (@SQL) END
Defining the fields
One of the requirements of the Bulk Insert command is that the destination table must exist and must have the same number of fields as the source file. In order to do this, I created an Access procedure which opened the CSV file in Excel, looped through each of the columns defined in the first row of the file, determined the data type for each field, and then inserted each field into the staging table. The tricky part of this was determining the data type to use for each of the fields, so I created a loop to review the data in the first 25 rows (if there were that many) to determine the type of data in each field based on the following logic:
1. Is data in the cell numeric?
a. If it contains a decimal, make it a 'float' type
b. If it fits in a variable defined as integer, then use SQL (small int), otherwise define it as an integer
2. If it is in a format which can be interpreted as a date (contains two '/' or two "-"), define it as a datetime2
3. Otherwise make it a text (nvarchar) value
After determining the data type to be used, I ran a pass-through query using the "Alter Table Add" syntax to append each new column to the staging table. This worked great, allowing me to create a staging table from any of the 15 data file formats I had received.
I then attempted to implement the Bulk Insert command to import the data from the raw CSV file into the staging table, but received the error message which indicated I had the wrong number of columns. I realized that when I created the table, I had created it with an ID column and failed to subequently remove that column.
After resolving this problem, I ran the Bulk Insert code again and received another error message, this one indicated I had the wrong data type for one of the columns. I dug deeper into the raw data, the Bulk Insert documentation and some internet searches and found that the Bulk Insert command will baulk when you attempt to import an integer or long integer value (a number not containing a decimal point) into a float column. Most of the internet posts I read about this error simply recommended defining all of the fields in the staging table as nvarchar to facilitate the import process, so I replaced all of the code that determined the data type with a single line of code which created the SQL string with nvarchar(75) data type, and called this stored procedure.
ALTER PROCEDURE [dbo].[df_SS_Statement_02_Add_Fields_to_Table] @Tablename nvarchar(max), @Fieldname nvarchar(100), @FieldType nvarchar(50) AS BEGIN -- SET NOCOUNT ON added to prevent extra result sets from -- interfering with SELECT statements. SET NOCOUNT ON; declare @SQL nvarchar(max) set @sql = 'ALTER TABLE [' + @Tablename + '] ADD [' + @Fieldname + '] ' + @FieldType Exec (@SQL) END
Import the CSV file into the SQL table
The final step was to create a code which would accept the name of the file to be imported and would perform the bulk insert. This VBA code looked like:
Dim qdf as DAO.QueryDef Set qdf = currentdb.QueryDefs("qry_SQL_Passthru_Does_Not_Return_Records") strSQL = "Bulk insert MyDb.dbo.temp_Staging " _ & "FROM '" & Sourcefile & "' " _ & "WITH (FirstRow=2, Fieldterminator = ',')" qdf.SQL = strSQL qdf.Execute dbfailonerror
This process appeared to work well, but when I started reviewing the data stored in some of the fields towards the end of each record I found numeric data in fields that should have contained text, and visa versa. Additional research indicated that the bulk insert command will split the record on a user defined character (the comma defined as the Fieldterminator indicated the code segment above), but that it does not differentiate between commas that separate fields and those embedded in text strings. This resulted in the bulk insert command splitting some my data into more columns than it should. Further review of the raw CSV files showed that if the data in a field contained a comma, the entire field would be encapsulated in quotes, but if there was no embedded comma, the text would not be surrounded by quotes.
In order to resolve this problem, I developed a custom string parsing routine to read the CSV, modify those columns with embedded commas to replace the comma with a space, concatenate the data back together again, and then export it to a new CSV file.
ReplaceWhat = "," ReplaceWith = " " SourcefileNum = FreeFile() Open Sourcefile For Input As #SourcefileNum DestFileNum = FreeFile() Open DestFileName For Output As #DestFileNum 'Loop through the source file, parse it, and reconstruct it, one line at a time While Not EOF(SourcefileNum) Line Input #SourcefileNum, strLine aryLine = Split(strLine, ",") strLineOut = "" For intLoop = LBound(aryLine) To UBound(aryLine) If Left(aryLine(intLoop), 1) = Chr$(34) And _ Right(aryLine(intLoop), 1) = Chr$(34) Then aryLine(intLoop) = Mid(aryLine(intLoop), 2) aryLine(intLoop) = Left(aryLine(intLoop), Len(aryLine(intLoop)) - 1) aryLine(intLoop) = Replace(aryLine(intLoop), ReplaceWhat, ReplaceWith) strLineOut = strLineOut & aryLine(intLoop) & ReplaceWhat ElseIf Left(aryLine(intLoop), 1) = Chr$(34) And _ Right(aryLine(intLoop), 1) <> Chr$(34) Then aryLine(intLoop) = Mid(aryLine(intLoop), 2) strLineOut = strLineOut & Replace(aryLine(intLoop), ReplaceWhat, ReplaceWith) ElseIf Right(aryLine(intLoop), 1) = Chr$(34) And _ Left(aryLine(intLoop), 1) <> Chr$(34) Then aryLine(intLoop) = Left(aryLine(intLoop), Len(aryLine(intLoop)) - 1) strLineOut = strLineOut _ & Replace(aryLine(intLoop), ReplaceWhat, ReplaceWith) _ & ReplaceWhat Else strLineOut = strLineOut _ & Replace(aryLine(intLoop), ReplaceWhat, ReplaceWith) _ & ReplaceWhat End If Next Print #DestFileNum, strLineOut intLineCount = intLineCount + 1 Wend
The Split() function in Access disregards quoted fields, the same way the Bulk Import process does. Because of this, as I rebuilt each record, I had to determine whether the first character of each array element contained a quote, and if so, I had to drop the quote, append the remainder of the array element to the output string, and then append the space character (to replace the original comma). The loop causes this algorithm to be processed repeatedly for each element of the array. This resulted in the transformation of a record from something like:
a,1.5,"Now is the, time",2,for all good men
a, 1.5,Now is the time,2,for all good men
This final modification, when inserted at the beginning of the process allowed me to read any of the CSV files, convert it so that columns in the data which contained text did not contain any embedded commas, and read the data into SQL Server.
In my next article, I'll describe how I created the custom import specification which allows my client to map the fields from these CSV files into the properly normalized data structure in their production database.