Problem parsing tab delimted csv/txt file - some columns contains carriage returns and some columns are empty

I've developed an app that allows a user to upload a TAB DELIMTED CSV file (of job listings), and which is then turned into a query object on the front end of his website (against which searches are conducted).  

I instructed the client to export his data as a TAB DELIMTED TEXT file, and to then re-name the file so that it has a CSV extension.  This is the file that gets uploaded.

When the file gets uploaded, I am using the ColdFusion Replace function to strip out quotation marks wherever they exist in the file, and then the modified data gets re-written to another, final CSV file.  What we end up with is a TAB DELIMTED CSV file that doesn't contain any quotation marks (which are sometimes automatically inserted as text qualifiers when there is a comma in the column data).  

Everything works fine when all of the following criteria are met:

1) All of the columns contain data
2) There are no carriage returns in any of the column data
3) There are no TAB characters in any of the column data

However, .. I've recently learned that the column data in the uploaded files will often contain carriage returns, and that there will often be columns that contain no data. Obviously, I need to re-think my approach and devise some kind of alternate solution for this.

So -- how can I work around this?  Is there some other, more compatible text-based file format that I should have the client export his data as? I suspect that using quotation marks as text quailfiers might actually be necessary in this case, though I don't want to be to hasty about making the switch just yet.  

Also, -- is there any way that I can replace empty columns with, .. say, ... 'N/A' prior to re-saving the data as a TAB-delimited CSV file?  

Lastly, -- how on earth should I handle those additional carriage returns in the column data?  If I use the replace function to strip them out -- then I would effectively corrupt the import file, since there would be no more expected carriage returns at the end of each row of data.  

Note that the file upload routines I already have in place could easily be adapted so that they "clean up" the file in whatever way that's necessary.  I just need some tips on how to handle the unwanted characters.

Any tips or suggestions would be greatly appreciated.  

Thanks!
- Yvan









IDEASDesignAsked:
Who is Participating?

[Webinar] Streamline your web hosting managementRegister Today

x
 
dgrafxConnect With a Mentor Commented:
Hi,
The reason you can't read a xls file as simply as a text file is because of it's internal formatting - it's a win32 app not just text.
And you can't read doing <cfquery dbtype="query"> either.

>>would require that I first set up a datasource in CFAdmin.  Am I correct?  
Yes

should work with any recent version of cf
doing in the fashion I described would yes be more efficient, for one thing - the method you're using now doesn't work.

the output will give you columns that may be null or have incorrect data, where you can then "fix"
the method you are using will not even give you a column if data there is null - as I'm sure you realized - and is one of your problems.

the clincher is the carriage returns within a column's value - when you are using a cr as delimiter - absolutely will not fly.

The concept though is not to setup a datasource for the excel file, but to read the excel file with sql server.
You do need to setup the sql server datasource and it doesn't matter what it's pointed to as long as it's a valid sql server db.
sql server then actually uses access drivers (as you see in the code) to read the file.

but since you don't use a datasource then why not use a custom tag.
Here is one that I have used in a case where sql server didn't have correct permissions to read an excel file and the host wouldn't make the changes.
CFX_Excel2Query
http://www.emerle.net/programming/display.cfm/t/cfx_excel2query
after thinking about it - this is really the way to go and you won't need to setup a datasource.
You will need to install this tag and if you are hosting where you don't have full control of the server - you'll need your host to set it up for you.

good luck
0
 
dgrafxCommented:
You've gotten yourself into a bad situation.
Your problems need to be taken care of at the source (where the csv is created)
1) I think you should have your client export data (from where?) to an excel sheet instead of csv or txt.
2) You can then read the excel sheet and replace whatever is needed.
3) Inserting the data should then be fairly easy.
0
 
IDEASDesignAuthor Commented:
Ok, .. that's what I've concluded as well - that I need to implement some kind of custom macro in Excel that would prepare the data for importing and then export it to the needed format.  The ideal Excel macro would do all of the following:

1) replace all tab characters in any of the columns with spaces
2) remove all quotes from any of the columns that contain them
3) export the data as a TAB DELIMTED CSV file, -- using quote symbols as text qualifiers, -- and that exports all of the empty columns as "" along with all the other columns that contains data

You wouldn't happen to know how to program excel macros, would you?  Because I think that's what I'll be needing at this point.  The above specifications would most certainly do the trick, I think.

Thanks!
- Yvan

P.S. I don't believe it's possible to read an XLS file via CHTTP like you are able to do with a CSV or TXT file.  Or am I misken about that?  If so, how would I extract the data from the XLS file (without having to set up a new datasource)?

0
Free Tool: IP Lookup

Get more info about an IP address or domain name, such as organization, abuse contacts and geolocation.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

 
dgrafxCommented:
When they create the xls I don't believe you need to do any processing on it - just make sure the columns are in place.
Then have them upload it via a form you create.
Then once you have the xls on the server you can:
<cfquery datasource="#Request.Datasource#" name="XLSData">
SELECT desired columns
FROM OPENROWSET('Microsoft.Jet.OLEDB.4.0', 'Excel 8.0;Database=#xlsfile#;HDR=YES', 'SELECT * FROM [Sheet1$]')
</cfquery>
#xlsfile# is the absolute path to xls sheet "D:\web\text.xls"
Where it says Sheet1 - this needs to be your sheet name in your file - Sheet1 is the default, but it can be renamed
This is using sql server
Your query XLSData is now available to do whatever with the data
In your output query you can then strip whatever or add whatever to get desired results - then insert to table.

If you're using access - I believe you can simply open access and click the "Open" icon selecting files of type "*.xls"
this will import your sheet into an access table where you can do whatever with the data
This would be a manual approach though.

I hope you're using sql server

0
 
IDEASDesignAuthor Commented:
No, I am not using any kind of database at all.  I merely read the contents of the uploaded CSV file using CHTTP and then turn it into a query object using <cfquery dbtype="query">.  

The method you've described, it seems, would require that I first set up a datasource in CFAdmin.  Am I correct?  

If the setting up of a ColdFusion datasource is NOT needed, .. then will this approach that you've outlined work using ColdFusion MX 6.1?  Or does it only work using ColdFusion MX  7 ?

Also, why would interfacing with an XLS file be any more efficient than the method I'm using now?  Would it be more forgiving of the missing column data and of the illegal characters in the individual columns?  I am open to using this approach, .. but of course, I would prefer not to bother with it if it'll just put me right back where I started.

If it's necessary that I first set up a ColdFusion datasource for the Excel file, ... what's the process for doing that in CFAdmin?  In other words, .. which driver do I need to choose, and are there any special settings that I need to apply?

Thanks!
- Yvan

0
 
IDEASDesignAuthor Commented:
Ok, .. I've tried the method you outlined where you create an MSSQL datasource, set it up in CFadmin, and then use the sql drivers to query the uploaded xls file:

-----------------------------------------------------------

<cfset xlsfile = '#application.server_path#job_listings\joblistings.xls'>

<cfquery datasource="#application.datasource#" name="XLSData">
      SELECT JobIdentifier
      FROM OPENROWSET('Microsoft.Jet.OLEDB.4.0', 'Excel 8.0;Database=#xlsfile#;HDR=YES', 'SELECT * FROM [Sheet1$]')
</cfquery>

<cfoutput query="XLSData">#JobIdentifier#<br></cfoutput>

-----------------------------------------------------------

However, .. I am getting this error, and I'm not sure why:

[Macromedia][SQLServer JDBC Driver][SQLServer]OLE DB provider 'Microsoft.Jet.OLEDB.4.0' reported an error. The provider did not give any information about the error.

VENDORERRORCODE          7399
SQLSTATE         HY000

I am open to experimenting with the custom tag yu highlighted (thank you for that), .. but want to explore the cost-free option first.  Any idea why I'm getting these errors? The mssql database that I've set up as a coldfusion datasource DOES exist, and I set it up in CFAdmin using the MSSQL driver.  

Thanks,
- Yvan


 



0
 
dgrafxCommented:
First - do you control this server?
i.e. can you modify the registry?
What account is sql server logged on as? - look in services / mssql$servername
0
 
IDEASDesignAuthor Commented:
Yes, I have full admin access to the server , ... and yes I can modify the registry.

I don't know what you mean when you ask me to look in services / mssql$servername though.  

I am looking in services, and all I see listed is MSSQLSERVER and MSSQLServerADHelper

I believe, however, that sql server is logged on as 'sa', as that is what I used to configure my DSN in CFAdmin.

- Yvan


0
 
IDEASDesignAuthor Commented:
Actually -- nevermind.  I just got approval from my employer for us to purchase the excel2qery custom tag.  Thanks so much for your help.  I am awarding you the 500 pts.

Cheers,
 - Yvan
0
 
dgrafxCommented:
Ok - when you first setup sql server - you can name it - you must not have so it appears as just MSSQLSERVER - no prob

sa should be good to go, but I did take a look and on our setup it's logged on as local system account.
if you need to look at that ever it would be in Services / MSSQLSERVER / log on tab

look for this registry key:
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\MSSQLServer\Providers\SQLOLEDB\DisallowAdhocAccess

or search for DisallowAdhocAccess - set it's value to 0 wherever you find it.
try that
0
 
dgrafxCommented:
ah - I see you posted again and awarded me the points - thank you!
yes the tag works great and considering it's cost it's probably cheaper than the time it takes to get the "free" method setup.

good luck ...
0
All Courses

From novice to tech pro — start learning today.