Solved

Detecting illegal characters in CSV file using a ColdFusion Regular Expression

Posted on 2014-03-12
4
1,034 Views
Last Modified: 2014-03-12
I'm building a ColdFusion Application that allows a user to upload a CSV file, which then gets imported into a MySQL database.  

The properties of the uploaded CSV file are as follows:

1) is comma separated
2) has quotation mark qualifiers
3) contains some empty columns
4) has no column headers
5) contains only a single row of data
6) contains columns that sometimes have commas in them
7) is plain text and can be viewed easily using a text editor

I'm reading the file using CFFILE, and would like to check it for illegal characters.  More specifically, I only want to allow for letters (both upper and lower case), numbers, punctuation, carriage returns and line feeds -- you know -- the kinds of characters you would expect to find in a comma separated CSV file.  

There is one offending character that I constantly encounter in these uploaded CSV files.  I'm not exactly sure what it is, but you can clearly see an example of it you refer to the CSV file that I've attached to this post.  It's a "VT" that's enclosed in a black circle, .. and I need to alert the user regarding the presence of this unwanted character in the CSV file BEFORE they attempt to import it into the database ... since it always results in a ColdFusion error.  

Currently, I am using the following code:

<cffile action="read" file="C:\wwwroot\InspectionRequest_1.csv" variable="csvData">
					
<cfif refind("[^a-zA-Z0-9 ,.&'$()\-+*=/]",csvData)>
					
     <p class="errortext">AN ERROR HAS OCCURRED!!!</p>
					
</cfif>

Open in new window


How can I extend this regex so that it's forgiving of all the types of characters that you'd expect to find in a comma separated CSV file -- yet unforgiving of all other illegal characters?  

Thanks,
- Yvan
InspectionRequest-1.csv
0
Comment
Question by:egoselfaxis
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 2
  • 2
4 Comments
 
LVL 35

Accepted Solution

by:
Dan Craciun earned 500 total points
ID: 39924129
0x0B is the vertical tab.

With that out of the way, I would do it this way:
\w - all word characters and numbers. Will accept characters from other languages
\s - all space characters, like space, tab, etc
"' - all text fields will be enclosed in "" or ''
,.?!:; - punctuation signs
@~#$%^&*(){}-+=\[\] - accepted symbols, add to the list as needed. you need to escape []

For a test expression like this:
[\w\s"',.?!:;@~#$%^&*(){}-+=\[\]]

Open in new window


HTH,
Dan
0
 

Author Closing Comment

by:egoselfaxis
ID: 39924410
Thank you!  

Since you identified the mysterious character for me, ..  I was actually able to simplify my conditional statement as follows:

<cfif csvData CONTAINS chr(11)>

<p>Error!</p>

</cfif>

Cheers!
- Yvan
0
 
LVL 35

Expert Comment

by:Dan Craciun
ID: 39924426
Glad I could help!

Just as a curiosity: you're sure that is the only illegal character?
0
 

Author Comment

by:egoselfaxis
ID: 39924449
Now, I am not 100% sure just yet .. but this particular vertical tab character has been known to be particularly problematic for us for the last couple of years, so we've decided to handle each offending character on a case-by-case basis until we've identified all of them.

Thanks again!
- yg
0

Featured Post

[Webinar] Code, Load, and Grow

Managing multiple websites, servers, applications, and security on a daily basis? Join us for a webinar on May 25th to learn how to simplify administration and management of virtual hosts for IT admins, create a secure environment, and deploy code more effectively and frequently.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Sometimes databases have MILLIONS of records and we need a way to quickly query that table to return the results me need. Sure you could use CFQUERY but it takes too long when there are millions of records. That is why SOLR was invented. Please …
In Part II of this series, I will discuss how to identify all open instances of Excel and enumerate the workbooks, spreadsheets, and named ranges within each of those instances.
The viewer will learn how to create a normally distributed random variable in Excel, use a normal distribution to simulate the return on an investment over a period of years, Create a Monte Carlo simulation using a normal random variable, and calcul…
Graphs within dashboards are meant to be dynamic, representing data from a period of time that will change each time the dashboard is updated with new data. Rather than update each graph to point to a different set within a static set of data, t…

734 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question