Want to protect your cyber security and still get fast solutions? Ask a secure question today.Go Premium

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 421
  • Last Modified:

Import CSV or TXT to database - theoretical questions

Hey All,

I am building an e-mail list manager that should allow the user the ability to import a CSV or TXT file they exported from Excel or their e-mail client. I am using ASPUpload to upload the file to the server and then I use FileSystemObject to open the file, Split() to put the contents into an array and then ADO to get the contents into the database. The problem I can forsee, is with error validation on the CSV or TXT file. I have been researching quite a bit of code and they all validate different things. Some replace quotes, others replace spaces, others replace line breaks, some remove column headers, etc. The CSV file I am using to mock up the application has one easy to follow format, but that is not real-world.

Here are my requirements:

The file should have only one column. The column should contain all e-mail addresses and I need to make sure I don't import the column header if the file contains one. Each e-mail address can have no trailing spaces, line breaks, etc.

Should I use a regular expression to do all this validation? And if so, what are all the things I should check for? In what linear fashion should I do the validation?

Please ask questions if need be. Thanks in advance for all your help.

Brian
0
bjrcreations
Asked:
bjrcreations
1 Solution
 
stefriCommented:
What is an email address (see RFC 822) ?
Left-hand-side: A string which can be made of:
characters from A-z and 0-9, special characteers such as: - (dash) or . (dot)
An @ sign followed by right-hand-side of the same kind of sequence as left hand side of the email address

You will check first if you find an @ sign: if not found ignore the string as email address
If found, check left hand side
Compare each character against valid characters using MID or regex expression
If LHS is valid, check right hand side
If RHS is valid, import the string as a valid email address.

Some code as follow may help you for a starting point

atSignPos = instr(1, email, "@")

if check(email, 1, atSignPos) then
   if check(email, atSignPos, len(email)) then
      ' Good Email Address
   end if
end if
.....
....

function check(str,fromHere, toThere)
dim i
check = false


if toThere > 1 then
    check =true
    for i = fromHere to toThere
      theChar= (mid(str, i,1)
      if asc(theChar) = asc("-") or _
         asc(theChar) = asc(".") or _
         (asc(theChar) >= asc("A") and _
         (asc(theChar) <= asc("z")) or _
         (asc(theChar) >= asc("0") and _
         (asc(theChar) <= asc("9"))

         check = check and true
       else
         check = false
         exit for
       end if
    next
end if
end function
0
 
iozturkCommented:
If I dont misunderstand your question this may help you.

email= Replace(email , chr(34), "") 'for double quotas
email= Replace(email , " ", "") 'for spaces
email= Replace(email , vbcrlf, "") 'for linebreaks
email= Replace(email , "'", "") 'for single quotas
0
 
whammyCommented:
Hmm, this is probably specific to how your list manager works.

What I would do (and I DO use something like this to import data into tables) is to write a script that parses the file one line at a time.

Since you don't want any line breaks or spaces, I'd check to see that:

- the line isn't empty
- the email address is loosely validated (but of course with a regular expression), i.e.:

Function IsEmail(str)
   Dim ieRegEx
   Set ieRegEx = New RegExp
   ieRegEx.Pattern = "^[\w\+\'\.-]+@[\w\'\.-]+\.[a-zA-Z0-9]{2,}$"
   IsEmail = ieRegEx.Test(str)
End Function

Of course some valid email addresses might not pass this test (although that will allow for even some relatively obscure valid emails with "+" in them), but I cannot understand the relevant RFC at all (and I have been looking for a solid reference for months that actually makes sense), so I just loosely validate email addresses, and the above regex seems to work pretty darn well...

Once you do that, I'd have the script you're creating split the line you're reading by spaces into an array... if there are any spaces (an array was created), then you don't include that line in the new file you are creating to load into your list manager, you might want to write it to a "bad" text file instead so you can double-check the data initially, at least.

I'd try something like that with a test list and see how it works.

As for questions, I'd want to know if your clients were sending things in comma-delimited format with a double quote text identifier, etc.

You can do a lot of this stuff with a simple VBScript, but with huge lists, it won't be very fast.

I'm working on a .NET program (.exe) that will (hopefully) quickly and easily format different types of files into the correct format to load into an email database - unfortunately, even using regular expressions, I don't see a clear-cut way to determine if a file is in a certain format and doesn't contain errors, so it does require some user intervention to change formats, but so far it's very fast compared to VBScript (seconds instead of minutes)!

0
Concerto's Cloud Advisory Services

Want to avoid the missteps to gaining all the benefits of the cloud? Learn more about the different assessment options from our Cloud Advisory team.

 
whammyCommented:
Actually I should say you can do ALL of this stuff with a simple VBScript file, not alot of it. ;-)

It's slow, but you can have a VBScript parse files at your leisure while you're doing other stuff (although they do take up system memory on your local computer).

An .exe is better for this kind of thing, but in a pinch you can always use VBScript:

http://www.w3schools.com/asp/asp_ref_filesystem.asp

That link helped me a LOT regarding different kinds of file manipulation (since none of our clients use XML, especially!).
0
 
bjrcreationsAuthor Commented:
Whammy, in your regular expression example above you use:

^[\w\+\'\.-]+@[\w\'\.-]+\.[a-zA-Z0-9]{2,}$

as your pattern. Does this pattern include IP addresses as the domain? I want to use the most accurate regular expression I can find for e-mail validation. Thanks.

Brian
0
 
whammyCommented:
Yeah, that's what the 0-9 is for.

That's probably not the most accurate (since like I said, I have been unable to find a human readable reference, the appropriate RFC doesn't fit that category, to me anyway ;-)), but it has been working on a site that gets hundreds of emails a day with no problems for months...

Althought it's loose, it still catches a lot of invalid email addresses...

There are a great number of email address regex's here that claim to adhere directly to the specs:

http://www.regexlib.com

...however most that I tried there didn't allow for some perfectly valid email addresses, like:

Jim.O'Brien+internal@someplace-else.com

However, this one submitted by Jason Howat _seems_ to work rather well from my initial testing:

^(([^<>;()[\]\\.,;:@"]+(\.[^<>()[\]\\.,;:@"]+)*)|(".+"))@((([a-z]([-a-z0-9]*[a-z0-9])?)|(#[0-9]+)|(\[((([01]?[0-9]{0,2})|(2(([0-4][0-9])|(5[0-5]))))\.){3}(([01]?[0-9]{0,2})|(2(([0-4][0-9])|(5[0-5]))))\]))\.)*(([a-z]([-a-z0-9]*[a-z0-9])?)|(#[0-9]+)|(\[((([01]?[0-9]{0,2})|(2(([0-4][0-9])|(5[0-5]))))\.){3}(([01]?[0-9]{0,2})|(2(([0-4][0-9])|(5[0-5]))))\]))$

:-|
0
 
bjrcreationsAuthor Commented:
Thanks for your input Whammy, it helped me get started on the right track.
0
 
whammyCommented:
No problem, let me know if you run into any specific problems with this, as I do this kind of stuff all the time. ;)
0

Featured Post

Free Tool: Site Down Detector

Helpful to verify reports of your own downtime, or to double check a downed website you are trying to access.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Tackle projects and never again get stuck behind a technical roadblock.
Join Now