Import CSV or TXT to database - theoretical questions

Hey All,

I am building an e-mail list manager that should allow the user the ability to import a CSV or TXT file they exported from Excel or their e-mail client. I am using ASPUpload to upload the file to the server and then I use FileSystemObject to open the file, Split() to put the contents into an array and then ADO to get the contents into the database. The problem I can forsee, is with error validation on the CSV or TXT file. I have been researching quite a bit of code and they all validate different things. Some replace quotes, others replace spaces, others replace line breaks, some remove column headers, etc. The CSV file I am using to mock up the application has one easy to follow format, but that is not real-world.

Here are my requirements:

The file should have only one column. The column should contain all e-mail addresses and I need to make sure I don't import the column header if the file contains one. Each e-mail address can have no trailing spaces, line breaks, etc.

Should I use a regular expression to do all this validation? And if so, what are all the things I should check for? In what linear fashion should I do the validation?

Please ask questions if need be. Thanks in advance for all your help.

Brian
LVL 3
bjrcreationsAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

stefriCommented:
What is an email address (see RFC 822) ?
Left-hand-side: A string which can be made of:
characters from A-z and 0-9, special characteers such as: - (dash) or . (dot)
An @ sign followed by right-hand-side of the same kind of sequence as left hand side of the email address

You will check first if you find an @ sign: if not found ignore the string as email address
If found, check left hand side
Compare each character against valid characters using MID or regex expression
If LHS is valid, check right hand side
If RHS is valid, import the string as a valid email address.

Some code as follow may help you for a starting point

atSignPos = instr(1, email, "@")

if check(email, 1, atSignPos) then
   if check(email, atSignPos, len(email)) then
      ' Good Email Address
   end if
end if
.....
....

function check(str,fromHere, toThere)
dim i
check = false


if toThere > 1 then
    check =true
    for i = fromHere to toThere
      theChar= (mid(str, i,1)
      if asc(theChar) = asc("-") or _
         asc(theChar) = asc(".") or _
         (asc(theChar) >= asc("A") and _
         (asc(theChar) <= asc("z")) or _
         (asc(theChar) >= asc("0") and _
         (asc(theChar) <= asc("9"))

         check = check and true
       else
         check = false
         exit for
       end if
    next
end if
end function
0
iozturkCommented:
If I dont misunderstand your question this may help you.

email= Replace(email , chr(34), "") 'for double quotas
email= Replace(email , " ", "") 'for spaces
email= Replace(email , vbcrlf, "") 'for linebreaks
email= Replace(email , "'", "") 'for single quotas
0
whammyCommented:
Hmm, this is probably specific to how your list manager works.

What I would do (and I DO use something like this to import data into tables) is to write a script that parses the file one line at a time.

Since you don't want any line breaks or spaces, I'd check to see that:

- the line isn't empty
- the email address is loosely validated (but of course with a regular expression), i.e.:

Function IsEmail(str)
   Dim ieRegEx
   Set ieRegEx = New RegExp
   ieRegEx.Pattern = "^[\w\+\'\.-]+@[\w\'\.-]+\.[a-zA-Z0-9]{2,}$"
   IsEmail = ieRegEx.Test(str)
End Function

Of course some valid email addresses might not pass this test (although that will allow for even some relatively obscure valid emails with "+" in them), but I cannot understand the relevant RFC at all (and I have been looking for a solid reference for months that actually makes sense), so I just loosely validate email addresses, and the above regex seems to work pretty darn well...

Once you do that, I'd have the script you're creating split the line you're reading by spaces into an array... if there are any spaces (an array was created), then you don't include that line in the new file you are creating to load into your list manager, you might want to write it to a "bad" text file instead so you can double-check the data initially, at least.

I'd try something like that with a test list and see how it works.

As for questions, I'd want to know if your clients were sending things in comma-delimited format with a double quote text identifier, etc.

You can do a lot of this stuff with a simple VBScript, but with huge lists, it won't be very fast.

I'm working on a .NET program (.exe) that will (hopefully) quickly and easily format different types of files into the correct format to load into an email database - unfortunately, even using regular expressions, I don't see a clear-cut way to determine if a file is in a certain format and doesn't contain errors, so it does require some user intervention to change formats, but so far it's very fast compared to VBScript (seconds instead of minutes)!

0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
Upgrade your Question Security!

Your question, your audience. Choose who sees your identity—and your question—with question security.

whammyCommented:
Actually I should say you can do ALL of this stuff with a simple VBScript file, not alot of it. ;-)

It's slow, but you can have a VBScript parse files at your leisure while you're doing other stuff (although they do take up system memory on your local computer).

An .exe is better for this kind of thing, but in a pinch you can always use VBScript:

http://www.w3schools.com/asp/asp_ref_filesystem.asp

That link helped me a LOT regarding different kinds of file manipulation (since none of our clients use XML, especially!).
0
bjrcreationsAuthor Commented:
Whammy, in your regular expression example above you use:

^[\w\+\'\.-]+@[\w\'\.-]+\.[a-zA-Z0-9]{2,}$

as your pattern. Does this pattern include IP addresses as the domain? I want to use the most accurate regular expression I can find for e-mail validation. Thanks.

Brian
0
whammyCommented:
Yeah, that's what the 0-9 is for.

That's probably not the most accurate (since like I said, I have been unable to find a human readable reference, the appropriate RFC doesn't fit that category, to me anyway ;-)), but it has been working on a site that gets hundreds of emails a day with no problems for months...

Althought it's loose, it still catches a lot of invalid email addresses...

There are a great number of email address regex's here that claim to adhere directly to the specs:

http://www.regexlib.com

...however most that I tried there didn't allow for some perfectly valid email addresses, like:

Jim.O'Brien+internal@someplace-else.com

However, this one submitted by Jason Howat _seems_ to work rather well from my initial testing:

^(([^<>;()[\]\\.,;:@"]+(\.[^<>()[\]\\.,;:@"]+)*)|(".+"))@((([a-z]([-a-z0-9]*[a-z0-9])?)|(#[0-9]+)|(\[((([01]?[0-9]{0,2})|(2(([0-4][0-9])|(5[0-5]))))\.){3}(([01]?[0-9]{0,2})|(2(([0-4][0-9])|(5[0-5]))))\]))\.)*(([a-z]([-a-z0-9]*[a-z0-9])?)|(#[0-9]+)|(\[((([01]?[0-9]{0,2})|(2(([0-4][0-9])|(5[0-5]))))\.){3}(([01]?[0-9]{0,2})|(2(([0-4][0-9])|(5[0-5]))))\]))$

:-|
0
bjrcreationsAuthor Commented:
Thanks for your input Whammy, it helped me get started on the right track.
0
whammyCommented:
No problem, let me know if you run into any specific problems with this, as I do this kind of stuff all the time. ;)
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
ASP

From novice to tech pro — start learning today.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.