Solved

Extract email list from a csv file.

Posted on 2011-09-18
8
260 Views
Last Modified: 2012-05-12
Hello,

I have hundreds of contacts in my gmail. I exported all contacts to a csv file by using gmail built-in export function. All contacts fall into several catalogues such as WTCDE New All_2 etc. The csv file's format is wild, each contact has one line in the file.
Ex,

Roger Sune,Roger,,Sune,,,,,,,,,,,,,,,,,,,,,,,WTCDE New All_2 ::: WTCDE Bible ::: WTCDE attendants ::: WTCDE formal members ::: sunday worship coworkers ::: * My Contacts,* ,rogerpkSune@aol.com,,,,,,,,,,,,,,,,,,,,,,,,,
Rollin Burwers,Rollin,,Burwers,,,,,,,,,,,,,,,,,,,,,,,,* ,burwers@mindspring.com,,,,,,,,,,,,,,,,,,,,,,,,,
aking@abcd.us,aking@abcd.us,,,,,,,,,,,,,,,,,,,,,,,,,,* ,aking@abcd.us,,,,,,,,,,,,,,,,,,,,,,,,,

Open in new window

Now I want to extract all email to a file.
The output file's format likes
rogerpkSune@aol.com,burwers@mindspring.com,aking@abcd.us

Open in new window

They may have duplicate ones, I want to get the unique result.
Thanks for help
0
Comment
Question by:zhshqzyc
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 6
  • 2
8 Comments
 
LVL 23

Expert Comment

by:Jens Fiederer
ID: 36557391
Getting the input could be as easy as

var input = from line in (((new StreamReader(filename)).ReadToEnd()).Split(' ')) select  line[20];  // or maybe not 20, didn't feel like counting out the commas!  Whatever.

With duplicates you mean complete dups?
0
 
LVL 23

Expert Comment

by:Jens Fiederer
ID: 36557420
OK, actual detailed code here (forgot to split out the separate lines in the above) assuming "duplicates" is a complete duplicate of the whole string, and your file is in c:/doc/input.txt
var input = from line in (((new StreamReader("c:/doc/input.txt")).ReadToEnd().Split('\n'))) select line.Split(',');
            var processed = from fields in input where fields.Length > 28 select fields[28];
            var unique = from item in processed group item by item into g select g.Key;
            foreach (var x in unique)
            {
                Console.WriteLine(x);
            }

Open in new window

0
 
LVL 23

Expert Comment

by:Jens Fiederer
ID: 36557424
(added the header stuff, since you need to include files)
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;

namespace uniquecsv
{
    class Program
    {
        static void Main(string[] args)
        {
            var input = from line in (((new StreamReader("c:/doc/input.txt")).ReadToEnd().Split('\n'))) select line.Split(',');
            var processed = from fields in input where fields.Length > 28 select fields[28];
            var unique = from item in processed group item by item into g select g.Key;
            foreach (var x in unique)
            {
                Console.WriteLine(x);
            }

        }
    }
}

Open in new window

0
MS Dynamics Made Instantly Simpler

Make Your Microsoft Dynamics Investment Count  & Drastically Decrease Training Time by Providing Intuitive Step-By-Step WalkThru Tutorials.

 

Author Comment

by:zhshqzyc
ID: 36557743
99.5% correct. A few are still wrong, I think that you used fields[28] to cause it.
Is there any way to extract them? Please consider symbol @.
Regular expression for email??
0
 
LVL 23

Accepted Solution

by:
Jens Fiederer earned 500 total points
ID: 36570463
I can't say what is up with the other 0.5% without actually seeing the offending data.

It is likely you have some fields that themselves contain commas or even newlines in those.  This can be avoided by doing fairly elaborate parsing that makes exceptions for special characters in certain places....but probably it is easier to just iterate through the fields and only pick out those that contain the "@"  ( you can check a field for that by comparing field.IndexOf("@") != -1) or (as you point out) even matching against a regex..... http://msdn.microsoft.com/en-us/library/ff650303.aspx suggests

 ^(?("")("".+?""@)|(([0-9a-zA-Z]((\.(?!\.))|[-!#\$%&'\*\+/=\?\^`\{\}\|~\w])*)(?<=[0-9a-zA-Z])@))(?(\[)(\[(\d{1,3}\.){3}\d{1,3}\])|(([0-9a-zA-Z][-\w]*[0-9a-zA-Z]\.)+[a-zA-Z]{2,6}))$

for that
0
 
LVL 23

Expert Comment

by:Jens Fiederer
ID: 36570480
Note that if you want to do proper parsing you need specific details about the csv format, not all CSV formats are created equal.

http://en.wikipedia.org/wiki/Comma-separated_values explains:

"Simple CSV implementations will not allow field values that contain a comma or other special characters such as newlines. More sophisticated CSV implementations permit commas and other special characters in a field value. Many implementations use " (double quote) characters around values that contain reserved characters (such as commas, double quotes, or newlines); embedded double quote characters may be represented by a pair of consecutive double quotes. (Creativyst 2010) Some CSV implementations may use an escape character such as a backslash to encode reserved characters as an escape sequence, such as Sybase Central."
0
 
LVL 23

Expert Comment

by:Jens Fiederer
ID: 36570496
While the coding is a bit awkward, if your CSV format is a fair match for some Microsoft format, you might be able to use one of the Data Providers to parse the file for you, as in:

http://www.switchonthecode.com/tutorials/csharp-tutorial-using-the-built-in-oledb-csv-parser
0
 

Author Comment

by:zhshqzyc
ID: 36570613
I got a solution at MSDN forum. The guy used regular expression really impressed me.

Thanks for your input anyway. Points for your fun anyway.
0

Featured Post

Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Real-time is more about the business, not the technology. In day-to-day life, to make real-time decisions like buying or investing, business needs the latest information(e.g. Gold Rate/Stock Rate). Unlike traditional days, you need not wait for a fe…
Performance in games development is paramount: every microsecond counts to be able to do everything in less than 33ms (aiming at 16ms). C# foreach statement is one of the worst performance killers, and here I explain why.
Come and listen to Percona CEO Peter Zaitsev discuss what’s new in Percona open source software, including Percona Server for MySQL (https://www.percona.com/software/mysql-database/percona-server) and MongoDB (https://www.percona.com/software/mongo-…
This is my first video review of Microsoft Bookings, I will be doing a part two with a bit more information, but wanted to get this out to you folks.

728 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question