Solved

Extract email list from a csv file.

Posted on 2011-09-18
8
258 Views
Last Modified: 2012-05-12
Hello,

I have hundreds of contacts in my gmail. I exported all contacts to a csv file by using gmail built-in export function. All contacts fall into several catalogues such as WTCDE New All_2 etc. The csv file's format is wild, each contact has one line in the file.
Ex,

Roger Sune,Roger,,Sune,,,,,,,,,,,,,,,,,,,,,,,WTCDE New All_2 ::: WTCDE Bible ::: WTCDE attendants ::: WTCDE formal members ::: sunday worship coworkers ::: * My Contacts,* ,rogerpkSune@aol.com,,,,,,,,,,,,,,,,,,,,,,,,,
Rollin Burwers,Rollin,,Burwers,,,,,,,,,,,,,,,,,,,,,,,,* ,burwers@mindspring.com,,,,,,,,,,,,,,,,,,,,,,,,,
aking@abcd.us,aking@abcd.us,,,,,,,,,,,,,,,,,,,,,,,,,,* ,aking@abcd.us,,,,,,,,,,,,,,,,,,,,,,,,,

Open in new window

Now I want to extract all email to a file.
The output file's format likes
rogerpkSune@aol.com,burwers@mindspring.com,aking@abcd.us

Open in new window

They may have duplicate ones, I want to get the unique result.
Thanks for help
0
Comment
Question by:zhshqzyc
  • 6
  • 2
8 Comments
 
LVL 23

Expert Comment

by:Jens Fiederer
ID: 36557391
Getting the input could be as easy as

var input = from line in (((new StreamReader(filename)).ReadToEnd()).Split(' ')) select  line[20];  // or maybe not 20, didn't feel like counting out the commas!  Whatever.

With duplicates you mean complete dups?
0
 
LVL 23

Expert Comment

by:Jens Fiederer
ID: 36557420
OK, actual detailed code here (forgot to split out the separate lines in the above) assuming "duplicates" is a complete duplicate of the whole string, and your file is in c:/doc/input.txt
var input = from line in (((new StreamReader("c:/doc/input.txt")).ReadToEnd().Split('\n'))) select line.Split(',');
            var processed = from fields in input where fields.Length > 28 select fields[28];
            var unique = from item in processed group item by item into g select g.Key;
            foreach (var x in unique)
            {
                Console.WriteLine(x);
            }

Open in new window

0
 
LVL 23

Expert Comment

by:Jens Fiederer
ID: 36557424
(added the header stuff, since you need to include files)
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;

namespace uniquecsv
{
    class Program
    {
        static void Main(string[] args)
        {
            var input = from line in (((new StreamReader("c:/doc/input.txt")).ReadToEnd().Split('\n'))) select line.Split(',');
            var processed = from fields in input where fields.Length > 28 select fields[28];
            var unique = from item in processed group item by item into g select g.Key;
            foreach (var x in unique)
            {
                Console.WriteLine(x);
            }

        }
    }
}

Open in new window

0
Free Tool: Subnet Calculator

The subnet calculator helps you design networks by taking an IP address and network mask and returning information such as network, broadcast address, and host range.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

 

Author Comment

by:zhshqzyc
ID: 36557743
99.5% correct. A few are still wrong, I think that you used fields[28] to cause it.
Is there any way to extract them? Please consider symbol @.
Regular expression for email??
0
 
LVL 23

Accepted Solution

by:
Jens Fiederer earned 500 total points
ID: 36570463
I can't say what is up with the other 0.5% without actually seeing the offending data.

It is likely you have some fields that themselves contain commas or even newlines in those.  This can be avoided by doing fairly elaborate parsing that makes exceptions for special characters in certain places....but probably it is easier to just iterate through the fields and only pick out those that contain the "@"  ( you can check a field for that by comparing field.IndexOf("@") != -1) or (as you point out) even matching against a regex..... http://msdn.microsoft.com/en-us/library/ff650303.aspx suggests

 ^(?("")("".+?""@)|(([0-9a-zA-Z]((\.(?!\.))|[-!#\$%&'\*\+/=\?\^`\{\}\|~\w])*)(?<=[0-9a-zA-Z])@))(?(\[)(\[(\d{1,3}\.){3}\d{1,3}\])|(([0-9a-zA-Z][-\w]*[0-9a-zA-Z]\.)+[a-zA-Z]{2,6}))$

for that
0
 
LVL 23

Expert Comment

by:Jens Fiederer
ID: 36570480
Note that if you want to do proper parsing you need specific details about the csv format, not all CSV formats are created equal.

http://en.wikipedia.org/wiki/Comma-separated_values explains:

"Simple CSV implementations will not allow field values that contain a comma or other special characters such as newlines. More sophisticated CSV implementations permit commas and other special characters in a field value. Many implementations use " (double quote) characters around values that contain reserved characters (such as commas, double quotes, or newlines); embedded double quote characters may be represented by a pair of consecutive double quotes. (Creativyst 2010) Some CSV implementations may use an escape character such as a backslash to encode reserved characters as an escape sequence, such as Sybase Central."
0
 
LVL 23

Expert Comment

by:Jens Fiederer
ID: 36570496
While the coding is a bit awkward, if your CSV format is a fair match for some Microsoft format, you might be able to use one of the Data Providers to parse the file for you, as in:

http://www.switchonthecode.com/tutorials/csharp-tutorial-using-the-built-in-oledb-csv-parser
0
 

Author Comment

by:zhshqzyc
ID: 36570613
I got a solution at MSDN forum. The guy used regular expression really impressed me.

Thanks for your input anyway. Points for your fun anyway.
0

Featured Post

Free Tool: IP Lookup

Get more info about an IP address or domain name, such as organization, abuse contacts and geolocation.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Introduction This article series is supposed to shed some light on the use of IDisposable and objects that inherit from it. In essence, a more apt title for this article would be: using (IDisposable) {}. I’m just not sure how many people would ge…
Exception Handling is in the core of any application that is able to dignify its name. In this article, I'll guide you through the process of writing a DRY (Don't Repeat Yourself) Exception Handling mechanism, using Aspect Oriented Programming.
This video shows how to use Hyena, from SystemTools Software, to bulk import 100 user accounts from an external text file. View in 1080p for best video quality.
With Secure Portal Encryption, the recipient is sent a link to their email address directing them to the email laundry delivery page. From there, the recipient will be required to enter a user name and password to enter the page. Once the recipient …

820 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question