[Okta Webinar] Learn how to a build a cloud-first strategyRegister Now

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 267
  • Last Modified:

Extract email list from a csv file.

Hello,

I have hundreds of contacts in my gmail. I exported all contacts to a csv file by using gmail built-in export function. All contacts fall into several catalogues such as WTCDE New All_2 etc. The csv file's format is wild, each contact has one line in the file.
Ex,

Roger Sune,Roger,,Sune,,,,,,,,,,,,,,,,,,,,,,,WTCDE New All_2 ::: WTCDE Bible ::: WTCDE attendants ::: WTCDE formal members ::: sunday worship coworkers ::: * My Contacts,* ,rogerpkSune@aol.com,,,,,,,,,,,,,,,,,,,,,,,,,
Rollin Burwers,Rollin,,Burwers,,,,,,,,,,,,,,,,,,,,,,,,* ,burwers@mindspring.com,,,,,,,,,,,,,,,,,,,,,,,,,
aking@abcd.us,aking@abcd.us,,,,,,,,,,,,,,,,,,,,,,,,,,* ,aking@abcd.us,,,,,,,,,,,,,,,,,,,,,,,,,

Open in new window

Now I want to extract all email to a file.
The output file's format likes
rogerpkSune@aol.com,burwers@mindspring.com,aking@abcd.us

Open in new window

They may have duplicate ones, I want to get the unique result.
Thanks for help
0
zhshqzyc
Asked:
zhshqzyc
  • 6
  • 2
1 Solution
 
Jens FiedererCommented:
Getting the input could be as easy as

var input = from line in (((new StreamReader(filename)).ReadToEnd()).Split(' ')) select  line[20];  // or maybe not 20, didn't feel like counting out the commas!  Whatever.

With duplicates you mean complete dups?
0
 
Jens FiedererCommented:
OK, actual detailed code here (forgot to split out the separate lines in the above) assuming "duplicates" is a complete duplicate of the whole string, and your file is in c:/doc/input.txt
var input = from line in (((new StreamReader("c:/doc/input.txt")).ReadToEnd().Split('\n'))) select line.Split(',');
            var processed = from fields in input where fields.Length > 28 select fields[28];
            var unique = from item in processed group item by item into g select g.Key;
            foreach (var x in unique)
            {
                Console.WriteLine(x);
            }

Open in new window

0
 
Jens FiedererCommented:
(added the header stuff, since you need to include files)
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;

namespace uniquecsv
{
    class Program
    {
        static void Main(string[] args)
        {
            var input = from line in (((new StreamReader("c:/doc/input.txt")).ReadToEnd().Split('\n'))) select line.Split(',');
            var processed = from fields in input where fields.Length > 28 select fields[28];
            var unique = from item in processed group item by item into g select g.Key;
            foreach (var x in unique)
            {
                Console.WriteLine(x);
            }

        }
    }
}

Open in new window

0
Concerto's Cloud Advisory Services

Want to avoid the missteps to gaining all the benefits of the cloud? Learn more about the different assessment options from our Cloud Advisory team.

 
zhshqzycAuthor Commented:
99.5% correct. A few are still wrong, I think that you used fields[28] to cause it.
Is there any way to extract them? Please consider symbol @.
Regular expression for email??
0
 
Jens FiedererCommented:
I can't say what is up with the other 0.5% without actually seeing the offending data.

It is likely you have some fields that themselves contain commas or even newlines in those.  This can be avoided by doing fairly elaborate parsing that makes exceptions for special characters in certain places....but probably it is easier to just iterate through the fields and only pick out those that contain the "@"  ( you can check a field for that by comparing field.IndexOf("@") != -1) or (as you point out) even matching against a regex..... http://msdn.microsoft.com/en-us/library/ff650303.aspx suggests

 ^(?("")("".+?""@)|(([0-9a-zA-Z]((\.(?!\.))|[-!#\$%&'\*\+/=\?\^`\{\}\|~\w])*)(?<=[0-9a-zA-Z])@))(?(\[)(\[(\d{1,3}\.){3}\d{1,3}\])|(([0-9a-zA-Z][-\w]*[0-9a-zA-Z]\.)+[a-zA-Z]{2,6}))$

for that
0
 
Jens FiedererCommented:
Note that if you want to do proper parsing you need specific details about the csv format, not all CSV formats are created equal.

http://en.wikipedia.org/wiki/Comma-separated_values explains:

"Simple CSV implementations will not allow field values that contain a comma or other special characters such as newlines. More sophisticated CSV implementations permit commas and other special characters in a field value. Many implementations use " (double quote) characters around values that contain reserved characters (such as commas, double quotes, or newlines); embedded double quote characters may be represented by a pair of consecutive double quotes. (Creativyst 2010) Some CSV implementations may use an escape character such as a backslash to encode reserved characters as an escape sequence, such as Sybase Central."
0
 
Jens FiedererCommented:
While the coding is a bit awkward, if your CSV format is a fair match for some Microsoft format, you might be able to use one of the Data Providers to parse the file for you, as in:

http://www.switchonthecode.com/tutorials/csharp-tutorial-using-the-built-in-oledb-csv-parser
0
 
zhshqzycAuthor Commented:
I got a solution at MSDN forum. The guy used regular expression really impressed me.

Thanks for your input anyway. Points for your fun anyway.
0

Featured Post

Free Tool: Subnet Calculator

The subnet calculator helps you design networks by taking an IP address and network mask and returning information such as network, broadcast address, and host range.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

  • 6
  • 2
Tackle projects and never again get stuck behind a technical roadblock.
Join Now