Solved

Extract email list from a csv file.

Posted on 2011-09-18
8
257 Views
Last Modified: 2012-05-12
Hello,

I have hundreds of contacts in my gmail. I exported all contacts to a csv file by using gmail built-in export function. All contacts fall into several catalogues such as WTCDE New All_2 etc. The csv file's format is wild, each contact has one line in the file.
Ex,

Roger Sune,Roger,,Sune,,,,,,,,,,,,,,,,,,,,,,,WTCDE New All_2 ::: WTCDE Bible ::: WTCDE attendants ::: WTCDE formal members ::: sunday worship coworkers ::: * My Contacts,* ,rogerpkSune@aol.com,,,,,,,,,,,,,,,,,,,,,,,,,
Rollin Burwers,Rollin,,Burwers,,,,,,,,,,,,,,,,,,,,,,,,* ,burwers@mindspring.com,,,,,,,,,,,,,,,,,,,,,,,,,
aking@abcd.us,aking@abcd.us,,,,,,,,,,,,,,,,,,,,,,,,,,* ,aking@abcd.us,,,,,,,,,,,,,,,,,,,,,,,,,

Open in new window

Now I want to extract all email to a file.
The output file's format likes
rogerpkSune@aol.com,burwers@mindspring.com,aking@abcd.us

Open in new window

They may have duplicate ones, I want to get the unique result.
Thanks for help
0
Comment
Question by:zhshqzyc
  • 6
  • 2
8 Comments
 
LVL 23

Expert Comment

by:Jens Fiederer
ID: 36557391
Getting the input could be as easy as

var input = from line in (((new StreamReader(filename)).ReadToEnd()).Split(' ')) select  line[20];  // or maybe not 20, didn't feel like counting out the commas!  Whatever.

With duplicates you mean complete dups?
0
 
LVL 23

Expert Comment

by:Jens Fiederer
ID: 36557420
OK, actual detailed code here (forgot to split out the separate lines in the above) assuming "duplicates" is a complete duplicate of the whole string, and your file is in c:/doc/input.txt
var input = from line in (((new StreamReader("c:/doc/input.txt")).ReadToEnd().Split('\n'))) select line.Split(',');
            var processed = from fields in input where fields.Length > 28 select fields[28];
            var unique = from item in processed group item by item into g select g.Key;
            foreach (var x in unique)
            {
                Console.WriteLine(x);
            }

Open in new window

0
 
LVL 23

Expert Comment

by:Jens Fiederer
ID: 36557424
(added the header stuff, since you need to include files)
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;

namespace uniquecsv
{
    class Program
    {
        static void Main(string[] args)
        {
            var input = from line in (((new StreamReader("c:/doc/input.txt")).ReadToEnd().Split('\n'))) select line.Split(',');
            var processed = from fields in input where fields.Length > 28 select fields[28];
            var unique = from item in processed group item by item into g select g.Key;
            foreach (var x in unique)
            {
                Console.WriteLine(x);
            }

        }
    }
}

Open in new window

0
Comprehensive Backup Solutions for Microsoft

Acronis protects the complete Microsoft technology stack: Windows Server, Windows PC, laptop and Surface data; Microsoft business applications; Microsoft Hyper-V; Azure VMs; Microsoft Windows Server 2016; Microsoft Exchange 2016 and SQL Server 2016.

 

Author Comment

by:zhshqzyc
ID: 36557743
99.5% correct. A few are still wrong, I think that you used fields[28] to cause it.
Is there any way to extract them? Please consider symbol @.
Regular expression for email??
0
 
LVL 23

Accepted Solution

by:
Jens Fiederer earned 500 total points
ID: 36570463
I can't say what is up with the other 0.5% without actually seeing the offending data.

It is likely you have some fields that themselves contain commas or even newlines in those.  This can be avoided by doing fairly elaborate parsing that makes exceptions for special characters in certain places....but probably it is easier to just iterate through the fields and only pick out those that contain the "@"  ( you can check a field for that by comparing field.IndexOf("@") != -1) or (as you point out) even matching against a regex..... http://msdn.microsoft.com/en-us/library/ff650303.aspx suggests

 ^(?("")("".+?""@)|(([0-9a-zA-Z]((\.(?!\.))|[-!#\$%&'\*\+/=\?\^`\{\}\|~\w])*)(?<=[0-9a-zA-Z])@))(?(\[)(\[(\d{1,3}\.){3}\d{1,3}\])|(([0-9a-zA-Z][-\w]*[0-9a-zA-Z]\.)+[a-zA-Z]{2,6}))$

for that
0
 
LVL 23

Expert Comment

by:Jens Fiederer
ID: 36570480
Note that if you want to do proper parsing you need specific details about the csv format, not all CSV formats are created equal.

http://en.wikipedia.org/wiki/Comma-separated_values explains:

"Simple CSV implementations will not allow field values that contain a comma or other special characters such as newlines. More sophisticated CSV implementations permit commas and other special characters in a field value. Many implementations use " (double quote) characters around values that contain reserved characters (such as commas, double quotes, or newlines); embedded double quote characters may be represented by a pair of consecutive double quotes. (Creativyst 2010) Some CSV implementations may use an escape character such as a backslash to encode reserved characters as an escape sequence, such as Sybase Central."
0
 
LVL 23

Expert Comment

by:Jens Fiederer
ID: 36570496
While the coding is a bit awkward, if your CSV format is a fair match for some Microsoft format, you might be able to use one of the Data Providers to parse the file for you, as in:

http://www.switchonthecode.com/tutorials/csharp-tutorial-using-the-built-in-oledb-csv-parser
0
 

Author Comment

by:zhshqzyc
ID: 36570613
I got a solution at MSDN forum. The guy used regular expression really impressed me.

Thanks for your input anyway. Points for your fun anyway.
0

Featured Post

PRTG Network Monitor: Intuitive Network Monitoring

Network Monitoring is essential to ensure that computer systems and network devices are running. Use PRTG to monitor LANs, servers, websites, applications and devices, bandwidth, virtual environments, remote systems, IoT, and many more. PRTG is easy to set up & use.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Article by: Ivo
C# And Nullable Types Since 2.0 C# has Nullable(T) Generic Structure. The idea behind is to allow value type objects to have null values just like reference types have. This concerns scenarios where not all data sources have values (like a databa…
We all know that functional code is the leg that any good program stands on when it comes right down to it, however, if your program lacks a good user interface your product may not have the appeal needed to keep your customers happy. This issue can…
This Micro Tutorial hows how you can integrate  Mac OSX to a Windows Active Directory Domain. Apple has made it easy to allow users to bind their macs to a windows domain with relative ease. The following video show how to bind OSX Mavericks to …
Established in 1997, Technology Architects has become one of the most reputable technology solutions companies in the country. TA have been providing businesses with cost effective state-of-the-art solutions and unparalleled service that is designed…

810 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question