Solved

How can I parse this lot into a standard format? Variable CSV files into fixed .

Posted on 2012-03-18
20
641 Views
Last Modified: 2012-04-21
originally here

I receive multiple csv files by email for each of the following csv examples and many more besides.

I need some way of collecting them all in to one table, that means columns have to be sorted into the right order and have gap columns where data is not present.
Not all data is required, but the following would be useful if present.
BANK NAME, DATE, ACCOUNT NAME, ACCCOUNT NUMBER, REF/TYPE, DESCRIPTION, AMOUNT,BALANCE

Typically the source name (i.e. the bank) or credit card) will be missing)
This can be viewed as a "standard", but subject to change.


MBNA CEDIT CARD
Transaction Date,MCC,Description,Amount
10/07/2010,    ,"PAYMENT DIRECT DEBIT - THANK YOU          ",25.82
12/07/2010,    ,"INTEREST CHARGED                                ",-22.87


PAYPAL
Date, Time, Time Zone, Name, Type, Status, Currency, Gross, Fee, Net, From Email Address, To Email Address, Transaction ID, Counterparty Status, Postal Address, Address Status, Item Title, Item ID, Postage and Packing, Insurance Amount, VAT, Option 1 Name, Option 1 Value, Option 2 Name, Option 2 Value, Auction Site, Buyer ID, Item URL, Closing Date, Escrow ID, Invoice ID, Reference Txn ID, Invoice Number, Custom Number, Receipt ID, Contact Phone Number, 

AMEX 
No headings
Example transaction:
03/11/2010,"Reference: AT104560035000010045213"," 244.55","PAYPAL *SPSERVICES SG Singapore SNG 7894","PAYPAL *SPSERVICES SG Singapore SNG 7894 Process Date 04/11/2010 0044568001                             PAYPAL *SPSERVICES SG Singapore SNG 7894",

ALLIANCE & LEICESTER CREDIT CARD
Transaction Date,MCC,Description,Amount
02/09/2010,    ,"PAYMENT DIRECT DEBIT - THANK YOU          ",122.16
03/09/2010,    ,"INTEREST CHARGED                                ",-116.95


RBS
Date, Type, Description, Value, Balance, Account Name, Account Number
NO TRANSACTIONS SINCE 31/3/2010
BALANCE IS NIL
AVAILABILITY IS NIL


RBS
Date, Type, Description, Value, Balance, Account Name, Account Number
08/01/2010,D/D,"'HMRC - NI DD",-12.00,20.59,"'Rev GEORGE Freeson","'262016-12107845",

BARCLAYS
Number,Date,Account,Amount,Subcategory,Memo
 ,14/03/2011,20-66-88 20637894,3000,DIRECTDEP,Mr Customer                    BGC
 ,07/03/2011,20-66-88 20637894,2400,DIRECTDEP,Mr Customer                    BGC
 ,06/03/2011,20-66-88 20627089,-1195,FT ,204455 34563456       HMRC VAT           FT 

Open in new window


Sometimes the headings are just missing.

I need everything sorted in exactly the order sent, BUT in ascending date order. What that means if they arrive newest first, I want the order exactly reversed: this means items on the same date retain their neighbours, but in reverse. Without this balance business controls do not work.

Time period is usually 12 months, so can be 12 x Barclays csv files and same for all the others - which can be concatenated on receipt in a "hot" folder.

That's just the start of trying to get some order.
It is tempting to adopt say the standard used by the QIF type of layout (the open standard one) rather than invent my own.

Having said that, how hard can it be with a bit of help from say vba or msquery of powerpivot?

For 2011 Mac, 2010 PC.
PC only solutions definitely considered.

Anthony
0
Comment
Question by:anthonymellorfca
  • 8
  • 4
  • 3
  • +3
20 Comments
 
LVL 84
ID: 37734876
You're going to have to parse this line-by-line, and figure out where and how to put the data into your tables. You can very likely import this into an Excel sheet or Access temporary table directly, but you'd have to test that first to see what the results will look like. Once you get it into the table, you could then query the table to determine what to do with each row, and how/where to move it over.

You might be better off using something like .NET to handle this, which makes working with oddly formatted text files like this easier.
0
 
LVL 16

Expert Comment

by:kmslogic
ID: 37735024
I think a solution to parsing this file is beyond the scope of a question here on EE...  Actually I'm pretty certain we don't even have all the information we'd need to parse the file.  As LSM said you need to go line by line and have handling for:

Ability to handle No Header (Bank Name?)
Mapping of all the possible field names they give you to the fields you want to store.  For example some of your providers might say Acct. Number and some might say Account Number but you'll need to map those to the same output column of your table.
Ability to handle all possible special messages like BALANCE IS NIL
For your ordering issue you can process each section in an array and then output it to your table forward or backwards after testing whether the dates are being sent in reversed order.

In any case this is not a trivial amount of work and would change over time as new providers introduce new formats.
0
 
LVL 9

Author Comment

by:anthonymellorfca
ID: 37735127
oh dear, perhaps I have not made it more clearly apparent, the above are examples of separate csv files.

Excel will directly open any one of them.

The manual process is:

open csv file

paste columns into a master sheet in the appropriate order.

that's pretty much it. With the exception of the one with no headers a lookup table would work to match existing headers with desired columns.

I wasn't imagining anything quite so complex, more a variation of this existing EE Solution

Anthony
0
 
LVL 16

Expert Comment

by:kmslogic
ID: 37735356
It doesn't matter you still need to do everything I said above even if you are processing them one by one.
0
 
LVL 9

Author Comment

by:anthonymellorfca
ID: 37736505
well, that sets me thinking for myself, no bad thing.

No headers I can do by hand or add headers myself, I can remove zero transaction files manually.

I can open each file  into a worksheet and collect the content by reference to a vlookup table into the master sheet having replaced presented headers with my standard ones.

I can import and use text to columns, but there is no header parsing facility and anyway can't see any advantage over opening the files direct.

I can import using MS Query and/or Powerpivot, both of which permit external data import and field selection and ordering, saving the query for each client for use each year.

I was hoping for some alternative ideas or suggestions as to which route will prove to be the best over time.

In fact I could manually edit each csv and

replace the headings with my own, especially if I keep a note following the first time ready for future years (these tend to be once per year processes).

concatenate the files using dos or macosx and import all as one... though that misses out how to add a column specifying the source of each.

All in all these thoughts amount to not using Excel until the last moment.

I am not thinking we need every possible field name, something else I should have made clear. I am being selective and any beyond the standard requirements can be allocated to some sort of memo/spare columns, or even excluded.

I suppose what might be of interest would be

how to reverse the order of any given text file, except line 1 or not
how to add
a line count column
a source name column
a filename column.

all things I do by hand, none of which takes long which is why finding a more efficient method is a challenge.

In fact Edlin springs to mind, not used that in a while. I could use Lotus 123.. that's an idea.

All the above are separate EE questions I will address separately.

Thanks - sounds like you are saying this can't be done within the terms of reference of Experts Exchange, is that correct? If so I need to consider how to limit my questions and be more clear.

Thank you for your input.

Anthony
0
 
LVL 84
ID: 37736992
I don't think we're saying this cannot be done, but that it cannot be done easily. What is simple for the human mind to process is often very, very complex to put down in a logical code segment. By my count you have 8 or so different transactions, all of which you wish to normalize and process. That's not too hard for humans to do - we just group them together, pick out the data we need (making decisions along the way as to which data to use, and which data to ignore) and then put them into a table. We then move to the next group, and apply a different set of criteria to end up with the same result - moving the transaction into a table.

You'd have to mimic that activity, and while the computer can do this, the real sticking point is the "making decisions" portion of that, since the root criteria changes based on which "group" you're working with.

If you know you're going to have those "types" of transactions and no more, then you could come up with some sort of decision making matrix to effectively move the data into your data store. For example, you could somehow identify an MBNA transaction (perhaps it always starts with the Date and Name), and then apply your "MBNA Criiteria" to that group. The next import may be a PayPal transaction, and you'd apply your PayPal Criteria to that line, and so on.

The trouble, as I said, is (a) determining which Criteria should be used for each line and (b) determining exactly what data to move into your data store after that.
0
 
LVL 9

Accepted Solution

by:
anthonymellorfca earned 0 total points
ID: 37737115
I agree with all that. However, I need to make it clear that any one file is 100% MBNA or 100% Barclays or whatever. There isn't any sorting of that sort..

What I am looking at is turning the csv headers into standard headers
A two line hlookup would do that for any given file, or a looped search and replace,: but probably as quick to type the standard heads in by hand. as you say, the human brain works faster, which is a nice recognition of what my question is truly about; whether it is time to let the machine do this, or not.

Having done the replacement I could then concatenate different source files mbna/barc etc, but i want further data (columns) appended before doing that, as a basis for pivot table reporting, again always done by hand to date because range filling is so quick it has always made sense to do it in excel, manually.

So there isn't a question of line by line criteria. Maybe i should have given only one example because the multiple examples seem to lead you to think every separate example is a separate line in the same file, which is not the case. Each example may represent say 250 transactions all laid out according to the header for that discrete file.

I suppose it can be said I am trying to concatenate all the files based on similar but different column headings, into one file, like the iphone contacts problem i linked to above.

Any criteria, such as column headings can be looked up by reference to the source, such as mbna etc Having said that parsing programs have been doing this since the year dot: cross allocating data based on existing headings to new ones, but Excel doesn't, as far as I can see so far, offer that facility or a way to save it, except maybe if i use ms query or powerpivot, where replacing the source files could run the same query refreshed on the following year's files.

edit: I come from an age when we used to hand transcribe every transaction, so you can imagine that having csv files at all is great, as is Excel (etc).

Here's what the parsing column order looks like:
Layout of csv columns versus standard required
My current thinking is to use numbers for column headings and then sort by column number using the more modern ability to sort across columns in addition to down rows.
all i have to do is add the numbers of blank columns and after sort the blanks and data content will be ordered as desired, for any given example file.

in fact named array constants come to mind for use as lookup tables, each one being the name of the bank/card.

So =vlookup(barclaysheading, barclaysarray,2,false)

where barclaysarrayconstant is:

{Number,999;Date,1;Account,4;Amount,17;Subcategory,13;Memo,12}


edit: the named array constant idea works quite nicely. Trying to figure out how to add the unused column numbers other than manually.
0
 
LVL 74

Expert Comment

by:Jeffrey Coachman
ID: 37738352
The Previous Experts have given you an honest evaluation of what you are up against here.

To be sure,  perhaps it would be easier if you simply posted an example of a few of these CSV files, then posted the *Exact* output you are expecting, based on the data in the sample CSV files.

Just bear in mind what the other Experts posted...
This is a volunteer forum (All Experts are unpaid), so complex, or otherwise time intensive questions, may be shunned by most experts here.

JeffCoachman
0
 
LVL 1

Expert Comment

by:ifpugnacious
ID: 37739303
Here's what I usually do with ETL problems like this:

1. Develop an Access table and Saved Import for each CSV file type. The Access table should have an AutoNumber ID to let me get back to the original record sequence at any time.

2. If the built-in Saved Import gives me any trouble, I switch to ADO data access (better portability if needed later) in VBA to write my table loaders. VBA also lets me put in source file name, timestamp, etc. that Saved Imports can't do. If these are essential, then VBA is a must; even if not essential, they can be really helpful in sorting out the inevitable problems later.

The goal is to get the source data into relational tables in as straighforward a manner as possible, because one you've done that, you can use the power of SQL queries to sort, filter, merge, etc. You're also past the possibility of Excel adding stray commas, the Access importers doing funny stuff with field and string delimiters, etc.

The amount of grief you will have depends on where the CSVs came from. If they came from a system that enforces data cleanliness on its inputs, then you have a good chance of being able to automate the whole thing. If the CSVs came from manually edited Excel worksheets with no cell controls, and turned to CSVs with a Save As, then you are in for a really bad time. Chances are you're somewhere in the middle, but that's why they call it "work," right?

3. Once I have a working query set to clobber the raw source tables, the VBA to load them, and subsequent queries to do my filtering and merging, I then write a macro to drive them.

4. If the macro gives me any trouble (error handling not that great, etc), I switch to VBA, as with the Saved Imports.

I always try to break the operations up into baby steps, and get them working manually first, with a good sample of data. Depending upon the data rate, complexity, and volume, there may be some things that just aren't worth automating. But you still have to have a virtually foolproof way of catching those cases, reliably, and yanking them out of the automated stream before they can do any harm.

I also write automated tests for each step that's even remotely complex, e.g. given this initial CSV record, here's what I should have in my staging table, and then at each step of the transforms. Particularly when you start splitting columns or have any kind of logic (if the row looks like this way, process it like this, otherwise like that), these will save your bacon (and a lot of time besides) in the long run. (I have a VBA framework that works with input and result tables that I really should clean up and publish to GitHub someday).

It hasn't happened in a while because I'm usually picking up the crumbs of the ETL world, but if you really need power tools, switch to an industrial-grade database (I know Microsoft SQL Server the best) so you can also use stored procedures and transactions with commit-rollback. There aren't many things worse than having an ETL process chain that mostly worked, and no way to put things back the way they were.

Happy data-doggin'!
0
 
LVL 74

Expert Comment

by:Jeffrey Coachman
ID: 37739336
...?

So what *specifically* do you need help with..?
0
What Security Threats Are You Missing?

Enhance your security with threat intelligence from the web. Get trending threat insights on hackers, exploits, and suspicious IP addresses delivered to your inbox with our free Cyber Daily.

 
LVL 9

Author Comment

by:anthonymellorfca
ID: 37739717
specifics were in the op.
and my question was a requested follow on.
however, it seems my country practice methods will have to do for another decade until I look  at this next time.

Thanks for providing me with insight to the current state of technology, clearly too much for now.

I guess I should withdraw the question given the consistent comments about this being unpaid forum etc, my apologies for over stepping the mark and I will try to do better next time - especially with examples. I did try but was unable to make myself understood I fear.

See you in the next question.

Anthony
0
 
LVL 84
ID: 37740356
This wasn't answered - it should be deleted.
0
 
LVL 17

Expert Comment

by:nepaluz
ID: 37743675
Anthony, I missed this discussion (from your last) as I've been away.
On the whole, I suppose the comments made here are accurate to a T with regard to utilising Excel. I am not acquainted with using the other tools you refer to, however, I know for a fact that your requirements can easily (and I use the word advisedly) be met by a .NET program (or java etc) similar to the one I recomended in your previous question.

You can PM me if this would be a consideration and we can take it from there, or indeed, if you ask a question in a relevant forum where I'd pickit up , or indeed the multitude of experts in the .NET field.
0
 
LVL 9

Author Comment

by:anthonymellorfca
ID: 37749247
I think the general non familiarity with the other tools I mention, which are all main stream Excel as far as I am concerned and what I was looking for experienced responses about, has lead to responses based on programming languages which is absolutely not what I was looking for - in addition to that my choice of topics may have been mistaken in that regard, attracting full blown programmers instead of Excel super users, for example.

Someone even asks for examples, which are included in the op as is an example of the output table heads.

Asking for the voice of experience with the tools is not at all the same as asking for someone to write a program.

I also think that many have trouble with csv files in Access, often created in Excel and this leads to a jaundiced view of the issue. My csv files are from banks, as is stated, not from Excel.

In any case I think this is best abandoned and we all move on to another day.

Anthony
0
 
LVL 17

Expert Comment

by:nepaluz
ID: 37749457
Very well then. Abandoned it is!
0
 
LVL 16

Expert Comment

by:kmslogic
ID: 37750189
Should be deleted.
0
 
LVL 9

Author Closing Comment

by:anthonymellorfca
ID: 37760300
lost the will to live.
0
 
LVL 9

Author Comment

by:anthonymellorfca
ID: 37875171
I think it should be put up as an example of how not to ask questions - amongst other things.
0
 
LVL 16

Expert Comment

by:kmslogic
ID: 37875407
You are right.  Because you could say "How do I write a General Ledger program, it needs to 1. xxx, 2. xxx, 3.xxx (and so on" and then say you've asked a specific question, but it's not the sort of question that you can ask here.  

If your question was is there some common trick or feature of, say, msquery that people use to do what you asked then the answer is "No."

As far as I can see you've figured out how to do everything you want to do so my question to you is "Why aren't you doing it?"  When you are in the actual process of making your solution, if you run into a wall and can't figure out how to do something particular then that is exactly the sort of question you'll get a good response from here.

Best of luck

Kelly
0
 
LVL 9

Author Comment

by:anthonymellorfca
ID: 37875551
Yes i was hoping for some broad suggestions based on experience other than my own
0

Featured Post

How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

Join & Write a Comment

This is about my first experience with programming Arduino.
This code takes an Excel list of URL’s and adds a header titled “URL List”. It then searches through all URL’s in column “A”, looking for duplicates. When a duplicate is found, it is moved to the top of the list. The duplicate URL’s are then highlig…
Graphs within dashboards are meant to be dynamic, representing data from a period of time that will change each time the dashboard is updated with new data. Rather than update each graph to point to a different set within a static set of data, t…
Access reports are powerful and flexible. Learn how to create a query and then a grouped report using the wizard. Modify the report design after the wizard is done to make it look better. There will be another video to explain how to put the final p…

758 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

21 Experts available now in Live!

Get 1:1 Help Now