Link to home
Create AccountLog in
Avatar of Anthony Mellor
Anthony MellorFlag for United Kingdom of Great Britain and Northern Ireland

asked on

How can I parse this lot into a standard format? Variable CSV files into fixed .

originally here

I receive multiple csv files by email for each of the following csv examples and many more besides.

I need some way of collecting them all in to one table, that means columns have to be sorted into the right order and have gap columns where data is not present.
Not all data is required, but the following would be useful if present.

Typically the source name (i.e. the bank) or credit card) will be missing)
This can be viewed as a "standard", but subject to change.

Transaction Date,MCC,Description,Amount
10/07/2010,    ,"PAYMENT DIRECT DEBIT - THANK YOU          ",25.82
12/07/2010,    ,"INTEREST CHARGED                                ",-22.87

Date, Time, Time Zone, Name, Type, Status, Currency, Gross, Fee, Net, From Email Address, To Email Address, Transaction ID, Counterparty Status, Postal Address, Address Status, Item Title, Item ID, Postage and Packing, Insurance Amount, VAT, Option 1 Name, Option 1 Value, Option 2 Name, Option 2 Value, Auction Site, Buyer ID, Item URL, Closing Date, Escrow ID, Invoice ID, Reference Txn ID, Invoice Number, Custom Number, Receipt ID, Contact Phone Number, 

No headings
Example transaction:
03/11/2010,"Reference: AT104560035000010045213"," 244.55","PAYPAL *SPSERVICES SG Singapore SNG 7894","PAYPAL *SPSERVICES SG Singapore SNG 7894 Process Date 04/11/2010 0044568001                             PAYPAL *SPSERVICES SG Singapore SNG 7894",

Transaction Date,MCC,Description,Amount
02/09/2010,    ,"PAYMENT DIRECT DEBIT - THANK YOU          ",122.16
03/09/2010,    ,"INTEREST CHARGED                                ",-116.95

Date, Type, Description, Value, Balance, Account Name, Account Number

Date, Type, Description, Value, Balance, Account Name, Account Number
08/01/2010,D/D,"'HMRC - NI DD",-12.00,20.59,"'Rev GEORGE Freeson","'262016-12107845",

 ,14/03/2011,20-66-88 20637894,3000,DIRECTDEP,Mr Customer                    BGC
 ,07/03/2011,20-66-88 20637894,2400,DIRECTDEP,Mr Customer                    BGC
 ,06/03/2011,20-66-88 20627089,-1195,FT ,204455 34563456       HMRC VAT           FT 

Open in new window

Sometimes the headings are just missing.

I need everything sorted in exactly the order sent, BUT in ascending date order. What that means if they arrive newest first, I want the order exactly reversed: this means items on the same date retain their neighbours, but in reverse. Without this balance business controls do not work.

Time period is usually 12 months, so can be 12 x Barclays csv files and same for all the others - which can be concatenated on receipt in a "hot" folder.

That's just the start of trying to get some order.
It is tempting to adopt say the standard used by the QIF type of layout (the open standard one) rather than invent my own.

Having said that, how hard can it be with a bit of help from say vba or msquery of powerpivot?

For 2011 Mac, 2010 PC.
PC only solutions definitely considered.

Avatar of Scott McDaniel (EE MVE )
Scott McDaniel (EE MVE )
Flag of United States of America image

You're going to have to parse this line-by-line, and figure out where and how to put the data into your tables. You can very likely import this into an Excel sheet or Access temporary table directly, but you'd have to test that first to see what the results will look like. Once you get it into the table, you could then query the table to determine what to do with each row, and how/where to move it over.

You might be better off using something like .NET to handle this, which makes working with oddly formatted text files like this easier.
I think a solution to parsing this file is beyond the scope of a question here on EE...  Actually I'm pretty certain we don't even have all the information we'd need to parse the file.  As LSM said you need to go line by line and have handling for:

Ability to handle No Header (Bank Name?)
Mapping of all the possible field names they give you to the fields you want to store.  For example some of your providers might say Acct. Number and some might say Account Number but you'll need to map those to the same output column of your table.
Ability to handle all possible special messages like BALANCE IS NIL
For your ordering issue you can process each section in an array and then output it to your table forward or backwards after testing whether the dates are being sent in reversed order.

In any case this is not a trivial amount of work and would change over time as new providers introduce new formats.
Avatar of Anthony Mellor


oh dear, perhaps I have not made it more clearly apparent, the above are examples of separate csv files.

Excel will directly open any one of them.

The manual process is:

open csv file

paste columns into a master sheet in the appropriate order.

that's pretty much it. With the exception of the one with no headers a lookup table would work to match existing headers with desired columns.

I wasn't imagining anything quite so complex, more a variation of this existing EE Solution

It doesn't matter you still need to do everything I said above even if you are processing them one by one.
well, that sets me thinking for myself, no bad thing.

No headers I can do by hand or add headers myself, I can remove zero transaction files manually.

I can open each file  into a worksheet and collect the content by reference to a vlookup table into the master sheet having replaced presented headers with my standard ones.

I can import and use text to columns, but there is no header parsing facility and anyway can't see any advantage over opening the files direct.

I can import using MS Query and/or Powerpivot, both of which permit external data import and field selection and ordering, saving the query for each client for use each year.

I was hoping for some alternative ideas or suggestions as to which route will prove to be the best over time.

In fact I could manually edit each csv and

replace the headings with my own, especially if I keep a note following the first time ready for future years (these tend to be once per year processes).

concatenate the files using dos or macosx and import all as one... though that misses out how to add a column specifying the source of each.

All in all these thoughts amount to not using Excel until the last moment.

I am not thinking we need every possible field name, something else I should have made clear. I am being selective and any beyond the standard requirements can be allocated to some sort of memo/spare columns, or even excluded.

I suppose what might be of interest would be

how to reverse the order of any given text file, except line 1 or not
how to add
a line count column
a source name column
a filename column.

all things I do by hand, none of which takes long which is why finding a more efficient method is a challenge.

In fact Edlin springs to mind, not used that in a while. I could use Lotus 123.. that's an idea.

All the above are separate EE questions I will address separately.

Thanks - sounds like you are saying this can't be done within the terms of reference of Experts Exchange, is that correct? If so I need to consider how to limit my questions and be more clear.

Thank you for your input.

I don't think we're saying this cannot be done, but that it cannot be done easily. What is simple for the human mind to process is often very, very complex to put down in a logical code segment. By my count you have 8 or so different transactions, all of which you wish to normalize and process. That's not too hard for humans to do - we just group them together, pick out the data we need (making decisions along the way as to which data to use, and which data to ignore) and then put them into a table. We then move to the next group, and apply a different set of criteria to end up with the same result - moving the transaction into a table.

You'd have to mimic that activity, and while the computer can do this, the real sticking point is the "making decisions" portion of that, since the root criteria changes based on which "group" you're working with.

If you know you're going to have those "types" of transactions and no more, then you could come up with some sort of decision making matrix to effectively move the data into your data store. For example, you could somehow identify an MBNA transaction (perhaps it always starts with the Date and Name), and then apply your "MBNA Criiteria" to that group. The next import may be a PayPal transaction, and you'd apply your PayPal Criteria to that line, and so on.

The trouble, as I said, is (a) determining which Criteria should be used for each line and (b) determining exactly what data to move into your data store after that.
Avatar of Anthony Mellor
Anthony Mellor
Flag of United Kingdom of Great Britain and Northern Ireland image

Link to home
Create an account to see this answer
Signing up is free. No credit card required.
Create Account
The Previous Experts have given you an honest evaluation of what you are up against here.

To be sure,  perhaps it would be easier if you simply posted an example of a few of these CSV files, then posted the *Exact* output you are expecting, based on the data in the sample CSV files.

Just bear in mind what the other Experts posted...
This is a volunteer forum (All Experts are unpaid), so complex, or otherwise time intensive questions, may be shunned by most experts here.

Here's what I usually do with ETL problems like this:

1. Develop an Access table and Saved Import for each CSV file type. The Access table should have an AutoNumber ID to let me get back to the original record sequence at any time.

2. If the built-in Saved Import gives me any trouble, I switch to ADO data access (better portability if needed later) in VBA to write my table loaders. VBA also lets me put in source file name, timestamp, etc. that Saved Imports can't do. If these are essential, then VBA is a must; even if not essential, they can be really helpful in sorting out the inevitable problems later.

The goal is to get the source data into relational tables in as straighforward a manner as possible, because one you've done that, you can use the power of SQL queries to sort, filter, merge, etc. You're also past the possibility of Excel adding stray commas, the Access importers doing funny stuff with field and string delimiters, etc.

The amount of grief you will have depends on where the CSVs came from. If they came from a system that enforces data cleanliness on its inputs, then you have a good chance of being able to automate the whole thing. If the CSVs came from manually edited Excel worksheets with no cell controls, and turned to CSVs with a Save As, then you are in for a really bad time. Chances are you're somewhere in the middle, but that's why they call it "work," right?

3. Once I have a working query set to clobber the raw source tables, the VBA to load them, and subsequent queries to do my filtering and merging, I then write a macro to drive them.

4. If the macro gives me any trouble (error handling not that great, etc), I switch to VBA, as with the Saved Imports.

I always try to break the operations up into baby steps, and get them working manually first, with a good sample of data. Depending upon the data rate, complexity, and volume, there may be some things that just aren't worth automating. But you still have to have a virtually foolproof way of catching those cases, reliably, and yanking them out of the automated stream before they can do any harm.

I also write automated tests for each step that's even remotely complex, e.g. given this initial CSV record, here's what I should have in my staging table, and then at each step of the transforms. Particularly when you start splitting columns or have any kind of logic (if the row looks like this way, process it like this, otherwise like that), these will save your bacon (and a lot of time besides) in the long run. (I have a VBA framework that works with input and result tables that I really should clean up and publish to GitHub someday).

It hasn't happened in a while because I'm usually picking up the crumbs of the ETL world, but if you really need power tools, switch to an industrial-grade database (I know Microsoft SQL Server the best) so you can also use stored procedures and transactions with commit-rollback. There aren't many things worse than having an ETL process chain that mostly worked, and no way to put things back the way they were.

Happy data-doggin'!

So what *specifically* do you need help with..?
specifics were in the op.
and my question was a requested follow on.
however, it seems my country practice methods will have to do for another decade until I look  at this next time.

Thanks for providing me with insight to the current state of technology, clearly too much for now.

I guess I should withdraw the question given the consistent comments about this being unpaid forum etc, my apologies for over stepping the mark and I will try to do better next time - especially with examples. I did try but was unable to make myself understood I fear.

See you in the next question.

This wasn't answered - it should be deleted.
Anthony, I missed this discussion (from your last) as I've been away.
On the whole, I suppose the comments made here are accurate to a T with regard to utilising Excel. I am not acquainted with using the other tools you refer to, however, I know for a fact that your requirements can easily (and I use the word advisedly) be met by a .NET program (or java etc) similar to the one I recomended in your previous question.

You can PM me if this would be a consideration and we can take it from there, or indeed, if you ask a question in a relevant forum where I'd pickit up , or indeed the multitude of experts in the .NET field.
I think the general non familiarity with the other tools I mention, which are all main stream Excel as far as I am concerned and what I was looking for experienced responses about, has lead to responses based on programming languages which is absolutely not what I was looking for - in addition to that my choice of topics may have been mistaken in that regard, attracting full blown programmers instead of Excel super users, for example.

Someone even asks for examples, which are included in the op as is an example of the output table heads.

Asking for the voice of experience with the tools is not at all the same as asking for someone to write a program.

I also think that many have trouble with csv files in Access, often created in Excel and this leads to a jaundiced view of the issue. My csv files are from banks, as is stated, not from Excel.

In any case I think this is best abandoned and we all move on to another day.

Very well then. Abandoned it is!
Should be deleted.
lost the will to live.
I think it should be put up as an example of how not to ask questions - amongst other things.
You are right.  Because you could say "How do I write a General Ledger program, it needs to 1. xxx, 2. xxx, (and so on" and then say you've asked a specific question, but it's not the sort of question that you can ask here.  

If your question was is there some common trick or feature of, say, msquery that people use to do what you asked then the answer is "No."

As far as I can see you've figured out how to do everything you want to do so my question to you is "Why aren't you doing it?"  When you are in the actual process of making your solution, if you run into a wall and can't figure out how to do something particular then that is exactly the sort of question you'll get a good response from here.

Best of luck

Yes i was hoping for some broad suggestions based on experience other than my own