Want to protect your cyber security and still get fast solutions? Ask a secure question today.Go Premium

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 1002
  • Last Modified:

Extracting data from csv file with multiple header rows

Hi all, I'm just looking for a jolt in the right direction in regards to extracting data from a csv file where there are multiple header lines. Python is the only language available to me at this stage. Below is an example of a source csv file from which data needs to be extracted:

Date                          INTERCONNECTORID              Run Number                Metred Flow
1/28/2009 0:30               N-Q-MNSP1                               1                             -70
1/28/2009 0:30               NSW1-QLD1                              1                             190
1/28/2009 0:30               T-V-MNSP1                                1                              34
1/28/2009 0:30               V-S-MNSP1                               1                              234
1/28/2009 0:30               V-SA                                         1                              36
1/28/2009 0:30               VIC1-NSW1                                                             78
Date                               RegionID                            Period Number           Total Demand
1/28/2009 0:30               Vic                                               1                          5436
1/28/2009 0:30               NSW                                            1                          45785
1/28/2009 0:30               SA                                               1                          32457
1/28/2009 0:30               QLD                                             1                          546
1/28/2009 0:30               WA                                               1                         57844

And the format the resulting csv needs to be in:

Date   N-QMNSP1   NSW1-QLD1   T-V-MNSP1   V-S-MNSP1   V-SA   VIC1-NSW1    Vic Demand    NSW Demand    SA Demand    QLD Demand      WA Demand
1/28/2009 0:30 -70     190               34               234           36         78                  5436              45785            32457                546                  57844

Etc...

So basically the column headers need to be pivoted to ensure the headings line up across the top of the sheet. A few columns must also be skipped in the extraction.

Now I'm not necessarily looking for a solution to this problem - I'm treating this as a learning exercise. I'd just like to know where to start in terms of Python. My guess is that I'll need to read up on regular expressions and the CSV module.

Thanks in advance.

Paul

0
paulkramer
Asked:
paulkramer
  • 4
  • 4
1 Solution
 
Roger BaklundCommented:
This does not look like a CSV file. A CSV file has a field separator, usually a comma (CSV=Comma Separated Values).

Can you post a larger set of test records? Preferably as an attachment.

Do you want the output to be a CSV file, or space delimited rows like the one you show above?
0
 
paulkramerAuthor Commented:
Sorry I should have included a better example. The source file is a csv file with a comma delimiter. I've attached one of the source files. Additionally, I've attached a csv file which I would like the resultant csv to be based around.

Public

Thanks,

Paul
PUBLIC-TRADINGIS-200901280030-00.txt
results.txt
0
 
ramromconsultant Commented:
It is very hard for me to guess the relationship between the source records and the result file. Please elaborate.
0
Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
paulkramerAuthor Commented:
The files are just an example of the functional csv's. Please refer to the original post for a better understanding of what I'm aiming for.
0
 
Roger BaklundCommented:
Like ramrom said, it's hard to understand the relationships between input and output. For instance, I did not find anything related to "snowy" in the input. I found some columns though, this should help getting you started:
import csv
 
infile = 'PUBLIC-TRADINGIS-200901280030-00.txt'
outfile = 'PUBLIC-TRADINGIS-result.csv'
 
reader = csv.reader(open(infile))
 
headers = ['I','Date','Time',' Trading Period',
  'NSW disp Pool Price ($)','NSW disp Total Demand (MW)',
  'QLD disp Pool Price ($)','QLD disp Total Demand (MW)',
  'SA disp Pool Price ($)','SA disp Total Demand (MW)',
  'Snowy disp Pool Price ($)','Snowy disp Total Demand (MW)',
  'VIC disp Pool Price ($)','VIC disp Total Demand (MW)',
  'TAS disp Pool Price ($)','TAS disp Total Demand (MW)',
  'VIC/SA disp Flow (mw)','VIC/Snowy disp Flow (mw)',
  'Snowy/NSW disp Flow (mw)','DL N->Q disp Flow (mw)',
  'NSW/QLD disp Flow (mw)','Murraylink disp Flow (mw)',
  'Basslink disp Flow (mw)']
 
m = {}  # mapping
 
for i in reader:
    if i[0] == 'D':  # only look at data rows
        if not m:    # only first iteration, same on all data rows
            date,time = i[4].split()
            m['Date'] = date
            m['Time'] = time 
            m['Trading Period'] = i[7]
        m[i[2]+'-'+i[6]] = i[8:]   # col 2&6 identifies the row, data is in index 8 and up
 
result = ['D',m['Date'],m['Time'],m['Trading Period'],
  m['PRICE-NSW1'][0],m['REGIONSUM-NSW1'][0],
  m['PRICE-QLD1'][0],m['REGIONSUM-QLD1'][0],
  m['PRICE-SA1'][0], m['REGIONSUM-SA1'][0],
  m['PRICE-TAS1'][0],m['REGIONSUM-TAS1'][0],
  m['PRICE-VIC1'][0],m['REGIONSUM-VIC1'][0]
  ]
 
writer = csv.writer(open(outfile, 'w'), delimiter='\t')
writer.writerow(headers)
writer.writerow(result)

Open in new window

0
 
paulkramerAuthor Commented:
The snowy header is complicated to explain, but yes that code does exactly what is needed. Understanding the code is another question though! What purpose does the mapping list m={} achieve?

To be honest, I'm a bit stumped as to what the following chunk of code is achieving:

for i in reader:
    if i[0] == 'D':  # only look at data rows
        if not m:    # only first iteration, same on all data rows
            date,time = i[4].split()
            m['Date'] = date
            m['Time'] = time
            m['Trading Period'] = i[7]
        m[i[2]+'-'+i[6]] = i[8:]   # col 2&6 identifies the row, data is in index 8 and up

I'm assuming "i" is a variable assigned to each row being iterated through. Would  =='D' skip empty space within a csv file?

Apologies for my ignorance: I haven't been coding for very long in Python and am still frequently referring to the inbuilt docs.

Thanks for the solution.

Paul
0
 
Roger BaklundCommented:
"for i in reader" is the main loop, reading one row from the input for each iteration. "i" represent one input row, yes. The test if i[0]=='D' is to make sure you only consider data rows, i[0] is 'I' for header rows and 'C' for comments (first & last row). The m variable is a dictionary, it is used to store the data temporarily, while it is being read. Then it is used in the result list, for instance m['PRICE-NSW1'][0] stores the 9th column in the PRICE row with REGIONID=NSW1, m['PRICE-NSW1'][1] stores the 10th, m['PRICE-NSW1'][2] stores the 11th and so on. (index 8 == 9th column, this is zero based).
0
 
paulkramerAuthor Commented:
Okay that makes things a hell of a lot clearer. I still don;t quite understand the "if not m" statement. Is this checking whether m is empty?
0
 
Roger BaklundCommented:
Yes. A dictionary evaluates as true if it has any data, false if it is empty.
0

Featured Post

Concerto Cloud for Software Providers & ISVs

Can Concerto Cloud Services help you focus on evolving your application offerings, while delivering the best cloud experience to your customers? From DevOps to revenue models and customer support, the answer is yes!

Learn how Concerto can help you.

  • 4
  • 4
Tackle projects and never again get stuck behind a technical roadblock.
Join Now