Lamdba function for merge several csv's with new column names

Hi,

I'm capturing AWS Cloudwatch Metrics for our EC2 and RDS servers in CSV format using 2 lambda functions.
Everyday at 10 pm, the lambda functions get executed and create a folder in a S3 bucket.

Bucket/EC2/2018-02-26 14:29/APP-1CPUUtilization.csv
So each new folder created everyday will have several csv files like APP-1CPUUtilization.csv , APP-1MEMUtilization.csv , APP-2CPUUtilization.csv , APP-2MEMUtilization.csv and so on.
If i add a New server, lets say APP-10 for example, 2 files will automatically get created as APP-10CPUUtilization.csv  & APP-10MEMUtilization.csv.

Each of these files have 2 Columns, |Date | & |CPU Utilization % | for a CPUUtilization file and |Date | &  | MEM Utilization % | for a MEMUtilization file.
The date value in the date column i.e. for example 2018-02-28 07:36:00+00:00  , no. of rows i.e. 289 per day and Column names are same for every file.The file name is what differentiates each file.

So i want to create a one single file for all the instances per day with column names  |Date | APP1-CPU Utilization % |APP2-MEM Utilization % |APP2-CPU Utilization %|APP2-MEM Utilization % |...and so on.
The column name APP1-CPU Utilization %  should be picked from the source file name.

Now lets say if i add a New server called APP-10 for 2 days, hence 2 New files would be created for each day i.e. APP-10CPUUtilization.csv  & APP-10MEMUtilization.csv, now the lambda which gets executed at 11:00 PM to concatenate all the files into one,should add New columns to the file ( for those 2 days ) as |Date | APP1-CPU Utilization % |APP2-MEM Utilization % |APP2-CPU Utilization %|APP2-MEM Utilization % | APP10-CPU Utilization % |APP10-MEM Utilization % |

Is this possible please ?
End goal is to ingest this file into Redshift table.

ThanksDownloads.zip
LVL 2
Clement PAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Clement PAuthor Commented:
Hi all,

Any inputs on this please ?

Thanks,
0
gelonidaCommented:
suggestion to split problem into 2.

1st. scan the directory and try to identify files that match together.

I notice. that at least in your Zip file some of the files have a basename, that end with ' (1).csv',
' (12).csv', etc.

will this happen in your real directory as well.

If yes, can  you assure, that there are not multiple files for the same server?
If yes, shall I take the one with the biggest number?

Here the code to create an inventory
a dict with one entry for each server containing the file name and the type (MEM or CPU)
if multiple files exist for the same server and the same csv_type, then the one with the highest number is kept.

import os
import re

FNAME_REX = re.compile("""
    (?P<servername>.*)        # server_name
    (?P<csv_type>CPU|MEM)     # file type
    Utilization\s*            # the word utilisation folowed by white space or not
    (?:|\((?P<counter>\d+)\)) # download counter
    \.csv                     # csv suffix
    $                         # end of line
    """, re.X)


def make_inventory(basedir):
    inventory = {}
    filenames = sorted(os.listdir(basedir))

    # go through all file names and keep the matching ones.
    for fname in filenames:
        #print("FN: %s" % fname)
        match = FNAME_REX.match(fname)
        if match:
            vals = match.groupdict()
            servername = vals['servername']
            csv_type = vals['csv_type']
            counter = int(vals['counter'] or '0')
            print("%s matches (server: %s type: %s"
                  % (fname, servername, csv_type))
            entry = inventory.get(servername) or {}
            inventory[servername] = entry
            # if multiple for same server / type, take the most recent one
            #   (the one with the highest counter)
            if csv_type in entry:
                if counter < entry[csv_type]['counter']:
                    continue
            entry[csv_type] = { 'fname': fname, 'counter':counter }

    # keep only the ones for which you have both csv_types
    inventory = dict((key, val) for (key, val) in inventory.items() if len(val) == 2)

    return inventory

Open in new window


The result of make_inventory() for your example files would be:

{
 "APP-FILE-SERVER": {
  "CPU": {
   "counter": 1,
   "fname": "APP-FILE-SERVERCPUUtilization (1).csv"
  },
  "MEM": {
   "counter": 2,
   "fname": "APP-FILE-SERVERMEMUtilization (2).csv"
  }
 },
 "APP-SERVER-3": {
  "CPU": {
   "counter": 0,
   "fname": "APP-SERVER-3CPUUtilization.csv"
  },
  "MEM": {
   "counter": 0,
   "fname": "APP-SERVER-3MEMUtilization.csv"
  }
 },
 "APP-DF": {
  "CPU": {
   "counter": 12,
   "fname": "APP-DFCPUUtilization (12).csv"
  },
  "MEM": {
   "counter": 7,
   "fname": "APP-DFMEMUtilization (7).csv"
  }
 }
}

Open in new window

0
gelonidaCommented:
For combining csv files I have a few questions:

What to do if one file contains more / other dates than the other one?

Are the corresponding times really always identical or could they occasionally (high server load or something) be off by one or more seconds?
If this could happen, better to strip off the seconds and always set them to :00 even if the were :xx in the file

I noticed, that the lines within a csv file are not ordered by time, but seem to be randomly ordered.
So how to order the result file.  by time? identical to the MEM file order? identical to CP file order?

Can these file be huge? With huge I mean, can python read and parse the entire file in RAM.
I'm asking as I know of some scripts, that tried to analyze log files and  these scripts broke when the log file was not rotated and contained a few years of data.
0
Cloud Class® Course: Microsoft Office 2010

This course will introduce you to the interfaces and features of Microsoft Office 2010 Word, Excel, PowerPoint, Outlook, and Access. You will learn about the features that are shared between all products in the Office suite, as well as the new features that are product specific.

Clement PAuthor Commented:
Hi Gelonida,

Thank you for your inputs.
Yes please all files get a base name as APP-DF or APP-FILE-SERVER followed by the metric captured.
The file base name is the Server Name.
Each day there would be 2 files for each server i.e. For example APP-DFMEMUtilization or APP-DFCPUUtilization.
If i spin a new server called APP-XYZ, then i would have 2 new files added to the folder as APP-XYZEMUtilization or APP-XYZCPUUtilization.

You are right and apologies i was wrong.
I just realized that dates arent same at all times i.e. for a given day, one server may have 10 entries for 3rd June in the following format 2018-06-03 13:42:00+00:00, while the other may have only 8 and also differ by seconds as well.
And also they arent in order as well.
I'm now wondering, its getting complicated now :(

Also to answer the file size, the avg file size right now is 15 -20 KB but it went up to 60 KB in the past.
Pic1.JPG
Pic2.JPG
0
gelonidaCommented:
Hi Clement,

OK. it's possible, htat one file contains more dates than another.
if files are 80k or less, then this is not a memory issue I wondered more whether they can have a size of 1 or more Gigabytes. So size won't be an issue.

What behaviour would you expect if dates exist in one file but not in the other.
have one value and a '-' if only the first exists for a date?
have a '-' and the value if only the scond exists?
have both values if both exist?

do you want the entries to be sorted by date in the resulting csv? (perhaps the best option)

This complicates things a little but is no issue.
What's most important is what behavior you would expect for the kind of analysis / report, that you want to make.
So tell me and I start setting up setting up a code snippet.

Did you already check the code, that I posted so far it's not creating any files, but should create a dict, that knows which files belong together and which server names exist.


about the times not being identical.

would it be OK if I round by the minute and group values together that have the same minute?

Again it's most important for you to look at the real data and all exceptions and know what you would expect as output.

In my opinion one of the most difficult parts of coding is identifying all possible situations / exceptions and deciding what you would expect in these situations.
0
Clement PAuthor Commented:
Hi Gelonida,

Firstly thank you for taking time and helping.I really appreciate this.

Please find my answers to your questions below

What behaviour would you expect if dates exist in one file but not in the other.
[CP]have both values if both exist?

do you want the entries to be sorted by date in the resulting csv?
[CP]Yes please

What's most important is what behavior you would expect for the kind of analysis / report, that you want to make.
[CP]Generally i would be checking the percentage of RAM and CPU usage per month basis please

would it be OK if I round by the minute and group values together that have the same minute?
[CP] Actually the data points are collected every 5 mins like this.
2018-06-05 21:57:00+00:00
2018-06-05 21:52:00+00:00
2018-06-05 21:47:00+00:00
2018-06-05 21:42:00+00:00

So i was wondering will it help if we round it up to an hr.
For example
2018-06-05 21:00
2018-06-05 22:00

Thanks,
Clement
0
gelonidaCommented:
do you want the entries to be sorted by date in the resulting csv?
[CP]Yes please

Thats easy to do and not ambigious.

CP] Actually the data points are collected every 5 mins like this.
then the results should be rounded down to 5 minutes (or rounded to one minute if both files have the same offsets.

If I round to one hour I would have multiple values for the same time.
In that case what should I display?
the average, the min value, the max value, all of them, just the first value of each hour.
Quite many options and as often it depends on what you want to see.

Are the date values in UTC or might they change the time offset due to day light saving changes in winter and in summer.

In other words. will the time offset always be '+00:00'  and I can therefore just ignore it?


For combining csv files:
What behaviour would you expect if dates exist in one file but not in the other.
[CP]have both values if both exist?

OK: so if you for one time you have for example only CPU, but no RAM you want to have no entry at all in the resulting csv file?

You can look at following solution, which combines the contents of two csv files and groups them together by a common field:

question 29088374
This code keeps all rows and completes the missing ones with an N/A value.

The code that you need is kind of similar. I copied the code from above answer and adapted it by hard-coding the field names as you know them.


you create a dict to store the results.
for each row in file1 (CPU), you use 'Date' as key and add it's values to the dict

for each row in file2(RAM) you use Date as key and complement the entry in the dict if it exists, otherwise you do nothing, as
you want to ignore entries for which you don't have both values.

OK here a function reading a csv file into a dict ('Date' will be the key)

def read_file_to_dict(fname):
    """ reads a file into a dict 
        The 'Date' field will be rounded down to the full minute
        the column without title that contains the word percent will be removed

        and returns the dict

        Each file must contain at least one line containing 
        the names of the columns
    """
    rows = {}

    with open(fname) as fin:
        reader = DictReader(fin)
        for row in reader:
            #Round date
            row['Date'] = row['Date'][:17] + "00+00:00"
            # remove the 'percent' column without title
            row.pop(None)
            rows[row['Date']] = row
    
    return rows

Open in new window


Next Function will read two csv files and combine them.
I hardcoded the field names:

def combine_csv(fname1, fname2, rslt_fname):
    """ reads rows from two csv files and combines them by 
        the Date Field
    """

    rslt_fields = ('Date', 'Mem Used (%)', 'CPU Used (%)')

    file1_rows = read_file_to_dict(fname1)
    file2_rows = read_file_to_dict(fname2)

    # This are the fields of the destination file
    print("fields to write to result file", rslt_fields)

    rows = {} # dict storing the combined result

    # process all rows from file1 and keep only rows, that
    # exist also in file2
    for key, row in file1_rows.items():
        if key in file2_rows: # key exists in both files?
            row2 = dict(file2_rows[key])
            print("merge", row, "and", row2)
            row.update(row2) 
            rows[key] = row

    # create result file alphabetically ordered by 'Date'
    with open(rslt_fname, "w") as fout:
        writer = DictWriter(fout, fieldnames=rslt_fields)
        writer.writeheader()
        for key, row in sorted(rows.items()):
            print("writing", row)
            writer.writerow(row)

Open in new window



Now combining everything:
1.) read the list of files in a directory and group them in the inventory dict.

iterate through inventory and combine the matching CPU / MEM files into a combined file
basedir = '.'
inventory = make_inventory(basedir)

# next line jsut for debugging
import json ; print(json.dumps(inventory, indent=1))

for app, entry in inventory.items():
    combine_csv(entry['MEM']['fname'],
                entry['CPU']['fname'],
                app + "combined.csv",
                )

Open in new window


Fule file here:
csv_join.py
0
Clement PAuthor Commented:
Hi Gelonida,

Thank you for your time and efforts to help me.

Following are some more facts about the files which i inadvertently forgot to share with you please

1)  All files have a 3rd Column with NO column name and its actually a metric like "Percent" or "Count"- can this column and its Data be ignored completely when consolidating the files please ?
2) Some files at the end, i mean the after last row i.e. last date entry, there will be  3 more 4 rows like "Datapoints",
"ResponseMetadata" and "Label". Can we ignore them as well please ? I've a attached a csv file with some sample Data please for your reference.
3) Lastly, some files are created for the day but at times because of some issues, the files will be created but blank with No columns or data except the rows "Datapoints","ResponseMetadata" and "Label", will these type of files cause a problem please ?

Following are my answers for your above questions please.

then the results should be rounded down to 5 minutes (or rounded to one minute if both files have the same offsets.
If I round to one hour I would have multiple values for the same time.
In that case what should I display?

[CP] - What ever you think is appropriate to round up please ? Or can we just add all the UNIQUE dates from all the files in the consolidated files in a sorted order and the values for those dates.Some thing like in the Sample.csv file i attached please ?
Where ever for any particular date a value is missing, can we replace the BLANK values with 0 rather than N/A please ?

[CP] -  Anytime we are rounding up the dates, i would like to show Avg Value please

In other words. will the time offset always be '+00:00'  and I can therefore just ignore it?
[CP] - we can ignore please

I'm really sorry for adding more complexities.

In the attached file, i'm just trying to understand how my source files will look and what they have and how my destination file will look like.

Thanks,
0
Clement PAuthor Commented:
attachment with same data for reference
Sample.xlsx
0
gelonidaCommented:
did you try already the script that I posted.

I'ts not doing everything you need, but I'd like to know whether I'm going in the right direction
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
Clement PAuthor Commented:
Hi Gelonida,

Apologies for this delayed response.
Actually i had to go on some scheduled management training so i passed on your scrip you gave me to a Dev colleague of mine who created a final lambda function which is working great.
Following is the code he used.I really appreciate all your time and willingness to help.
You were the best and i would love to challenge you with something new that i would need help with.
Sorry i'm not that familiar with experts-exchange ticket system,hence i selected your comment as best answer and it is closing the case.

Regards,
 

import boto3
import csv
from io import StringIO
from datetime import datetime, timedelta
from dateutil import tz
import os.path
import re

src_bucket_name = 'cloudwatch-metrics-store'
src_key_prefix = 'CloudFront/, EC2/, ELB/, RDS/, S3/'
src_key_suffix = '.csv'
dest_bucket_name = src_bucket_name
dest_key_prefix = 'Processed/all-'
dest_key_suffix = '.csv'

from_tz = tz.gettz('UTC')
to_tz = tz.gettz('Europe/London')
extra_cols = ['CPUUtilization', 'MEMUtilization', 'DBConnections', 'MEMFree']

regex = re.compile("^\d{4}-\d{2}-\d{2} \d{2}:\d{2}")
#ts = '2018-06-09'
theday = datetime.now() - timedelta(days=1)
ts = "{:%Y-%m-%d}".format(theday)

s3_client = boto3.client('s3')

def dt_parse(t):
    ret = datetime.strptime(t[0:19], '%Y-%m-%d %H:%M:%S')
    if t[19] == '+':
        ret -= timedelta(hours = int(t[20:22]), minutes = int(t[23:]))
    elif t[19] == '-':
        ret += timedelta(hours = int(t[20:22]), minutes = int(t[23:]))
    return ret.replace(tzinfo=from_tz)

def dt_format(dt):
    local = dt.astimezone(to_tz)
    return '{:%Y-%m-%d %H:%M:%S}'.format(local)

def get_objects(bucket, prefix='', suffix=''):
    paginator = s3_client.get_paginator('list_objects_v2')
    for result in paginator.paginate(Bucket=bucket, Prefix=prefix):
        for o in result.get('Contents', []):
            key = o.get('Key')
            if key.endswith(suffix):
                if not len(extra_cols):
                    yield {'key': key}
                else:
                    basename = os.path.basename(key).split('.')[0]
                    for ecol in extra_cols:
                        if basename.endswith(ecol):
                            yield {'key': key, 'instance_name': basename.replace(ecol, ''), 'type': ecol}
                            break;
               
def get_object(bucket, obj):
    buf = StringIO()
    result = s3_client.get_object(Bucket=bucket, Key=obj['key'])
    content = result.get('Body').read().rstrip().decode('utf-8')

    lines = content.splitlines()
    reader = csv.reader(lines, delimiter=',')
    for row in reader:
        dt = row[0]
        if regex.match(dt):
            tmp_row = []
            tmp_row.append(dt_format(dt_parse(dt)))
            tmp_row.append(row[1])
           
            if 'instance_name' in obj:
                tmp_row.append(obj['instance_name'])
            else:
                tmp_row.append(None)
               
            if 'type' in obj:
                tmp_row.append(obj['type'])
            else:
                tmp_row.append(None)
           
            buf.write(','.join(tmp_row))
            buf.write('\n')
       
    return buf.getvalue()
   
def put_object(bucket, key, data):
    s3_client.put_object(Bucket=bucket, Key=key, Body=data)
    return

def lambda_handler(event, context):
    buf = StringIO()
    prefixes = src_key_prefix.split(',')
    for prefix in prefixes:
        objs = get_objects(src_bucket_name, prefix.strip() + ts, src_key_suffix)
        for obj in objs:
            content = get_object(src_bucket_name, obj).strip()
            if content:
                buf.write(content)
                buf.write('\n')
   
    data = buf.getvalue().rstrip('\n').strip()
    if data:
        put_object(dest_bucket_name, dest_key_prefix + ts + dest_key_suffix, buf.getvalue())
   
    #print('a'.split(','))
   
    return "OK"
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Amazon S3

From novice to tech pro — start learning today.