Solved

Joining numpy arrays in Python

Posted on 2010-09-21
4
1,233 Views
Last Modified: 2012-05-10
Greetings all:

Imagine I have two 2-dim numpy arrays:

xs = array([
    [59, 24.5, 25.5, 26.5, 4],
    [1727, 21.5, 22.5, 23.5, 9],
    [1840, 21.5, 22.5, 23.5, 9],
    [2252, 22.0, 23.0, 24.0, 4],
    [2445, 22.0, 23.0, 24.0, 4]
])

[
    [x11, x12, x13, x14, x15],
    [x21, x22, x23, x24, x25],
    [x31, x32, x33, x34, x35],
    [x41, x42, x43, x44, x45],
    [x51, x52, x53, x54, x55]
]

ys= array([
    [159, 124.5, 125.5, 126.5],
    [1227, 121.5, 122.5, 123.5],
    [1340, 121.5, 122.5, 123.5],
    [1452, 122.0, 123.0, 124.0],
    [2945, 122.0, 123.0, 124.0]
])

[
    [y11, y12, y13, y14, y15],
    [y21, y22, y23, y24, y25],
    [y31, y32, y33, y34, y35],
    [y41, y42, y43, y44, y45],
    [y51, y52, y53, y54, y55]
]

Assume the first column of each matrix is a timestamp which can be compared to each other (e.g. TS0 < TS1 == True). I'm using the minutes and seconds only for sake of brevety (e.g. 59 is 59 seconds after the hour/minute; 1727 is 17 minutes 27 seconds after the hour, etc)

The return value must be a third matrix, nm, where the first timestamp column will survive and the remaining columns will be joined in one of two ways:

1. First way is to join the arrays if either element has changed.

So the result should look like this:

([
    [159, 24.5, 25.5, 26.5, 4, 124.5, 125.5, 126.5],
    [1227, 24.5, 25.5, 26.5, 4, 121.5, 122.5, 123.5],
    [1452, 24.5, 25.5, 26.5, 4, 122.0, 123.0, 124.0],
    [1727, 21.5, 22.5, 23.5, 9, 122.0, 123.0, 124.0],
    [2252, 22.0, 23.0, 24.0, 4, 122.0, 123.0, 124.0]
])
[
    [y11, x12, x13, x14, x15, y12, y13, y14], # element in y changes; x does not
    [y21, x12, x13, x14, x15, y22, y23, y24], # element in y changes; x does not
    [y41, x12, x13, x14, x15, y42, y43, y44], # element in y changes; x does not
    [x21, x22, x23, x24, x25, y42, y43, y44], # element in y does not change; x does
    [x41, x42, x43, x44, x45, y42, y43, y44], # element in y does not change; x does
]

2. Second way is to join the arrays if both elements have changed.

So the result should look like this:

([
    [159, 24.5, 25.5, 26.5, 4, 124.5, 125.5],
    [1727, 21.5, 22.5, 23.5, 9, 122.0, 123.0]
])

([
    [y11, x12, x13, x14, x15, y12, y13, y14],
    [x21, x22, x23, x24, x25, y42, y43, y44]
])

I've made far too many attempts to post code here, but essentially I've been able to return a matrix to (kind of) match case 1:

[
    [159 124.5 125.5 126.5],
    [1227 121.5 122.5 123.5],
    [1452 122.0 123.0 124.0],
    [1727 21.5 22.5 23.5 9],
    [2252 22.0 23.0 24.0 4]
]

but this is not complete of course because it does not actually "join the arrays".
0
Comment
Question by:strimp101
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 2
  • 2
4 Comments
 
LVL 9

Expert Comment

by:zaghaghi
ID: 33731924
I think that it's not well defined! Please make it clear.
0
 

Author Comment

by:strimp101
ID: 33734148
Thanks for the response - this is a bit difficult for me to explain, but here goes:

The arrays represent price series for derivative (credit default swaps) securities. The goal of the project is to use the joined matrix to run robust regressions (OLS rejection, winsor, ORD, etc.).

The securities trade infrequenty without a "closing price" like stocks so we cannot compare them arbitrarily by lining them in up in series. So the two methods I outlined are ways to join the series in ways suitable for regression analysis.

For sake of argument, use the first two columns of the matricies - column 1 representing a timestamp and column 2 representing the price.

X starts at price 24.5 at time 59 (first row) and does not change until it changes to price 21.5 at time 1727 (second row).

In the meantime, price Y changes from 124.5 at time 159 (first row) to 121.5 at time 1227 (second row).

So at this point, nm should have two rows:

159, 24.5, 25.5, 26.5, 4, 124.5, 125.5, 126.5;
1227, 24.5, 25.5, 26.5, 4, 121.5, 122.5, 123.5;

Column 1 is the timestamp from matrix y.
Columns 2 - 5 are the values from matrix x (note the value is the same in both rows because the value did not change between time 159 and 1227).
Columns 6 - 8 are the values from matrix y (note the values differ in each row because the y values change at time 159 and 1227).

Y does not change again until time 1452 to price 122.

So Y changes three total times.

While Y is changing from time 159, x is not changing.

So if we were to put time times in order (column 1 of both matricies), the times would be:

59 from x < the x value from this timestamp persists at each y value change below until at least the next x change at time 1840
159 from y
1227 from y
1340 from y
1452 from y
1840 from x
2252 from x
2445 from x
2945 from y

The only time we record a price is when one or the other change from the previous value.
0
 
LVL 9

Accepted Solution

by:
zaghaghi earned 500 total points
ID: 33736545
Hi,

After an hour, i think that i understand what you want and you can find out the script needed for the first case.

#import what we need for array operations
from numpy import array,  concatenate,  insert,  delete,  allclose

#first array
xs = array([
    [59, 24.5, 25.5, 26.5, 4],
    [1727, 21.5, 22.5, 23.5, 9],
    [1840, 21.5, 22.5, 23.5, 9],
    [2252, 22.0, 23.0, 24.0, 4],
    [2445, 22.0, 23.0, 24.0, 4]
])

#second array
ys = array([
    [159, 124.5, 125.5, 126.5],
    [1227, 121.5, 122.5, 123.5], 
    [1340, 121.5, 122.5, 123.5],
    [1452, 122.0, 123.0, 124.0], 
    [2945, 122.0, 123.0, 124.0]
])

# we walk on time, start from min timestamp to max timestamp
#x array contains last xs[xi] that we pass its time
x = array([])
#x array contains last ys[yi] that we pass its time
y = array([])
#contains un filtered result, i.e. contanins additional rows that are not needed
result = array([[0., 0., 0., 0., 0., 0., 0., 0.]])
#xi and yi are indices for xs and xy respectively
xi = 0
yi = 0
#here we start, until one of xs or ys array completely  checked
while xi < len(xs) and yi < len(ys):
    #if time stamp of xs is smaller than ys, then we update result array by inserting a new row to it containing timestamp, x and y
    if xs[xi][0] < ys[yi][0]:
        x = xs[xi][1:]
        #this if checks that y is not empty(i.e for the first time)
        if len(y) > 0:
            # make new row with smallest timestamp x and y
            r = concatenate(([xs[xi][0]] ,  x ,  y))
            #insert the new row to the result array
            result = insert(result, len(result),   r,  axis=0)
        #increase xi by one
        xi = xi + 1
    else:
        y = ys[yi][1:]
        #this if checks that x is not empty(i.e for the first time)
        if len(x) > 0:
            # make new row with smallest timestamp x and y
            r = concatenate(([ys[yi][0]] ,  x ,  y))
            #insert the new row to the result array
            result = insert(result, len(result),   r,  axis=0)
        #increase yi by one
        yi = yi + 1

#if xs does not completely checked, this while help us to check remaining items in it
while xi < len(xs):
        x = xs[xi][1:]
        r = concatenate(([xs[xi][0]] ,  x ,  y))
        result = insert(result, len(result),   r,  axis=0)
        xi = xi + 1
#and if ys does not completely checked, this while help us to check remaining items in it, note that only one of these two whiles executed!
while yi < len(ys):
        y = ys[yi][1:]
        r = concatenate(([ys[yi][0]] ,  x ,  y))
        result = insert(result, len(result),   r,  axis=0)
        yi = yi + 1

#remove row number i if its x and y is the same as row number i-1.
for i in reversed(range(1, len(result))):
    if allclose(result[i][1:], result[i-1][1:]):
        result = delete(result,  i,  axis=0)

#delete the first row i.e. [0.,0.,0.,0.....
result = delete(result,  0,  axis=0)

print(result)

Open in new window

0
 

Author Closing Comment

by:strimp101
ID: 33771684
Nice.
0

Featured Post

Free Tool: Subnet Calculator

The subnet calculator helps you design networks by taking an IP address and network mask and returning information such as network, broadcast address, and host range.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

A short article about a problem I had getting the GPS LocationListener working.
In this post we will learn how to connect and configure Android Device (Smartphone etc.) with Android Studio. After that we will run a simple Hello World Program.
Learn the basics of strings in Python: declaration, operations, indices, and slicing. Strings are declared with quotations; for example: s = "string": Strings are immutable.: Strings may be concatenated or multiplied using the addition and multiplic…

739 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question