asked on

Python lists join/matching

I have 2 lists:

[(1, 'a','xy'), (2, 'b','vw'),(3,'c','pq')]
[(1,'#$'), (3,'**')]

Can I do a join to pair the list elements based on the first value in the inside touples?

I want to pair:
(1, 'a','xy') with (1,'#$')
(2, 'b','vw') - has no match
(3,'c','pq') with (3,'**')]

What I want is to retrieve 2 rowsets from database and then pair them in python based on the value of 1 column.

farzanj

Do you want something like:

a = [(1, 'a','xy'), (2, 'b','vw'),(3,'c','pq')]
b = [(1,'#$'), (3,'**'),(5, 'adfs')]


def merge(l1, l2):
    l3 = []
    for i in l1:
        found = False
        for j in l2:
            if i[0] == j[0]:
                l3.append((i,j))
                found = True

        if not found:
            l3.append(i)
    return l3

c = merge(a, b)
print c

Open in new window

Zberteoc

ASKER

Something like that but I want to iterate in parallel if possible. I know that in both lists the first elements from tuples are ordered and unique. This way I have the minimum iterations in the end.

So if I get:

a-1 and b-1 - stop b if they are equal, don't go to the end
a-2 and b - 3 - stop b and go next a
a-3 and b-3 stop if they are equal

and so forth.

pepr

Based on your experience with SQL, try the following:

#!python3

# Your lists resemble SQL tables, and the wanted result resembles
# the LEFT OUTER JOIN of the tables.
lst1 = [(1, 'a','xy'), (2, 'b','vw'),(3,'c','pq')]
lst2 = [(1,'#$'), (3,'**')]

# The efficient JOIN in SQL requires using indexes. When simplified,
# the indexes help to use the tables as look-up tables, i.e. they help
# to find faster the row based on the key. In Python (an also in other
# programming languages, dictionary (or a hash table, or lookup table)
# were designed for similar purpose.
#
# Let's take the second list and construct a structure that resembles
# the SQL table with index. Here the "dictionary comprehension"
# construction is used. Basically, the lst is iterated to get the tuples,
# the first element of the tuple is used as a key, the whole tuple is
# used as a value. (You can modify the code to store only the rest
# of the tuple -- depending on the problem you want to solve.)
d2 = { row[0]: row for row in lst2 }
print(d2)

# Now you can search easily for the values based on the key.
print(1, '-->', d2[1])
print(3, '-->', d2[3])

# However, some keys may not be in the dictionary. Then you want
# to get some default value -- here the (None, None) tuple. For the
# purpose you can use the dict.get() method instead of the indexing
# operator dict[]. It does the same but it allows you to define some
# default value. It can be used, of course, also for the existing keys.
# This implements the OUTER feature of the JOIN.
default = (None, None)
print(2, '-->', d2.get(2, default))

# Now you can iterate through the first list (this is the LEFT table),
# get its first element that is used as a key for JOIN, and get the row
# from the second table (via index, hence the dictionary d2).
# Here the rows are printed one after the other...
print('-' * 50)
for t in lst1:
    print(t, d2.get(t[0], default))

# However, you can use the + operator to join two tuples (rows) and produce
# the new, joined one. Here the key is not repeated -- that is done
# via slicing [1:] (i.e. from the element on index 1 to the end).
print('-' * 50)
for t in lst1:
    print(t + d2.get(t[0], default)[1:])

Open in new window

It prints on my console:

c:\tmp\_Python\Zberteoc\Q_28237876>py a.py
{1: (1, '#$'), 3: (3, '**')}
1 --> (1, '#$')
3 --> (3, '**')
2 --> (None, None)
--------------------------------------------------
(1, 'a', 'xy') (1, '#$')
(2, 'b', 'vw') (None, None)
(3, 'c', 'pq') (3, '**')
--------------------------------------------------
(1, 'a', 'xy', '#$')
(2, 'b', 'vw', None)
(3, 'c', 'pq', '**')

Open in new window

Zberteoc

ASKER

That looks neat. I only have one issue. If you have relatively small lists this would be fine but what if you deal with 2 lists with millions of elements. To build 2 dictionaries you will have to iterate through both of them and on top of that it will practically double the data in the dictionaries.

How do you deal with this situation in Python?

TommySzalapski

I'm not really good at python yet, (I saw this from the Misc Programming topic) but I can give you an algorithm.
If you know they are both sorted, then it's really easy.

i = 0
j = 0
while i < list1.size and j < list2.size
  if list1[i].value == list2[j].value:
    # do your combining logic
    i = i + 1 # now move to the next in both lists
    j = j + 1
  else if list1[i].value < list2[j].value:
    i = i + 1 # list1[i] has no match
  else
    j = j + 1 # list2[j] has no match

Open in new window

Zberteoc

ASKER

Have you tried this code?

pepr

@Zberteoc: For the case of exremely large data, it is always better to use the database operation directly (from Python if you need it in Python). The more complex the original SQL is, the more difficult will be to simulate the SQL Server functionality and efficiency.

Basically, the problem is already with capturing the data to the Python lists. It is better to use some SQL engine wrapper and iterate through the result of the SQL command (also in Python).

For the concrete example, I will try later to write the code without the dictionary, using only the lists.

What SQL engine do you use? How do you get the data inserted to the lists?

Zberteoc

ASKER

I am using MS-SQL server.

A weird thing that happens is that I am building a relatively large SQL script in python and then I apply it to a database. I also save the SQL script in a log file before executing it. It is supposed to insert 1000 rows but uses 1000 MERGE statements not simply inserts.

The code I wrote in Python works but when SQL script is applied from it only about half of the rows get inserted(ie, 467) while if I copy the SQL script from the log file and execute it directly from Management Studio, it works correctly and inserts all 1000 rows.

Thing is that I don't get any error reported, everything looks fine. Is there a default limit for the variables size? I build the SQL script in a string variable. But again, when I save the variable content in the log file, I get all 1000 MERGE statements, which then work in SQL directly.

Zberteoc

ASKER

I control that 1000 number from a variable and I reduced it to 500, still only 463 rows inserted, and then to 200 and finally all 200 were inserted. It seems that there is some limitation when the variable is executed with:

sql_cursor.execute(sql_script)

The length of the original script with 1000 rows was 2,234,991. It may be a limitation when you execute a query.

TommySzalapski

Have you tried this code?

I gave you pseudo-code, not code. The idea is you can use that algorithm to iterate through both lists in parallel.
Have I tested that idea? Yes. Many times.

Start both lists at the top.
If they match, increment both.
If they don't match, increment the smaller one.

Simple and fast.

Zberteoc

ASKER

Oh, I thought you posted Python code. :o).

I got the idea and I actually I implemented something similar but I also am novice in Python and I am looking for more elegant and efficient ways if possible. The data structures in Python, lists, sets, tuples, dictionaries, are kind of new thing and is not that easy to figure out by yourself how they can be used.

TommySzalapski

As far as theoretical time complexity goes, this is O(N) and is really as efficient as it gets.
However, since python is a scripted language, it is possible that the built in routines will perform faster than iterating through manually.
I don't know of any fancy python ways of doing this with built in routines. We'll see what everyone else comes up with and then if you are curious, you can time them and see how much faster it can be (if at all).

pepr

Tommy's code is for inner join and it is perfecly suitable for Python if the sorted lists are prepared in advance. After the Python-syntax corrections, the code looks very similar:

#!python3

lst1 = [(1, 'a','xy'), (2, 'b','vw'), (3,'c','pq'), (5,'e','aq'), (6,'f','xa')]
lst2 = [(1,'#$'), (3,'**'), (4,'+-'), (6,'%%'), (7,'@@'), (8, '&*')]

i = 0
j = 0
while i < len(lst1) and j < len(lst2):
    if lst1[i][0] == lst2[j][0]:
        # do your combining logic
        print(lst1[i], lst2[j])    # here side by side
        i = i + 1 # now move to the next in both lists
        j = j + 1
    elif lst1[i][0] < lst2[j][0]:
        i = i + 1 # list1[i] has no match
    else:
        j = j + 1 # list2[j] has no match
        

# Exactly the same code only with different way of combining the rows.        
print('-' * 50)        
i = 0
j = 0
while i < len(lst1) and j < len(lst2):
    if lst1[i][0] == lst2[j][0]:
        # do your combining logic
        print(lst1[i] + lst2[j][1:])  # here as one tuple without repeating the key
        i = i + 1 # now move to the next in both lists
        j = j + 1
    elif lst1[i][0] < lst2[j][0]:
        i = i + 1 # list1[i] has no match
    else:
        j = j + 1 # list2[j] has no match

Open in new window

It prints on the console:

c:\tmp\_Python\Zberteoc\Q_28237876>py b.py
(1, 'a', 'xy') (1, '#$')
(3, 'c', 'pq') (3, '**')
(6, 'f', 'xa') (6, '%%')
--------------------------------------------------
(1, 'a', 'xy', '#$')
(3, 'c', 'pq', '**')
(6, 'f', 'xa', '%%')

Open in new window

In Python, you can also iterate through the sequence, i.e. without building the lists first, on the fly. The StopIteration exception is the standard way of how the sequence generator say it just finished:

#!python3

# Here the lists exist, but let's iterate through them without indexing
# and without using the len() function.
lst1 = [(1, 'a','xy'), (2, 'b','vw'), (3,'c','pq'), (5,'e','aq'), (6,'f','xa')]
lst2 = [(1,'#$'), (3,'**'), (4,'+-'), (6,'%%'), (7,'@@'), (8, '&*')]

it1 = iter(lst1)
it2 = iter(lst2)
try:
    e1 = next(it1)
    e2 = next(it2)
    while True:
        if e1[0] == e2[0]:
            # do your combining logic
            print(e1, e2)    # here side by side
            e1 = next(it1)   # now move to the next in both lists
            e2 = next(it2)
        elif e1[0] < e2[0]:
            e1 = next(it1)   # list1[i] has no match
        else:
            e2 = next(it2)   # list2[j] has no match
except StopIteration:
    pass        

# Exactly the same code only with different way of combining the rows.        
print('-' * 50)        
it1 = iter(lst1)
it2 = iter(lst2)
try:
    e1 = next(it1)
    e2 = next(it2)
    while True:
        if e1[0] == e2[0]:
            # do your combining logic
            print(e1 + e2[1:]) # one tuple without repeating the key
            e1 = next(it1)     # now move to the next in both lists
            e2 = next(it2)
        elif e1[0] < e2[0]:
            e1 = next(it1)     # list1[i] has no match
        else:
            e2 = next(it2)     # list2[j] has no match
except StopIteration:
    pass

Open in new window

The produced output is the same as for the solution with list indexing. The major difference is that the sequences can be huge. They are not stored in memory.

Zberteoc

ASKER

@pepr:

Sorry of being lazy :) bu you gave an "inner join" solution, I need a "left join". It works and definitely the version that uses the least memory is preferable.

It would be nice to build a function that does both, or maybe 2 function for each situation.

pepr

Only a minor modification makes it LEFT OUTER:

#!python3

# Here the lists exist, but let's iterate through them without indexing
# and without using the len() function.
lst1 = [(1, 'a','xy'), (2, 'b','vw'), (3,'c','pq'), (5,'e','aq'), (6,'f','xa')]
lst2 = [(1,'#$'), (3,'**'), (4,'+-'), (6,'%%'), (7,'@@'), (8, '&*')]


it1 = iter(lst1)
it2 = iter(lst2)
try:
    e1 = next(it1)
    e2 = next(it2)
    while True:
        if e1[0] == e2[0]:
            # do your combining logic for the INNER part of the join
            print(e1 + e2[1:]) # one tuple without repeating the key
            e1 = next(it1)     # now move to the next in both lists
            e2 = next(it2)
        elif e1[0] < e2[0]:
            # Here the key of the element of the first sequence is smaller
            # which means that the second sequence value should be
            # replaced by the default. This is the LEFT OUTER extra
            # functionality (when compared with the INNER).
            # Let the None is the default. To avoid slicing a constant,
            # we create a tuple with a single element. The comma inside
            # the parentheses says it is not a parenthesized expression
            # but the tuple. (The operator+ needs two tuples.)
            print(e1 + (None,) )
            e1 = next(it1)     
        else:
            e2 = next(it2)     # list2[j] has no match
except StopIteration:
    pass

Open in new window

It displays

c:\tmp\_Python\Zberteoc\Q_28237876>py d.py
(1, 'a', 'xy', '#$')
(2, 'b', 'vw', None)
(3, 'c', 'pq', '**')
(5, 'e', 'aq', None)
(6, 'f', 'xa', '%%')

Open in new window

When saving memory, the important part of the solution will be to avoid building lst1 and lst2. You should get the iterators instead. I do not know your code, but it is likely you can get them like it1 = iter(cursor) when cursor.execute('SELECT ...') was called earlier.

ASKER CERTIFIED SOLUTION

pepr

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

Zberteoc

ASKER

@pepr

Thank you for you awesome code bits. The previous post was already enough. Python turns out to be such a cool language. I wonder why isn't even more popular than already is.

pepr

You are welcome. Possibly, the points should be split next time ;)

Actually, there are more of wonderful languages -- like C++.