Solved

# Python lists join/matching

Posted on 2013-09-12
Medium Priority
673 Views
I have 2 lists:

[(1, 'a','xy'), (2, 'b','vw'),(3,'c','pq')]
[(1,'#\$'), (3,'**')]

Can I do a join to pair the list elements based on the first value in the inside touples?

I want to pair:
(1, 'a','xy') with (1,'#\$')
(2, 'b','vw') - has no match
(3,'c','pq') with (3,'**')]

What I want is to retrieve 2 rowsets from database and then pair them in python based on the value of 1 column.
0
Question by:Zberteoc
• 8
• 6
• 3
• +1

LVL 31

Expert Comment

ID: 39488149
Do you want something like:

``````a = [(1, 'a','xy'), (2, 'b','vw'),(3,'c','pq')]

def merge(l1, l2):
l3 = []
for i in l1:
found = False
for j in l2:
if i[0] == j[0]:
l3.append((i,j))
found = True

l3.append(i)
return l3

c = merge(a, b)
print c
``````
0

LVL 27

Author Comment

ID: 39488393
Something like that but I want to iterate in parallel if possible. I know that in both lists the first elements from tuples are ordered and unique. This way I have the minimum iterations in the end.

So if I get:

a-1 and b-1 - stop b if they are equal, don't go to the end
a-2 and b - 3 - stop b and go next a
a-3 and b-3 stop if they are equal

and so forth.
0

LVL 29

Expert Comment

ID: 39488554
Based on your experience with SQL, try the following:
``````#!python3

# Your lists resemble SQL tables, and the wanted result resembles
# the LEFT OUTER JOIN of the tables.
lst1 = [(1, 'a','xy'), (2, 'b','vw'),(3,'c','pq')]
lst2 = [(1,'#\$'), (3,'**')]

# The efficient JOIN in SQL requires using indexes. When simplified,
# the indexes help to use the tables as look-up tables, i.e. they help
# to find faster the row based on the key. In Python (an also in other
# programming languages, dictionary (or a hash table, or lookup table)
# were designed for similar purpose.
#
# Let's take the second list and construct a structure that resembles
# the SQL table with index. Here the "dictionary comprehension"
# construction is used. Basically, the lst is iterated to get the tuples,
# the first element of the tuple is used as a key, the whole tuple is
# used as a value. (You can modify the code to store only the rest
# of the tuple -- depending on the problem you want to solve.)
d2 = { row[0]: row for row in lst2 }
print(d2)

# Now you can search easily for the values based on the key.
print(1, '-->', d2[1])
print(3, '-->', d2[3])

# However, some keys may not be in the dictionary. Then you want
# to get some default value -- here the (None, None) tuple. For the
# purpose you can use the dict.get() method instead of the indexing
# operator dict[]. It does the same but it allows you to define some
# default value. It can be used, of course, also for the existing keys.
# This implements the OUTER feature of the JOIN.
default = (None, None)
print(2, '-->', d2.get(2, default))

# Now you can iterate through the first list (this is the LEFT table),
# get its first element that is used as a key for JOIN, and get the row
# from the second table (via index, hence the dictionary d2).
# Here the rows are printed one after the other...
print('-' * 50)
for t in lst1:
print(t, d2.get(t[0], default))

# However, you can use the + operator to join two tuples (rows) and produce
# the new, joined one. Here the key is not repeated -- that is done
# via slicing [1:] (i.e. from the element on index 1 to the end).
print('-' * 50)
for t in lst1:
print(t + d2.get(t[0], default)[1:])
``````
It prints on my console:
``````c:\tmp\_Python\Zberteoc\Q_28237876>py a.py
{1: (1, '#\$'), 3: (3, '**')}
1 --> (1, '#\$')
3 --> (3, '**')
2 --> (None, None)
--------------------------------------------------
(1, 'a', 'xy') (1, '#\$')
(2, 'b', 'vw') (None, None)
(3, 'c', 'pq') (3, '**')
--------------------------------------------------
(1, 'a', 'xy', '#\$')
(2, 'b', 'vw', None)
(3, 'c', 'pq', '**')
``````
0

LVL 27

Author Comment

ID: 39490279
That looks neat. I only have one issue. If you have relatively small lists this would be fine but what if you deal with 2 lists with millions of elements. To build 2 dictionaries you will have to iterate through both of them and on top of that it will practically double the data in the dictionaries.

How do you deal with this situation in Python?
0

LVL 37

Expert Comment

ID: 39490317
I'm not really good at python yet, (I saw this from the Misc Programming topic) but I can give you an algorithm.
If you know they are both sorted, then it's really easy.
``````i = 0
j = 0
while i < list1.size and j < list2.size
if list1[i].value == list2[j].value:
i = i + 1 # now move to the next in both lists
j = j + 1
else if list1[i].value < list2[j].value:
i = i + 1 # list1[i] has no match
else
j = j + 1 # list2[j] has no match
``````
0

LVL 27

Author Comment

ID: 39490376
Have you tried this code?
0

LVL 29

Expert Comment

ID: 39490510
@Zberteoc: For the case of exremely large data, it is always better to use the database operation directly (from Python if you need it in Python). The more complex the original SQL is, the more difficult will be to simulate the SQL Server functionality and efficiency.

Basically, the problem is already with capturing the data to the Python lists. It is better to use some SQL engine wrapper and iterate through the result of the SQL command (also in Python).

For the concrete example, I will try later to write the code without the dictionary, using only the lists.

What SQL engine do you use? How do you get the data inserted to the lists?
0

LVL 27

Author Comment

ID: 39490544
I am using MS-SQL server.

A weird thing that happens is that I am building a relatively large SQL script in python and then I apply it to a database. I also save the SQL script in a log file before executing it. It is supposed to insert 1000 rows but uses 1000 MERGE statements not simply inserts.

The code I wrote in Python works but when SQL script is applied from it only about half of the rows get inserted(ie, 467) while if I copy the SQL script from the log file and execute it directly from Management Studio, it works correctly and inserts all 1000 rows.

Thing is that I don't get any error reported, everything looks fine. Is there a default limit for the variables size? I build the SQL script in a string variable. But again, when I save the variable content in the log file, I get all 1000 MERGE statements, which then work in SQL directly.
0

LVL 27

Author Comment

ID: 39490595
I control that 1000 number from a variable and I reduced it to 500, still only 463 rows inserted, and then to 200 and finally all 200 were inserted. It seems that there is some limitation when the variable is executed with:

sql_cursor.execute(sql_script)

The length of the original script with 1000 rows was 2,234,991. It may be a limitation when you execute a query.
0

LVL 37

Expert Comment

ID: 39490668
Have you tried this code?
I gave you pseudo-code, not code. The idea is you can use that algorithm to iterate through both lists in parallel.
Have I tested that idea? Yes. Many times.

Start both lists at the top.
If they match, increment both.
If they don't match, increment the smaller one.

Simple and fast.
0

LVL 27

Author Comment

ID: 39490732
Oh, I thought you posted Python code. :o).

I got the idea and I actually I implemented something similar but I also am novice in Python and I am looking for more elegant and efficient ways if possible. The data structures in Python, lists, sets, tuples, dictionaries, are kind of new thing and is not that easy to figure out by yourself how they can be used.
0

LVL 37

Expert Comment

ID: 39491201
As far as theoretical time complexity goes, this is O(N) and is really as efficient as it gets.
However, since python is a scripted language, it is possible that the built in routines will perform faster than iterating through manually.
I don't know of any fancy python ways of doing this with built in routines. We'll see what everyone else comes up with and then if you are curious, you can time them and see how much faster it can be (if at all).
0

LVL 29

Expert Comment

ID: 39491636
Tommy's code is for inner join and it is perfecly suitable for Python if the sorted lists are prepared in advance. After the Python-syntax corrections, the code looks very similar:
``````#!python3

lst1 = [(1, 'a','xy'), (2, 'b','vw'), (3,'c','pq'), (5,'e','aq'), (6,'f','xa')]
lst2 = [(1,'#\$'), (3,'**'), (4,'+-'), (6,'%%'), (7,'@@'), (8, '&*')]

i = 0
j = 0
while i < len(lst1) and j < len(lst2):
if lst1[i][0] == lst2[j][0]:
print(lst1[i], lst2[j])    # here side by side
i = i + 1 # now move to the next in both lists
j = j + 1
elif lst1[i][0] < lst2[j][0]:
i = i + 1 # list1[i] has no match
else:
j = j + 1 # list2[j] has no match

# Exactly the same code only with different way of combining the rows.
print('-' * 50)
i = 0
j = 0
while i < len(lst1) and j < len(lst2):
if lst1[i][0] == lst2[j][0]:
print(lst1[i] + lst2[j][1:])  # here as one tuple without repeating the key
i = i + 1 # now move to the next in both lists
j = j + 1
elif lst1[i][0] < lst2[j][0]:
i = i + 1 # list1[i] has no match
else:
j = j + 1 # list2[j] has no match
``````
It prints on the console:
``````c:\tmp\_Python\Zberteoc\Q_28237876>py b.py
(1, 'a', 'xy') (1, '#\$')
(3, 'c', 'pq') (3, '**')
(6, 'f', 'xa') (6, '%%')
--------------------------------------------------
(1, 'a', 'xy', '#\$')
(3, 'c', 'pq', '**')
(6, 'f', 'xa', '%%')
``````
In Python, you can also iterate through the sequence, i.e. without building the lists first, on the fly. The StopIteration exception is the standard way of how the sequence generator say it just finished:
``````#!python3

# Here the lists exist, but let's iterate through them without indexing
# and without using the len() function.
lst1 = [(1, 'a','xy'), (2, 'b','vw'), (3,'c','pq'), (5,'e','aq'), (6,'f','xa')]
lst2 = [(1,'#\$'), (3,'**'), (4,'+-'), (6,'%%'), (7,'@@'), (8, '&*')]

it1 = iter(lst1)
it2 = iter(lst2)
try:
e1 = next(it1)
e2 = next(it2)
while True:
if e1[0] == e2[0]:
print(e1, e2)    # here side by side
e1 = next(it1)   # now move to the next in both lists
e2 = next(it2)
elif e1[0] < e2[0]:
e1 = next(it1)   # list1[i] has no match
else:
e2 = next(it2)   # list2[j] has no match
except StopIteration:
pass

# Exactly the same code only with different way of combining the rows.
print('-' * 50)
it1 = iter(lst1)
it2 = iter(lst2)
try:
e1 = next(it1)
e2 = next(it2)
while True:
if e1[0] == e2[0]:
print(e1 + e2[1:]) # one tuple without repeating the key
e1 = next(it1)     # now move to the next in both lists
e2 = next(it2)
elif e1[0] < e2[0]:
e1 = next(it1)     # list1[i] has no match
else:
e2 = next(it2)     # list2[j] has no match
except StopIteration:
pass
``````
The produced output is the same as for the solution with list indexing. The major difference is that the sequences can be huge. They are not stored in memory.
0

LVL 27

Author Comment

ID: 39491976
@pepr:

Sorry of being lazy :) bu you gave an "inner join" solution, I need a "left join". It works and definitely the version that uses the least memory is preferable.

It would be nice to build a function that does both, or maybe 2 function for each situation.
0

LVL 29

Expert Comment

ID: 39492590
Only a minor modification makes it LEFT OUTER:
``````#!python3

# Here the lists exist, but let's iterate through them without indexing
# and without using the len() function.
lst1 = [(1, 'a','xy'), (2, 'b','vw'), (3,'c','pq'), (5,'e','aq'), (6,'f','xa')]
lst2 = [(1,'#\$'), (3,'**'), (4,'+-'), (6,'%%'), (7,'@@'), (8, '&*')]

it1 = iter(lst1)
it2 = iter(lst2)
try:
e1 = next(it1)
e2 = next(it2)
while True:
if e1[0] == e2[0]:
# do your combining logic for the INNER part of the join
print(e1 + e2[1:]) # one tuple without repeating the key
e1 = next(it1)     # now move to the next in both lists
e2 = next(it2)
elif e1[0] < e2[0]:
# Here the key of the element of the first sequence is smaller
# which means that the second sequence value should be
# replaced by the default. This is the LEFT OUTER extra
# functionality (when compared with the INNER).
# Let the None is the default. To avoid slicing a constant,
# we create a tuple with a single element. The comma inside
# the parentheses says it is not a parenthesized expression
# but the tuple. (The operator+ needs two tuples.)
print(e1 + (None,) )
e1 = next(it1)
else:
e2 = next(it2)     # list2[j] has no match
except StopIteration:
pass
``````
It displays
``````c:\tmp\_Python\Zberteoc\Q_28237876>py d.py
(1, 'a', 'xy', '#\$')
(2, 'b', 'vw', None)
(3, 'c', 'pq', '**')
(5, 'e', 'aq', None)
(6, 'f', 'xa', '%%')
``````
When saving memory, the important part of the solution will be to avoid building lst1 and lst2. You should get the iterators instead. I do not know your code, but it is likely you can get them like it1 = iter(cursor) when cursor.execute('SELECT ...') was called earlier.
0

LVL 29

Accepted Solution

pepr earned 2000 total points
ID: 39493283
``````#!python3

# Here is your wanted function. A it is always a good idea not to use
# print() inside such functions, I have given it the form of a generator.
# It differs from a usual function. It uses the 'yield' command instead
# of 'return'. The generator, when called, actually returns an iterator
# object that stores the state when executing the 'yield'. The next
# iteration continues just after the last 'yield' command. See the example
# of usage below.
def leftOuterJoin(seq1, seq2):
it1 = iter(seq1)
it2 = iter(seq2)
try:
row1 = next(it1)
row2 = next(it2)
while True:
if row1[0] == row2[0]:
yield row1 + row2[1:]
row1 = next(it1)     # now move to the next in both
row2 = next(it2)
elif row1[0] < row2[0]:
yield row1 + (None,) # default for the second
row1 = next(it1)
else:
row2 = next(it2)     # skip the second
except StopIteration:
pass

# If you may want to use the stript as a module later, it is a good
# idea to execute the script body conditionaly. The following condition
# holds only when the .py file was executed as a script.
if __name__ == '__main__':
# Let's use the lists in the example for demonstration. The real
# code may use whatever iterable sequences that contain the tuples
# with expected content.
lst1 = [(1, 'a','xy'), (2, 'b','vw'), (3,'c','pq'), (5,'e','aq'), (6,'f','xa')]
lst2 = [(1,'#\$'), (3,'**'), (4,'+-'), (6,'%%'), (7,'@@'), (8, '&*')]

# Now the function/generator can be used for joining -- here for
# the lists as lists support iteration. The generator returns
# (actually yields) the joined data as tuples -- one by one.
# The generator does NOT eat the memory for whole sequences. Instead,
# it iterates through them step by step.
for row in leftOuterJoin(lst1, lst2):
print(row)
``````
Notice the for loop at the end. It only iterates through the sequence of tuples returned by the generator. The printed result...
``````c:\tmp\_Python\Zberteoc\Q_28237876>py e.py
(1, 'a', 'xy', '#\$')
(2, 'b', 'vw', None)
(3, 'c', 'pq', '**')
(5, 'e', 'aq', None)
(6, 'f', 'xa', '%%')
``````
0

LVL 27

Author Comment

ID: 39493989
@pepr

Thank you for you awesome code bits. The previous post was already enough. Python turns out to be such a cool language. I wonder why isn't even more popular than already is.
0

LVL 29

Expert Comment

ID: 39494091
You are welcome. Possibly, the points should be split next time ;)

Actually, there are more of wonderful languages -- like C++.
0

## Featured Post

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.