We help IT Professionals succeed at work.
Research Question

Python parse with lambda

Leo Torres
Leo Torres asked
on
50 Views
Last Modified: 2020-11-02
I have a field that has 3 different values in any order I need to separate data into its own column not sure how to do this. In SQL Server, I would use charindex to find anchor location and extract what I need in Python I am not sure.
Here is a sample set:(all this data is in 1 column goal is to split it into its own column with just integer value)
3 bike, 4 cars, 2 mopeds
3 cars, 4 mopeds, 6 bike
1 mopeds, 2 bike, 3 cars

This is what I have so far
df['transport'].astype(str)

df['bike'] = df['transport'].apply(lambda row: row[row.index('bike')-2 : row.index('bike')-1])
I changed transport to type string since by default its an object.

I get this error
ValueError: substring not found
Could this be because there are some cells with no value?






Comment
Watch Question

CERTIFIED EXPERT

Commented:
Is lambda a part of the task? Or, could the solution be different?
#!python3

lines = [
    '3 bike, 4 cars, 2 mopeds',
    '3 cars, 4 mopeds, 6 bike',
    '1 mopeds, 2 bike, 3 cars'
]

cols = ['bike', 'cars', 'mopeds']

# Processing each line separately, display the number in its column.
print(', '.join(cols))  # header
for line in lines:
    d = {}
    for couple in line.split(','):
        v, col = couple.split()
        ##print(f'col={col!r}, v={v!r}')
        d[col] = int(v)
    ##print(d)
    row = []
    for col in cols:
        row.append(d[col])
    ##print(row)
    print(', '.join(str(v) for v in row))   # row of values converted to strings and joined
It prints
d:\___Python\LeoTorres\ee29199115>py a.py
bike, cars, mopeds
3, 4, 2
6, 3, 4
2, 3, 1
Leo TorresSQL Developer
CERTIFIED EXPERT

Author

Commented:
I will test this some more I thought lambda because I need the result to be added back to the dataframe as a column.
Leo TorresSQL Developer
CERTIFIED EXPERT

Author

Commented:
There are cases when the row value may not have all three values. I guess that is what is happening here. So I got the error below.
ValueError                                Traceback (most recent call last)
c:\Users\user\Documents\EDA1.py in 
      185     d = {}
     186     for couple in line.split(','):
---> 187         v, col = couple.split()
     188         ##print(f'col={col!r}, v={v!r}')
     189         d[col] = int(v)

ValueError: not enough values to unpack (expected 2, got 1)
There are instances where there is more than 3 or no value. Thank you for the help this still a neat trick I though using lambda it may be less complex code.
aikimarkSocial distance; Wear a mask; Don't touch your face; Wash your hands for 20 seconds
CERTIFIED EXPERT
Top Expert 2014

Commented:
import re
rgx = re.compile(r'(\d+) +bike')

df['bike'] = rgx.search(df['transport']).group(1)

# if you want the 'bike' column to be numeric, use
df['bike'] = int( rgx.search(df['transport']).group(1) )

Open in new window

NorieAnalyst Assistant
CERTIFIED EXPERT

Commented:
How about this?
import pandas as pd

lines = [
    '3 bike, 4 cars, 2 mopeds',
    '3 cars, 4 mopeds, 6 bike',
    '1 mopeds, 2 bike, 3 cars',
    '1 mopeds, 2 bike'
]

data = {}
for line in lines:
  items = line.split(',')
  for item in items:
    key = item.strip().split(' ')[1]
    value = item.strip().split(' ')[0]

    if key in data.keys():
      data[key] = data[key] + [value]
    else:
      data[key] = [value]

df = pd.DataFrame.from_dict(data, orient='index').transpose()

print(df)

Open in new window

Leo TorresSQL Developer
CERTIFIED EXPERT

Author

Commented:
Thank you Norie, I think your on to something here. But there are some cells that have NaN value and when that happens it errors out
here is the error
IndexError                                Traceback (most recent call last)
c:\Users\user\Documents\EDA1.py in 
      198   print(items)
      199   for item in items:
----> 200     key = item.strip().split(' ')[1]
      201     value = item.strip().split(' ')[0]
      202     if key in data.keys():

IndexError: list index out of range
if no value exists then we need the bike cars Moped value to just be set to 0 and continue to next row.

aikimark Thank you for your help
running this line
rgx.search(df['transport']).group(1)
produces this error
TypeError: expected string or bytes-like object
transport is of type object. If that helps
NorieAnalyst Assistant
CERTIFIED EXPERT

Commented:
Leo

Not sure what you mean.

The code I posted won't produce any errors, though I do realise there will be NaN/None in the dataframe.

This code, with addition of fillna will replace NaN/None with 0.
lines = [
    '3 bike, 4 cars, 2 mopeds',
    '3 cars, 4 mopeds, 6 bike',
    '1 mopeds, 2 bike, 3 cars',
    '1 mopeds, 2 bike'
]

data = {}
for line in lines:
  items = line.split(',')

  for item in items:
    key = item.strip().split(' ')[1]
    value = item.strip().split(' ')[0]
    if key in data.keys():
      data[key] = data[key] + [value]
    else:
      data[key] = [value]

df = pd.DataFrame.from_dict(data, orient='index').transpose().fillna(value=0)

print(df)

Open in new window

CERTIFIED EXPERT

Commented:
Error at the line 187
v, col = couple.split()
happens because some of the elements does not contain or the number or the column name.
For the lines with less elements or even empty, the d.get(key, default) can be used instead of d[key]. In the case of a missing key, the default value is returned. So, the modified script can look like this:
#!python3

lines = [
    '3 bike, 4 cars, 2 mopeds',
    '3 cars, 4 mopeds, 6 bike',
    '1 mopeds, bike, 3 cars',
    '4 mopeds, 6 bike',
    '5 mopeds',
    ''  # empty line
]

cols = ['bike', 'cars', 'mopeds']

# Processing each line separately, display the number in its column.
print(', '.join(cols))  # header
for line in lines:
    d = {}
    for couple in line.split(','):
        t = couple.split()
        if len(t) != 2:
            print(f'Unexpected element {t!r} in line {line!r} -- ignored.')
            continue
        v, col = t
        d[col] = int(v)

    row = []
    for col in cols:
        row.append(d.get(col, 0))

    print(', '.join(str(v) for v in row))   # row of values converted to strings and joined 
It prints the following...
d:\___Python\LeoTorres\ee29199115>py b.py
bike, cars, mopeds
3, 4, 2
6, 3, 4
Unexpected element ['bike'] in line '1 mopeds, bike, 3 cars' -- ignored.
0, 3, 1
6, 0, 4
0, 0, 5
Unexpected element [] in line '' -- ignored.
0, 0, 0
NorieAnalyst Assistant
CERTIFIED EXPERT

Commented:
pepr

What if the column names are unknown?
CERTIFIED EXPERT

Commented:
@Norie: It depends what Leo needs. For example, they can be collected on the fly, appending the new name to the end. However, if one needs to produce all columns for all rows, then one extra pass has to be done to discover the columns first.
Leo TorresSQL Developer
CERTIFIED EXPERT

Author

Commented:
I am testing the scripts thank you gentlemen. As for the value if its nan it should just return 0 as for a column name just give it any name. dummy column with a 0 would be great.
Leo TorresSQL Developer
CERTIFIED EXPERT

Author

Commented:
@pepr your code no longer errors out but i need this data in dataframe as rows and column. The printout shows me your code works but need a dataframe result also there are some data points that i may not know about this is data it is not always clean so entering column names not always possible. in these case there are 4 more names that appear 20 or fewer times with in 20000 rows so I spoted them and added them to your list

@Norie your code errors in the same spot
key = item.strip().split(' ')[1]
Error
IndexError: list index out of range

Need sometning like this
key = item.strip().split(' ')[1].fillna(value='dummy')

But I got this error
AttributeError: 'str' object has no attribute 'fillna'

NorieAnalyst Assistant
CERTIFIED EXPERT

Commented:
Leo

How are you running the code that's been posted?
Leo TorresSQL Developer
CERTIFIED EXPERT

Author

Commented:
I pass the values from column into lines variable
lines = df['transport']


data = {}
for line in lines:
  items = line.split(',')
  print(items)
  for item in items:
    key = item.strip().split(' ')[1]
   # print(key)
    value = item.strip().split(' ')[0]
   # print(value)

    if key in data.keys():
      data[key] = data[key] + [value]
    else:
      data[key] = [value]

df = pd.DataFrame.from_dict(data, orient='index').transpose().fillna(value=0)




once it reaches nan it fails
IndexError                                Traceback (most recent call last)
c:\Users\user\Documents\EDA1.py in 
      198   print(items)
      199   for item in items:
----> 200     key = item.strip().split(' ')[1]
      201    # print(key)
      202     value = item.strip().split(' ')[0]

IndexError: list index out of range

Also tried to fill lines variable. like so
lines = df_plays['personnelD'].fillna(value='000')

got error
  198   print(items)
      199   for item in items:
----> 200     key = item.strip().split(' ')[1]
      201    # print(key)
      202     value = item.strip().split(' ')[0]

IndexError: list index out of range

aikimarkSocial distance; Wear a mask; Don't touch your face; Wash your hands for 20 seconds
CERTIFIED EXPERT
Top Expert 2014

Commented:
@Leo

Please test this:
import re
rgx = re.compile(r'(\d+) +bike')

df['bike'] = rgx.search(df['transport'].value).group(1)

# if you want the 'bike' column to be numeric, use
df['bike'] = int( rgx.search(df['transport'].value).group(1), 0 )

Open in new window

Leo TorresSQL Developer
CERTIFIED EXPERT

Author

Commented:
@aikimark

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
c:\Users\user\Documents\EDA1.py in 
      236 rgx = re.compile(r'(\d+) +bike')
----> 237 df['bike'] = rgx.search(df['transport']).group(1)

TypeError: expected string or bytes-like object
aikimarkSocial distance; Wear a mask; Don't touch your face; Wash your hands for 20 seconds
CERTIFIED EXPERT
Top Expert 2014

Commented:
Leo

Your code is missing the .value that I posted
Leo TorresSQL Developer
CERTIFIED EXPERT

Author

Commented:
@aikimark my apologies your correct. I fixed my error. Reran and now i get a different error. see below

AttributeError                            Traceback (most recent call last)
c:\Users\user\Documents\EDA1.py in 
      235 import re
      236 rgx = re.compile(r'(\d+) +bike')
----> 237 df['bike'] = rgx.search(df['transport'].value).group(1)

G:\ProgramFiles\Anaconda37\envs\tensorflow222\lib\site-packages\pandas\core\generic.py in __getattr__(self, name)
   5272             if self._info_axis._can_hold_identifiers_and_holds_name(name):
   5273                 return self[name]
-> 5274             return object.__getattribute__(self, name)
   5275 
   5276     def __setattr__(self, name: str, value) -> None:

AttributeError: 'Series' object has no attribute 'value'
CERTIFIED EXPERT
Commented:
This problem has been solved!
(Unlock this solution with a 7-day Free Trial)
UNLOCK SOLUTION
Leo TorresSQL Developer
CERTIFIED EXPERT

Author

Commented:
@pepr testing your code it looks like this works I want to validate numbers. Question, I have read that using lambda makes it easier and less codie so you dont have to write such complex and lengthy code. Why didnt anyone choose to use lambda to iterate through all the rows. I am good with this answer if it works but I want to understand the reasoning here for future purposes. Thank you to all!!

CERTIFIED EXPERT

Commented:
Lambda is just an unnamed function. What is actually meant by using lambda is using a functional approach based on lambda calculus. Functions are applied to the lists. The very basic functions (like map()) are used to compose mor complex functions and the resulting functions are called lambda functions... functions that need not to be given a name -- because you do not need them later.

Some programmers were trained using mostly functional languages, I was trained using mostly procedural approach. Sometimes, the functions are passed as arguments to do the transfomation prescribed by the functions "from outside". It all depends on the problem being solved, on the mindset of the programmer, on the design of the framework...
Leo TorresSQL Developer
CERTIFIED EXPERT

Author

Commented:
@pepr Thank you for your work and knowledge share. I tried to use this code on a similar column but it didnt work because is a semicolon some times. Here it is similar to this
https://www.experts-exchange.com/questions/29199233/Python-parsing-a-column-with-semicolon-sometimes.html#questionAdd 
CERTIFIED EXPERT

Commented:
Does it mean that semicolon is used instead of comma sometimes, or is the semicolon used to separate value from the key? Can you show some example of such line? If the semicolon and comma are used alternatively for the same purpose, then probably the most straightforward fix is to replace the semicolon by comma in the line:
#!python3

import pandas

lines = [
    '3 bike, 4 cars, 2 mopeds',
    '3 cars; 4 mopeds; 6 bike',
    '1 mopeds; bike, 3 cars',
    '4 mopeds, 6 bike',
    '5 mopeds',
    ''  # empty line
    '7 trucks'
]


# Processing each line separately, display the number in its column.
data = []
for n, line in enumerate(lines, 1):
    line = line.replace(';', ',')     # <------------ here is the fix
    row = {}
    for couple in line.split(','):
        t = couple.split()
        if len(t) != 2:
            print(f'Unexpected element {t!r} at line no. {n}: {line!r} -- ignored.')
            continue
        v, col = t
        row[col] = v

    data.append(row)  # new record added

df = pandas.DataFrame.from_records(data).fillna(0)
print(df)
If the semicolon is used to separate the value and the key, then the semicolon should be replaced by space at the very same line.