How to apply a function that splits multiple numbers to the fields of a column in a dataframe in Python?

Elisa L. Published at Dev

Elisa L.

I need to apply a function that splits multiple numbers from the fields of a dataframe.

In this dataframe there a all the kids' measurements that are needed for a school: Name, Height, Weight, and Unique Code, and their dream career.

The name is only formed of alpha-characters. But some kids might have both first name and middle name. (e.g. Vivien Ester)
The height is known to be >= 100 cm for every child.
The weight is known to be < 70 kg for every child.
The unique code is known to be any number, but it is associated with the letters 'AX', for every child. But the AX may not always be stick to the number (e.g. 7771AX), it might be a space next to it. (e.g. 100 AX)
Every kid has its dream career

They could appear in any order, but they always follow the rules from above. However, for some kids some measurements could not appear (e.g.: height or unique code or both are missing or all are missing).

So the dataframe is this:

data = { 'Dream Career': ['Scientist', 'Astronaut', 'Software Engineer', 'Doctor', 'Fashion Designer', 'Teacher', 'Architect'],
    'Measurements': ['Rachel 24.3 100.25 100 AX', '100.5 Tuuli 30.1', 'Michael 28.0 7771AX 113.75', 'Vivien Ester 40AX 115.20', 'Oliver 40.5', 'Julien 35.1 678 AX 111.1', 'Bee 20.0 100.80 88AX']
       }

df = pd.DataFrame (data, columns = ['Dream Career','Measurements'])

And it looks like this:

        Dream Career                Measurements
0          Scientist   Rachel 24.3 100.25 100 AX
1          Astronaut            100.5 Tuuli 30.1
2  Software Engineer  Michael 28.0 7771AX 113.75
3             Doctor    Vivien Ester 40AX 115.20
4   Fashion Designer                 Oliver 40.5
5            Teacher    Julien 35.1 678 AX 111.1
6          Architect        Bee 20.0 100.80 88AX

I try to split all of these measurements into different columns, based on the specified rules.

So the final dataframe should look like this:

       Dream Career         Names  Weight  Height Unique Code
0          Scientist       Rachael    24.3  100.25       100AX
1          Astronaut         Tuuli    30.1  100.50         NaN
2  Software Engineer       Michael    28.0  113.75      7771AX
3             Doctor  Vivien Ester     NaN  115.20        40AX
4   Fashion Designer        Oliver    40.5     NaN         NaN
5            Teacher        Julien    35.1  111.10       678AX
6          Architect           Bee    10.0  100.80        88AX

I tried this code and it works very well, but only on single strings. And I need to do this while in the dataframe and still keep every's kid's associate dream career (so the order is not lost).

num_rx = r'[-+]?\.?\d+(?:,\d{3})*\.?\d*(?:[eE][-+]?\d+)?'
def get_weight_height(s):
    nums = re.findall(num_rx, s)
    height = np.nan
    weight = np.nan
    if (len(nums) == 0):
        height = np.nan
        weight = np.nan
    elif (len(nums) == 1):
        if float(nums[0]) >= 100:
            height = nums[0]
            weight = np.nan
        else:
            weight = nums[0]
            height = np.nan
    elif (len(nums) == 2):
        if float(nums[0]) >= 100:
            height = nums[0]
            weight = nums[1]
        else:
            height = nums[1]
            weight = nums[0]
    return height, weight

class_code = {'Small': 'AX', 'Mid': 'BX', 'High': 'CX'}

def hasNumbers(inputString):
    return any(char.isdigit() for char in inputString)

def extract_measurements(string, substring_name):
    height = np.nan
    weight = np.nan
    unique_code = np.nan
    name = ''
    if hasNumbers(string):
        num_rx = r'[-+]?\.?\d+(?:,\d{3})*\.?\d*(?:[eE][-+]?\d+)?'
        nums = re.findall(num_rx, string)
        if (substring_name in string):
            special_match = re.search(rf'{num_rx}(?=\s*{substring_name}\b)', string)
            if special_match:
                unique_code = special_match.group()
                string = string.replace(unique_code, '')
                unique_code = unique_code + substring_name
            if len(nums) >= 2 & len(nums) <= 3:
                height, weight = get_weight_height(string)
        else:
            height, weight = get_weight_height(string)
    name = " ".join(re.findall("[a-zA-Z]+", string))
    name = name.replace(substring_name,'')
    return format(float(height), '.2f'), float(weight), unique_code, name

And I apply it like this:

string = 'Anya 101.30 23 4546AX'
height, weight, unique_code, name = extract_measurements(string, class_code['Small'])        
print( 'name is: ', name, '\nh is: ', height, '\nw is: ', weight, '\nunique code is: ', unique_code)

The results are very good.

I tried to apply the function on the dataframe, but I don't know how, I tried this as I got inspired from this and this and this... but they are all different than my problem:

df['height'], df['weight'], df['unique_code'], df['name'] = extract_measurements(df['Measurements'], class_code['Small'])

I cannot figure out how to apply it on my dataframe. Please help me.

I am at the very beginning, I highly appreciate all the help if you could possibly help me!

piterbarg

Use apply for rows (axis=1) and choose 'expand' option. Then rename columns and concat to the original df:

pd.concat([df,(df.apply(lambda row : extract_measurements(row['Measurements'], class_code['Small']), axis = 1, result_type='expand')
   .rename(columns = {0:'height', 1:'weight', 2:'unique_code', 3:'name'})
)], axis = 1)

output:

    Dream Career       Measurements                  height    weight  unique_code    name
--  -----------------  --------------------------  --------  --------  -------------  ------------
 0  Scientist          Rachel 24.3 100.25 100 AX        100       100  100AX          Rachel
 1  Astronaut          100.5 Tuuli 30.1                 100       100  nan            Tuuli
 2  Software Engineer  Michael 28.0 7771AX 113.75       100       100  7771AX         Michael
 3  Doctor             Vivien Ester 40AX 115.20         100       100  40AX           Vivien Ester
 4  Fashion Designer   Oliver 40.5                      100       100  nan            Oliver
 5  Teacher            Julien 35.1 678 AX 111.1         100       100  678AX          Julien
 6  Architect          Bee 20.0 100.80 88AX             100       100  88AX           Bee

(note I stubbed def get_weight_height(string) function because your coded did not include it, to always return 100,100)

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at2021-06-4

Comments

0 comments

TOP Ranking

Article

How to apply a function that splits multiple numbers to the fields of a column in a dataframe in Python?

How to apply a function that splits multiple numbers to the fields of a column in a dataframe in Python?

Loopback Error: connect ECONNREFUSED 127.0.0.1:3306 (MAMP)

Can't pre-populate phone number and message body in SMS link on iPhones when SMS app is not running in the background

pump.io port in URL

How to import an asset in swift using Bundle.main.path() in a react-native native module

Failed to listen on localhost:8000 (reason: Cannot assign requested address)

Spring Boot JPA PostgreSQL Web App - Internal Authentication Error

Emulator wrong screen resolution in Android Studio 1.3

3D Touch Peek Swipe Like Mail

Double spacing in rmarkdown pdf

Svchost high CPU from Microsoft.BingWeather app errors

How to how increase/decrease compared to adjacent cell

Using Response.Redirect with Friendly URLS in ASP.NET

java.lang.NullPointerException: Cannot read the array length because "<local3>" is null

BigQuery - concatenate ignoring NULL

How to fix "pickle_module.load(f, **pickle_load_args) _pickle.UnpicklingError: invalid load key, '<'" using YOLOv3?

ngClass error (Can't bind ngClass since it isn't a known property of div) in Angular 11.0.3

Can a 32-bit antivirus program protect you from 64-bit threats

Make a B+ Tree concurrent thread safe

Bootstrap 5 Static Modal Still Closes when I Click Outside

Vector input in shiny R and then use it

Assembly definition can't resolve namespaces from external packages