Regular Expression to extract quantity with dimensions from text in Python

VIPUL VAIBHAV

I'm trying to extract dimensions and units from text.

The data could look like anything:

53 inch x 45 inch

10 in by 5 in

53" W x 74" L x 15" H

53 inch W x 74 inch L x 15 inch H

There are posts which cover the first two cases but I was not able to understand how to deal with case 3 and 4 here.

This is what I tried for the basics from this but somehow it doesn't work:

import re
regex = r"(?<!\S)\d+(?:,\d+)?\s*(?:inch|in| in|\")* ?x ?\d+(?:,\d+)?(?: ?x ?\d+(?:,\d+)?)*\s*(?:inch| inch|in| in|\")*"
test_str = ("15 mm x 2 mm x 3")
result = re.findall(regex, test_str)  
print(result)

Also, I just want to extract just these because I am using Quantulum for just extracting other numeric values but it fails in this case. So any guidance on how to merge the two things to function together would be very much appreciated.

Thank you for help

Wiktor Stribiżew

You can use

(?<!\S)(\d+(?:,\d+)?) *(?:(?:in(?:ch)?|")(?: +W)?)? ?(?:x|by) ?(\d+(?:,\d+)?)(?: ?x ?\d+(?:,\d+)?)* *(?:(?:in(?:ch)?|")(?: +L)?)?(?: ?x ?(\d+(?:,\d+)?))* *(?:(?:in(?:ch)?|")(?: +H)?)?

See the regex demo.

Certainly, \s is better instead of literal spaces in the pattern as it can match any whitespace:

(?<!\S)(\d+(?:,\d+)?)\s*(?:(?:in(?:ch)?|")(?:\s+W)?)?\s?(?:x|by)\s?(\d+(?:,\d+)?)(?:\s?x\s?\d+(?:,\d+)?)*\s*(?:(?:in(?:ch)?|")(?:\s+L)?)?(?:\s?x\s?(\d+(?:,\d+)?))*\s*(?:(?:in(?:ch)?|")(?:\s+H)?)?

Details:

  • (?<!\S) - a left-hand whitespace boundary
  • (\d+(?:,\d+)?) - Group 1: an int or float numeric value
  • * - zero or more spaces
  • (?:(?:in(?:ch)?|")(?: +W)?)? - an optional sequence of in, inch or " that are optionally followed by one or more spaces and W
  • ? - an optional space
  • (?:x|by) - x or by
  • ? - an optional space
  • (\d+(?:,\d+)?)(?: ?x ?\d+(?:,\d+)?)* *(?:(?:in(?:ch)?|")(?: +L)?)?(?: ?x ?(\d+(?:,\d+)?))* *(?:(?:in(?:ch)?|")(?: +H)?)? - two more optional repetitions of the similar pattern sequences as described above (L and H are used instead of W), the numeric values are captured into Group 2 and 3.

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

Extract specific words from text in python using regular expression

Regular expression : extract text from middle of line

regular expression to extract JSON string from text

Extract Quantity and price from text

Python - Regular expression to extract data from a variable

Regular expression to extract chunks of text from a text file?

Regular expression to extract any 4 digit number starting beginning with 5 from a text file using python

Regular Expression to Extract Text Bounded by '/'

Regular expression to extract dots that are in text

Extract text in Regular Expression in Javascript

Regular expression to extract software version from the given text?

extract [+-] decimals from a text using jquery regular expression

Extract URLs from paragraph or block of text using a regular expression

How can I extract number from text with regular expression

Regular expression to extract a year from anywhere within a free text string

How to extract text from a string using a regular expression in R?

R regular expression to extract TV show name from text file

Python regular expression to extract string from python dataframe

Regular expression with quantity dependence

Extract substring with regular expression in Python

Python Regular expression to extract a substring

Extract patterns with regular expression in Python

Using regular expression in python to extract a list of strings from a large string

How to extract slug from URL with regular expression in Python?

Regular Expression to Extract Expression from Function?

extract the dimensions from the head lines of text file

Regular expression to extract first claim of patent text

Regular expression to extract text between braces

Regular expression to extract text between square brackets