lines = []
total_check = 0
with pdfplumber.open(file) as pdf:
pages = pdf.pages
for page in pdf.pages:
text = page.extract_text()
for line in text.split('\n'):
print(line)
output data:
Totaalbedrag excl. btw € 25,00
When I try to retrieve VAT from data:
KVK_re = re.compile(r'(excl. btw .+)')
KVK_re.search(data).group(0)
output: AttributeError: 'NoneType' object has no attribute 'group'
KVK_re = re.compile(r'(excl. btw .+)')
KVK_re.search(r'excl. btw € 25,00').group(0)
output: 'excl. btw € 25,00'
How is it possible that when I paste the literal output in a search it does find the number € 25,00 and when I enter the data variable it does not?
Please help me!
In most cases, when a literal space is used in the pattern and there is no match, the reason is the invisible characters, or non-breaking spaces.
When you have non-breaking spaces, \xA0
, you can simply replace the literal spaces with \s
to match any whitespace, or [ \xA0]
to match either of the spaces.
It appears there may be a combination of both spaces and some invisible chars in this case, thus, you may use \W
to match any non-word chars instead of a literal space:
r'excl\.\W+btw\W.+'
Collected from the Internet
Please contact [email protected] to delete if infringement.
Comments