I cannot extract the text from an element using ElementTree

Luis

A snippet of my document and the code is as follows:

import xml.etree.ElementTree as ET
obj = ET.fromstring("""
   <tab>
    <infos><bounds left="7947" top="88607" width="10086" height="1184" bottom="89790" right="18032" mbFixSize="false" mbFrameAreaPositionValid="true" mbFrameAreaSizeValid="true" mbFramePrintAreaValid="true"/>     <prtBounds left="115" top="0" width="9300" height="1169" bottom="1168" right="9414"/> </infos>
    <row > <infos> <bounds left="8062" top="88607" width="9300" height="524" bottom="89130" right="17361" mbFixSize="false" mbFrameAreaPositionValid="true" mbFrameAreaSizeValid="true" mbFramePrintAreaValid="true"/>      <prtBounds left="0" top="0" width="9300" height="524" bottom="523" right="9299"/>      </infos>
     <cell ptr="000002232E644270" id="199" symbol="class SwCellFrame" next="202" upper="198" lower="200" rowspan="1"> <infos> <bounds left="8062" top="88607" width="546" height="524" bottom="89130" right="8607" mbFixSize="false" mbFrameAreaPositionValid="true" mbFrameAreaSizeValid="true" mbFramePrintAreaValid="true"/>        <prtBounds left="7" top="15" width="532" height="509" bottom="523" right="538"/>  </infos>
      <txt> <infos> <bounds left="8069" top="88622" width="532" height="187" bottom="88808" right="8600" mbFixSize="false" mbFrameAreaPositionValid="true" mbFrameAreaSizeValid="false" mbFramePrintAreaValid="true"/> <prtBounds left="0" top="3" width="532" height="184" bottom="186" right="531"/>        </infos>
       <Finish/>
      </txt>
      <txt> <infos> <bounds left="8069" top="88809" width="532" height="149" bottom="88957" right="8600" mbFixSize="false" mbFrameAreaPositionValid="true" mbFrameAreaSizeValid="false" mbFramePrintAreaValid="true"/> <prtBounds left="136" top="0" width="396" height="149" bottom="148" right="531"/> </infos>
UDA       <Finish/>
      </txt>
     </cell>
     <cell ptr="000002232E642E40" id="202" symbol="class SwCellFrame" next="205" prev="199" upper="198" lower="203" rowspan="1"> <infos> <bounds left="8608" top="88607" width="3283" height="524" bottom="89130" right="11890" mbFixSize="false" mbFrameAreaPositionValid="true" mbFrameAreaSizeValid="true" mbFramePrintAreaValid="true"/> <prtBounds left="7" top="15" width="3269" height="509" bottom="523" right="3275"/> </infos>
      <txt>
       <infos> <bounds left="8615" top="88622" width="3269" height="180" bottom="88801" right="11883" mbFixSize="false" mbFrameAreaPositionValid="true" mbFrameAreaSizeValid="false" mbFramePrintAreaValid="true"/> <prtBounds left="0" top="7" width="3269" height="173" bottom="179" right="3268"/> </infos> <Finish/>
      </txt>
      <txt> <infos> <bounds left="8615" top="88802" width="3269" height="149" bottom="88950" right="11883" mbFixSize="false" mbFrameAreaPositionValid="true" mbFrameAreaSizeValid="false" mbFramePrintAreaValid="true"/> <prtBounds left="58" top="0" width="3170" height="149" bottom="148" right="3227"/> </infos>
Nombre       <Finish/>
      </txt>
     </cell>
    </row>
  </tab>
""")
a = obj.findall('./row/cell/txt')
for i, item in enumerate(a):
    print(i, item.text.strip())

But if I simplify the document, I do manage to extract the text,

obj = ET.fromstring("""
   <tab>
    <row>
     <cell > 
      <txt > <Finish/> </txt>
      <txt > UDA <Finish/> </txt>
     </cell>
     <cell >
      <txt > <Finish/> </txt>
      <txt > Nombre       <Finish/> </txt>
     </cell>
   </row>
  </tab>
""")

a = obj.findall('./row/cell/txt')
for i, item in enumerate(a):
    print(i, item.text.strip())
0 
1 UDA
2 
3 Nombre

I don't know how to solve this problem, because my working document is very large and I can't simplify it as I have done in this example.

mzjn

The "UDA" and "Nombre" strings are found in the tail of infos elements. The easiest way to get the wanted output is to use itertext():

a = obj.findall('./row/cell/txt')
for i, item in enumerate(a):
    text = "".join([s.strip() for s in item.itertext()])
    print(i, text)

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

Python 3.10: I cannot extract text from an element with Selenium 4.8.0

Extract data from XML using ElementTree in Python

ElementTree cannot find element

How to extract the text from the element using Selenium through Java

Extract the text in element by using BeautifulSoup

change xml element text using xml.etree.ElementTree

Don't encode Element text object using Python ElementTree

How to remove the <text:soft-page-break /> element from the ElementTree?

I need to extract an element from a tag using BeautifulSoup4

How can I extract the text from the <em> tag using BeautifulSoup

How can I extract a text from a bytes file using python

How do i extract text from email body using UiPath?

How can I extract the text from a webelement using selenium

How do I get the xml:id of an element using ElementTree in python

Using ElementTree to extract <content:encoded>

How can I extract text from an HTML element containing a mix of `p` tags and inner text?

Cannot extract <link> element using HtmlAgilityPack and XPath

cannot extract item element from xml

Extract elements literal text using Element Tree

How to extract text from element, element found but text is empty?

Cannot extract text from xml in python

Cannot extract URLs from a text file

How EXTRACT THE TEXT from an option of a select element

How to extract the text from the "Health" Element

How to extract text from single XML element?

How to extract text from 'a' element with BeautifulSoup?

Extract one element from lines of a text file

Extract all text regardless of tags with ElementTree

How do I extract text of a single HTML element by tag name using MSXML in VBA?