Remove CDATA from XML with regex in Windows CMD (powershell)

Francis Mescudi

I am working with some XML data and I am stacked trying to remove CDATA in XML. I tried many ways, and it seems the simplier is by replacing all patterns

hey <![CDATA[mate - number 1]]> what's up

by

hey mate - number 1 what's up

Regex, in order to get the whole expression is (\<\!\[CDATA\[)(.*)(\]\]\>), so when using PERL (PCRE), I just need to replace by \2.

By this, and taking advantage of Powershell, I am running in CMD:

powershell -Command "(gc Desktop\test_in.xml) -replace '(\<\!\[CDATA\[)(.*)(\]\]\>)', '\2' | Out-File Desktop\test_out.xml")

Although the result is everthing is replaced by string \2, instead of mate - number 1 in the example.

Instead of \2, I tried (?<=(\<\!\[CDATA\[))(.*?)(?=(\]\]\>)) since I am getting with this the inner part I am trying to keep, although the result is frustating, again literal replacing.

Any guess?

Thank you!

PS. If anyone know how to avoid this replacing in R, it is usefull as well.

Parfait

Any XSLT that runs the Identity Transform (i.e., copies itself) will remove the <CData> tags. Consider running with R's xslt package or with PowerShell:

library(xml2)
library(xslt)

txt <- "<root>
              <data>hey <![CDATA[mate - number 1]]> what's up</data>
       </root>"    
doc <- read_xml(txt)

txt <- '<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
            <xsl:output indent="yes"/>
            <xsl:strip-space elements="*"/>

            <xsl:template match="@*|node()">
              <xsl:copy>
                 <xsl:apply-templates select="@*|node()"/>
              </xsl:copy>
            </xsl:template>

         </xsl:stylesheet>'    
style <- read_xml(txt, package = "xslt")

new_xml <- xml_xslt(doc, style)

# Output
cat(as.character(new_xml))

# <?xml version="1.0" encoding="UTF-8"?>
# <root>
#    <data>hey mate - number 1 what's up</data>
# </root>

Powershell

$xslt = New-Object System.Xml.Xsl.XslCompiledTransform;

$xslt.Load("C:\Path\To\Identity_Transform\Script.xsl");
$xslt.Transform("C:\Path\To\Input.xml", "C:\Path\To\Output.xml");

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

Remove CDATA from XML

Exctract URLs and CDATA from XML string with regex

How to remove "mvn" from windows 10 CMD

Get html from CDATA in xml

How to access XML data inside CDATA with Powershell

Should RegEx be stored in XML as CDATA or as Attributes?

Remove language from Windows 10 using PowerShell

Reading XML CDATA in windows 8 phone app

Create random data file from Windows CMD/PowerShell

How do I get the windows product key from cmd/powershell

Turn on/off Bluetooth radio/adapter from cmd/powershell in Windows 10

How to use a variable in powershell replace command (from Windows CMD)

Remove PDF passwords with PowerShell (or CMD)

Remove string from file name, using windows cmd

Using CMD on Windows to remove a specific substring from a directory of files

How to read CDATA in XML file with PowerShell using a variable for the XML path?

Remove parentheses in powershell regex?

Reading CDATA from XML file with BeautifulSoup

How to read CDATA from xml file with Python

Extract img src from cdata text in XML

strip off CData from xml using xslt

Powershell remove XML tags

How To Delete Multiple Files After Matching Through REGEX Using CMD/PowerShell In Windows?

Remove all instances of a specific XML tag from a string using regex

Windows 8.1 can't start Powershell from a cmd or powershell prompt - "This app can't run on your PC"

Extract values from xml and it has namespaces and parsing xml cdata

Powershell regex to remove comma but not delimiter

Passing commands to cmd from powershell

run powershell command from cmd