How can I convert literal escape sequences in a string to the corresponding bytes?

113

Rafael Almeida

I have a UTF-8 encoded string that comes from somewhere else that contains the characters \xc3\x85lesund (literal backslash, literal "x", literal "c", etc).

Printing it outputs the following:

\xc3\x85lesund

I want to convert it to a bytes variable:

b'\xc3\x85lesund'

To be able to encode:

'Ålesund'

How can I do this? I'm using python 3.4.

ThisSuitIsBlackNot

Using `unicode_escape`

TL;DR You can decode bytes using the unicode_escape encoding to convert \xXX and \uXXXX escape sequences to the corresponding characters:

>>> r'\xc3\x85lesund'.encode('utf-8').decode('unicode_escape').encode('latin-1')
b'\xc3\x85lesund'

First, encode the string to bytes so it can be decoded:

>>> r'\xc3\x85あ'.encode('utf-8')
b'\\xc3\\x85\xe3\x81\x82'

(I changed the string to show that this process works even for characters outside of Latin-1.)

Here's how each character is encoded (note that あ is encoded into multiple bytes):

\ (U+005C) -> 0x5c
x (U+0078) -> 0x78
c (U+0063) -> 0x63
3 (U+0033) -> 0x33
\ (U+005C) -> 0x5c
x (U+0078) -> 0x78
8 (U+0038) -> 0x38
5 (U+0035) -> 0x35
あ (U+3042) -> 0xe3, 0x81, 0x82

Next, decode the bytes as unicode_escape to replace each escape sequence with its corresponding character:

>>> r'\xc3\x85あ'.encode('utf-8').decode('unicode_escape')
'Ã\x85ã\x81\x82'

Each escape sequence is converted to a separate character; each byte that is not part of an escape sequence is converted to the character with the corresponding ordinal value:

\\xc3 -> U+00C3
\\x85 -> U+0085
\xe3 -> U+00E3
\x81 -> U+0081
\x82 -> U+0082

Finally, encode the string to bytes again:

>>> r'\xc3\x85あ'.encode('utf-8').decode('unicode_escape').encode('latin-1')
b'\xc3\x85\xe3\x81\x82'

Encoding as Latin-1 simply converts each character to its ordinal value:

U+00C3 -> 0xc3
U+0085 -> 0x85
U+00E3 -> 0xe3
U+0081 -> 0x81
U+0082 -> 0x82

And voilà, we have the byte sequence you're looking for.

Using `codecs.escape_decode`

As an alternative, you can use the codecs.escape_decode method to interpret escape sequences in a bytes to bytes conversion, as user19087 posted in an answer to a similar question:

>>> import codecs
>>> codecs.escape_decode(r'\xc3\x85lesund'.encode('utf-8'))[0]
b'\xc3\x85lesund'

However, codecs.escape_decode is undocumented, so I wouldn't recommend using it.

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at2020-10-25

Comments

0 comments

TOP Ranking

Article