I have a UTF-8 encoded string that comes from somewhere else that contains the characters \xc3\x85lesund
(literal backslash, literal "x", literal "c", etc).
Printing it outputs the following:
\xc3\x85lesund
I want to convert it to a bytes variable:
b'\xc3\x85lesund'
To be able to encode:
'Ålesund'
How can I do this? I'm using python 3.4.
unicode_escape
TL;DR You can decode bytes using the unicode_escape
encoding to convert \xXX
and \uXXXX
escape sequences to the corresponding characters:
>>> r'\xc3\x85lesund'.encode('utf-8').decode('unicode_escape').encode('latin-1')
b'\xc3\x85lesund'
First, encode the string to bytes so it can be decoded:
>>> r'\xc3\x85あ'.encode('utf-8')
b'\\xc3\\x85\xe3\x81\x82'
(I changed the string to show that this process works even for characters outside of Latin-1.)
Here's how each character is encoded (note that あ is encoded into multiple bytes):
\
(U+005C) -> 0x5cx
(U+0078) -> 0x78c
(U+0063) -> 0x633
(U+0033) -> 0x33\
(U+005C) -> 0x5cx
(U+0078) -> 0x788
(U+0038) -> 0x385
(U+0035) -> 0x35あ
(U+3042) -> 0xe3, 0x81, 0x82Next, decode the bytes as unicode_escape
to replace each escape sequence with its corresponding character:
>>> r'\xc3\x85あ'.encode('utf-8').decode('unicode_escape')
'Ã\x85ã\x81\x82'
Each escape sequence is converted to a separate character; each byte that is not part of an escape sequence is converted to the character with the corresponding ordinal value:
\\xc3
-> U+00C3\\x85
-> U+0085\xe3
-> U+00E3\x81
-> U+0081\x82
-> U+0082Finally, encode the string to bytes again:
>>> r'\xc3\x85あ'.encode('utf-8').decode('unicode_escape').encode('latin-1')
b'\xc3\x85\xe3\x81\x82'
Encoding as Latin-1 simply converts each character to its ordinal value:
And voilà, we have the byte sequence you're looking for.
codecs.escape_decode
As an alternative, you can use the codecs.escape_decode
method to interpret escape sequences in a bytes to bytes conversion, as user19087 posted in an answer to a similar question:
>>> import codecs
>>> codecs.escape_decode(r'\xc3\x85lesund'.encode('utf-8'))[0]
b'\xc3\x85lesund'
However, codecs.escape_decode
is undocumented, so I wouldn't recommend using it.
Collected from the Internet
Please contact [email protected] to delete if infringement.
Comments