I am writing compressed data as a bytes
type to a black-box API (i.e. I cannot change what happens under the hood). When I get that data back, it is returned as a string
type which I cannot decompress using the generic python modules (zlib, bz2, etc)
In more detail, part of the problem is that this string includes the leading 'b'
, e.g.
b'x\x9c\xabV*HL\xd1\xcd\xccK\xcbW\xb2RPJ\xcb\xcfOJ,R\xaa\x05\x00T\x83\x07b'
(this is a string type).
When I compare this to the original binary representation, outside of the quotes and leading B it is identical.
If I try to simply convert back to bytes (e.g. using the bytes
function) it wraps the whole thing and escapes the slashes and I get something like the following:
b"b'x\\x9c\\xabV*HL\\xd1\\xcd\\xccK\\xcbW\\xb2RPJ\\xcb\\xcfOJ,R\\xaa\\x05\\x00T\\x83\\x07b'"
Questions is, is it possible to convert this back to a bytes type so I can decompress it? If so, how?
I've seen a few different examples (e.g. How to cast a string to bytes without encoding) that don't quite work out for what I'm trying.
UPDATE:
Lots of good answers, thanks folks! I wish I could click accept on multiple of them. And yes, as many of you noted, it is zlib compressed. This is by design as we have extremely limited space to work with and would like to stay with JSON if possible (zlib was chosen arbitrarily to just get the quirks of binary data out, and may not be the final choice).
Assuming type str
for your original string, you have the following raw string (literal length 4 escape codes not an actual escape code representing 1 byte):
s = r"b'x\x9c\xabV*HL\xd1\xcd\xccK\xcbW\xb2RPJ\xcb\xcfOJ,R\xaa\x05\x00T\x83\x07b'"
If you remove the leading b'
and '
, you can use the latin1
encoding to convert to bytes. latin1
is a 1:1 mapping of Unicode code points to byte values, because the first 256 Unicode code points represent the latin1
character set:
>>> s[2:-1].encode('latin1')
b'x\\x9c\\xabV*HL\\xd1\\xcd\\xccK\\xcbW\\xb2RPJ\\xcb\\xcfOJ,R\\xaa\\x05\\x00T\\x83\\x07b'
This is now a byte string, but contains literal escape codes. Now apply the unicode_escape
encoding to translate back to a str
of the actual code points:
>>> s2 = b.decode('unicode_escape')
>>> s2
'x\x9c«V*HLÑÍÌKËW²RPJËÏOJ,Rª\x05\x00T\x83\x07b'
This is now a Unicode string, with code points, but we still need a byte string. Encode with latin1
again:
>>> b2 = s2.encode('latin1')
>>> b2
b'x\x9c\xabV*HL\xd1\xcd\xccK\xcbW\xb2RPJ\xcb\xcfOJ,R\xaa\x05\x00T\x83\x07b'
In one step:
>>> s = r"b'x\x9c\xabV*HL\xd1\xcd\xccK\xcbW\xb2RPJ\xcb\xcfOJ,R\xaa\x05\x00T\x83\x07b'"
>>> b = s[2:-1].encode('latin1').decode('unicode_escape').encode('latin1')
>>> b
b'x\x9c\xabV*HL\xd1\xcd\xccK\xcbW\xb2RPJ\xcb\xcfOJ,R\xaa\x05\x00T\x83\x07b'
It appears this sample data is a zlib-compressed JSON string:
>>> import zlib,json
>>> json.loads(zlib.decompress(b))
{'pad-info': 'foobar'}
Collected from the Internet
Please contact [email protected] to delete if infringement.
Comments