I am performing simple string comparison between two Chinese characters which are both properly decoded (I think) from UTF-8, however, the results are still non-equal and I haven't been able to figure out why. One character is being read from an input file and the other is from a decoded EPUB book.
What I've tried:
The Code
Read in the file where I get the character to compare:
with open(input_file_name, encoding="utf-8") as input_file:
In this case, the file is a single line with the character: 子
Read in the ebook and then try to find the character:
book = epub.read_epub(args.ebook_path)
for doc in book.get_items_of_type(ebooklib.ITEM_DOCUMENT):
content = doc.content.decode('utf-8')
print(content)
if word in content:
print("MATCH FOUND")
break
From the code above you can see I'm printing the content of each item in the book. Part of that output includes:
<td class="b_cell1" width="90%"><p class="p_index_">zǐ 子</p>
where the character clearly appears.
What I Expected
I expected the two characters to match. However, if I change the code to:
word = '子'
for doc in book.get_items_of_type(ebooklib.ITEM_DOCUMENT):
content = doc.content.decode('utf-8')
print(content)
if word in content:
print("MATCH FOUND")
break
it will print MATCH FOUND and appropriately find the character. If I inspect the binary values of the character read from the file and the overwritten word shown above:
The problem was what is called the byte order mark. That is what those extra three bytes (\xef\xbb\xbf
) are on my variable.
From this post.
Simply use the "utf-8-sig" codec:
fp = open("file.txt")
s = fp.read()
u = s.decode("utf-8-sig")
That gives you a unicode string without the BOM. You can then use
s = u.encode("utf-8")
to get a normal UTF-8 encoded string back in s [a reference to the original posts' variable].
Collected from the Internet
Please contact [email protected] to delete if infringement.
Comments