Chinese character comparison returning false when it should return true

Grant Curell

I am performing simple string comparison between two Chinese characters which are both properly decoded (I think) from UTF-8, however, the results are still non-equal and I haven't been able to figure out why. One character is being read from an input file and the other is from a decoded EPUB book.

What I've tried:

  • I have decoded the file from UTF-8 and the EPUB book's content also from UTF-8.
  • Read a number of posts about similar problems, but everything I could find boiled down to people not knowing how to decode the string correctly.

The Code

Read in the file where I get the character to compare:

with open(input_file_name, encoding="utf-8") as input_file:

In this case, the file is a single line with the character: 子

Read in the ebook and then try to find the character:

book = epub.read_epub(args.ebook_path)

for doc in book.get_items_of_type(ebooklib.ITEM_DOCUMENT):
    content = doc.content.decode('utf-8')
    print(content)
    if word in content:
        print("MATCH FOUND")
        break

From the code above you can see I'm printing the content of each item in the book. Part of that output includes:

<td class="b_cell1" width="90%"><p class="p_index_">zǐ 子</p>

where the character clearly appears.

What I Expected

I expected the two characters to match. However, if I change the code to:

word = '子'

for doc in book.get_items_of_type(ebooklib.ITEM_DOCUMENT):
    content = doc.content.decode('utf-8')
    print(content)
    if word in content:
        print("MATCH FOUND")
        break

it will print MATCH FOUND and appropriately find the character. If I inspect the binary values of the character read from the file and the overwritten word shown above:

  • Value of 子 from my file: b'\xef\xbb\xbf\xe5\xad\x90'
  • Value of 子 as word shown in the code snippet above: b'\xe5\xad\x90'
Grant Curell

The problem was what is called the byte order mark. That is what those extra three bytes (\xef\xbb\xbf) are on my variable.

From this post.


Simply use the "utf-8-sig" codec:

fp = open("file.txt")
s = fp.read()
u = s.decode("utf-8-sig")

That gives you a unicode string without the BOM. You can then use

s = u.encode("utf-8")

to get a normal UTF-8 encoded string back in s [a reference to the original posts' variable].

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

Python "==" returning False when it should return True

HttpRequest.Content.IsMimeMultipartContent() is returning false when it should return true

Function continues to return true even when it should be returning false

Why is if statement returning false when it should be true?

Why does my Firebase Function keep returning false when the data should return true?

Why is this function returning true when it should be returning false?

Javascript function seems to be returning true when it should be returning false

regex returns false when should return true

An and statement returns True when it should return False

matcher.find() returning false while it should return true

Python code comparison with if validates False when it should be True

momentjs isValid is returning false when it should be true and visa versa

Java - !String.equals("abc") returning TRUE when it should be FALSE

Java matcher.matches() returning false when it should be true

Split String Array[index] > constant returning True when it should be false

Isset() returning false when it should be true - Symfony PHP

empty function return false when it should return true codeigniter

When should I return TRUE and when FALSE on DialogProc

isinstance returning False when I expect it to return true

Javascript function returns true when it should return false

Why does this code snippet output "true" when it should return "false"?

IF returns false but should return true

else if returning true when false

If or statement not returning true when it should be

Should be returning true but returns false instead?

Prolog list length comparison return true/false

MiniTest::Assertions returning false for test that should be returning true

PHP preg_match should be returning true but returning false

Why does my javascript function return true when it should return false