Chinese character comparison returning false when it should return true

Grant Curell

I am performing simple string comparison between two Chinese characters which are both properly decoded (I think) from UTF-8, however, the results are still non-equal and I haven't been able to figure out why. One character is being read from an input file and the other is from a decoded EPUB book.

What I've tried:

I have decoded the file from UTF-8 and the EPUB book's content also from UTF-8.
Read a number of posts about similar problems, but everything I could find boiled down to people not knowing how to decode the string correctly.

The Code

Read in the file where I get the character to compare:

with open(input_file_name, encoding="utf-8") as input_file:

In this case, the file is a single line with the character: 子

Read in the ebook and then try to find the character:

book = epub.read_epub(args.ebook_path)

for doc in book.get_items_of_type(ebooklib.ITEM_DOCUMENT):
    content = doc.content.decode('utf-8')
    print(content)
    if word in content:
        print("MATCH FOUND")
        break

From the code above you can see I'm printing the content of each item in the book. Part of that output includes:

<td class="b_cell1" width="90%"><p class="p_index_">zǐ 子</p>

where the character clearly appears.

What I Expected

I expected the two characters to match. However, if I change the code to:

word = '子'

for doc in book.get_items_of_type(ebooklib.ITEM_DOCUMENT):
    content = doc.content.decode('utf-8')
    print(content)
    if word in content:
        print("MATCH FOUND")
        break

it will print MATCH FOUND and appropriately find the character. If I inspect the binary values of the character read from the file and the overwritten word shown above:

Value of 子 from my file: b'\xef\xbb\xbf\xe5\xad\x90'
Value of 子 as word shown in the code snippet above: b'\xe5\xad\x90'

Grant Curell

The problem was what is called the byte order mark. That is what those extra three bytes (\xef\xbb\xbf) are on my variable.

From this post.

Simply use the "utf-8-sig" codec:

fp = open("file.txt")
s = fp.read()
u = s.decode("utf-8-sig")

That gives you a unicode string without the BOM. You can then use

s = u.encode("utf-8")

to get a normal UTF-8 encoded string back in s [a reference to the original posts' variable].

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at2020-12-7

Comments

0 comments

TOP Ranking

Article

Chinese character comparison returning false when it should return true

Chinese character comparison returning false when it should return true

Loopback Error: connect ECONNREFUSED 127.0.0.1:3306 (MAMP)

Can't pre-populate phone number and message body in SMS link on iPhones when SMS app is not running in the background

pump.io port in URL

How to import an asset in swift using Bundle.main.path() in a react-native native module

Failed to listen on localhost:8000 (reason: Cannot assign requested address)

Spring Boot JPA PostgreSQL Web App - Internal Authentication Error

Emulator wrong screen resolution in Android Studio 1.3

3D Touch Peek Swipe Like Mail

Double spacing in rmarkdown pdf

Svchost high CPU from Microsoft.BingWeather app errors

How to how increase/decrease compared to adjacent cell

Using Response.Redirect with Friendly URLS in ASP.NET

java.lang.NullPointerException: Cannot read the array length because "<local3>" is null

BigQuery - concatenate ignoring NULL

How to fix "pickle_module.load(f, **pickle_load_args) _pickle.UnpicklingError: invalid load key, '<'" using YOLOv3?

ngClass error (Can't bind ngClass since it isn't a known property of div) in Angular 11.0.3

Can a 32-bit antivirus program protect you from 64-bit threats

Make a B+ Tree concurrent thread safe

Bootstrap 5 Static Modal Still Closes when I Click Outside

Vector input in shiny R and then use it

Assembly definition can't resolve namespaces from external packages