What is the proper way to use codecs' encoding in Python?

kormak

I have an HTML file encoded in utf-8. I want to ouput it to a text file, encoded in utf-8. Here's the code I'm using:

import codecs
IN = codecs.open("E2P3.html","r",encoding="utf-8")
codehtml = IN.read()

#codehtml = codehtml.decode("utf-8") 

texte = re.sub("<br>","\n",codehtml)

#texte = texte.encode("utf-8") 

OUT = codecs.open("E2P3.txt","w",encoding="utf-8")
OUT.write(texte)

IN.close()
OUT.close()

As you can see, I've tried using both 'decode' and 'codecs'. Neither of these work, my output text file defaults as Occidental (Windows-1252) and some entities become gibberish. What am I doing wrong here?

Tim Pietzcker

When opening a UTF-8 file with the codecs module, as you did, the contents of the file are automatically decoded into Unicode strings, so you must not try to decode them again.

The same is true when writing the file; if you write it using the codecs module, the Unicode string you're passing will automatically be encoded to whatever encoding you specified.

To make it explicit that you're dealing with Unicode strings, it might be a better idea to use Unicode literals, as in

texte = re.sub(u"<br>", u"\n",codehtml)

although it doesn't really matter in this case (which could also be written as

texte = codehtml.replace(u"<br>", u"\n")

since you're not actually using a regular expression).

If the application doesn't recognize the UTF-8 file, it might help saving it with a BOM (Byte Order Mark) (which is generally discouraged, but if the application can't recognize a UTF-8 file otherwise, it's worth a try):

OUT = codecs.open("E2P3.txt","w",encoding="utf-8-sig")

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

What is the proper way to use @property in python

What is the proper way to use inotify?

What is the proper way to use continue?

What is the proper way to use if _ is _ or .isKind(of: )

What is the proper way to use IF THEN in AQL?

What is the proper way to use descriptors as fields in Python dataclasses?

Proper way to use **kwargs in Python

what is the "proper" way to use django REST framework?

What is the proper way to use a .equals method in Java?

What is the proper way to use an alternative binary

What is the proper way to use multiple layouts in ReactJS

What is the proper way to use bit array in Rust?

What is the proper way to use Toolbar and SwipeRefreshLayout?

What is the proper way to use React Memo with Flow?

what is the proper way to use $nin operator with mongoDB

What is the proper/right way to use Async Storage?

What is a Proper way to use Input range listener

What's the proper way to use Coroutines in Activity?

Python encodes (Korean) characters in an unexpected way with euc-kr encoding (codecs, encodings module)

What is the proper way to parse a Lucene Query in python?

What is the proper way to write getter and setter in Python?

What is the proper way to terminate a Timer thread in python?

What is the proper way to manage multiple python versions?

What is the proper way to comment functions in Python?

What is the proper way to bundle variables in python

What is the proper way to construct class, object in python

What is the proper way to encapsulate conditionals in Python?

What is the proper way to use Python mock's autospec for objects's methods?

In BeautifulSoup, what's the proper way to use a strainer with lxml parsing?