How to get correct replacement of ISO-8859-1 characters to UTF-8?

Ger Cas Published at Dev

Ger Cas

I want to replace ISO-8859-1 characters from file below to be valid for UTF-8 encoding.

<HTML>
<HEAD>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
</HEAD>

<BODY>

<A NAME="top"></A>

<TABLE border=0 width=609 cellspacing=0 cellpadding=0>
<TR><td rowspan=2><img src="http://www.example.com" width=10></td>
<TD width=609 valign=top>

<p>'</p>
<p>*</p>
<p>-</p>
<p>—</p>
<p>§</p>
<p>«</p>
<p>»</p>
<p>¿</p>
<p>Á</p>

</TD>
</TR>
</TABLE>

</body>
</html>

Doing some research I found that the issue is related with locale language and I was able to build this awk program, but only replaces the first 2 characters (' and *)

LC_ALL=ISO_8859-1 awk '{
   gsub(/charset=iso-8859-1/, "charset=UTF-8"  ,  $0)
   gsub(/\047/, "\\&apos;"  ,  $0)
   gsub(/*/, "\\&ast;"      ,  $0)
   gsub(/–/, "\\&ndash;"    ,  $0)
   gsub(/—/, "\\&mdash;"    ,  $0)
   gsub(/§/, "\\&sect;"     ,  $0)
   gsub(/«/, "\\&laquo;"    ,  $0)
   gsub(/»/, "\\&raquo;"    ,  $0)
   gsub(/¿/, "\\&iquest;"   ,  $0)
   gsub(/Á/, "\\&Aacute;"   ,  $0)
   print
   }' t.html | iconv -f ISO_8859-1 -t UTF-8

This is the current output (showing below partial output, only lines affected by the program):

<p>&apos;</p>
<p>&ast;</p>
<p>-</p>
<p>-</p>
<p>§</p>
<p>«</p>
<p>»</p>
<p>¿</p>
<p>Á</p>

and expected output is:

<p>&ast;</p>
<p>&ndash;</p>
<p>&mdash;</p>
<p>&sect;</p>
<p>&laquo;</p>
<p>&raquo;</p>
<p>&iquest;</p>
<p>&Aacute;</p>

I've already tried a similar code using sed but the same issue.

How to fix this?

Below locale config:

***Ubuntu 18.04.1 LTS

$ locale
LANG=C.UTF-8
LANGUAGE=
LC_CTYPE="C.UTF-8"
LC_NUMERIC="C.UTF-8"
LC_TIME="C.UTF-8"
LC_COLLATE="C.UTF-8"
LC_MONETARY="C.UTF-8"
LC_MESSAGES="C.UTF-8"
LC_PAPER="C.UTF-8"
LC_NAME="C.UTF-8"
LC_ADDRESS="C.UTF-8"
LC_TELEPHONE="C.UTF-8"
LC_MEASUREMENT="C.UTF-8"
LC_IDENTIFICATION="C.UTF-8"
LC_ALL=

luciole75w

This issue is likely due to an encoding mismatch between the input file and the awk script.

Please first note that there is probably a (very common) confusion between ISO-8859-1 and Windows-1252 here. The html sample in the original post contains em/en dash characters which are not part of the ISO-8859-1 layout, so it certainly uses another encoding, probably Windows-1252 (which is a superset of ISO-8859-1 including the dash characters) since the OP reported to use Ubuntu through the Windows subsystem layer.

I'll then assume that the html input file is indeed encoded with Windows-1252. So non-ASCII characters (code points ≥ 128) use only one byte.

If the awk program is loaded from a file encoded in UTF-8, or even directly typed in a terminal window which uses the UTF-8 endoding, then the regular expressions and literal strings embedded in the program are also encoded in UTF-8. So non-ASCII characters use multiple bytes.

For example, the character § (code point 167 = 0xA7), is represented by the byte A7 in Windows-1252 and the sequence of bytes C2 A7 in UTF-8. If you use gsub(/§/, "S") in your UTF-8 encoded awk program, then awk looks for the sequence C2 A7 in the input file which only contains A7. It will not match. Unless you are (un)lucky enough to have a character Â (code point 194 = 0xC2) hanging out just before your §.

Changing the locale does not help here because it only tells awk how to parse its input (data and program), whereas what you need here is to transcode either the data or the regular expressions. For this to work you would have to be able to specify the locale of the data independently of the locale of the program, which is not supported.

So, assuming that your system is set up with an UTF-8 locale and that your awk script uses this locale (no matter if loaded from a file or typed in a terminal), here are several methods you can use to align the input file and the regular expressions on the same encoding so that gsub works as expected.

Please note that these suggestions stick to your first awk command since it is the source of the issue. The final pipe to iconv is needed only if you intentionally does not transform all the special characters you may have in the input to html entities. Otherwise the output of awk is plain ASCII so already UTF-8 compliant.

Option 1 : convert the input file from Windows-1252 to UTF-8

No need for another iconv step after that in any case.

iconv -f WINDOWS-1252 t.html | awk '{
   gsub(/charset=iso-8859-1/, "charset=UTF-8")
   gsub(/\047/, "\\&apos;")
   gsub(/\*/, "\\&ast;")
   gsub(/–/, "\\&ndash;")
   gsub(/—/, "\\&mdash;")
   gsub(/§/, "\\&sect;")
   gsub(/«/, "\\&laquo;")
   gsub(/»/, "\\&raquo;")
   gsub(/¿/, "\\&iquest;")
   gsub(/Á/, "\\&Aacute;")
   print
   }'

Option 2 : convert the awk program from UTF-8 to Windows-1252

Because the awk program may want to have fun too. Let's use process substitution.

awk -f <(iconv -t WINDOWS-1252 <<'EOS'
{
   gsub(/charset=iso-8859-1/, "charset=UTF-8")
   gsub(/'/, "\\&apos;")
   gsub(/\*/, "\\&ast;")
   gsub(/–/, "\\&ndash;")
   gsub(/—/, "\\&mdash;")
   gsub(/§/, "\\&sect;")
   gsub(/«/, "\\&laquo;")
   gsub(/»/, "\\&raquo;")
   gsub(/¿/, "\\&iquest;")
   gsub(/Á/, "\\&Aacute;")
   print
}
EOS
) t.html

Option 3 : save the awk/schell script in a file encoded in Windows-1252

... with your favorite tool.

Option 4 : switch the encoding of your terminal session to Windows-1252

In case you type/paste the awk command in a terminal of course.

Note that this different from setting the locale (LC_CTYPE). I'm not aware of a way to do this programmatically. If somebody knows, feel free to contribute.

Option 5 : avoid non-ASCII characters altogether in the awk program

Sounds anyway a good practice in my opinion.

awk '{
   gsub(/charset=iso-8859-1/, "charset=UTF-8")
   gsub(/\047/, "\\&apos;")
   gsub(/\*/, "\\&ast;")
   gsub(/\226/, "\\&ndash;")
   gsub(/\227/, "\\&mdash;")
   gsub(/\247/, "\\&sect;")
   gsub(/\253/, "\\&laquo;")
   gsub(/\273/, "\\&raquo;")
   gsub(/\277/, "\\&iquest;")
   gsub(/\301/, "\\&Aacute;")
   print
   }' t.html

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at2020-12-13

Comments

0 comments

How can I perform an ISO-8859-1 to UTF-8 text file conversion while not changing any characters that are already valid UTF-8

TOP Ranking

Article