I want to replace ISO-8859-1 characters from file below to be valid for UTF-8 encoding.
<HTML>
<HEAD>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
</HEAD>
<BODY>
<A NAME="top"></A>
<TABLE border=0 width=609 cellspacing=0 cellpadding=0>
<TR><td rowspan=2><img src="http://www.example.com" width=10></td>
<TD width=609 valign=top>
<p>'</p>
<p>*</p>
<p>-</p>
<p>—</p>
<p>§</p>
<p>«</p>
<p>»</p>
<p>¿</p>
<p>Á</p>
</TD>
</TR>
</TABLE>
</body>
</html>
Doing some research I found that the issue is related with locale
language and I was able to build this awk program, but only replaces the first 2 characters ('
and *
)
LC_ALL=ISO_8859-1 awk '{
gsub(/charset=iso-8859-1/, "charset=UTF-8" , $0)
gsub(/\047/, "\\'" , $0)
gsub(/*/, "\\*" , $0)
gsub(/–/, "\\–" , $0)
gsub(/—/, "\\—" , $0)
gsub(/§/, "\\§" , $0)
gsub(/«/, "\\«" , $0)
gsub(/»/, "\\»" , $0)
gsub(/¿/, "\\¿" , $0)
gsub(/Á/, "\\Á" , $0)
print
}' t.html | iconv -f ISO_8859-1 -t UTF-8
This is the current output (showing below partial output, only lines affected by the program):
<p>'</p>
<p>*</p>
<p>-</p>
<p>-</p>
<p>§</p>
<p>«</p>
<p>»</p>
<p>¿</p>
<p>Á</p>
and expected output is:
<p>*</p>
<p>–</p>
<p>—</p>
<p>§</p>
<p>«</p>
<p>»</p>
<p>¿</p>
<p>Á</p>
I've already tried a similar code using sed
but the same issue.
How to fix this?
Below locale config:
***Ubuntu 18.04.1 LTS
$ locale
LANG=C.UTF-8
LANGUAGE=
LC_CTYPE="C.UTF-8"
LC_NUMERIC="C.UTF-8"
LC_TIME="C.UTF-8"
LC_COLLATE="C.UTF-8"
LC_MONETARY="C.UTF-8"
LC_MESSAGES="C.UTF-8"
LC_PAPER="C.UTF-8"
LC_NAME="C.UTF-8"
LC_ADDRESS="C.UTF-8"
LC_TELEPHONE="C.UTF-8"
LC_MEASUREMENT="C.UTF-8"
LC_IDENTIFICATION="C.UTF-8"
LC_ALL=
This issue is likely due to an encoding mismatch between the input file and the awk script.
Please first note that there is probably a (very common) confusion between ISO-8859-1 and Windows-1252 here. The html sample in the original post contains em/en dash characters which are not part of the ISO-8859-1 layout, so it certainly uses another encoding, probably Windows-1252 (which is a superset of ISO-8859-1 including the dash characters) since the OP reported to use Ubuntu through the Windows subsystem layer.
I'll then assume that the html input file is indeed encoded with Windows-1252. So non-ASCII characters (code points ≥ 128) use only one byte.
If the awk program is loaded from a file encoded in UTF-8, or even directly typed in a terminal window which uses the UTF-8 endoding, then the regular expressions and literal strings embedded in the program are also encoded in UTF-8. So non-ASCII characters use multiple bytes.
For example, the character §
(code point 167 = 0xA7), is represented by the byte A7
in Windows-1252 and the sequence of bytes C2 A7
in UTF-8. If you use gsub(/§/, "S")
in your UTF-8 encoded awk program, then awk looks for the sequence C2 A7
in the input file which only contains A7
. It will not match. Unless you are (un)lucky enough to have a character Â
(code point 194 = 0xC2) hanging out just before your §
.
Changing the locale does not help here because it only tells awk how to parse its input (data and program), whereas what you need here is to transcode either the data or the regular expressions. For this to work you would have to be able to specify the locale of the data independently of the locale of the program, which is not supported.
So, assuming that your system is set up with an UTF-8 locale and that your awk script uses this locale (no matter if loaded from a file or typed in a terminal), here are several methods you can use to align the input file and the regular expressions on the same encoding so that gsub
works as expected.
Please note that these suggestions stick to your first awk command since it is the source of the issue. The final pipe to iconv
is needed only if you intentionally does not transform all the special characters you may have in the input to html entities. Otherwise the output of awk is plain ASCII so already UTF-8 compliant.
No need for another iconv
step after that in any case.
iconv -f WINDOWS-1252 t.html | awk '{
gsub(/charset=iso-8859-1/, "charset=UTF-8")
gsub(/\047/, "\\'")
gsub(/\*/, "\\*")
gsub(/–/, "\\–")
gsub(/—/, "\\—")
gsub(/§/, "\\§")
gsub(/«/, "\\«")
gsub(/»/, "\\»")
gsub(/¿/, "\\¿")
gsub(/Á/, "\\Á")
print
}'
Because the awk program may want to have fun too. Let's use process substitution.
awk -f <(iconv -t WINDOWS-1252 <<'EOS'
{
gsub(/charset=iso-8859-1/, "charset=UTF-8")
gsub(/'/, "\\'")
gsub(/\*/, "\\*")
gsub(/–/, "\\–")
gsub(/—/, "\\—")
gsub(/§/, "\\§")
gsub(/«/, "\\«")
gsub(/»/, "\\»")
gsub(/¿/, "\\¿")
gsub(/Á/, "\\Á")
print
}
EOS
) t.html
... with your favorite tool.
In case you type/paste the awk command in a terminal of course.
Note that this different from setting the locale (LC_CTYPE). I'm not aware of a way to do this programmatically. If somebody knows, feel free to contribute.
Sounds anyway a good practice in my opinion.
awk '{
gsub(/charset=iso-8859-1/, "charset=UTF-8")
gsub(/\047/, "\\'")
gsub(/\*/, "\\*")
gsub(/\226/, "\\–")
gsub(/\227/, "\\—")
gsub(/\247/, "\\§")
gsub(/\253/, "\\«")
gsub(/\273/, "\\»")
gsub(/\277/, "\\¿")
gsub(/\301/, "\\Á")
print
}' t.html
Collected from the Internet
Please contact [email protected] to delete if infringement.
Comments