Links 2.21 is a fantastic text based browser which is able to output formatted text from URL's.
links -dump "https://example.com/page.html" > output.txt
As is, output.txt contains all links as text only, so for example if there is a link in the html source like this:
<a href="/some/link/example.html">Some Text</a>
then output.txt will simply have "Some Text" but nothing from the href attribute.
What I'd like to do is have the info from links included in the output for example like this:
[Some Text|https://example.com/some/link/example.html]
or anything similar. Is this possible? The browser clearly has this info because when it renders the page, the links are "clickable" (actually selectable by keys in text mode) and it correctly follows all links.
Or is there another way of converting a web page to plain text but including all the info about <a ...> tags in a structured way?
Note that I'm fully aware of tons of tools to extract links from web pages and tons of tools to convert web pages to text, but nothing really which does both at the same time.
If it is acceptable to have the link addresses listed at the end of the dump you can do:
links -html-numbered-links 1 -dump "https://example.com/"
The result will look something like this
Example Domain
This domain is for use in illustrative examples in documents. You may use
this domain in literature without prior coordination or asking for
permission.
[1]More information...
Links:
1. https://www.iana.org/domains/example
Collected from the Internet
Please contact [email protected] to delete if infringement.
Comments