How can I extract a text from a bytes file using python

Juan Sepulveda

I'm trying to code a script that gets the code of a website, saves all html in a file and after that extracts some information.

For the moment I´ve done the first part, I've saved all html into a text file.

Now I have to extract the relevant information and then save it in another text file.

But I'm having problems with encoding ...and also I don´t know very well how to extract the text in python.

Parsing a website:

import urllib.request

... file name to store the data

file_name = r'D:\scripts\datos.txt'

I want to get the text that goes after this tag and before this other one

tag_starts_with = '<p class="item-description">'
tag_ends_with = '</p>'

I get the website code and I save it into a text file

with urllib.request.urlopen("http://www.website.com/") as response, open(file_name, 'wb') as out_file:
    data = response.read() 
    out_file.write(data)

print (out_file) # First question how can I print the file? Gives me an error, I can´t print bytes

the file is now full of html text so I want to open it and process it

file_for_results = open(r'D:\scripts\datos.txt',encoding="utf8")

Extract information from the file

second question how to do a substring of the lines that contain the file and get the text between p class="item-description" and /p so i can store in file_for_results

here is the pseudocode that I'm not capable to code.

for line in file_to_filter:
    if line contains word_starts_with
      copy in file_for_results until you find </p>

Thanks in advanced for your help

Jacob H

I am assuming this is an assignment of some sort, where you need to parse the html given an algorithm, if not just use Beautiful Soup.

The pseudocode actually translates to python code quite easily:

file_to_filter = open("file.html", 'r')
out_file = open("text_output",'w')
for line in file_to_filter:
    if word_starts_with in line:
        print(line, end='', file=out_file) # Store data in another file
    if word_ends_with in line:
        break

And of course you need to close the files, make sure you remove the tags and so on, but this is roughly what your code should be given this algorithm.

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at2021-04-6

Comments

0 comments

TOP Ranking

Article

How can I extract a text from a bytes file using python

How can I extract a text from a bytes file using python

Loopback Error: connect ECONNREFUSED 127.0.0.1:3306 (MAMP)

Can't pre-populate phone number and message body in SMS link on iPhones when SMS app is not running in the background

pump.io port in URL

How to import an asset in swift using Bundle.main.path() in a react-native native module

Failed to listen on localhost:8000 (reason: Cannot assign requested address)

Spring Boot JPA PostgreSQL Web App - Internal Authentication Error

Emulator wrong screen resolution in Android Studio 1.3

3D Touch Peek Swipe Like Mail

Double spacing in rmarkdown pdf

Svchost high CPU from Microsoft.BingWeather app errors

How to how increase/decrease compared to adjacent cell

Using Response.Redirect with Friendly URLS in ASP.NET

java.lang.NullPointerException: Cannot read the array length because "<local3>" is null

BigQuery - concatenate ignoring NULL

How to fix "pickle_module.load(f, **pickle_load_args) _pickle.UnpicklingError: invalid load key, '<'" using YOLOv3?

ngClass error (Can't bind ngClass since it isn't a known property of div) in Angular 11.0.3

Can a 32-bit antivirus program protect you from 64-bit threats

Make a B+ Tree concurrent thread safe

Bootstrap 5 Static Modal Still Closes when I Click Outside

Vector input in shiny R and then use it

Assembly definition can't resolve namespaces from external packages