How can I extract a text from a bytes file using python

Juan Sepulveda

I'm trying to code a script that gets the code of a website, saves all html in a file and after that extracts some information.

For the moment I´ve done the first part, I've saved all html into a text file.

Now I have to extract the relevant information and then save it in another text file.

But I'm having problems with encoding ...and also I don´t know very well how to extract the text in python.

Parsing a website:


import urllib.request

... file name to store the data

file_name = r'D:\scripts\datos.txt'

I want to get the text that goes after this tag and before this other one

tag_starts_with = '<p class="item-description">'
tag_ends_with = '</p>'

I get the website code and I save it into a text file

with urllib.request.urlopen("http://www.website.com/") as response, open(file_name, 'wb') as out_file:
    data = response.read() 
    out_file.write(data)

print (out_file) # First question how can I print the file? Gives me an error, I can´t print bytes

the file is now full of html text so I want to open it and process it

file_for_results = open(r'D:\scripts\datos.txt',encoding="utf8")

Extract information from the file

second question how to do a substring of the lines that contain the file and get the text between p class="item-description" and /p so i can store in file_for_results

here is the pseudocode that I'm not capable to code.

for line in file_to_filter:
    if line contains word_starts_with
      copy in file_for_results until you find </p>

Thanks in advanced for your help

Jacob H

I am assuming this is an assignment of some sort, where you need to parse the html given an algorithm, if not just use Beautiful Soup.

The pseudocode actually translates to python code quite easily:

file_to_filter = open("file.html", 'r')
out_file = open("text_output",'w')
for line in file_to_filter:
    if word_starts_with in line:
        print(line, end='', file=out_file) # Store data in another file
    if word_ends_with in line:
        break

And of course you need to close the files, make sure you remove the tags and so on, but this is roughly what your code should be given this algorithm.

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

How can I Extract values with information from big text file using php

How can I remove a specific number of bytes from the beginning and end of a file using python?

How can I extract these characters from a text file?

How can I extract data from text file?

How I can extract specific target number from text file

How can I extract a portion of text from all lines of a file?

How can I extract the text from the <em> tag using BeautifulSoup

How can I extract the text from a webelement using selenium

How can I extract only certain text from similar elements using BeautifulSoup and Python

How can I extract address from raw text using NLTK in python?

How can I extract text fragments from PDF with their coordinates in Python?

How can I extract text from string in python?

How can i extract text from a PDF with python?

How to extract text from under headings in a docx file using python

How I can extract the portion of words from the file using python3.6?

How can I Extract Specific xml tags from a local xml file using python?

How can I extract the folder path from file path in Python?

How can I extract the folder path from file path in Python?

How can I extract a JSON object from a CSV file in Python?

How can I sum integers from multiple text files into a new text file using python?

How can I extract text from single quotes, even if the text itself contains single quotes, using regex in Python?

How do I extract a single chunk of bytes from within a file?

How can i extract values from cassandra output using python?

How can I Extract information from HTML page using Python?

How can I extract text from images?

How do I extract data from multiple text files to Excel using Python? (One file's data per sheet)

How can I extract some patterns of sub text from a gibberish looking text using regular expressions?

how can i count paragraphs of text file using python?

How i can Read text file from a specific line in python?