I need to check at least 20k urls to check if the url is up and save some data in a database.
I already know how to check if an url is online and how to save some data in the database. But without concurrency it will take ages to check all urls so whats the fastest way to check thousands of urls?
I am following this tutorial: https://realpython.com/python-concurrency/ and it seems that the "CPU-Bound multiprocessing Version" is the fastest way to do, but I want to know if that it is fastest way or if there are better options.
Edit:
Based on the replies I will update the post comparing Multiprocessing and Multithreading
Example 1: Print "Hello!" 40 times
Threading
Multiprocessing with 8 cores:
If you use 8 threads it will be better the threading
Example 2, the problem propounded in my question:
After several tests if you use more than 12 threads the threading will be faster. For example, if you want to test 40 urls and you use threading with 40 threads it will be 50% faster than multiprocessing with 8 cores
Thanks for your help
To say that multiprocessing is always the best choice is incorrect, multiprocessing is best only for heavy computations!
The best choice for actions which do not require heavy computations, but only IN/OUT operations like database requets or requests of remote webapp api, is module threading. Threading can be faster than multiprocessing since multiprocessing need to serialize data to send it to child process, meanwhile trheads use the same memory stack.
Typical activity in the case is to create input queue.Queue and put task (urls in you case in it) and create several workers to take tasks from the Queue:
import threading as thr
from queue import Queue
def work(input_q):
"""the function take task from input_q and print or return with some code changes (if you want)"""
while True:
item = input_q.get()
if item == "STOP":
break
# else do some work here
print("some result")
if __name__ == "__main__":
input_q = Queue()
urls = [...]
threads_number = 8
workers = [thr.Thread(target=work, args=(input_q,),) for i in range(threads_number)]
# start workers here
for w in workers:
w.start
# start delivering tasks to workers
for task in urls:
input_q.put(task)
# "poison pillow" for all workers to stop them:
for i in range(threads_number):
input_q.put("STOP")
# join all workers to main thread here:
for w in workers:
w.join
# show that main thread can continue
print("Job is done.")
Collected from the Internet
Please contact [email protected] to delete if infringement.
Comments