What's the fastest way to check thousands of urls?

redunicorn

I need to check at least 20k urls to check if the url is up and save some data in a database.

I already know how to check if an url is online and how to save some data in the database. But without concurrency it will take ages to check all urls so whats the fastest way to check thousands of urls?

I am following this tutorial: https://realpython.com/python-concurrency/ and it seems that the "CPU-Bound multiprocessing Version" is the fastest way to do, but I want to know if that it is fastest way or if there are better options.

Edit:

Based on the replies I will update the post comparing Multiprocessing and Multithreading

Example 1: Print "Hello!" 40 times

Threading

  • With 1 thread: 20.152419090270996 seconds
  • With 2 threads: 10.061403036117554 seconds
  • With 4 threads: 5.040558815002441 seconds
  • With 8 threads: 2.515489101409912 seconds

Multiprocessing with 8 cores:

  • It took 3.1343798637390137 seconds

If you use 8 threads it will be better the threading

Example 2, the problem propounded in my question:

After several tests if you use more than 12 threads the threading will be faster. For example, if you want to test 40 urls and you use threading with 40 threads it will be 50% faster than multiprocessing with 8 cores

Thanks for your help

Artiom Kozyrev

To say that multiprocessing is always the best choice is incorrect, multiprocessing is best only for heavy computations!

The best choice for actions which do not require heavy computations, but only IN/OUT operations like database requets or requests of remote webapp api, is module threading. Threading can be faster than multiprocessing since multiprocessing need to serialize data to send it to child process, meanwhile trheads use the same memory stack.

Threading module

Typical activity in the case is to create input queue.Queue and put task (urls in you case in it) and create several workers to take tasks from the Queue:

import threading as thr
from queue import Queue


def work(input_q):
    """the function take task from input_q and print or return with some code changes (if you want)"""
    while True:
        item = input_q.get()
        if item == "STOP":
            break

        # else do some work here
        print("some result")


if __name__ == "__main__":
    input_q = Queue()
    urls = [...]
    threads_number = 8
    workers = [thr.Thread(target=work, args=(input_q,),) for i in range(threads_number)]
    # start workers here
    for w in workers:
        w.start

    # start delivering tasks to workers 
    for task in urls:
        input_q.put(task)

    # "poison pillow" for all workers to stop them:

    for i in range(threads_number):
        input_q.put("STOP")

    # join all workers to main thread here:

    for w in workers:
        w.join

    # show that main thread can continue

    print("Job is done.")

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

Fastest way to check thousands of gzip files

What's the fastest way to read images from urls?

What's the fastest way to check if a value exists in a MySQL JSON array?

What's the fastest way to check if a username is available with a huge dataset?

What is the fastest way to check a pandas dataframe for elements?

What's the fastest way to reinitialize a vector?

What's the fastest way to acces a Pandas DataFrame?

What's the fastest way to redirect to a different URL?

What's the fastest way to threshold a numpy array?

What's the fastest way to obtain inverse of a float

What's the fastest way to square a number in JavaScript?

What is the best and the fastest way to delete large directory containing thousands of files (in ubuntu)

Fastest way to apply javascript to html for thousands of documents

Fastest way to grep through thousands of gz files?

What is the fastest way to check the leading characters in a char array?

What is the fastest way to check if a class has a function defined?

What is the fastest way to check whether a directory is empty in Python

What Is The Fastest Way To Check For a Spacebar Press using JQuery?

What is the fastest and efficient way to check Deep Equal for two java objects?

What is the fastest way to check for oddness & perform divisions by powers of 2

What is the fastest way to check if a string contains only white spaces in JavaScript?

What is the fastest way to perform a HTTP request and check for 404?

What is the fastest way to check an internet connection using C or C++?

What is the fastest and most efficient way to check database for new entry?

What is the fastest way to check bits in variable using bitwise operations?

What is the fastest way to check if a matrix is an element of an 3d array?

What is the fastest way to check if value is exists in std::map?

What is the fastest way to 302 a link to it's final URL?

What's the best and fastest way to save Map in Android?