线程上的简单分工并没有减少花费的时间

code_fodder 发表于 Dev

代码_饲料

我一直试图通过将工作拆分为任务/线程来改善项目的计算时间，但效果不佳。所以我决定做一个简单的测试项目，看看我是否可以让它在一个非常简单的情况下工作，这也没有像我预期的那样工作。

我试图做的是：

在一个线程中执行 X 次任务 - 检查花费的时间。
在 Y 线程中执行 X / Y 次任务 - 检查所用时间。

因此，如果 1 个线程需要 T 秒来执行 100'000'000 次“工作”迭代，那么我会期望：

2 个线程执行 50'000'000 次迭代，每个需要 ~ T / 2 秒
3 个线程执行 33'333'333 次迭代，每个需要 ~ T / 3 秒

依此类推，直到达到某个线程限制（内核数或其他）。

因此，我编写了代码并在我的 8 核系统 (AMD Ryzen) 上进行了测试，该系统的 RAM > 16GB，当时什么也不做。

1 线程花费：~6.5s
2 线程花费：~6.7s
3 个线程耗时：~13.3 秒
8 个线程耗时：~16.2s

很明显，这里有些不对劲！

我将代码移植到 Godbolt 中，我看到了类似的结果。Godbolt 只允许 3 个线程，对于 1、2 或 3 个线程，运行时间大约为 8 秒（相差约 1 秒）。这是 Godbolt 实时代码：https ://godbolt.org/z/6eWKWr

最后附上代码供参考：

#include <iostream>
#include <math.h>
#include <vector>
#include <thread>

#define randf() ((double) rand()) / ((double) (RAND_MAX))

void thread_func(uint32_t interations, uint32_t thread_id)
{
    // Print the thread id / workload
    std::cout << "starting thread: " << thread_id << " workload: " << interations << std::endl;
    // Get the start time
    auto start = std::chrono::high_resolution_clock::now();
    // do some work for the required number of interations
    for (auto i = 0u; i < interations; i++)
    {
        double value = randf();
        double calc = std::atan(value);
        (void) calc;
    }
    // Get the time taken
    auto total_time = std::chrono::high_resolution_clock::now() - start;
    // Print it out
    std::cout << "thread: " << thread_id << " finished after: "
              << std::chrono::duration_cast<std::chrono::milliseconds>(total_time).count()
              << "ms" << std::endl;
}

int main()
{
    // Note these numbers vary by about probably due to godbolt servers load (?)
    // 1 Threads takes: ~8s
    // 2 Threads takes: ~8s
    // 3 Threads takes: ~8s
    uint32_t num_threads = 3; // Max 3 in godbolt
    uint32_t total_work = 100'000'000;

    // Seed rand
    std::srand(static_cast<unsigned long>(std::chrono::steady_clock::now().time_since_epoch().count()));

    // Store the start time
    auto overall_start = std::chrono::high_resolution_clock::now();

    // Start all the threads doing work
    std::vector<std::thread> task_list;
    for (uint32_t thread_id = 1; thread_id <= num_threads; thread_id++)
    {
        task_list.emplace_back(std::thread([=](){ thread_func(total_work / num_threads, thread_id); }));
    }

    // Wait for the threads to finish
    for (auto &task : task_list)
    {
        task.join();
    }

    // Get the end time and print it
    auto overall_total_time = std::chrono::high_resolution_clock::now() - overall_start;
    std::cout << "\n==========================\n"
              << "thread overall_total_time time: "
              << std::chrono::duration_cast<std::chrono::milliseconds>(overall_total_time).count()
              << "ms" << std::endl;
    return 0;
}

注意：我尝试使用 std::async 也没有任何区别（不是我期待的）。我也尝试编译发布 - 没有区别。

我读过这样的问题：为什么使用更多线程使它比使用更少线程更慢，我看不到一个明显的（对我来说）瓶颈：