为什么基于堆栈的迭代并不比 C 中的递归更好？（以斐波那契数为例）

泽文·曾夫·赞博里

简要介绍：

（次要编辑：我问这个问题的目的不是关于算法本身。我完全了解具有 3 个局部变量或 3 个元素的数组的快速迭代解决方案。实际上，我故意使这两个测试具有最低的复杂性与我能想到的不同。我想知道的是，如果实现与递归算法相同的算法，在自定义堆栈和迭代中解决相同的问题可以提高性能！）

当我们在学校学习编程时，我们通常被告知迭代通常比递归更有效，除非递归提供了一种特定而优雅的方法来解决问题。

所以最近我决定把它放到一个简单的测试中。鉴于函数调用本质上是由调用堆栈处理的，因此可以实现自定义堆栈来处理所需的局部变量并将递归实现转换为迭代实现。下面是我实现一个基于递归的斐波那契数计算器的例子，据说是一个使用迭代的等效算法，两者都是用 C 编写的。

方法：

递归实现 (fibon_recu) :

uint64_t calls = 0;

/* Calculate Fibonacci number using recursive algorithm. */
uint64_t fibonacci(uint8_t idx)
{
  calls++;
  return (idx <= 1) ? idx : (fibonacci(idx - 1) + fibonacci(idx - 2));
}

迭代实现 (fibon_iter) :

uint64_t loop_count;

/* Calculate Fibonacci number using stack-based method derived from recursive algorithm. */
uint64_t fibonacci(uint8_t idx)
{
  uint64_t ret = 0;
  uint8_t stack_val[ARR_STACK_SIZE], cache;
  uint16_t stack_size;

  loop_count = 0;

  // Push index into simulated stack
  stack_size = 1;
  *stack_val = idx;

  while(stack_size)
  {
    // Pop simulated stack top
    stack_size -= 1;
    cache = *(stack_val + stack_size);

    if(cache > 1)
    {
      // Push <index - 1> and <index - 2> into simulated stack
      *(stack_val + stack_size) = cache - 1;
      *(stack_val + stack_size + 1) = cache - 2;
      stack_size += 2;
    }
    else
    {
      ret += cache;
    }

    loop_count++;
  }

  return ret;
}

值得一提的是，templatetypedef 的回答说：

根据堆栈的实现，该堆栈可能最终需要在堆中动态分配内存，这通常比设置和拆除堆栈帧更昂贵。与往常一样，如果您有两种算法并想比较它们的经验性能，最好将这两种算法相互运行，然后看看哪个获胜。

这确实是在我的测试设备上观察到的，我决定使用静态数组来模拟堆栈，因为静态数组本身将在堆栈帧中而不是在堆中分配。在这种情况下，在我的设备上访问堆中的变量会使性能下降大约 20-30 倍（此处未显示该图）。

尽管如此，Nathaniel Ford 的这个答案表明，使用自定义堆栈的迭代实现有时可以提高性能，我认为这非常有趣。

此外，为了使比较尽可能公平，我用-O0标志编译了两段代码以禁用编译器优化。

结果：

其中fibon_recu代表递归测试和fibon_iter代表迭代测试。仅供参考，以防这些信息是相对的，我决定将gcc版本和基本设备信息都放在图中。

更新：

<九月 2021 年 7 月；11:00 UTC>
感谢Ian Abbott指出使用指针访问而不是索引可以提高性能，因为-O0存在标志。这使得迭代测试的执行时间比递归测试略长。

<九月 2021 年 7 月；22:45 UTC>
感谢这里所有慷慨的开发人员提供的见解甚至详细的测试！我在阅读这些答案时做了一些更新，并将问题标记为已解决。但是，请阅读下面的所有答案，因为它们都有不同但重要的观点！

首先，Eugene指出，即使使用-O0标志，函数调用本身也可以进行优化，而自制堆栈实现则不然。因此-O0，正如Lundin 所暗示的那样，实际上仍然不是一个公平的测试。

其次，Jacon指出尾递归可能不会被编译器取代-O1，同时-O1对John进行的基于堆栈的迭代测试的性能有显着提高。

下面列出了这些提到的答案所建议的程序集输出和结果：

-O1：
- fibon_iter_O1.asm
- fibon_recu_O1.asm
- 测试结果：
-O2：
- fibon_iter_O2.asm
- fibon_recu_O2.asm
- 测试结果：

It turns out the compiler behavior mentioned by Eugene does hold on my test environment, with -O1 flag preserves all the recursive bl calls while -O2 flag optimized the tailing bl call. The improvement observed by John can also be reproduced on my device, shown in the results above: the -O1 flag makes iteration preform much better than recursion.

Thus, I'd guess the test results show that the complexity of the instructions and recursion calls are competing with each other in terms of efficiency and performance. The -O0 flag gives no optimization for home-brew stack at all, resulting in extra complexity that breaks even the advantages of iteration. The -O1 flag retains the recursive calls while optimizing the iterative implementation, granting the latter a performance boost. The -O2 flag removes the tail recursion and makes recursive implementation runs better than that under -O1, making the competing factors apparent again.

Original Question:

Why does the seemingly equivalent iterative implementation still preform not better than the recursive one? Did I get anything obviously wrong, or something is actually hidden under plain sight?

Thanks!

John Bode

What I'd like to know about is that if implementing an identical algorithm to the recursive one attacking the same problem in custom stacks and iteration can improve the performance!

This assumes the algorithms are actually identical. Your stack-based algorithm isn't identical to the recursive version - for one thing, you're performing a lot more assignments and tests in the stack-based version than the recursive version, which kills any gains you make by not calling the function recursively. Secondly, the recursive version is (most likely) passing arguments via registers, not the stack.

On my system, turning on level 1 optimization significantly improves the runtime performance of the stack-based version relative to the recursive version (to where the stack-based version takes much less time than the recursive version). On my system, going from -O0 to -O1 speeds up the recursive version by ~26% and the stack-based version by ~67%.

And for what it's worth, this is a naive iterative implementation of fibonacci:

unsigned long fib_it( int n )
{
  /**
   * Circular buffer to store the
   * intermediate values
   */
  unsigned long f[3] = { 0ul, 1ul, 1ul };

  if ( n <= 2 )
    return f[n];

  int i;
  for ( i = 3; i <= n; i++ )
    f[i % 3] = f[(i-1) % 3] + f[(i-2) % 3];

  return f[(i+2) % 3];
}

which absolutely spanks the recursive and stack-based versions in terms of both run time and memory usage as n gets large (you basically just have to worry about overflow). As others have shown there are faster algorithms that minimize the number of intermediate calculations, and there is a closed-form version that's faster yet.

If a recursive algorithm is executing too slowly for you, you don't speed it up by re-implementing the recursiveness of it with your own stacks - you speed it up by using a non-recursive approach. As a definition of the Fibonacci function, F_n = F_n-1 + F_n-2 is compact and easy to understand, but as an implementation of the function it is horribly inefficient because it computes the same values multiple times. The iterative version above computes each value exactly once.

What makes recursive functions like quicksort fast is that they dramatically reduce the size of the problem space with each call. Recursive fibonacci doesn't do that.

Edit

一些数字。我的测试工具有两个参数——要计算的第 N 个数字，以及要使用的算法——s用于基于堆栈的版本、r递归和i我上面的迭代版本。对于小 N，差异并不太明显（用编译-O1）：

$ ./fib -n 5 -r
   i          F(i)    clocks   seconds
   -          ----    ------   -------
   5             5         1  0.000001

$ ./fib -n 5 -s
   i          F(i)    clocks   seconds
   -          ----    ------   -------
   5             5         2  0.000002

$ ./fib -n 5 -i
   i          F(i)    clocks   seconds
   -          ----    ------   -------
   5             5         2  0.000002

但是，随着 N 变大，差异就显现出来了：

$ ./fib -n 40 -r
   i          F(i)    clocks   seconds
   -          ----    ------   -------
  40     102334155    543262  0.543262

$ ./fib -n 40 -s
   i          F(i)    clocks   seconds
   -          ----    ------   -------
  40     102334155    330824  0.330824

$ ./fib -n 40 -i
   i          F(i)    clocks   seconds
   -          ----    ------   -------
  40     102334155         2  0.000002

一旦超过 N=50，递归和基于堆栈的版本的运行时间就会变得非常长，而迭代版本的运行时间不会改变：

 $ ./fib -n 50 -r
   i          F(i)    clocks   seconds
   -          ----    ------   -------
  50   12586269025  66602759 66.602759

 $ ./fib -n 50 -s
   i          F(i)    clocks   seconds
   -          ----    ------   -------
  50   12586269025  39850376 39.850376

 $ ./fib -n 50 -i
   i          F(i)    clocks   seconds
   -          ----    ------   -------
  50   12586269025         2  0.000002

现在，这就是我的系统上的行为方式，这是使用非常粗糙的仪器（对的几次调用clock）的第 0 级分析。

本文收集自互联网，转载请注明来源。

如有侵权，请联系 [email protected] 删除。

编辑于 2021-09-14

我来说两句

0 条评论

登录后参与评论

上一篇：查找并移动以相同字符开头的 1 列中的所有单元格 - 谷歌表 - 谷歌应用程序脚本

TOP 榜单

文章