In GHC- Why is the lazy version of this small program so much faster than the loop based variant?

jamshidh

These two programs do the same thing, but one runs 10x faster.

This takes approx. 10 seconds on my machine:

import Control.Monad
import qualified Data.ByteString as B
import qualified Data.ByteString.Lazy as BL

theValueOne=B.singleton 1

main = replicateM_ 100000000 $ B.putStr theValueOne

The second version uses output-lazy IO. It is done in about 1 second (as fast as c):

import qualified Data.ByteString.Lazy as BL

main = BL.putStr $ BL.pack $ replicate 100000000 1

Question: Why is the non-lazy version so slow? More importantly, how can I make it fast? (I've tried recursion, forM, modifying the output buffer using hSetBuffering... Nothing has made a difference)


Note- This is more than just an academic question. The non-lazy version is an extremely simplified version of an executable my company uses in production, which is also slow in the same way. It would be nearly impossible to re-architect the larger program around the analogous lazy solution.

K. A. Buhr

Updated: Added possible source of problem and a solution.

I don't think it has anything to do with lazy I/O. If you rewrite the strict I/O version to write two bytes at once:

theValueOne = B.singleton 1
main = replicateM_ 50000000 $ B.putStr (theValueOne <> theValueOne)

that halves the time. Write ten bytes at once:

theValueOne = B.singleton 1
main = replicateM_ 10000000 $ B.putStr (foldMap id (replicate 10 theValueOne))

and it's already faster than the lazy I/O version.

The issue is that there's a bit of overhead in a B.hPutStr call, much more than the overhead of a C fwrite call, and it's just not a particularly efficient way to write a single byte.

A good chunk of the overhead comes from the fact that Haskell I/O buffers have immutable metadata. Even though the buffer content itself is mutable, the pointers to valid data within the buffer are immutable, and so writing a single byte requires a heap allocation of a new GHC.IO.Buffer.Buffer structure which GHC can't optimize away

One solution is to use a hand-crafted buffering structure with a mutable pointer. The following works, and it's about twice as fast as the lazy I/O version in the original question.

{-# LANGUAGE RecordWildCards #-}
{-# OPTIONS_GHC -Wall #-}

import Control.Monad
import Data.IORef
import Data.Word
import Foreign.ForeignPtr
import Foreign.Ptr
import Foreign.Storable
import System.IO

data WriteBuffer = WriteBuffer
  { handle :: !Handle
  , capacity :: !Int
  , used :: !(IORef Int)
  , content :: !(ForeignPtr Word8)
  }

newBuffer :: Handle -> IO WriteBuffer
newBuffer h = do
  hSetBinaryMode h True
  hSetBuffering h NoBuffering
  WriteBuffer h cap <$> newIORef 0 <*> mallocForeignPtrBytes cap
  where cap = 4096

flushBuffer :: WriteBuffer -> IO ()
flushBuffer WriteBuffer{..} = do
  n <- readIORef used
  withForeignPtr content $ \p -> hPutBuf handle p n
  writeIORef used 0

writeByte :: Word8 -> WriteBuffer -> IO ()
writeByte w buf@(WriteBuffer{..}) = do
  n <- readIORef used
  withForeignPtr content $ \p -> poke (plusPtr p n) w
  let n' = n + 1
  writeIORef used n'
  when (n' == capacity) $
    flushBuffer buf

main :: IO ()
main = do
  b <- newBuffer stdout
  replicateM_ 100000000 (writeByte 1 b)
  flushBuffer b

Someone ironically, converting this to a version using an immutable counter and passing the WriteBuffer as state through foldM doubles the speed again, so it's about 4 times as fast as the lazy I/O version in the original question:

{-# LANGUAGE RecordWildCards #-}
{-# OPTIONS_GHC -Wall #-}

import Control.Monad
import Data.Word
import Foreign.ForeignPtr
import Foreign.Ptr
import Foreign.Storable
import System.IO

data WriteBuffer = WriteBuffer
  { handle :: !Handle
  , capacity :: !Int
  , used :: !Int
  , content :: !(ForeignPtr Word8)
  }

newBuffer :: Handle -> IO WriteBuffer
newBuffer h = do
  hSetBinaryMode h True
  hSetBuffering h NoBuffering
  WriteBuffer h cap 0 <$> mallocForeignPtrBytes cap
  where cap = 4096

flushBuffer :: WriteBuffer -> IO WriteBuffer
flushBuffer buf@WriteBuffer{..} = do
  withForeignPtr content $ \p -> hPutBuf handle p used
  return $ buf { used = 0 }

writeByte :: Word8 -> WriteBuffer -> IO WriteBuffer
writeByte w buf@(WriteBuffer{..}) = do
  withForeignPtr content $ \p -> poke (plusPtr p used) w
  let used' = used + 1
      buf' = buf { used = used' }
  if (used' == capacity)
    then flushBuffer buf'
    else return buf'

main :: IO ()
main = do
  b <- newBuffer stdout
  b' <- foldM (\s _ -> writeByte 1 s) b [(1::Int)..100000000]
  void (flushBuffer b')

The reason this one is so fast seems to be that GHC is able to optimize away the WriteBuffer constructor entirely from the fold and just pass around unboxed pointers and integers in the loop. My guess is that if I modified the mutable version above to avoid boxing and unboxing the integer in the used IORef, it would be similarly fast.

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

Why is memcmp so much faster than a for loop check?

Why is an array so much faster than an ArrayList?

Why is sum so much faster than inject(:+)?

Why is any() so much faster than in?

why < is much faster than !=?

Why is numpy's cumsum so much faster than manual C++'s loop?

Why is PHP7 so much faster than Python3 in executing this simple loop?

Why is async code considered so much faster than synchronous?

Why is linear search so much faster than binary search?

Why is Racket implementation so much faster than MIT Scheme?

Why is a list comprehension so much faster than appending to a list?

Why is my computation so much faster in C# than Python

How does COPY work and why is it so much faster than INSERT?

Why is Eigens mean() method so much faster than sum()?

Why is batch mode so much faster than parfor?

Why is pow(a, d, n) so much faster than a**d % n?

Why is pow(a, d, n) so much faster than a**d % n?

Why is it so much faster passing an array as a literal rather than a parameter?

Why is this code in Python so much faster than in C++?

Why is awk so much faster than python in this case?

Torch: Why is this collate function so much faster than this other one?

Why is filtering DataFrame by boolean mask so much faster than apply()?

Why items method for dictionary is so much faster than a straightforward iteration?

Why does SSH feel so much faster than HTTP?

Why is strcmp so much faster than my function?

Why is fio seq_writes so much faster than dd?

Why are elementwise additions much faster in separate loops than in a combined loop?

Why is XOR much faster than OR?

Why is a `for` loop so much faster to count True values?