TPL Dataflow vs plain Semaphore

BornToCode

I have a requirement to make a scalable process. The process has mainly I/O operations with some minor CPU operations (mainly deserializing strings). The process query the database for a list of urls, then fetches data from these urls, deserialize the downloaded data to objects, then persist some of the data into crm dynamics and also to another database. Afterwards I need to update the first database which urls were processed. Part of the requirement is to make the parallelism degree configurable.

Initially I thought to implement it via a sequence of tasks with await and limit the parallelism using Semaphore - quite simple. Then I read a few posts and answers here of @Stephen Cleary which recommends using TPL Dataflow and I thought it could be a good candidate. However I want to make sure I'm "complicating" the code by using Dataflow for a worthy cause. I also got a suggestion to use a ForEachAsync extension method which is also simple to use, however I'm not sure if it won't cause a memory overhead because of the way it partitions the collection.

Is TPL Dataflow a good option for such a scenario? How is it better than a Semaphore or the ForEachAsync method - what benefits will I actually gain if I implement it via TPL DataFlow over each of the other options (Semaphore/ForEachASync)?

Stephen Cleary

The process has mainly IO operations with some minor CPU operations (mainly deserializing strings).

That's pretty much just I/O. Unless those strings are huge, the deserialization won't be worth parallelizing. The kind of CPU work you're doing will be lost in the noise.

So, you'll want to focus on concurrent asynchrony.

SemaphoreSlim is the standard pattern for this, as you've found.
TPL Dataflow can also do concurrency (both asynchronous and parallel forms).

ForEachAsync can take several forms; note that in the blog post you referenced, there are 5 different implementations of this method, each of which are valid. "[T]here are many different semantics possible for iteration, and each will result in different design choices and implementations." For your purposes (not wanting CPU parallelization), you shouldn't consider the ones using Task.Run or partitioning. In an asynchronous concurrency world, any ForEachAsync implementation is just going to be syntactic sugar that hides which semantics it implements, which is why I tend to avoid it.

This leaves you with SemaphoreSlim vs. ActionBlock. I generally recommend people start with SemaphoreSlim first, and consider moving to TPL Dataflow if their needs become more complex (in a way that seems like they would benefit from a dataflow pipeline).

E.g., "Part of the requirement is to make the parallelism degree configurable."

You may start off with allowing a degree of concurrency - where the thing being throttled is a single whole operation (fetch data from url, deserialize the downloaded data to objects, persist into crm dynamics and to another database, and update the first database). This is where SemaphoreSlim would be a perfect solution.

But you may decide you want to have multiple knobs: say, one degree of concurrency for how many urls you're downloading, and a separate degree of concurrency for persisting, and a separate degree of concurrency for updating the original database. And then you'd also need to limit the "queues" in-between these points: only so many deserialized objects in-memory, etc. - to ensure that fast urls with slow databases don't cause problems with your app using too much memory. If these are useful semantics, then you have started approaching the problem from a dataflow perspective, and that's the point that you may be better served with a library like TPL Dataflow.

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at2020-10-30

Comments

0 comments

TOP Ranking

Article

TPL Dataflow vs plain Semaphore

TPL Dataflow vs plain Semaphore

pump.io port in URL

Loopback Error: connect ECONNREFUSED 127.0.0.1:3306 (MAMP)

Can't pre-populate phone number and message body in SMS link on iPhones when SMS app is not running in the background

How to import an asset in swift using Bundle.main.path() in a react-native native module

Failed to listen on localhost:8000 (reason: Cannot assign requested address)

Spring Boot JPA PostgreSQL Web App - Internal Authentication Error

ngClass error (Can't bind ngClass since it isn't a known property of div) in Angular 11.0.3

Using Response.Redirect with Friendly URLS in ASP.NET

Can a 32-bit antivirus program protect you from 64-bit threats

Double spacing in rmarkdown pdf

How to fix "pickle_module.load(f, **pickle_load_args) _pickle.UnpicklingError: invalid load key, '<'" using YOLOv3?

3D Touch Peek Swipe Like Mail

Bootstrap 5 Static Modal Still Closes when I Click Outside

Assembly definition can't resolve namespaces from external packages

Vector input in shiny R and then use it

Emulator wrong screen resolution in Android Studio 1.3

Svchost high CPU from Microsoft.BingWeather app errors

Graphics Context misaligned on first paint

Python connect to firebird docker database

Is this docker-for-mac password dialog legit?

How to save models trained locally in Amazon SageMaker?