Efficient storage / search algorithm for date range availability

orokusaki

Background

I'm providing a map-based search for paddle boards, including searching by date - think "check in" / "check out" dates in a hotel search, where only hotels available for that entire date range are included in the results.

I have availability data for a couple years, (updated daily) for each paddle board. A paddle board is either available on a day (for 12 noon pick-up) or not available (meaning somebody is picking it up at 12 noon and will drop it off the next morning).

There are hundreds of thousands of paddle boards, so storing separate records for each day's availability for each is untenable (e.g., 350k paddle boards w/ 2 years availability data results in over 250 million rows).

Technology

I'm using Postgres 15, and I can add extensions, if necessary. I also have Redis available in the stack, in case the best solution involves non-RDBMS.

The app is very read-heavy, so large / expensive indexes are OK, within reason.

Initial Idea

My initial idea for storage and efficient searching involves storing availability data in a char field, for each paddle board, in a format like this: AAANNAAAANA (730 chars long, where A represents "available" and N represents "not available", with a static starting date (e.g., Jan 1, 2023)). Given a search date range (e.g., pick up on Jan 20 / drop off on the morning of Jan 25), I can construct a regexp search (e.g., ^.{19}A{5}) and achieve decent performance... but... I feel like there's something more ideal.

For example, if I was working with a nested tree structure and wanted to get all offspring nodes of a given node, I could use an algorithm, like Modified Preorder Tree Traversal, to make easy work of an otherwise nightmarishly slow problem. Is there no such concept / mathematical solution to the kind of problem I'm describing?

Question

My question is, aside from the regexp solution I've come up with, are there any algorithmic solutions that are widely used to solve problems like this? For example, I looked into segment trees and some other data structures, not completely diving into any particular one, but I couldn't find anything that I deemed to be at least a semi-ideal solution to the problem. I'm looking for a solution that wouldn't involve really low-level hacking; something that's relatively expressive / maintainable, and where the main filtering work takes place in the database (vs in application space).

Bergi

My initial idea for storage and efficient searching involves storing availability data in a char field, for each paddle board, in a format like this: AAANNAAAANA (730 chars long, where A represents "available" and N represents "not available", with a static starting date)

Using a character string for this is rather unwieldy. Nicest to work with, and following the best practice of database normalisation, would be to store the availability data in a separate table - with one entry per availability, per non-availability, or per date for which the status is known, depending on your modeling.

However, for your use case - in particular, never updating individual availabilities but always the whole series, always reading the whole (or at least a large part of the) series, and not ever needing to store additional data per checkout - a non-atomic representation might be ok and likely even faster.

One approach would be to use a bit string, which is smaller and has more appropriate operators available than the character string. Afaics, it does not provide suitable indexing capabilities.
Next would be using an array of dates, i.e. date[], each representing a non-availability (which I guess are less common than availabilities). This probably takes much more space, but you can build a GIN index to search for arrays containing specific dates or overlapping with specific date arrays.
Finally, Postgres 14 introduced multiranges, representing a set of non-overlapping ranges (or, if you want, a range with gaps). Using a datemultirange would be fitting especially if most checkouts of boards are for multiple sequential days, not just a single day. They offer similarly useful operators, and can be indexed to search for containment and overlaps with other (multi)ranges.

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at2023-06-2

Comments

0 comments

TOP Ranking

Article