Get a fast reduce function of a big stream

Hey guys,

Ich have a big stream of strings (1M,10M,100M)
and want to make a reduce.

If I use following very easy solution, then I need much memory but it is quite fast (1M needs 100ms on my machine)

 def simpleLoop:String = {
      val sb = new StringBuilder
      for (i <- 1 to end){
        if (sb.isEmpty){
          sb.append(i.toString)
        }
        else{
          sb.append("." +i.toString)
        }
      }
      sb.toString()
  }

If I use a very simple solution with akka streams, then it is very slow.

 Source(List.range( 1,1_000_000))
      .map(x => x.toString)
      .runFold("")((x,y) => concat(x,y,"."))

  def concat(s1:String,s2:String,d:String) = if (s1 == "") s2 else s1 + d + s2

Of course I can group that, then I need about 3s for 1M on my machine.

 Source(List.range( 1,1_000_000))
      .map(x => x.toString)
      .grouped(1000)
      .mapAsync(1000)(l => Source(l).runFold("")((x,y) => concat(x,y,".")))
      .runFold("")((x,y) => concat(x,y,"."))

  def concat(s1:String,s2:String,d:String) = if (s1 == "") s2 else s1 + d + s2

But again thats 30x slower. Is there any trick to make it better?

Best
Sigurd

Hi @sigurd,

the question is why you would like to use streams for that particular use case. This seems like a pretty basic operation that wouldn’t gain anything from using streams. Streams introduce some overhead and if the operations itself are very simple that overhead might be significant. In particular, operations that go from memory to memory will likely not see any benefit from using streams.

Also the comparison is somewhat unfair because you use a single mutable StringBuilder (with amortized linear append cost and memory allocations) in the simple loop but use immutable String in the streams versions (probably more like O(n^2) append costs and memory allocations because of all the string copies involved).

Hope that explains what you are seeing.

A use case would be to write big files (csv or so) but also a line could be very big.
Let say you have 10M incoming entries. And you have to group a subset of them, so in output you have only 1M but some of them contains a whole group of maybe 10k per line.
The question is: is ist better to create one line as a synchronous process (just a string builder) or with a stream (just a fold).

The main question is if your workload is a streaming workload, i.e. can you process the input one by one, so that you can benefit from not loading all of the data into memory. In that case, akka-stream probably applies.

  • Concatenating data in memory is not a streaming use case
  • Writing data to a file sequentially is a streaming use case
  • Grouping data one by one element without doing aggregation per group and writing these out somewhere is a streaming use case
  • Grouping data but then doing an aggregation that needs all data per group requires all data in memory, so streaming won’t help a lot.
  • Sorting data is generally not a streaming use case

Hard to say from your description if your grouping pattern would benefit from streaming or not. It sounds as if it could work if you write one file per group?

3 Likes

Thank you very much for this answer.
I want to know whether there is a better solution
but it makes totally sense what you explain.

In the end I need one file (however it is stored). It is possible to store it in chunks and read it chunk by chunk if its requested (by a HTTP call).