val textFile = spark.textFile("hdfs://...") val counts = textFile.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")
This magic is powered by the functional programming. The Scala language used by Spark has build in Map/Reduce functions. Scala also defined Map/flatMap and Mapvalues. If you only want to do single node word count, you don’t have to execute Spark/Hadoop. The Scala’s functional programming is enough.
1) create several lines of words.
scala> val lines = List("Hello world", "This is a hello world example for word count") lines: List[String] = List(Hello world, This is a hello world example for word count)
2) use flatMap to convert to words lists
scala> val words = lines.flatMap(_.split(" ")) words: List[String] = List(Hello, world, This, is, a, hello, world, example, for, word, count)
3) group words to map
scala> words.groupBy((word:String) => word) res9: scala.collection.immutable.Map[String,List[String]] = Map(for -> List(for), count -> List(count), is -> List(is), This -> List(This), world -> List(world, world), a -> List(a), Hello -> List(Hello), hello -> List(hello), example -> List(example), word -> List(word))
The structure of each map is
4) count length of each words list
scala> words.groupBy((word:String) => word).mapValues(_.length) res12: scala.collection.immutable.Map[String,Int] = Map(for -> 1, count -> 1, is -> 1, This -> 1, world -> 2, a -> 1, Hello -> 1, hello -> 1, example -> 1, word -> 1)
Scala do not have build in reduceByKey function like Spark. That is why I use groupBy. Here is a discussion about why reduceByKey works much better then groupByKey on large datasets.