Olologin的答案几乎是正确的,但我相信您想要做的是将RDD分为3个元组,而不是将RDD分为3个元组。为此,请尝试以下操作:
rdd = sc.parallelize(["e1", "e2", "e3", "e4", "e5", "e6", "e7", "e8", "e9", "e10"])
transformed = rdd.zipWithIndex().groupBy(lambda (_, i): i / 3)
.map(lambda (_, list): tuple([elem[0] for elem in list]))
在pyspark中运行时,我得到以下信息:
>>> from __future__ import print_function
>>> rdd = sc.parallelize(["e1", "e2", "e3", "e4", "e5", "e6", "e7", "e8", "e9", "e10"])
>>> transformed = rdd.zipWithIndex().groupBy(lambda (_, i): i / 3).map(lambda (_, list): tuple([elem[0] for elem in list]))
>>> transformed.foreach(print)
...
('e4', 'e5', 'e6')
('e10',)
('e7', 'e8', 'e9')
('e1', 'e2', 'e3')