我有一个带有“周”和“年”列的数据框,需要计算以下月份的月份:
输入:
+----+----+
|Week|Year|
+----+----+
| 50|2012|
| 50|2012|
| 50|2012|
预期产量:
+----+----+-----+
|Week|Year|Month|
+----+----+-----+
| 50|2012|12 |
| 50|2012|12 |
| 50|2012|12 |
任何帮助,将不胜感激。谢谢
感谢@ zero323,他向我指出了sqlContext.sql查询,我将查询转换为以下内容:
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.DataFrame;
import org.apache.spark.sql.RowFactory;
import org.apache.spark.sql.SQLContext;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import static org.apache.spark.sql.functions.*;
public class MonthFromWeekSparkSQL {
public static void main(String[] args) {
SparkConf conf = new SparkConf().setAppName("MonthFromWeekSparkSQL").setMaster("local");
JavaSparkContext sc = new JavaSparkContext(conf);
SQLContext sqlContext = new org.apache.spark.sql.SQLContext(sc);
List myList = Arrays.asList(RowFactory.create(50, 2012), RowFactory.create(50, 2012), RowFactory.create(50, 2012));
JavaRDD myRDD = sc.parallelize(myList);
List<StructField> structFields = new ArrayList<StructField>();
// Create StructFields
StructField structField1 = DataTypes.createStructField("week", DataTypes.IntegerType, true);
StructField structField2 = DataTypes.createStructField("year", DataTypes.IntegerType, true);
// Add StructFields into list
structFields.add(structField1);
structFields.add(structField2);
// Create StructType from StructFields. This will be used to create DataFrame
StructType schema = DataTypes.createStructType(structFields);
DataFrame df = sqlContext.createDataFrame(myRDD, schema);
DataFrame df2 = df.withColumn("yearAndWeek", concat(col("year"), lit(" "), col("week")))
.withColumn("month", month(unix_timestamp(col("yearAndWeek"), "yyyy w").cast(("timestamp")))).drop("yearAndWeek");
df2.show();
}
}
您实际上创建了一个新的年份和星期格式为“ yyyy w”的列,然后使用unix_timestamp对其进行转换,您可以从中拉出您所看到的月份。
PS:似乎在spark 1.5中投放行为不正确-https: //issues.apache.org/jira/browse/SPARK-11724
因此,在这种情况下,更一般的做法是 .cast("double").cast("timestamp")
本文收集自互联网,转载请注明来源。
如有侵权,请联系 [email protected] 删除。
我来说两句