pysparkノート


pyspark.sql.SQLContext
Top funcs: 1. DF createDataFrame(data,schema)2を作成する.複数のデータソースからデータを読み込む、sqlContext 3.user-define類registerDataFrameAsTable(df,tableName)registerFunction(name,f,returnType=StringType)4.sqlContextでsql文(Hive-ql)を実行する
pyspark.sql.DataFrame 1. collect 2. 行アクション:filter=where distinct dropDuplicates=drop_duplicates dropna 3. 列操作:select withColumn加列withColumnRenamed名前変更列Col.cast(Type)変換形式Col.between(l,u)inSet isin(new)F.when()when().otherwise()条件Col.substr(startPos,len)
alias別名4.groupBy操作agg 5.Mapクラスmap flatMap foreach 6.join, intersect, unionAll, subtract, orderBy 7. DataFrameReaderの2種類の読み取り方法:sqlContext.read.format(‘FORMAT’).load(‘PATH’) sqlContext.read.load(path,format,schema,**options) 8. DataFrameWriterの2種類の書き込み方式:df.write.format(‘FORMAT’).save(path) df.save(path,format,mode,**options)
saveAsTable
pyspark.sql.types
from pyspark.sql.types import *
StringType
BooleanType
DataType
TimestampType
DecimalType
DoubleType
FloatType
IntegerType
StructType

pyspark.sql.functions