Spark]Sparkデータフレームの主な方法-(4)withColumn

10856 ワード

テキストリンク

withColumnメソッド

サマリ

withcolumnを使用してタイプを更新、変更、および新しいcolumn値を追加する

withColumn(「新規/更新カラム名」,「新規/更新値」)

の新しい値を生成したり、値を更新したりするときに既存のカラムに基づいている場合、新しいカラムは文字列であり、既存のカラムはcol(「カラム名」)を使用して

を適用する必要があります.

新規カラムはselect()メソッドで

を追加することもできます.
withColumnRename()メソッドを使用して

列名を変更

a.基本的な使い方

を適用する必要があります.

# 라이브러리 로드
from pyspark.sql.functions import col 

# Copy - spark는 .copy() 메서드 없어서, select(*)로 카피
titanic_sdf_copied = titanic.sdf.select(*) 

# 신규 컬럼 추가
titanic_sdf_copied = titanic_sdf_copied.withColumn('Extra_Fare', col('Fare') * 10)

# 기존 컬럼 업데이트
titanic_sdf_copied = titanic_sdf_copied.withColumn('Fare', col('Fare') + 20)

# 기존 컬럽 타입 변경
titanic_sdf_copied = titanic_sdf_copied.withColumn('Fare', col('Fare').cast('Interger')

# 한번에 여러 withColumn() 적용
titanic_sdf_copied = titanic_sdf_copied.withColumn('Fare', col('Fare') + 20)
									   .withColumn('Fare', col('Fare').cast('Interger')

b.定数列-lit()の適用

の新しい/更新値は柱でなければならないので、litを使用して定数値を囲む

が必要です.

# 상수 값으로 update시에 아래와 같이 수행시 error 발생 -> 반드시 update할 값은 컬럼형!
titanic_sdf_copied = titanic_sdf_copied.withColumn('Extra_Fare', 10)

# lit imporrt
from pyspark.sql.functions import lit

# 상수 값 update시, lit() 사용
titanic_sdf_copied = titanic_sdf_copied.withColumn('Extra_Fare', lit(10))

# 상수 값으로 신규 컬럼 생성시에도 lit() 사용
titanic_sdf_copied = titanic_sdf_copied.withColumn('New_Name', lit(Test_Name))

c.select()新しい列の追加

# 라이브러리 로드
from pyspark.sql.functions import col, substring

# select()를 이용한, 신규 컬럼 추가 
titanic_copied = titanic_copied.select('*', col('Sex').alias('Gender')
titanic_copied = titanic_copied.select('*',substring('Cabin',0,1).alias('First'))

# withColumn으로 할 경우
titanic_copied = titanic_copied.withColumn('Gender', col('Sex'))
							   .withColumn('Cabin_First', substring('Cabin',0,1))
                                       
# split 사용
titanic_copied = titanic_copied.withColumn('Sp',split(col('Name'), ',')
									   .getItem(0))
titanic_copied = titanic_copied.withColumn('Sp',split(col('Name'), ',')
									   .getItem(1))

d.列名の変更-withColumnRenamed()

withColumnRename(「既存の列名」>「列名の変更」)変更列名

# 컬럼명 변경 
titanic_sdf_copied = titanic_sdf_copied.withColumnRenamed('Gender', 'Gender_Renamed')

# 변경하려는 컬럼이 없어도 오류 발생 시키지 않으므로 유의 필요 
titanic_sdf_copied = titanic_sdf_copied.withColumnRenamed('Gender_X', 'Gender_Renamed')

Reference

この問題について(Spark]Sparkデータフレームの主な方法-(4)withColumn), 我々は、より多くの情報をここで見つけました https://velog.io/@baekdata/sparkwithcolumn

テキストは自由に共有またはコピーできます。ただし、このドキュメントのURLは参考URLとして残しておいてください。

Collection and Share based on the CC Protocol

Bashの機能——起動ファイルの実行

switch