두 개의 다른 목록에 포함 된 값을 기반으로 데이터 프레임에 새 열을 어떻게 생성합니까?

알로 프란

다음과 같은 pyspark 데이터 프레임이 있습니다.

+--------------------+--------------------+
|               label|           sentences|
+--------------------+--------------------+
|[things, we, eati...|<p>I am construct...|
|[elephants, nordi...|<p><strong>Edited...|
|[bee, cross-entro...|<p>I have a data ...|
|[milking, markers...|<p>There is an Ma...|
|[elephants, tease...|<p>I have Score d...|
|[references, gene...|<p>I'm looking fo...|
|[machines, exitin...|<p>I applied SVM ...|
+--------------------+--------------------+

그리고 top_ten다음과 같은 목록 :

['bee', 'references', 'milking', 'expert', 'bombardier', 'borscht', 'distributions', 'wires', 'keyboard', 'correlation']

그리고 목록 에 레이블 값 중 하나 이상이 있는지 나타내는 열 을 만들어야 합니다 (물론 각 행에 대해).new_label1.0top_ten

논리는 말이되지만 구문에 대한 경험이 부족합니다. 확실히이 문제에 대한 짧은 대답이 있습니까?

난 노력 했어:

temp = train_df.withColumn('label', F.when(lambda x: x.isin(top_ten), 1.0).otherwise(0.0))

이:

def matching_top_ten(top_ten, labels):
    for label in labels:
        if label.isin(top_ten):
            return 1.0
        else:
            return 0.0

이 마지막 시도 후에 이러한 함수를 데이터 프레임에 매핑 할 수 없다는 것을 알게되었습니다. 따라서 열을 RDD로 변환하고 매핑 한 다음 .join()다시 되돌릴 수 있다고 생각 하지만 불필요하게 지루하게 들립니다.

** 업데이트 : ** UDF로 위의 기능을 시도했지만 운이 없었습니다 ...

from pyspark.sql.functions import udf
from pyspark.sql.types import FloatType
matching_udf = udf(matching_top_ten, FloatType())
temp = train_df.select('label', matching_udf(top_ten, 'label').alias('new_labels'))
----
TypeError: Invalid argument, not a string or column: [...top_ten list values...] of type <class 'list'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function.

내가 찾은 다른 유사한 질문이 있지만, 그들 중 어느 것도 다른 목록에 대해 목록을 확인하는 논리를 포함하지 않습니다 (기껏해야 목록에 대한 단일 값).

파울리

audf 를 사용할 필요가 없으며 explode+ 비용을 피할 수 있습니다 agg.

Spark 버전 2.4 이상

다음을 사용할 수 있습니다 pyspark.sql.functions.arrays_overlap.

import pyspark.sql.functions as F

top_ten_array = F.array(*[F.lit(val) for val in top_ten])

temp = train_df.withColumn(
    'new_label', 
    F.when(F.arrays_overlap('label', top_ten_array), 1.0).otherwise(0.0)
)

또는 pyspark.sql.functions.array_intersect().

temp = train_df.withColumn(
    'new_label', 
    F.when(
        F.size(F.array_intersect('label', top_ten_array)) > 0, 1.0
    ).otherwise(0.0)
)

이 검사는 모두의 공통의 크기 있는지 label와는 top_ten비 제로이다.

Spark 1.5 ~ 2.3의 경우 다음을 통해 array_contains루프에서 사용할 수 있습니다 top_ten.

from operator import or_
from functools import reduce

temp = train_df.withColumn(
    'new_label',
    F.when(
        reduce(or_, [F.array_contains('label', val) for val in top_ten]),
        1.0
    ).otherwise(0.0)
)

label의 값이 포함되어 있는지 테스트 top_ten하고 비트 OR로 결과를 줄입니다. 에 포함 된 True값이있는 경우 에만 반환 됩니다 .top_tenlabel

이 기사는 인터넷에서 수집됩니다. 재 인쇄 할 때 출처를 알려주십시오.

침해가 발생한 경우 연락 주시기 바랍니다[email protected] 삭제

에서 수정2021-01-21

몇 마디 만하겠습니다

0리뷰

로그인참여 후 검토

이전 게시물：동일한 데이터 프레임에서 날짜는 같지만 시간이 다른 여러 파일

TOP 리스트

기사