Calculating the cosine similarity between all the rows of a dataframe in pyspark

884

Abhinav Choudhury

I have a dataset containing workers with their demographic information like age gender,address etc and their work locations. I created an RDD from the dataset and converted it into a DataFrame.

There are multiple entries for each ID. Hence, I created a DataFrame which contained only the ID of the worker and the various office locations' that he/she had worked.

    |----------|----------------|
    | **ID**    **Office_Loc**  |
    |----------|----------------|
    |   1      |Delhi, Mumbai,  |
    |          | Gandhinagar    |
    |---------------------------|
    |   2      | Delhi, Mandi   | 
    |---------------------------|
    |   3      |Hyderbad, Jaipur|
    -----------------------------

I want to calculate the cosine similarity between each worker with every other worker based on their office locations'.

So, I iterated through the rows of the DataFrame, retrieving a single row from the DataFrame :

myIndex = 1
values = (ID_place_df.rdd.zipWithIndex()
            .filter(lambda ((l, v), i): i == myIndex)
            .map(lambda ((l,v), i): (l, v))
            .collect())

and then using map

    cos_weight = ID_place_df.select("ID","office_location").rdd\
  .map(lambda x: get_cosine(values,x[0],x[1]))

to calculated the cosine similarity between the extracted row and the whole DataFrame.

I do not think my approach is a good one since I am iterating through the rows of the DataFrame, it defeats the whole purpose of using spark. Is there a better way to do it in pyspark? Kindly advise.

MaFF

You can use the mllib package to compute the L2 norm of the TF-IDF of every row. Then multiply the table with itself to get the cosine similarity as the dot product of two by two L2norms:

1. RDD

rdd = sc.parallelize([[1, "Delhi, Mumbai, Gandhinagar"],[2, " Delhi, Mandi"], [3, "Hyderbad, Jaipur"]])

Compute TF-IDF:

documents = rdd.map(lambda l: l[1].replace(" ", "").split(","))

from pyspark.mllib.feature import HashingTF, IDF
hashingTF = HashingTF()
tf = hashingTF.transform(documents)

You can specify the number of features in HashingTF to make the feature matrix smaller (fewer columns).

    tf.cache()
    idf = IDF().fit(tf)
    tfidf = idf.transform(tf)

Compute L2norm:

from pyspark.mllib.feature import Normalizer
labels = rdd.map(lambda l: l[0])
features = tfidf

normalizer = Normalizer()
data = labels.zip(normalizer.transform(features))

Compute cosine similarity by multiplying the matrix with itself:

from pyspark.mllib.linalg.distributed import IndexedRowMatrix
mat = IndexedRowMatrix(data).toBlockMatrix()
dot = mat.multiply(mat.transpose())
dot.toLocalMatrix().toArray()

    array([[ 0.        ,  0.        ,  0.        ,  0.        ],
           [ 0.        ,  1.        ,  0.10794634,  0.        ],
           [ 0.        ,  0.10794634,  1.        ,  0.        ],
           [ 0.        ,  0.        ,  0.        ,  1.        ]])

OR: Using a Cartesian product and the function dot on numpy arrays:

data.cartesian(data)\
    .map(lambda l: ((l[0][0], l[1][0]), l[0][1].dot(l[1][1])))\
    .sortByKey()\
    .collect()

    [((1, 1), 1.0),
     ((1, 2), 0.10794633570596117),
     ((1, 3), 0.0),
     ((2, 1), 0.10794633570596117),
     ((2, 2), 1.0),
     ((2, 3), 0.0),
     ((3, 1), 0.0),
     ((3, 2), 0.0),
     ((3, 3), 1.0)]

2. DataFrame

Since you seem to be using dataframes, you can use the spark mlpackage instead:

import pyspark.sql.functions as psf
df = rdd.toDF(["ID", "Office_Loc"])\
    .withColumn("Office_Loc", psf.split(psf.regexp_replace("Office_Loc", " ", ""), ','))

Compute TF-IDF:

from pyspark.ml.feature import HashingTF, IDF
hashingTF = HashingTF(inputCol="Office_Loc", outputCol="tf")
tf = hashingTF.transform(df)

idf = IDF(inputCol="tf", outputCol="feature").fit(tf)
tfidf = idf.transform(tf)

Compute L2 norm:

from pyspark.ml.feature import Normalizer
normalizer = Normalizer(inputCol="feature", outputCol="norm")
data = normalizer.transform(tfidf)

Compute matrix product:

from pyspark.mllib.linalg.distributed import IndexedRow, IndexedRowMatrix
mat = IndexedRowMatrix(
    data.select("ID", "norm")\
        .rdd.map(lambda row: IndexedRow(row.ID, row.norm.toArray()))).toBlockMatrix()
dot = mat.multiply(mat.transpose())
dot.toLocalMatrix().toArray()

OR: using a join and a UDF for function dot:

dot_udf = psf.udf(lambda x,y: float(x.dot(y)), DoubleType())
data.alias("i").join(data.alias("j"), psf.col("i.ID") < psf.col("j.ID"))\
    .select(
        psf.col("i.ID").alias("i"), 
        psf.col("j.ID").alias("j"), 
        dot_udf("i.norm", "j.norm").alias("dot"))\
    .sort("i", "j")\
    .show()

    +---+---+-------------------+
    |  i|  j|                dot|
    +---+---+-------------------+
    |  1|  2|0.10794633570596117|
    |  1|  3|                0.0|
    |  2|  3|                0.0|
    +---+---+-------------------+

This tutorial lists different methods to multiply large scale matrices: https://labs.yodas.com/large-scale-matrix-multiplication-with-pyspark-or-how-to-match-two-large-datasets-of-company-1be4b1b2871e

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at2020-10-26

Comments

0 comments

TOP Ranking

Article

Calculating the cosine similarity between all the rows of a dataframe in pyspark

Calculating the cosine similarity between all the rows of a dataframe in pyspark

pump.io port in URL

Loopback Error: connect ECONNREFUSED 127.0.0.1:3306 (MAMP)

Can't pre-populate phone number and message body in SMS link on iPhones when SMS app is not running in the background

How to import an asset in swift using Bundle.main.path() in a react-native native module

Failed to listen on localhost:8000 (reason: Cannot assign requested address)

Spring Boot JPA PostgreSQL Web App - Internal Authentication Error

ngClass error (Can't bind ngClass since it isn't a known property of div) in Angular 11.0.3

Using Response.Redirect with Friendly URLS in ASP.NET

Can a 32-bit antivirus program protect you from 64-bit threats

Double spacing in rmarkdown pdf

How to fix "pickle_module.load(f, **pickle_load_args) _pickle.UnpicklingError: invalid load key, '<'" using YOLOv3?

3D Touch Peek Swipe Like Mail

Bootstrap 5 Static Modal Still Closes when I Click Outside

Assembly definition can't resolve namespaces from external packages

Vector input in shiny R and then use it

Emulator wrong screen resolution in Android Studio 1.3

Svchost high CPU from Microsoft.BingWeather app errors

Graphics Context misaligned on first paint

Python connect to firebird docker database

Is this docker-for-mac password dialog legit?

How to save models trained locally in Amazon SageMaker?