reading large json file with panda

AhmedKamal2021 Published at Dev

AhmedKamal2021

I am trying to load books reviews from this page https://nijianmo.github.io/amazon/index.html I downloaded the file to extract it but when I try to read it with pandas I get memory error

pd.read_json('path/Books_5.json',lines=True)

I tried other files that are smaller and it worked I am doing sentiment analysis I need 250k reviews with scores of 4,5 and 250k reviews with scores 1,2.

i tried to use this to check for the score and take the text into lists to make a data frame with them later

with pd.read_json('path/Books_5.json',lines=True,chunksize= 1) as reader:
for chunk in reader:
    if chunk[chunk['overall'] > 3]:
        pos_revs.append(chunk['reviewText'])
    elif chunk[chunk['overall'] < 3]:
        neg_revs.append(chunk['reviewText'])
    if (len(pos_revs) == 250000) & (len(neg_revs) == 250000):
        break

but i got the error

ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

looking at similar questions json files are not the same this how mine looks like

BeRT2me

Loading lines one at a time into DataFrames just to check their rating is incredibly inefficient, it's better to treat everything as dictionaries and make some Series at the end.

import json
import gzip
import pandas as pd

def parse(path):
    g = gzip.open(path, 'r')
    for l in g:
        yield json.loads(l)

file = parse('Books_5.json.gz')
pos_revs = []
neg_revs = []
while len(pos_revs) < 250000 or len(neg_revs) < 250000:
    line = next(file)
    rating = line['overall']        
    if len(pos_revs) < 250000 and rating > 3:
        review = line.get('reviewText')
        if review:
            pos_revs.append(review)
    if len(neg_revs) < 250000 and rating < 3:
        review = line.get('reviewText')
        if review:
            neg_revs.append(line.get('reviewText'))

pos_revs = pd.Series(pos_revs)
neg_revs = pd.Series(neg_revs)

print(pos_revs)
print(neg_revs)

Output:

0         The King, the Mice and the Cheese by Nancy Gur...
1                                        The kids loved it!
2         My students (3 & 4 year olds) loved this book!...
3                                                   LOVE IT
4                                                    Great!
                                ...
249995               Great read. Dis t want to put it down.
249996                                     Love this series
249997    So I am one of those people who absolutely lov...
249998    I learned a great deal from this book. The Fre...
249999    Having already read Tuchman's book on the outb...
Length: 250000, dtype: object

0         Looking for a Louis Untermeyer book  from the ...
1         Completly boring!!! Yes it's a childerns book ...
2         I don't like Hillerman novels.  It was chosen ...
3         I have read many of the Hillerman books and en...
4         I really love Hillerman's books.  He is one of...
                                ...
249995    When I first started reading SUSPECT, I though...
249996    I really despised this book.  Sure it portrays...
249997    This is a bleak novel. The mindless violence t...
249998    Like the title says, this is not as good as Ch...
249999    Great concept. Predictable, poorly written sto...
Length: 250000, dtype: object

Or a purely pandas version could look something like this, and possibly be faster:

reader = pd.read_json('Books_5.json.gz',lines=True, chunksize=100000)
pos_revs = pd.DataFrame()
neg_revs = pd.DataFrame()
for chunk in reader:
    if pos := (len(pos_revs) < 250000):
        temp_pos = chunk[chunk['overall'].gt(3)][['summary']]
        pos_revs = pd.concat([pos_revs, temp_pos], ignore_index=True)
        # OPTIONAL:
        pos_revs.drop_duplicates(inplace=True, ignore_index=True)
    if neg := (len(neg_revs) < 250000):
        temp_neg = chunk[chunk['overall'].lt(3)][['summary']]
        neg_revs = pd.concat([neg_revs, temp_neg], ignore_index=True)
        # OPTIONAL:
        neg_revs.drop_duplicates(inplace=True, ignore_index=True)
    if not neg and not pos:
        break

print(pos_revs)
print(neg_revs)

Output:

                                                  summary
0               A story children will love and learn from
1                                              Five Stars
2                                           Not Nice Mice
3                        One of my favorite kids' stories
4                   One of our families favorite books!!!
...                                                   ...
294397  PRATCHETT ON TOP FORM WITH THIS BRILLIANT NEW ...
294398        Pratchett's aphorisms get better and better
294399  Thief of Time - John Deakins for ABSOLUTE MAGN...
294400                           An absolute masterpiece!
294401                     Beautiful, Engaging, A Classic

[294402 rows x 1 columns]

                                                  summary
0                                               Two Stars
1                                  Don't waste your money
2                                    Tony missed the mark
3                              Don't Start with This One!
4                                         Nothing special
...                                                   ...
253377                         Not the caliber of "Naked"
253378                 "Weak Stories" b/w "One Ace Essay"
253379      Fast service - product smells of mildew/mold.
253380  Nothing new, classical narration, good against...
253381  This is a magnificent book-hardcover version b...

[253382 rows x 1 columns]

I'm not sure which method is faster, but both take less than a minute to run on the 6GB file.

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at2022-07-3

Comments

0 comments

reading large json file with panda

reading large json file with panda

Failed to listen on localhost:8000 (reason: Cannot assign requested address)

Loopback Error: connect ECONNREFUSED 127.0.0.1:3306 (MAMP)

How to import an asset in swift using Bundle.main.path() in a react-native native module

pump.io port in URL

Compiler error CS0246 (type or namespace not found) on using Ninject in ASP.NET vNext

BigQuery - concatenate ignoring NULL

ngClass error (Can't bind ngClass since it isn't a known property of div) in Angular 11.0.3

ggplotly no applicable method for 'plotly_build' applied to an object of class "NULL" if statements

Spring Boot JPA PostgreSQL Web App - Internal Authentication Error

How to remove the extra space from right in a webview?

java.lang.NullPointerException: Cannot read the array length because "<local3>" is null

Jquery different data trapped from direct mousedown event and simulation via $(this).trigger('mousedown');

flutter: dropdown item programmatically unselect problem

How to use merge windows unallocated space into Ubuntu using GParted?

Change dd-mm-yyyy date format of dataframe date column to yyyy-mm-dd

Nuget add packages gives access denied errors

Svchost high CPU from Microsoft.BingWeather app errors

Can't pre-populate phone number and message body in SMS link on iPhones when SMS app is not running in the background

12.04.3--- Dconf Editor won't show com>canonical>unity option

Any way to remove trailing whitespace *FOR EDITED* lines in Eclipse [for Java]?

maven-jaxb2-plugin cannot generate classes due to two declarations cause a collision in ObjectFactory class

Any way to remove trailing whitespace FOR EDITED lines in Eclipse [for Java]?