将抓取结果保存到 Pandas Dataframe 中

欧比万科托比

我正在使用以下代码使用 Selenium 和 Beautiful Soup 抓取 Google Scholar 不同页面的一些信息。

我可以打印所有抓取的信息，但无法将结果保存到一个 Dataframe 中进行导出。

如何保存每个搜索结果的结果（标题、作者、链接、摘要）？

# Dataframe initialisieren
data = {
        "Titel": [],
        "Link" :[],
        "Authoren" : [] ,
        "Veröffentlichungsjahr" : [],
        "Abstract" :[] 
        }
df = pd.DataFrame(data)


# Ort wo Chromedriver gespeichert ist (lokal)
PATH = '/Applications/chromedriver'

driver = webdriver.Chrome(PATH)

# URL aufrufen
driver.get('https://scholar.google.de/')
time.sleep(5)

#Searchbar finden und ausfüllen
search = driver.find_element_by_id('gs_hdr_tsi')
search.send_keys('"circular economy"AND "Dlt" AND "Germany" AND "Sweden"')
time.sleep(5)
search.send_keys(Keys.RETURN)

## Anzahl Ergebnisse --> /10 ist die Anzahl der Klicks auf "weiter"
Anzahl = driver.find_element_by_id('gs_ab_md').text
x=re.findall(r'\d+', Anzahl)[0]
# y mal auf "Weiter" klicken lassen
y = int(int(x)/10)+1
print("Seitenanzahl:", y)
i=0

for i in range(2): #y
    # Schranke einbauen, damit Selenium solange pausiert bis Ergebnisse geladen sind

    page_source = driver.page_source
    soup = BeautifulSoup(page_source, 'lxml')
        
    for item in soup.select('[data-lid]'): 
            try: 
                print('----------------------------------------') 
       # print(item) 
                print(item.select('h3')[0].get_text()) 
                title = item.select('h3')[0].get_text()

                print(item.select('a')[0]['href'])
                link = item.select('a')[0]['href']

                print(item.select('.gs_a')[0].get_text()) 
                author = item.select('.gs_a')[0].get_text()

                txt = item.select('.gs_a')[0].get_text()
                print(re.findall(r'\d+', txt)[0])

                year = re.findall(r'\d+', txt)[0]
                print(item.select('.gs_rs')[0].get_text()) 

                abstract = item.select('.gs_rs')[0].get_text()

                data_2 = {
                "Titel" : title,
                "Link" : link,
                "Authoren" : author,
                "Veröffentlichungsjahr" : year,
                "Abstract" : abstract
                }
                
                df_new = pd.DataFrame(data_2)

                df = df.append(df_new, ignore_index=True)

                print('----------------------------------------') 
            except Exception as e: 
                #raise e
                print('---')
    # Random Wartezeit (2-15 Sekunden), bis nächste Seite aufgerufen wird, um IP-Blocks zu verhindern
    
    w = random.randint(1,14)
    time.sleep(w)
    try:
        driver.find_element_by_link_text('Weiter').click()
    except:    
        driver.quit()

    i+=1

科拉连

不要在循环期间创建设置数据帧。策略是将记录收集到字典列表中，最后创建您的数据框。

新代码（搜索# <- HERE）

# Ort wo Chromedriver gespeichert ist (lokal)
PATH = '/Applications/chromedriver'

driver = webdriver.Chrome(PATH)

# URL aufrufen
driver.get('https://scholar.google.de/')
time.sleep(5)

#Searchbar finden und ausfüllen
search = driver.find_element_by_id('gs_hdr_tsi')
search.send_keys('"circular economy"AND "Dlt" AND "Germany" AND "Sweden"')
time.sleep(5)
search.send_keys(Keys.RETURN)

## Anzahl Ergebnisse --> /10 ist die Anzahl der Klicks auf "weiter"
Anzahl = driver.find_element_by_id('gs_ab_md').text
x=re.findall(r'\d+', Anzahl)[0]
# y mal auf "Weiter" klicken lassen
y = int(int(x)/10)+1
print("Seitenanzahl:", y)

records = []  # <- HERE
for i in range(2): #y
    # Schranke einbauen, damit Selenium solange pausiert bis Ergebnisse geladen sind

    page_source = driver.page_source
    soup = BeautifulSoup(page_source, 'lxml')
        
    for item in soup.select('[data-lid]'): 
            try: 
                print('----------------------------------------') 
       # print(item) 
                print(item.select('h3')[0].get_text()) 
                title = item.select('h3')[0].get_text()

                print(item.select('a')[0]['href'])
                link = item.select('a')[0]['href']

                print(item.select('.gs_a')[0].get_text()) 
                author = item.select('.gs_a')[0].get_text()

                txt = item.select('.gs_a')[0].get_text()
                print(re.findall(r'\d+', txt)[0])

                year = re.findall(r'\d+', txt)[0]
                print(item.select('.gs_rs')[0].get_text()) 

                abstract = item.select('.gs_rs')[0].get_text()

                records.append({  # <- HERE
                "Titel" : title,
                "Link" : link,
                "Authoren" : author,
                "Veröffentlichungsjahr" : year,
                "Abstract" : abstract
                })

                print('----------------------------------------') 
            except Exception as e: 
                #raise e
                print('---')
    # Random Wartezeit (2-15 Sekunden), bis nächste Seite aufgerufen wird, um IP-Blocks zu verhindern
    
    w = random.randint(1,14)
    time.sleep(w)
    try:
        driver.find_element_by_link_text('Weiter').click()
    except:    
        driver.quit()

df = pd.DataFrame(records)  # <- HERE

输出：

>>> df
                                               Titel  ...                                           Abstract
0  Shifting infrastructure landscapes in a circul...  ...  … [Google Scholar] [CrossRef]; Kirchherr, J.; ...
1  Demand-supply matching through auctioning for ...  ...  … 12, 76131 Karlsruhe, Germany cPolitecnico di...
2  Using internet of things and distributed ledge...  ...  … The authors were able to show how a combinat...
3  The impact of Blockchain Technology on the Tra...  ...  … In the broader sense, Blockchain is a Distri...
4  Assessing the role of triple helix system inte...  ...  … depends upon the successful diffusion of sev...
5  Circular Digital Built Environment: An Emergin...  ...  … For example, when searching for articles rel...
6  [PDF][PDF] Phillip Bendix (Wuppertal Institute...  ...  … Stadtreinigung Hamburg (Germany): AI image …...
7  Waste Management–A Case Study of Producer Resp...  ...  … A similar study in Germany reported an inter...
8  Blockchain in the built environment and constr...  ...  … changes in regulation can facilitate industr...

[9 rows x 5 columns]

>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9 entries, 0 to 8
Data columns (total 5 columns):
 #   Column                 Non-Null Count  Dtype
---  ------                 --------------  -----
 0   Titel                  9 non-null      object
 1   Link                   9 non-null      object
 2   Authoren               9 non-null      object
 3   Veröffentlichungsjahr  9 non-null      object
 4   Abstract               9 non-null      object
dtypes: object(5)
memory usage: 488.0+ bytes

现在您可以使用df.to_csv(...)导出您的数据。

本文收集自互联网，转载请注明来源。

如有侵权，请联系 [email protected] 删除。