刮擦非描述性标记之间的文本

特雷夫

在某些情况下,我的文本介于模糊的值和属性之间,这些值在整个文件中多次出现(例如,“”被重复使用)。

最终,我想退出:“ Prev Close:”和“ 565.07”并将该信息放入字符串或列表之类的东西(请提出建议)。


相关HTML来源的部分:

<div class="yui-u first yfi-start-content"><div class="yfi_quote_summary"><div id="yfi_quote_summary_data" class="rtq_table"><table id="table1"><tr><th scope="row" width="48%">Prev Close:</th><td class="yfnc_tabledata1">565.07</td></tr>

我的代码(Python 3.4.1):

soup = BeautifulSoup(data) # data contains the HTML source

FirstTable_tag = soup.find('div', attrs={'class': '"yui-u first yfi-start-content"'})
# Should the keys (attributes) in the "findNextSibling parameters below be filled in or left empty???
next_FirstTable_tag = FirstTable_tag.findNextSibling('div', attrs={'class': '"yfi_quote_summary"'})     
next_next_FirstTable_tag = next_FirstTable_tag.findNextSibling('div', attrs={'id': '"yfi_quote_sumary_data"', 'class': '"rtq_table"'})
next_next_next_FirstTable_tag = next_next_FirstTable_tag.findNextSibling('table', attrs={'id': '"table1"'})
data = next_next_next_FirstTable_tag.get_text()

SelectSoup = BeautifulSoup(data)
print("SelectSoup:" + SelectSoup + "(should be:  Prev Close)")

错误

Traceback (most recent call last):
    next_FirstTable_tag = FirstTable_tag.findNextSibling          
AttributeError: 'NoneType' object has no attribute 'findNextSibling'
<<< Process finished. (Exit code 1)

编辑

这是所要求的原始和完整资源

尽管我已经着手使用Yahoo的API,这显然是一种更好的方法,但我仍然在@scandinavian_的帮助下设法摆脱好奇心。

我在上面更新了代码,但是仍然遇到相同的错误。


编辑2

此后的帖子将重点关注@scandinavian_正在协助开发的解决方案:

import sys
import urllib.request
url = "http://finance.yahoo.com/q?s=GOOG"
urlRunner = urllib.request.urlopen(url)
data = urlRunner.read()

from bs4 import BeautifulSoup
soup = BeautifulSoup(data)

import re
tables = soup.findAll("table", id = re.compile('^table'))
result = {}
for table in tables:
    for th, td in zip(table.findAll("th"), table.findAll("td")):
        result[th.text] = td.text
print(result)

结果:

{'52wk Range:': '502.80 - 604.83', 'Market Cap:': '381.04B', 'Next Earnings Date:': 'N/A', 'P/E (ttm):': '29.52', 'Avg Vol (3m):': '1,701,610', 'EPS (ttm):': '19.09', '1y Target Est:': 'N/A', 'Volume:': '561,384', 'Ask:': '563.98 x 100', 'Div & Yield:': 'N/A (N/A) ', 'Bid:': '563.56 x 100', 'Beta:': '1.144', 'Open:': '568.00', "Day's Range:": '562.53 - 569.77', 'Prev Close:': '566.37'}

斯堪的纳维亚语_

这是基于我想您想要的,但是如果没有适当的数据样本就无法说出来。我无法猜测它的结构。在您的描述中,听起来好像数据是不规则的,这在您的样本中看不到。

from bs4 import BeautifulSoup
from itertools import izip

html = """<div class="yui-u first yfi-start-content">
    <div class="yfi_quote_summary">
        <div id="yfi_quote_summary_data" class="rtq_table">
            <table id="table1">
                <tr>
                    <th scope="row" width="48%">Target Point:</th>
                    <td class="yfnc_tabledata1">200.22</td>
                </tr>
                <tr>
                    <th scope="row" width="48%">Target Point:</th>
                    <td class="yfnc_tabledata1">200.22</td>
                </tr>
                <tr>
                    <th scope="row" width="48%">Target Point:</th>
                    <td class="yfnc_tabledata1">200.22</td>
                </tr>
            </table>
        </div>
    </div>
</div>"""

bs = BeautifulSoup(html)

result = {}

ths = bs.findAll("th")
tds = bs.findAll("td")
elements = izip(ths, tds)

result = []

for x, y in elements:
    result.append((x.text, y.text))

print result

编辑:

Yahoo API解决方案,请考虑改用以下方法:

import requests

URL = "https://query.yahooapis.com/v1/public/yql"

query = 'select * from yahoo.finance.quotes where symbol in ("GOOG")'

params = {
    "q": query,
    "format": "json",
    "env": "store://datatables.org/alltableswithkeys"
}

data = requests.get(URL, params=params).json()

print data['query']['results']['quote']['PreviousClose']
print data['query']['results']['quote']['Open']

这将打印:

565.07
561.78

以下是库存的可用数据:

AfterHoursChangeRealtime
AnnualizedGain
Ask
AskRealtime
AverageDailyVolume
Bid
BidRealtime
BookValue
Change
Change_PercentChange
ChangeFromFiftydayMovingAverage
ChangeFromTwoHundreddayMovingAverage
ChangeFromYearHigh
ChangeFromYearLow
ChangeinPercent
ChangePercentRealtime
ChangeRealtime
Commission
Currency
DaysHigh
DaysLow
DaysRange
DaysRangeRealtime
DaysValueChange
DaysValueChangeRealtime
DividendPayDate
DividendShare
DividendYield
EarningsShare
EBITDA
EPSEstimateCurrentYear
EPSEstimateNextQuarter
EPSEstimateNextYear
ErrorIndicationreturnedforsymbolchangedinvalid
ExDividendDate
FiftydayMovingAverage
HighLimit
HoldingsGain
HoldingsGainPercent
HoldingsGainPercentRealtime
HoldingsGainRealtime
HoldingsValue
HoldingsValueRealtime
LastTradeDate
LastTradePriceOnly
LastTradeRealtimeWithTime
LastTradeTime
LastTradeWithTime
LowLimit
MarketCapitalization
MarketCapRealtime
MoreInfo
Name
Notes
OneyrTargetPrice
Open
OrderBookRealtime
PEGRatio
PERatio
PERatioRealtime
PercebtChangeFromYearHigh
PercentChange
PercentChangeFromFiftydayMovingAverage
PercentChangeFromTwoHundreddayMovingAverage
PercentChangeFromYearLow
PreviousClose
PriceBook
PriceEPSEstimateCurrentYear
PriceEPSEstimateNextYear
PricePaid
PriceSales
SharesOwned
ShortRatio
StockExchange
symbol
Symbol
TickerTrend
TradeDate
TwoHundreddayMovingAverage
Volume
YearHigh
YearLow
YearRange

本文收集自互联网,转载请注明来源。

如有侵权,请联系 [email protected] 删除。

编辑于
0

我来说两句

0 条评论
登录 后参与评论

相关文章