从HTML提取数据并格式化输出

白手套

介绍

我目前正在个人学习WebScrapping，以获取新的技巧和纯粹的爱好。

到目前为止，通过使用我用Java和Jsoup库编写的代码，我已经能够从网站中提取数据（在稍微研究结构之后）。

//To input the html file
   File inputFile = new File("test2.html");
   Document doc = Jsoup.parse(inputFile, "Unicode");

   //To grab the part we are working with (knowing the website for sure)
   Element content = doc.getElementById("mainContent");
   Elements tds = doc.select("[class=nowrap]");
   System.out.println(tds.text());


    (Note that I am working from a HTML file)

到目前为止，我得到了这个“期望的”输出

 <td align="right" class="nowrap"> <a href="website" onclick="return 
 doWindow(this, 700, 500);" class="popup">0</a> </td>
 <td align="right" class="nowrap"><a href="website" 
 onclick="doWindow(this.href, '1024', '768'); return false;">10 000</a> [10 
 000]</td>
 <td align="right" class="nowrap">10 000</td>
 <td align="right" class="nowrap">20.48</td>
 <td align="right" class="nowrap">0.00</td>
 <td align="right" class="nowrap">$28.65</td>
 <td align="right" class="nowrap">0.00 %</td>
 <td align="right" class="nowrap">$894.69</td>
 <td align="right" class="nowrap">10.11</td>
 <td align="right" class="nowrap">0.21</td>
 <td align="right" class="nowrap"> <a href="website" onclick="return 
  doWindow(this, 700, 500);" class="popup">0</a> </td>
 <td align="right" class="nowrap"><a href="website" 
  onclick="doWindow(this.href, '1024', '768'); return false;">10 000</a> [10 
  000]</td>
  <td align="right" class="nowrap">10 000</td>
  <td align="right" class="nowrap">46.21</td>
  <td align="right" class="nowrap">0.00</td>
  <td align="right" class="nowrap">$53.82</td>
  <td align="right" class="nowrap">0.00 %</td>
  <td align="right" class="nowrap">$1 151.78</td>
  <td align="right" class="nowrap">8.01</td>
  <td align="right" class="nowrap">0.00</td>
  <td align="right" class="nowrap"> <a href="website" onclick="return 
 doWindow(this, 700, 500);" class="popup">0</a> </td>
 <td align="right" class="nowrap"><a href="website" 
  onclick="doWindow(this.href, '1024', '768'); return false;">5 000</a> [5 
  000]</td>
  <td align="right" class="nowrap">5 000</td>
  <td align="right" class="nowrap">22.51</td>
  <td align="right" class="nowrap">0.00</td>
  <td align="right" class="nowrap">$222.53</td>
  <td align="right" class="nowrap">0.00 %</td>
  <td align="right" class="nowrap">$2 399.92</td>
  <td align="right" class="nowrap">5.94</td>
  <td align="right" class="nowrap">0.01</td>

问题

我对包含（准确的数字（字符串））之后要进行一些数学运算的文本更感兴趣。

因此，我继续阅读有关Jsoup的文档，发现我可以.text()用来摆脱HTML内容，例如从HTML文件中获取一长串数字：

0 10 000 [10 000] 10 000 20.48 0.00 $28.65 0.00 % $894.69 10.11 0.21 0 10 
000 [10 000] 10 000 46.21 0.00 $53.82 0.00 % $1 151.78 8.01 0.00 0 5 000 [5 
000] 5 000 22.51 0.00 $222.53 0.00 % $2 399.92 5.94 0.01

如何将其分成3个字符串并能够使用数字？

我在其他问题中看到过，其中一种方法可能是RegEx，但仍然无法获得理想的结果。

编辑：取得了一些进展

经过一番研究，我找到了转换为文本并访问所需数据的方法：

tds.get(key).text();

其中key是一个整数，它表示获得的最后一个String中的位置

这解决了我的问题的一部分，因为HTML无法提供一个属性。

<td align="center">
        <input type="text" tabindex="2" name="productData[price]       
        [{33013477}]" size="10" value="3000.00">    
</td>

我需要的值在属性值=“ 3000.0”处

感谢您对此问题的关注。

恶魔

为了从HTML源中提取数据，我使用了一个名为getBetween（）的小方法来执行任务。当然，我个人想要的数据似乎总是在某种字符串之间：

/**
 * Retrieves any string data located between the supplied string leftString
 * parameter and the supplied string rightString parameter.<br><br>
 * <p>
 * <p>
 * This method will return all instances of a substring located between the
 * supplied Left String and the supplied Right String which may be found
 * within the supplied Input String.<br>
 *
 * @param inputString (String) The string to look for substring(s) in.
 *
 * @param leftString  (String) What may be to the Left side of the substring
 *                    we want within the main input string. Sometimes the
 *                    substring you want may be contained at the very
 *                    beginning of a string and therefore there is no
 *                    Left-String available. In this case you would simply
 *                    pass a Null String ("") to this parameter which
 *                    basically informs the method of this fact. Null can
 *                    not be supplied and will ultimately generate a
 *                    NullPointerException.
 *
 * @param rightString (String) What may be to the Right side of the
 *                    substring we want within the main input string.
 *                    Sometimes the substring you want may be contained at
 *                    the very end of a string and therefore there is no
 *                    Right-String available. In this case you would simply
 *                    pass a Null String ("") to this parameter which
 *                    basically informs the method of this fact. Null can
 *                    not be supplied and will ultimately generate a
 *                    NullPointerException.
 *
 * @param options     (Optional - Boolean - 2 Parameters):<pre>
 *
 *      ignoreLetterCase    - Default is false. This option works against the
 *                            string supplied within the leftString parameter
 *                            and the string supplied within the rightString
 *                            parameter. If set to true then letter case is
 *                            ignored when searching for strings supplied in
 *                            these two parameters. If left at default false
 *                            then letter case is not ignored.
 *
 *      trimFound           - Default is true. By default this method will trim
 *                            off leading and trailing white-spaces from found
 *                            sub-string items. General sentences which obviously
 *                            contain spaces will almost always give you a white-
 *                            space within an extracted sub-string. By setting
 *                            this parameter to false, leading and trailing white-
 *                            spaces are not trimmed off before they are placed
 *                            into the returned Array.</pre>
 *
 * @return (1D String Array) Returns a Single Dimensional String Array
 *         containing all the sub-strings found within the supplied Input
 *         String which are between the supplied Left String and supplied
 *         Right String. You can shorten this method up a little by
 *         returning a List&lt;String&gt; ArrayList and removing the 'List
 *         to 1D Array' conversion code at the end of this method. This
 *         method initially stores its findings within a List object
 *         anyways.
 */
public String[] getBetween(String inputString, String leftString, 
                    String rightString, boolean... options) {
    // Return nothing if nothing was supplied.
    if (inputString.equals("") || (leftString.equals("") && rightString.equals(""))) {
        return null;
    }

    // Prepare optional parameters if any supplied.
    // If none supplied then use Defaults...
    boolean ignoreCase = false; // Default.
    boolean trimFound = true;   // Default.
    if (options.length > 0) {
        if (options.length >= 1) {
            ignoreCase = options[0];
        }
        if (options.length >= 2) {
            trimFound = options[1];
        }
    }

    // Remove any ASCII control characters from the
    // supplied string (if they exist).
    String modString = inputString.replaceAll("\\p{Cntrl}", "");

    // Establish a List String Array Object to hold
    // our found substrings between the supplied Left
    // String and supplied Right String.
    List<String> list = new ArrayList<>();

    // Use Pattern Matching to locate our possible
    // substrings within the supplied Input String.
    String regEx = Pattern.quote(leftString)
            + (!rightString.equals("") ? "(.*?)" : "(.*)?")
            + Pattern.quote(rightString);
    if (ignoreCase) {
        regEx = "(?i)" + regEx;
    }
    Pattern pattern = Pattern.compile(regEx);
    Matcher matcher = pattern.matcher(modString);
    while (matcher.find()) {
        // Add the found substrings into the List.
        String found = matcher.group(1);
        if (trimFound) {
            found = found.trim();
        }
        list.add(found);
    }

    String[] res;
    // Convert the ArrayList to a 1D String Array.
    // If the List contains something then convert
    if (list.size() > 0) {
        res = new String[list.size()];
        res = list.toArray(res);
    } // Otherwise return Null.
    else {
        res = null;
    }
    // Return the String Array.
    return res;
}

获取网页HTML源代码很容易。为了从最初发布的“所需输出”中获取所需的数值（如下所示）

HTML来源：

 <td align="right" class="nowrap"> <a href="website" onclick="return 
 doWindow(this, 700, 500);" class="popup">0</a> </td>
 <td align="right" class="nowrap"><a href="website" 
 onclick="doWindow(this.href, '1024', '768'); return false;">10 000</a> [10 
 000]</td>
 <td align="right" class="nowrap">10 000</td>
 <td align="right" class="nowrap">20.48</td>
 <td align="right" class="nowrap">0.00</td>
 <td align="right" class="nowrap">$28.65</td>
 <td align="right" class="nowrap">0.00 %</td>
 <td align="right" class="nowrap">$894.69</td>
 <td align="right" class="nowrap">10.11</td>
 <td align="right" class="nowrap">0.21</td>
 <td align="right" class="nowrap"> <a href="website" onclick="return 
  doWindow(this, 700, 500);" class="popup">0</a> </td>
 <td align="right" class="nowrap"><a href="website" 
 onclick="doWindow(this.href, '1024', '768'); return false;">10 000</a> [10 
 000]</td>
 <td align="right" class="nowrap">10 000</td>
 <td align="right" class="nowrap">46.21</td>
 <td align="right" class="nowrap">0.00</td>
 <td align="right" class="nowrap">$53.82</td>
 <td align="right" class="nowrap">0.00 %</td>
 <td align="right" class="nowrap">$1 151.78</td>
 <td align="right" class="nowrap">8.01</td>
 <td align="right" class="nowrap">0.00</td>
 <td align="right" class="nowrap"> <a href="website" onclick="return 
 doWindow(this, 700, 500);" class="popup">0</a> </td>
 <td align="right" class="nowrap"><a href="website" 
 onclick="doWindow(this.href, '1024', '768'); return false;">5 000</a> [5 
 000]</td>
 <td align="right" class="nowrap">5 000</td>
 <td align="right" class="nowrap">22.51</td>
 <td align="right" class="nowrap">0.00</td>
 <td align="right" class="nowrap">$222.53</td>
 <td align="right" class="nowrap">0.00 %</td>
 <td align="right" class="nowrap">$2 399.92</td>
 <td align="right" class="nowrap">5.94</td>
 <td align="right" class="nowrap">0.01</td>
 <td align="right" class="nowrap"> <a href="website" onclick="return
 <td align="center">
     <input type="text" tabindex="2" name="productData[price]       
     [{33013477}]" size="10" value="3000.00">    
 </td>

我将使用getBetween（）方法类似：

// Let's assume the "desired output" you acquired 
// is contained within a Text file named "HtmlData.txt".

// Hold our scraped data in a 2D List inteface.
List<List<String>> list = new ArrayList<>();

// Read File using BufferedReader in a Try With Resources block...
try (BufferedReader reader = new BufferedReader(new FileReader("HtmlData.txt"))) {
    String line;
    List<String> numbers = null;
    while ((line = reader.readLine()) != null) {
        numbers = new ArrayList<>();
        line = line.trim();
        if (line.equals("")) {
            continue;
        }
        if (line.startsWith("onclick=\"doWindow(this.href,")) {
            while ((line = reader.readLine()) != null) {
                line = line.trim();
                if (line.endsWith("return")) {
                    list.add(numbers);
                    break;
                }
                if (line.equals("")) {
                    continue;
                }
                if (line.startsWith("<td align=\"right\" class=\"nowrap\">")) {
                    numbers.add(getBetween(line, "<td align=\"right\" class=\"nowrap\">", "</td>", true, true)[0]);
                }
            }
        }
        if (line.contains("name=\"productData[price]")) {
            while ((line = reader.readLine()) != null) {
                line = line.trim();
                if (line.equals("")) {
                    continue;
                }
                if (line.startsWith("[{33013477}]")) {
                    numbers.add("Product Price: " + getBetween(line, "value=\"", "\">", true, true)[0]);
                    list.add(numbers);
                    break;  // DONE
                }
            }
        }
    }
    if (numbers != null && !numbers.isEmpty()) {
        list.add(numbers);
    }
}
catch (IOException ex) {
    ex.printStackTrace();
}

// Display our findings to the Console Window in a 
// table style format:
for (int i = 0; i < list.size(); i++) {
    for (int j = 0; j < list.get(i).size(); j++) {
        System.out.printf("%-10s ", list.get(i).get(j));
    }
    System.out.println("");
}

如果您没有注意到，则希望从行中找到另一部分：

<td align="center">
    <input type="text" tabindex="2" name="productData[price]       
    [{33013477}]" size="10" value="3000.00">    
</td>

也包含在文件数据中。运行代码后，您将在控制台窗口中看到以下内容：

10 000     20.48      0.00       $28.65     0.00 %     $894.69    10.11      0.21       
10 000     46.21      0.00       $53.82     0.00 %     $1 151.78  8.01       0.00       
5 000      22.51      0.00       $222.53    0.00 %     $2 399.92  5.94       0.01       
Product Price: 3000.00

本文收集自互联网，转载请注明来源。

如有侵权，请联系 [email protected] 删除。