PCRE与utf8字符不匹配

休·达令(Hugh Darling)

我正在编译启用了utf8标志的PCRE模式,并试图与此匹配一个utf8char*字符串,但它不匹配并pcre_exec返回负数。我将主题长度设为65,pcre_exec这是字符串中的字符数。我相信它期望字节数,因此我尝试将参数增加到70,但仍然得到相同的结果。我不知道还有什么使比赛失败。在我自杀之前请帮忙。

PCRE_UTF8但是,如果我尝试不带标志,则匹配,但偏移量矢量[1]为30,它是输入字符串中紧接Unicode字符之前的字符的索引)

#include "stdafx.h"
#include "pcre.h"
#include <pcre.h>               /* PCRE lib        NONE  */
#include <stdio.h>              /* I/O lib         C89   */
#include <stdlib.h>             /* Standard Lib    C89   */
#include <string.h>             /* Strings         C89   */
#include <iostream>

int main(int argc, char *argv[]) 
{
   pcre *reCompiled;

   int pcreExecRet;
   int subStrVec[30];
   const char *pcreErrorStr;
   int pcreErrorOffset; 
   char* aStrRegex = "(\\?\\w+\\?\\s*=)?\\s*(call|exec|execute)\\s+(?<spName>\\w+)(" 
                                     // params can be an empty pair of parenthesis or have parameters inside them as well.
                                     "\\(\\s*(?<params>[?\\w,]+)\\s*\\)"
                                     // paramList along with its parenthesis is optional below so a SP call can be just "exec sp_name" for a stored proc call without any parameters.
                                     ")?";
    reCompiled = pcre_compile(aStrRegex, 0, &pcreErrorStr, &pcreErrorOffset, NULL);
    if(reCompiled == NULL) {
      printf("ERROR: Could not compile '%s': %s\n", aStrRegex, pcreErrorStr);
      exit(1);
    } 

    char* line = "?rt?=call SqlTxFunctionTesting(?înFîéld?,?outField?,?inOutField?)";
    pcreExecRet = pcre_exec(reCompiled,
                            NULL,
                            line, 
                            65,  // length of string
                            0,                      // Start looking at this point
                            0,                      // OPTIONS
                            subStrVec,
                            30);                    // Length of subStrVec

   printf("\nret=%d",pcreExecRet);

   //int substrLen = pcre_get_substring(line, subStrVec, pcreExecRet, 1, &mantissa);

}
迪玛·库里洛(Dima Kurilo)

1)

char * q= "î";
printf("%d, %s", q[0], q);

输出:
63

2)您必须使用PCRE_BUILD_PCRE16(或32)和PCRE_SUPPORT_UTF重建PCRE。并使用pcre16.lib和/或pcre16.dll。然后,您可以尝试以下代码:

  pcre16 *reCompiled;
  int pcreExecRet;
  int subStrVec[30];
  const char *pcreErrorStr;
  int pcreErrorOffset;  
  wchar_t* aStrRegex = L"(\\?\\w+\\?\\s*=)?\\s*(call|exec|execute)\\s+(?<spName>\\w+)(" 
                                     // params can be an empty pair of paranthesis or have parameters inside them as well.
                                     L"\\(\\s*(?<params>[?,\\w\\p{L}]+)\\s*\\)"
                                     // paramList along with its paranthesis is optional below so a SP call can be just "exec sp_name" for a stored proc call without any parameters.
                                     L")?";
   reCompiled = pcre16_compile((PCRE_SPTR16)aStrRegex, PCRE_UTF8, &pcreErrorStr, &pcreErrorOffset, NULL);
   if(reCompiled == NULL) {
    printf("ERROR: Could not compile '%s': %s\n", aStrRegex, pcreErrorStr);
    exit(1);
   } 

  const wchar_t* line = L"?rt?=call SqlTxFunctionTesting(  ?inField?,?outField?,?inOutField?,?fd?  )";
  const wchar_t* mantissa=new wchar_t[wcslen(line)];
  pcreExecRet = pcre16_exec(reCompiled,
                            NULL,
                            (PCRE_SPTR16)line, 
                            wcslen(line),  // length of string
                            0,                      // Start looking at this point
                            0,                      // OPTIONS
                            subStrVec,
                            30);                    // Length of subStrVec

 printf("\nret=%d",pcreExecRet);
 for (int i=0;i<pcreExecRet;i++){
     int substrLen = pcre16_get_substring((PCRE_SPTR16)line, subStrVec, pcreExecRet, i, (PCRE_SPTR16 *)&mantissa);
     wprintf(L"\nret string=%s, length=%i\n",mantissa,substrLen);
 }

3)\ w = [0-9A-Z_a-z]。它不包含unicode符号。
4)这确实可以帮助您:http : //answers.oreilly.com/topic/215-how-to-use-unicode-code-points-properties-blocks-and-scripts-in-regular-expressions/
5) PCRE 8.33源(pcre_exec.c:2251)

/* Find out if the previous and current characters are "word" characters.
It takes a bit more work in UTF-8 mode. Characters > 255 are assumed to
be "non-word" characters. Remember the earliest consulted character for
partial matching. */

本文收集自互联网,转载请注明来源。

如有侵权,请联系 [email protected] 删除。

编辑于
0

我来说两句

0 条评论
登录 后参与评论

相关文章