为什么LF和CRLF在/ ^ \ s * $ / gm正则表达式中表现不同？

zavr 发表于 Dev

结束

我一直在Windows上看到此问题。当我尝试在Unix的每一行上清除任何空格时：

const input =
`===

HELLO

WOLRD

===`
console.log(input.replace(/^\s+$/gm, ''))

这产生了我所期望的：

===

HELLO

WOLRD

===

即，如果空白行上有空格，它们将被删除。另一方面，在Windows上，正则表达式清除WHOLE字符串。为了显示：

const input =
`===

HELLO

WOLRD

===`.replace(/\r?\n/g, '\r\n')
console.log(input.replace(/^\s+$/gm, ''))

（模板文字永远只打印\n在JS，所以我不得不来取代\r\n以模拟Windows（?后\r只是要确定那些谁也不相信），结果如下：

===
HELLO
WOLRD
===

整条线都不见了！但是我的正则表达式带有^和设置$了m标志，所以有点像/^-to-$/m。之间的区别是什么\r，\r\n然后才产生不同的结果？

当我做一些记录

console.log(input.replace(/^\s*$/gm, (m) => {
  console.log('matched')
  return ''
}))

我看到\ r \ n

matched
matched
matched
matched
matched
matched
===
HELLO
WOLRD
===

并且仅\ n

matched
matched
matched
===

HELLO

WOLRD

===

VLAZ

TL; DR如果允许，包括空格和换行符的模式也将匹配\r\n序列中的字符。

首先，让我们实际检查一下替换时有哪些字符和不存在的字符。从仅使用换行符的字符串开始：

const inputLF =
`===

HELLO

WOLRD

===`.replace(/\r?\n/g, "\n");

console.log('------------ INPUT ')
console.log(inputLF);
console.log('------------')

debugPrint(inputLF, 2);
debugPrint(inputLF, 3);
debugPrint(inputLF, 4);
debugPrint(inputLF, 5);

const replaceLF = inputLF.replace(/^\s+$/gm, '');

console.log('------------ REPLACEMENT')
console.log(replaceLF);
console.log('------------')

debugPrint(replaceLF, 2);
debugPrint(replaceLF, 3);
debugPrint(replaceLF, 4);
debugPrint(replaceLF, 5);

console.log(`charcode ${replaceLF.charCodeAt(2)} : ${replaceLF.charAt(2)}`);
console.log(`charcode ${replaceLF.charCodeAt(3)} : ${replaceLF.charAt(3)}`);
console.log(`charcode ${replaceLF.charCodeAt(4)} : ${replaceLF.charAt(4)}`);
console.log(`charcode ${replaceLF.charCodeAt(5)} : ${replaceLF.charAt(5)}`);

console.log('------------')
console.log('inputLF === replaceLF :', inputLF === replaceLF)

function debugPrint(str, charIndex) {
  console.log(`index: ${charIndex}
   charcode: ${str.charCodeAt(charIndex)}
   character: ${str.charAt(charIndex)}`
 );
}

每行以字符代码10结尾，字符代码10是换行（LF）字符，用字符串文字表示\n。替换前后，两个字符串是相同的-不仅看起来相同，而且实际上彼此相等，所以替换没有任何作用。

现在让我们检查另一种情况：

const inputCRLF =
`===

HELLO

WOLRD

===`.replace(/\r?\n/g, "\r\n")
console.log('------------ INPUT ')
console.log(inputCRLF);
console.log('------------')

debugPrint(inputCRLF, 2);
debugPrint(inputCRLF, 3);
debugPrint(inputCRLF, 4);
debugPrint(inputCRLF, 5);
debugPrint(inputCRLF, 6);
debugPrint(inputCRLF, 7);

const replaceCRLF = inputCRLF.replace(/^\s+$/gm, '');;

console.log('------------ REPLACEMENT')
console.log(replaceCRLF);
console.log('------------')

debugPrint(replaceCRLF, 2);
debugPrint(replaceCRLF, 3);
debugPrint(replaceCRLF, 4);
debugPrint(replaceCRLF, 5);

function debugPrint(str, charIndex) {
  console.log(`index: ${charIndex}
   charcode: ${str.charCodeAt(charIndex)}
   character: ${str.charAt(charIndex)}`
 );
}

此时每一行两端用其是在将字符串和表示的回车（CR）字符字符代码13\r和然后在LF如下。替换后，=\r\n\r\nH它不只是一个序列而已=\r\nH。让我们看看为什么。

这是MDN关于元字符的内容^：

匹配输入的开始。如果多行标志设置为true，则也将在换行符后立即匹配。

这是MDN关于元字符的内容 $

匹配输入的结尾。如果多行标志设置为true，则也将在换行符前紧接匹配。

所以它们匹配后和前一个换行符。MDN表示LF或CR。如果我们测试包含不同换行符的字符串，则可以看出这一点：

const stringLF = "hello\nworld";
const stringCRLF = "hello\r\nworld";

const regexStart = /^\s/m;
const regexEnd = /\s$/m;

console.log(regexStart.exec(stringLF));
console.log(regexStart.exec(stringCRLF));

console.log(regexEnd.exec(stringLF));
console.log(regexEnd.exec(stringCRLF));

如果我们尝试在换行符附近匹配空格，则在有LF的情况下不匹配任何内容，但确实将CR与CRLF匹配。因此，在这种情况下$将与此处匹配：

"hello\r\nworld"
        ^^ what `^\s` matches

"hello\r\nworld"
      ^^ what `\s$` matches

因此，无论^和$识别CRLF序列行尾任。当您进行搜索和替换时，这将有所作为。由于您的正则表达式指定^\s+$这意味着，当你有一个行是完全\r\n然后它匹配。但是由于一个不明显的原因：

const re = /^\s+$/m;

const sringLF = "hello\n\nworld";
const stringCRLF = "hello\r\n\r\nworld";


console.log(re.exec(sringLF));
console.log(re.exec(stringCRLF));

因此，正则表达式在其他两个换行符之间不匹配\r\n而是一个\n\r（两个空格字符）。那是因为+它渴望并且会消耗尽可能多的字符序列。这是正则表达式引擎将尝试的。为简洁起见，经过简化：

input = "hello\r\n\r\nworld
regex = /^\s+$/

Step 1
hello[\r]\n\r\nworld
    matches `^`, symbol satisfied -> continue with next symbol in regex

Step 2
hello[\r\n]\r\nworld
    matches `^\s+` -> continue matching to satisfy `+` quantifier

Step 3
hello[\r\n\r]\nworld
    matches `^\s+` -> continue matching to satisfy `+` quantifier

Step 4
hello[\r\n\r\n]world
    matches `^\s+` -> continue matching to satisfy `+` quantifier

Step 5
hello[\r\n\r\nw]orld
    does not match `\s` -> backtrack

Step 6
hello[\r\n\r\n]world
    matches `^\s+`, quantifier satisfied -> continue to next symbol in regex

Step 7
hello[\r\n\r\nw]orld
    does not match `$` in `^\s+$` -> backtrack

Step 8
hello[\r\n\r\n]world
    matches `^\s+$`, last symbol satisfied -> finish

最后，这里有些隐蔽之处-与空白匹配很重要。这是因为它的行为与大多数其他符号不同，因为它明确匹配换行符，而.不会：

匹配除行终止符之外的任何单个字符

所以，如果你指定\s$这将在符合CR \r\n，因为正则表达式引擎被迫寻找一个匹配既\s和$，因此它找到\r之前\n。但是，对于其他许多模式，则不会发生这种情况，因为$通常会在CR之前（或字符串的末尾）满足该要求。

与之相同，^\s它将在换行符后显式地寻找空白字符，而换行符由CRLF中的LF满足，但是，如果您不希望这样做，则它将在LF后愉快地匹配：

const stringLF = "hello\nworld";
const stringCRLF = "hello\r\nworld";

const regexStartAll = /^./mg;
const regexEndAll = /.$/gm;

console.log(stringLF.match(regexStartAll));
console.log(stringCRLF.match(regexStartAll));

console.log(stringLF.match(regexEndAll));
console.log(stringCRLF.match(regexEndAll));

因此，^\s+$一旦您了解了正则表达式引擎与您告诉的内容完全匹配，所有这些便意味着它们具有一些非直觉的行为，但却完全连贯。

本文收集自互联网，转载请注明来源。

如有侵权，请联系 [email protected] 删除。

编辑于 2021-01-23

我来说两句

0 条评论

登录后参与评论

上一篇：运行时错误'1004'应用程序定义的错误或对象定义的错误'

TOP 榜单

文章

为什么LF和CRLF在/ ^ \ s * $ / gm正则表达式中表现不同？

为什么LF和CRLF在/ ^ \ s * $ / gm正则表达式中表现不同？

隐藏发件人没有短信PHP

Hashchange事件侦听器在将事件处理程序附加到事件之前进行侦听

在浏览器中请求URL时会发生什么？

flask-admin 如何自定义删除按钮

材质UI垂直滑块。如何改变在垂直材料UI滑块导轨的厚度（反应）

用日期数据透视表和日期顺序查询

Jqgrid：多级别组摘要

java io ioexception无法解析服务器地址解析器的响应

Swift如何使用Base64Url编码JWT标头和有效负载之类的json对象

sshd AllowGroups组未授予访问权限

jQuery无限滚动固定div中的滚动

android 背部按下

Flexbox CSS 对齐属性环境惰性？

为什么随机森林中的平均降低基尼系数取决于人口规模？

ClickHouse 创建临时表

为什么PlusShare.Builder setRecipients方法不起作用？

如何在Android中识别MICR代码

PyQt4.QtCore模块无法向sip模块注册

正则表达式，用于查找所有以任何字母开头和数字开头的文件

是否可以通过编程方式对很多动画进行重新着色？

机器密钥生成