문자열이 루비의 정규 표현식과 일치하는지 확인하는 가장 빠른 방법은 무엇입니까?
문자열이 Ruby의 정규 표현식과 일치하는지 확인하는 가장 빠른 방법은 무엇입니까?
내 문제는 런타임에 주어진 정규 표현식과 일치하는 문자열을 찾기 위해 거대한 문자열 목록을 통해 "egrep"해야한다는 것입니다. 문자열이 정규 표현식과 일치하는지 여부 만 중요합니다. 일치하는 위치 나 일치하는 그룹의 내용이 무엇인지가 아닙니다. 이 가정이 내 코드가 정규 표현식과 일치하는 시간을 줄이는 데 사용될 수 있기를 바랍니다.
정규식을로드합니다.
pattern = Regexp.new(ptx).freeze
나는 것으로 나타났습니다 string =~ pattern
약간 빠르게보다 string.match(pattern)
.
이 테스트를 더 빠르게하는 데 사용할 수있는 다른 트릭이나 단축키가 있습니까?
Ruby 2.4.0부터 다음을 사용할 수 있습니다 RegExp#match?
.
pattern.match?(string)
Regexp#match?
및 다음 과 같은 다른 방법에 의해 수행되는 객체 할당을 방지하기 때문에 2.4.0 릴리스 노트 에서 성능 향상으로 명시 적으로 나열됩니다 .Regexp#match
=~
Regexp # match? 역 참조 객체를 생성하지 않고 객체 할당을 줄이기 위해 변경하지 않고 정규 표현식 일치를 실행하는을
추가했습니다 .Regexp#match?
$~
이것은 간단한 벤치 마크입니다.
require 'benchmark'
"test123" =~ /1/
=> 4
Benchmark.measure{ 1000000.times { "test123" =~ /1/ } }
=> 0.610000 0.000000 0.610000 ( 0.578133)
"test123"[/1/]
=> "1"
Benchmark.measure{ 1000000.times { "test123"[/1/] } }
=> 0.718000 0.000000 0.718000 ( 0.750010)
irb(main):019:0> "test123".match(/1/)
=> #<MatchData "1">
Benchmark.measure{ 1000000.times { "test123".match(/1/) } }
=> 1.703000 0.000000 1.703000 ( 1.578146)
따라서 =~
더 빠르지 만 반환 값으로 원하는 것을 결정합니다. 텍스트에 정규식이 포함되어 있는지 확인하려면=~
이것은 인터넷에서 몇 가지 기사를 찾은 후에 실행 한 벤치 마크입니다.
2.4.0에서 승자는 re.match?(str)
(@ wiktor-stribiżew가 제안한대로) 이전 버전에서 re =~ str
가장 빠르지 만 str =~ re
거의 비슷합니다.
#!/usr/bin/env ruby
require 'benchmark'
str = "aacaabc"
re = Regexp.new('a+b').freeze
N = 4_000_000
Benchmark.bm do |b|
b.report("str.match re\t") { N.times { str.match re } }
b.report("str =~ re\t") { N.times { str =~ re } }
b.report("str[re] \t") { N.times { str[re] } }
b.report("re =~ str\t") { N.times { re =~ str } }
b.report("re.match str\t") { N.times { re.match str } }
if re.respond_to?(:match?)
b.report("re.match? str\t") { N.times { re.match? str } }
end
end
결과 MRI 1.9.3-o551 :
$ ./bench-re.rb | sort -t $'\t' -k 2
user system total real
re =~ str 2.390000 0.000000 2.390000 ( 2.397331)
str =~ re 2.450000 0.000000 2.450000 ( 2.446893)
str[re] 2.940000 0.010000 2.950000 ( 2.941666)
re.match str 3.620000 0.000000 3.620000 ( 3.619922)
str.match re 4.180000 0.000000 4.180000 ( 4.180083)
Results MRI 2.1.5:
$ ./bench-re.rb | sort -t $'\t' -k 2
user system total real
re =~ str 1.150000 0.000000 1.150000 ( 1.144880)
str =~ re 1.160000 0.000000 1.160000 ( 1.150691)
str[re] 1.330000 0.000000 1.330000 ( 1.337064)
re.match str 2.250000 0.000000 2.250000 ( 2.255142)
str.match re 2.270000 0.000000 2.270000 ( 2.270948)
Results MRI 2.3.3 (there is a regression in regex matching, it seems):
$ ./bench-re.rb | sort -t $'\t' -k 2
user system total real
re =~ str 3.540000 0.000000 3.540000 ( 3.535881)
str =~ re 3.560000 0.000000 3.560000 ( 3.560657)
str[re] 4.300000 0.000000 4.300000 ( 4.299403)
re.match str 5.210000 0.010000 5.220000 ( 5.213041)
str.match re 6.000000 0.000000 6.000000 ( 6.000465)
Results MRI 2.4.0:
$ ./bench-re.rb | sort -t $'\t' -k 2
user system total real
re.match? str 0.690000 0.010000 0.700000 ( 0.682934)
re =~ str 1.040000 0.000000 1.040000 ( 1.035863)
str =~ re 1.040000 0.000000 1.040000 ( 1.042963)
str[re] 1.340000 0.000000 1.340000 ( 1.339704)
re.match str 2.040000 0.000000 2.040000 ( 2.046464)
str.match re 2.180000 0.000000 2.180000 ( 2.174691)
What about re === str
(case compare)?
Since it evaluates to true or false and has no need for storing matches, returning match index and that stuff, I wonder if it would be an even faster way of matching than =~
.
Ok, I tested this. =~
is still faster, even if you have multiple capture groups, however it is faster than the other options.
BTW, what good is freeze
? I couldn't measure any performance boost from it.
Depending on how complicated your regular expression is, you could possibly just use simple string slicing. I'm not sure about the practicality of this for your application or whether or not it would actually offer any speed improvements.
'testsentence'['stsen']
=> 'stsen' # evaluates to true
'testsentence'['koala']
=> nil # evaluates to false
What I am wondering is if there is any strange way to make this check even faster, maybe exploiting some strange method in Regexp or some weird construct.
Regexp engines vary in how they implement searches, but, in general, anchor your patterns for speed, and avoid greedy matches, especially when searching long strings.
The best thing to do, until you're familiar with how a particular engine works, is to do benchmarks and add/remove anchors, try limiting searches, use wildcards vs. explicit matches, etc.
The Fruity gem is very useful for quickly benchmarking things, because it's smart. Ruby's built-in Benchmark code is also useful, though you can write tests that fool you by not being careful.
I've used both in many answers here on Stack Overflow, so you can search through my answers and will see lots of little tricks and results to give you ideas of how to write faster code.
The biggest thing to remember is, it's bad to prematurely optimize your code before you know where the slowdowns occur.
To complete Wiktor Stribiżew and Dougui answers I would say that /regex/.match?("string")
about as fast as "string".match?(/regex/)
.
Ruby 2.4.0 (10 000 000 ~2 sec)
2.4.0 > require 'benchmark'
=> true
2.4.0 > Benchmark.measure{ 10000000.times { /^CVE-[0-9]{4}-[0-9]{4,}$/.match?("CVE-2018-1589") } }
=> #<Benchmark::Tms:0x005563da1b1c80 @label="", @real=2.2060338060000504, @cstime=0.0, @cutime=0.0, @stime=0.04000000000000001, @utime=2.17, @total=2.21>
2.4.0 > Benchmark.measure{ 10000000.times { "CVE-2018-1589".match?(/^CVE-[0-9]{4}-[0-9]{4,}$/) } }
=> #<Benchmark::Tms:0x005563da139eb0 @label="", @real=2.260814556000696, @cstime=0.0, @cutime=0.0, @stime=0.010000000000000009, @utime=2.2500000000000004, @total=2.2600000000000007>
Ruby 2.6.2 (100 000 000 ~20 sec)
irb(main):001:0> require 'benchmark'
=> true
irb(main):005:0> Benchmark.measure{ 100000000.times { /^CVE-[0-9]{4}-[0-9]{4,}$/.match?("CVE-2018-1589") } }
=> #<Benchmark::Tms:0x0000562bc83e3768 @label="", @real=24.60139879199778, @cstime=0.0, @cutime=0.0, @stime=0.010000999999999996, @utime=24.565644999999996, @total=24.575645999999995>
irb(main):004:0> Benchmark.measure{ 100000000.times { "CVE-2018-1589".match?(/^CVE-[0-9]{4}-[0-9]{4,}$/) } }
=> #<Benchmark::Tms:0x0000562bc846aee8 @label="", @real=24.634255946999474, @cstime=0.0, @cutime=0.0, @stime=0.010046, @utime=24.598276, @total=24.608321999999998>
Note: times varies, sometimes /regex/.match?("string")
is faster and sometimes "string".match?(/regex/)
, the differences maybe only due to the machine activity.
'program tip' 카테고리의 다른 글
C가 강력하게 입력 되었습니까? (0) | 2020.10.18 |
---|---|
Node.js에서 새 줄에 추가하는 방법 (0) | 2020.10.18 |
3 일마다 크론 작업 (0) | 2020.10.18 |
두 배열 결합 (0) | 2020.10.18 |
postgreSQL에서 왼쪽으로 0 채우기 (0) | 2020.10.18 |