Headline
CVE-2021-43854: Resolved serious ReDoS in PunktSentenceTokenizer by tomaarsen · Pull Request #2869 · nltk/nltk
NLTK (Natural Language Toolkit) is a suite of open source Python modules, data sets, and tutorials supporting research and development in Natural Language Processing. Versions prior to 3.6.5 are vulnerable to regular expression denial of service (ReDoS) attacks. The vulnerability is present in PunktSentenceTokenizer, sent_tokenize and word_tokenize. Any users of this class, or these two functions, are vulnerable to the ReDoS attack. In short, a specifically crafted long input to any of these vulnerable functions will cause them to take a significant amount of execution time. If your program relies on any of the vulnerable functions for tokenizing unpredictable user input, then we would strongly recommend upgrading to a version of NLTK without the vulnerability. For users unable to upgrade the execution time can be bounded by limiting the maximum length of an input to any of the vulnerable functions. Our recommendation is to implement such a limit.
Resolves #2866
Hello!
Pull request overview
- Resolved serious Regular Expression Denial of Service (ReDoS) issue in PunktSentenceTokenizer.
- This has severe consequences for the commonly used
sent_tokenize
andword_tokenize
.
- This has severe consequences for the commonly used
The consequences of the ReDoS
The ReDoS is described in #2866, where an example of a malicious string is given. The consequences can be seen with this script:
n = 8 for length in [10**i for i in range(2, n)]: text = “a” * length start_t = time.time() word_tokenize(text) print(f"A length of {length:<{n}} takes {time.time() - start_t:.4f}s")
Which outputs:
A length of 100 takes 0.0060s
A length of 1000 takes 0.0060s
A length of 10000 takes 0.6320s
A length of 100000 takes 56.3322s
...
This is before I canceled the execution of the program after running it for several hours.
Potential malicious inputs
Any long sequence of any non-whitespace characters except .
, ?
or !
will be malicious input to the regex. A quick example is simply:
It takes word_tokenize
56.3322 seconds to tokenize this input.
The details of the ReDoS
The ReDoS is caused by this regex:
_period_context_fmt = r"""
\S* # some word material
%(SentEndChars)s # a potential sentence ending
(?=(?P<after_tok>
%(NonWord)s # either other punctuation
|
\s+(?P<next_tok>\S+) # or whitespace and some other token
))“"”
"""Format of a regular expression to find contexts including possible
sentence boundaries. Matches token which the possible sentence boundary
ends, and matches the following token within a lookahead expression."""
After being formatted with the default values, this becomes:
_period_context_fmt = r""" \S* # some word material [\.\?!] # a potential sentence ending (?=(?P<after_tok> (?:[)\";}\]\*:@\’\({\[!\?] # either other punctuation | \s+(?P<next_tok>\S+) # or whitespace and some other token ))“"”
The vulnerability is already at the very start of the regex. The following regex is also vulnerable:
_period_context_fmt = r""" \S* # some word material [\.\?!] # a potential sentence ending “"”
The issue is that the Python regex engine will start matching tokens with \S*
(i.e. any non-whitespace), and only recognize that it failed to find a .
, ?
or a !
when it either reaches a whitespace character. So, with an input of a thousand characters of "a"
, then the regex engine will start matching from the very first "a"
, and only recognize that it won’t be able to match this way after matching the full 1000 "a"
's. Then, it will backtrack all the way back to the first "a"
, skip it, and try the entire thing again from the second "a"
, now matching 999 characters before realising this won’t work. This will continue until finally realising that the regex will not match the string at all.
Given an input string of length n
, this will take (n^2 + n) / 2
steps. With other words, word_tokenize
becomes at least O(n^2) time complexity, while a performance of O(n) should be reachable for most regex problems.
The fix
The fix seems elementary:
- Remove
\S*
from the regex, then the regex engine will only start trying to match when it actually finds a potential end of sentence token (i.e..
,!
or?
by default). - Then, use a simply regex like
\S*$
to match the final word before the end of sentence token.
And this works very well, with one small but crucial detail: re.finditer
will find all non-overlapping matches, and we’ve now reduced the size of each match. See the consequences with this example:
On the develop
branch
>>> pst = PunktSentenceTokenizer() >>> text = “Very bad acting!!! I promise.” >>> list(pst._lang_vars.period_context_re().finditer(text)) # doctest: +SKIP [<re.Match object; span=(9, 18), match=’acting!!!’>]
After removing \S*
from the regex
>>> pst = PunktSentenceTokenizer() >>> text = “Very bad acting!!! I promise.” >>> list(pst._lang_vars.period_context_re().finditer(text)) # doctest: +NORMALIZE_WHITESPACE [<re.Match object; span=(15, 16), match=’!’>, <re.Match object; span=(16, 17), match=’!’>, <re.Match object; span=(17, 18), match=’!’>]
In short, we need to manually remove these overlapping cases. This is what the new _match_potential_end_contexts
method is for. I can’t come up with a regex-only solution for this problem.
Results
Upon running the script again, but now with the fix in place, we get:
A length of 100 takes 0.0070s A length of 1000 takes 0.0010s A length of 10000 takes 0.0060s A length of 100000 takes 0.0400s A length of 1000000 takes 0.3520s A length of 10000000 takes 3.4641s
As you can see, the performance is significantly better, and more in tune with the expected O(n).
Beyond that, the performance of normal sentences (i.e. with dots) is in line with the original performance.
If all is well, then only the performance is (positively) impacted.
Note
I believe that this PR does not require re-training of the Punkt models.
- Tom Aarsen