Security
Headlines
HeadlinesLatestCVEs

Headline

CVE-2023-39631: Warn that evaluate() should not be used on user input · Issue #442 · pydata/numexpr

An issue in LanChain-ai Langchain v.0.0.245 allows a remote attacker to execute arbitrary code via the evaluate function in the numexpr library.

CVE
#debian#git

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open

corgeman opened this issue

Jul 7, 2023

· 35 comments

Comments

The evaluate() function eventually calls eval() on the data provided. eval() is extremely dangerous when supplied with user input and to my knowledge it isn’t mentioned that the function does this. I would add a warning in the documentation about this. As a proof-of-concept, the following code should execute the command ‘echo verybad’ on your computer when ran:

import numexpr s = “"” (lambda fc=( lambda n: [ c for c in ().__class__.__bases__[0].__subclasses__() if c.__name__ == n ][0] ): fc(“function”)( fc(“Popen”)("echo verybad",shell=True),{} )() )() “"” numexpr.evaluate(s)

Yeah developers seem to be using NumExpr more and more as a parser for handling input from a UI. It would be much safer if we could use ast.parse instead. I think adding a warning to the docs is fine, but it would also make sense to me to add an optional sanitizer. We could search the string for various Python keywords that have no business being in an expression, such as:

  1. import
  2. lambda
  3. eval
  4. __ (dunder)
  5. locals
  6. globals

Sanitization could be enabled by default but give the user the means to turn it off.

Alright I implemented a check in 4b2d89c that forbids certain operators used in the expression. Namely: ;, :, [, and __. As far as I can tell this defeats all of the commonly cited attack vectors against eval(). I.e. it bans multiple expressions, lambdas, indexing, and all the dunders. If we want to be very safe, we would want to ban . as a character. However, this is difficult because . is both the reference operator and the decimal operator.

If we want to ban . we could do so if we require it to be followed by a numeral. E.g. 0.1 would pass the check, os.remove would not pass the check. However, this would ban valid Python code such as a * 5. which is a shorthand for casting to double. Perhaps there is someone out their with better regex-foo than myself and can suggest a solution for a regex that bans the Python . operator but doesn’t cause issues for decimal points?

Regardless I think this is a good step forward and we’re at the point where we should do a new release.

Ok, made some more improvements. I can ban attribute access to everything but .real and .imag, via

forbidden_re = re.compile('[;[:]||.[abcdefghjklmnopqstuvwxyzA-Z]')

I also strip the string of all whitespace beforehand.

Closing with release of 2.8.5.

Hi @robbmcleod, just wanted to point out that you can still access other attributes because Python translates some utf chars into ascii chars automatically (mainly greek alphabet used in math), thus:

numexpr.evaluate("(3+1).ᵇit_count()")

Results in:

It might be better to whitelist real and imag rather than blacklist chars.

@jan-kubena ok thanks for the heads up. The trouble here is that decimal needs to work, and numbers can appear in variable names, but not at the start. Do you happen to know where the documentation for which character sets are mangled is?

The really proper way to do it would be to back-port the ast parser work I did in for my attempt at making 3.0.

Apparently the pandas group is using dunders as private variables in some of their NumExpr calls.

pandas-dev/pandas#54449

I’m of two minds about this. I could regex against "[a-zA-Z0-9_]+" instead of just "__". But for the security conscious, it does feel to me that NumExpr shouldn’t be able to access private variables in the locals and globals dictionaries.

Hi folks,

I have a simpler example which also breaks in the new version:

import pandas as pd

name = “Mass (kg)" df = pd.DataFrame({name: [200, 300, 400]}) df.query(f"`{name}` < 300”)

ValueError: Expression (BACKTICK_QUOTED_STRING_Mass__LPAR_kg_RPAR_) < (300) has forbidden control characters.

I understand Mass (kg) is a reasonable column name, so the validation definitely needs more tuning.

Hi folks,
I’m not sure if it is the same issue here, we are using numexpr for calculations in a pandas dataframe and we are using double underscore in some of our calculations (for instance, a__b) but now it fails in 2.8.5 (working in 2.8.4).
I have a minimal example without involving pandas:

import numexpr numexpr.evaluate('a__b / 2’, {’a__b’: 4})

Or a oneliner:

python -c "import numexpr; print(numexpr.evaluate('a__b / 2’, {’a__b’: 4}))"

In numexpr==2.8.4 that one works perfectly.
In numexpr==2.8.5 we have the following:
ValueError: Expression a__b / 2 has forbidden control characters.

Is it an intentional feature or something that can be workedaround easily (sending some kwarg to ignore the error)?
We will pin our numexpr requeriments to “<2.8.5” in the meanwhile.

Many thanks!

@nicoddemus and @zorion, just waiting to hear back from Pandas’ devs on the original report: pandas-dev/pandas#54449 before I do anything. I’ve tried in the past to establish some sort of line of communication with them and gotten crickets.

Hi, thanks for your reply.

I think we are having a legit use of dunder but it is not allowed in a breaking change from 2.8.4 to 2.8.5 so we have to fix our version to 2.8.4 (or lower) and this is an ok workaround for now.
On the other hand, if we had a way to flag “evaluate” that we trust our input we may remove this version restriction. Is it possible to do so?

Many thanks in advance!

@nicoddemus and @zorion, just waiting to hear back from Pandas’ devs on the original report: pandas-dev/pandas#54449 before I do anything. I’ve tried in the past to establish some sort of line of communication with them and gotten crickets.

Hi, one of the pandas devs here.

I don’t maintain the eval code (I don’t think anyone still here does anymore), so not really the best person to comment on this.

Is there a way to gate this stricter checking behind an option?
(For pandas, we have a warning in the docs saying that eval will let users run arbritary code)

Thanks,
Thomas

@lithomas1 and company,

I see a few approaches here.

  1. We can deprecate the implementation of the __ filter for a immediate release and put it back in for a future release.
  2. We can put in an option to disable the security check. However, I do think it should default to be on.

In order to avoid forcing everyone to do an emergency release, we probably need to do both, assuming the option defaults to sanitize.

@lithomas1 and company,

I see a few approaches here.

  1. We can deprecate the implementation of the __ filter for a immediate release and put it back in for a future release.
  2. We can put in an option to disable the security check. However, I do think it should default to be on.

In order to avoid forcing everyone to do an emergency release, we probably need to do both, assuming the option defaults to sanitize.

Thanks, this sounds good to me.
2 is probably good enough for pandas.
(We are going to be releasing soon anyways in a couple weeks. I think users can be fine pinning numexpr for now).

It would be nice to actually fix pandas, however, I’m not too sure what the future of eval/query will be given its current “zombie”-like state.

There’s actually two changes in numexpr 2.8.5 that fail pandas tests - this, and a change to integer overflow behaviour (possibly a result of the negative-powers changes, but I’m not sure yet) pandas-dev/pandas#54546

@rebecca-palmer You can make a new issue if you want, but NumExpr could never cover 2**100 as that’s way in excess of 64-bit integers.

@zorion, @nicoddemus, @lithomas1 I made a push in 397cc98 that should hopefully fix these issues, if you could please test?

  1. I improved the blacklisting. It should better match the funny unicode coercion that can be done for the attribute access attack.
  2. It only blocks dunders and not a single double underscore.
  3. You can disable it by calling validate(…, sanitize=False) or equivalently evaluate(…, sanitize=False).

If you have a chance please test and provide me with any feedback you may have.

Hi @robbmcleod,

Thanks for attempting a fix.

My original example now passes, but this one breaks (reduced from the actual code):

import pandas as pd

name = “II (MM)" df = pd.DataFrame({name: [200, 300, 400]}) df.query(f"`{name}` >= 3.1e-05”)

ValueError: Expression (BACKTICK_QUOTED_STRING_II__LPAR_MM_RPAR_) >= (3.1e-05) has forbidden control characters.

The problem in this case seems to be the scientific notation, if I change 3.1e-05 to something that does not format to scientific (say 0.31) then it no longer errors out. If I use the actual number in decimal notation (0.000031) it still fails because it seems internally it formats back to scientific, because it generates the exact same message as above (with 3.1e-05):

import pandas as pd

name = “II (MM)" df = pd.DataFrame({name: [200, 300, 400]}) df.query(f"`{name}` >= 0.000031”)

ValueError: Expression (BACKTICK_QUOTED_STRING_II__LPAR_MM_RPAR_) >= (3.1e-05) has forbidden control characters.

sanitize=False would be great for us, however we call DataFrame.query which does not have that argument yet.

@nicoddemus @robbmcleod I think changing _attr_pat to r’.\b(?!(real|imag|\d+e?)\b)' (i.e. adding ‘e?’) fixes that, but I haven’t actually tested it.

Perhaps a more reliable approach would be to use ast.parse, and reject the tree if we find statements, lambdas, dunder import, etc?

@robbmcleod I’ve added some comments in the commit - do those notify anyone when it’s already on the main branch?

It needs to match [eE]?[±]?. I.e. either ‘e’ or ‘E’ can denote scientific notation and then it can be '-' or ‘+’ exponents. It’s not a big deal, I just haven’t had time to sit down and do it yet. Please be patient.

I definitely don’t get notifications on commits.

@nicoddemus I did use ast.parse for the NumExpr-3.0 branch. This is not a trivial fix to backport it. NumExpr 2 has an home-brew AST.

I think that most of the comments are in 397cc98

@robbmcleod Sorry if that looked like a demand to hurry - it was not intended as such, but as a guess that comments that don’t notify anyone might never get seen.

Debian now has my version, but it hasn’t been through their CI yet.

I’m not sure if I should create a whole new issue for this, but the changes in this commit (which made it into 2.8.5) raise DeprecationWarning: invalid escape sequence ‘;’ due to the usage ; here, which turns into errors in some CI setups. I left a review comment in 397cc98 to that effect, as well. I can surface this as a bug report of its own if you like.

9b380ae switched to _attr_pat = r’.\b(?!(real|imag|[eE]?[±]?\d+)\b)', which probably won’t do what you want - I suggest _attr_pat = r’.\b(?!(real|imag|\d+([eE][±]?\d+)?)\b)' but I haven’t tested that

(The tests don’t notice because they use test values without a . (2e-5 not 2.1e-5), which never triggered this issue to begin with. We should maybe change that.)

@rebecca-palmer I don’t understand, the point is to disable attribute access. Variable/member names in Python cannot start with a number. So 2.1e-5 should never match the regex because a variable cannot start with 1, for example.

@robbmcleod (?!..) means "does not match this", so the items inside the (?!..) (.real, .imag, and numbers with a decimal point) are allowed while other uses of . (other attribute access) are blocked.

kozalosev added a commit to kozalosev/textUtilsBot that referenced this issue

Aug 25, 2023

If no one can think of adding any more tests to this I’ll prepare another release?

I’ll try and test locally against Pandas as well.

This is out of topic for this issue, but related: is there a link to a discussion to the decision to introduce this check? I ask because using a regex for this seems fragile (as this issue proves), plus the rationale itself seems a bit misguided: if it uses eval behind the covers, I think it would be best to warn users of that fact instead of trying to check that nothing fancy is going on – because no matter how much the check is well done, it might contain flaws, which will eventually will be passed on to eval, which is a security concern. Perhaps it would make more sense to just let users know that the string is passed on to eval, and the same security concerns apply here.

Ah sorry @rebecca-palmer, seem to have missed it.

I see the suggestion of adding a warning to the docs was given, and then decided to implemented a sanitizer. OK, thanks.

Related news

GHSA-f73w-4m7g-ch9x: Langchain vulnerable to arbitrary code execution via the evaluate function in the numexpr library

An issue in LanChain-ai Langchain v.0.0.245 allows a remote attacker to execute arbitrary code via the evaluate function in the numexpr library.

CVE: Latest News

CVE-2023-50976: Transactions API Authorization by oleiman · Pull Request #14969 · redpanda-data/redpanda