Headline
Data Scientists Dial Back Use of Open Source Code Due to Security Worries
Data scientists, who often choose open source packages without considering security, increasingly face concerns over the unvetted use of those components, new study shows.
Vulnerabilities in open source components — such as the widespread flaws revealed 10 months ago in Log4j 2.0 — have forced data scientists to reevaluate the open source code frequently used in analysis and the creation of machine learning models.
According to a report by Anaconda, a data-science platform firm, in the past year, 40% of surveyed data scientists, business analysts, and students have scaled back their use of open source components, while a third remained steady, and only 7% incorporated more open source code into their projects. The majority of those surveyed do not report to the information technology department (18%), but work within their own data science or research and development group (47%), according to Anaconda’s “2022 State of Data Science” report, released last week.
While software developers and IT have already started vetting secure code, the concerns over the security in open source software is a relatively new trend for the data science world, says Peter Wang, co-founder and CEO of Anaconda.
“We see a tremendous portion of people who are at organizations where IT has created a very strict posture around open source and Python,” he says. “These are not expert developers. … They are data scientists and machine learning people who may not be very seasoned developers at all, using whatever they could download to do their analysis, and then they handed that over that to IT.”
The security of open source components — and the software supply chain, in general — has become a primary consideration among software developers, businesses, and national governments over the past two years. In May, for example, the US National Institute of Standards and Technology (NIST) issued guidance for address software supply chain risks. In addition, a growing number of software vendors have joined with the Linux Foundation’s Open Software Security Foundation (OpenSSF).
While many data science teams scan open source components for vulnerabilities, many create their own software instead. Source: Anaconda’s “2022 State of Data Science” report.
Overall, the maturity of organizations’ security efforts has improved. About half of firms have an open source security policy in place, which leads to better performance in measures of security readiness, according to the June survey. In addition, the efforts to control open source risk has jumped by 51% in the past 12 months, a study of security maturity stated on Sept. 21.
"[W]ith the attention placed on software supply chains, most enterprise organizations are taking a risk-based approach to application security," Jason Schmitt, general manager of the Synopsys Software Integrity Group, said in a statement announcing the study. “Such an approach recognizes that security isn’t limited to the codebase; it includes the process of software development where security reviews and testing ‘shift everywhere’ to continuously improve security outcomes.”
**Devs Expand Use of Open Source **
Software companies are not seeing any sort of decrease in open source usage, according to other data. Instead, development organizations are focusing on improving the security of open source software and using security as a primary guide in selecting components.
In the “2021 State of the Software Supply Chain” report, for example, Sonatype found that the top four open source ecosystems — the Maven Central Repository (Java), Node.js (JavaScript), the Python Package Index (Python), and the NuGet gallery (.NET) — housed 37 million open source projects and components, an increase of 20% year-over-year. The demand for those components is likewise increasing: More than 2.2 trillion components were downloaded, a 73% annual increase.
A self-reported move away from open source packages by the data science community is likely indicative of greater awareness of security issues and less about jettisoning open source components in development, says Tracy Miranda, head of open source at Chainguard.
While data science teams and development teams may have reacted differently to major security issues — such as Log4j 2.0 — companies have little recourse when moving away from one open source package than to adopt a different package whose maintainers have put a greater emphasis on security, she says.
“Companies leverage open source as a way to increase their velocity so if they are scaling back, what are they scaling back to? Writing code in-house? Using third-party versions packaged up?” Miranda says, adding that instead, “I do think we can expect to see companies be more discerning about the quality of the open source they use, especially related to security features.”
Data Scientists Are Playing Catch-up
The disconnect between the two sides is likely due to the different audiences in the various surveys. Anaconda’s survey focused on data science professionals, as can be seen from their respondent’s choice of programming languages — 58% used Python and 42% used SQL, while only 26% used JavaScript.
A better measure of software developer sentiments is StackOverflow’s “2022 Developer Survey,” which found that while 58% of ‘people learning to code’ use Python, only 44% of professional developers code in that language. On the other hand, 68% of professional developers use JavaScript, according to StackOverflow’s survey.
In addition, while data science professional work at companies that overwhelmingly (87%) allow open-source software, about a quarter (26%) have minimal oversight by the IT department of their open source choices, the Anaconda report stated. In another 18% of companies, the IT department only specifies about half of the available open source components.
The maintainers of the most critical projects — of which there are hundreds, if not thousands — need to use secure dependencies, test their own code, and validate the trustworthiness of contributors. The maintainers should also publish a security scorecard — a Google-created initiative now managed by the Open Source Security Foundation (OpenSSF), which gives a security grade to a project based on nearly 20 different criteria.
While awareness is likely increasing, there is no quick solution, Miranda says.
“The reality is that the more secure options have not previously existed,” she says. “Trimming unnecessary dependencies to reduce attack surface is sensible, but it’s hard to do once the dependency tree has grown large.”