Security
Headlines
HeadlinesLatestCVEs

Headline

DSE measures and improves DevOps

DSE wants to make sure that their DevSecOps vision is working across the company - to demonstrate, quantifiably, that the changes were having the desired effect. Let’s take a look at how they stack up against four key DevOps metrics.

Red Hat Blog
#vulnerability#web#google#red_hat#git

Davie Street Enterprises is our fictional Red Hat customer that is working its way through real-world digital transformation problems, and this time around, it’s tackling measurement.

Newly-promoted Director of Security Engineering Zachary L. Tureaud knew that Monique Wallace, Davie Street Enterprises’ (DSE) CIO, was impressed with the work he had led on solution design, but he still needed to ensure his DevSecOps vision was working. It wasn’t enough to just point to all the new tooling his team had put together.

He needed to be able to demonstrate, quantifiably, that the changes were having the desired effect. Let’s see how he led DSE toward a DevSecOps practice using measured approaches from Google’s DevOps research and assessment team.

How is DevOps performing at DSE?

In a hallway conversation he had with Susan Chin, Senior Director of Development, and Ranbir Ahuja, Senior Director IT Operations, Tureaud learned that one of the guiding principles they had established was to define and measure goals throughout the DevOps life cycle—exactly what he was looking for.

Borrowing from the four key DevOps metrics Google’s DevOps Research and Assessment (DORA) team had established, Chin and Ahuja have been measuring the team’s performance. They shared their initial findings with Tureaud, prior to when the team - led by Andres Martinez - had guided the development teams into adoption of DevOps:

Deployment frequency

Low

Quarterly deployments were cause for celebration; anything faster than that was shot down by the CAB

Lead time for changes

Low

Heavy backlog filled with high-priority fixes were pushing new feature changes further and further towards the bottom of the priority pile.

Change failure rate

Low

The team routinely experienced a high failure rate, primarily due to security issues discovered after deployment.

Time to restore service

Low

Don’t even get Wallace started on this one. Spaghettification of code due to the high pressure on the dev team for adding features, and the resulting hypercomplex web of dependencies meant they often would need four to six hours just to identify a root cause.

The days of sneakernet code transfers—or worse, dealing with Subversion merges—were thankfully in the past after Martinez’s guidance of the adoption of GitLab for continuous integration/continuous delivery (CI/CD) and GitOps. The ephemeral review environments in particular were a major contribution towards improvements in deployment frequency and change lead time.

The streamlined CI/CD process provided them with confidence in their configuration changes and deployments, but also revealed some previously unknown weaknesses. His team had continued to encounter issues related to “dependency creep,” a rather pernicious problem that was caused by too many developers using too many different versions of too many different dependencies.

To solve this, Martinez implemented a repository governance strategy that leveraged JFrog’s Artifactory. Working with Tureaud’s security team ensured that the repositories Martinez’s development teams were using had been vetted and determined to be secure and compliant. It had the added benefit of exposing his developers to packages that some of his teams were using that other teams hadn’t encountered, and they were often much more performant.

The centralized dependency catalog’s caching via Artifactory’s remote repositories took things even further, providing not only for enhanced stability, but also reduced latency, which in turn meant build times were greatly reduced.Figure 1.

Here is the result of the cross-functional team’s DevOps efforts

Deployment frequency

Medium

On average, the team deployed about every two weeks.

Lead time for changes

Medium

Code commit to production occurred every month, on average.

Change failure rate

Medium

The team’s failure rate was improved but security issues discovered after deployment were still causing a non-trivial amount of change failures.

Time to restore service

Medium

This metric is high on Wallace’s list to improve. The team can only restore service within eight hours, on average.

**Security will only slow us down, right? **

Being at DSE before the shift to DevOps, Tureaud knew the stigma security had in slowing development to a crawl. He also knew that one of the reasons he was promoted was due to a couple of projects he led to improve speed through security automation. Tureaud knew he could apply the same principles to improve Chin and Ahuja’s DORA metrics.

Reduce Application security issues discovered in test and production

One of the first metrics Tureaud wanted to tackle was change failure rate because of the amount of security issues discovered after deployment. Using Synopsys and Sysdig, Tureaud implemented security controls in the build to ensure applications were secure before deployment.

Leveraging the work he did with the J.A.R.V.I.S project, Tureaud automated application security scanning into the CI/CD pipeline with Synopsys Intelligent Orchestration facilitating when to run Coverity SAST and Black Duck SCA on each build. One of the build checks Tureaud implemented as a Black Duck policy was to ensure no high or critical vulnerabilities were identified on any dependencies found in the application. This directly lowered the change failure rate as these security issues were found before deployment by the development teams, instead of after by the security team.

Tureaud then used Sysdig’s OpenShift pipeline example to scan the image created in the build process for malware, sensitive content, and ensure they are adhering to DSE’s compliance standards. Scanning the image before it hits the container registry cuts down significantly on registry costs because the build would fail if the image wasn’t safe. Again, this lowered the change failure rate because unsafe images weren’t being deployed to production!

Percentage of deployments stopped due to failed policies

However, if left isolated, these CI/CD gates would actually start to decrease deployment frequency and lead time for changes. Why?

Well, after these gates are enabled, most of the builds would start to fail because the developers don’t have the right tools and knowledge needed to submit secure applications and containers.

Tureaud knew even before he implemented the CI/CD changes, he needed to first shift the security controls left in the development process and hold a series of lunch-and-learns with the development teams on secure coding practices.

One of the key integrations Tureaud implemented for the developers was Synopsys Code Sight, which is an IDE plugin that brought Coverity SAST and Black Duck SCA functionality right to the developer’s desktop. This allowed developers to resolve application security issues before submitting to CI/CD, which would normally break the build due to Tureaud’s new security gates.

Time to fix security issues

The last metric Tureaud had to improve was time to restore service. This was high on Wallace’s list of improvements with the goal of getting to an Elite level (restoring service in less than an hour).

Tureaud’s DevSecOps design seemed to level out at the Medium level with only being able to restore service within 8 hours on average. Tureaud discovered the main challenge in resolving security issues wasn’t the tooling itself, but the ability for the response teams to react and find all the great data within these tools to resolve issues quicker. For example, it typically took the response team 30 minutes to find who owns the affected application, and then another hour or so to do the research and provide all the necessary information to the development teams to apply the fix.

Tureaud turned to Splunk, which has recently partnered with Red Hat to deliver real-time insights across all stages of the application and software delivery life cycle. Splunk has the ability to aggregate and correlate data from Red Hat OpenShift and all the partner technologies in DSE’s DevSecOps architecture.

Figure 2.

Now when security incidents occur, information on apps affected, the fixes, and who can apply the fixes is automatically gathered and correlated, allowing the right people to work on the right fixes right away. This significantly reduces the time to resolve security issues down from several hours to an average of one hour.

DSE DevSecOps improves DORA

DevSecOps at DSE wasn’t built in a day, but Chin, Ahuja, and Tureaud’s persistence and great work in improving the team’s DORA’s metrics have materialized into the numbers shown below.

Deployment frequency

High

The average has now shifted down to once every five days.

Lead time for changes

Medium

Code commit to production now occurs every two weeks and is getting better!

Change failure rate

High

Hotfixes and rollbacks are very rare at DSE.

Time to restore service

Elite

With the comprehensive DevSecOps solution, DSE is able to restore service within an hour.

The improvement in these four simple metrics provided concrete evidence that the development, operations, and security teams are no longer siloed.

Their collaboration paved the way for improved team and company dynamics. More importantly, these three teams (who rarely worked together) now consider themselves all part of the same cohesive team—building, deploying and maintaining secure applications for Davie Street Enterprises.

Learn more about how Red Hat solutions can help in your enterprise’s DevSecOps journey.

Red Hat Blog: Latest News

Introducing confidential containers on bare metal