[Review]: Deploying AI Systems Securely
Best Practices for Deploying Secure and Resilient AI Systems from NSA, GCHQ, CSEC, ASD and NCSC
When the tech spooks from all “Five Eyes” countries release guidance on AI system security, you better take a look. The question is how much of it applies to your systems and how to do that on the cheap and simple.
Original document: https://www.ncsc.govt.nz/news/deploying-ai-systems-securely/
I ignore the “Executive Summary”, “Scope and Audience” and “Introduction” sections and jump straight to the main part. Even if you are not in the audience, it won’t hurt to read the summary.
The document is divided into three major parts: secure deployment environment, continuous protection, and secure operation and maintenance. The first section is concerned with the infrastructure, the second with the AI system itself, and the third with updates and handling change.
Secure the deployment environment
Manage deployment environment governance:
The main point here is to establish an organisational structure that is able to handle the context (risks, threats and mitigation) of the system in the general IT (DevOps) context.
The document requires the AI system deployment team to provide a threat model. Mind you, this is _not_ the threat model of the AI system _itself_. Not that it is not important, just clarifying.
Ensure a robust deployment environment architecture:
Part of this section is standard IT security: Zero Trust frameworks, two-person control and integrity.
They correctly address the fact that not only code but also data and weights (binaries) are part of the system, and the same rules apply to them. This is true for training and inference and is especially important if multiple (external) parties are involved (a relatively rare occasion for startups but still needs to be addressed).
Harden deployment environment configurations:
This is again standard IT security with the important notion of updating hardware-specific software (drivers of GPUs, AI/ML packages, etc) and the problems that come with this (these themselves can be a security risk)
Protect deployment networks from threats:
Using tools to detect and block threats immediately should be a core feature in every IT system, AI or otherwise.
The majority of this section is similar to standard IT security recommendations with the notable extension to data and weights that are particularly important to AI systems.
In well designed AI systems where the system itself is code defined (IaC - Infrastructure as Code) this should be relatively easy to achieve and focus should be to external dependencies.
Continuously protect the AI system
Validate the AI system before and during use:
This section is a speedrun of the last couple of years MLOps practices. Essentially you need make sure that whatever gets into production is _exactly_ what you think is. You can write books on this, here they are mostly concerned that during the update of AI systems the components (data, code, weights) can’t be tampered with.
There are funny handwavy sentences like: “Thoroughly test the AI model for robustness, accuracy, and potential vulnerabilities after modification.”
Cheers, I guess…
Secure exposed APIs:
This is again standard IT, AI systems play a similar role to backend services and must be treated the same way.
Actively monitor model behaviour:
The document here refers to malicious attacks not the actual model performance. While security practices will make sure what the DSes want will be reflected in production, independently these changes must be checked as well. But I think that’s not a security but a business issue.
Protect model weights
This is interesting because they treat weights as some extra high value asset but I don’t see how this is different than for example a standard binary executable. Interesting anyway that they felt important to promote this to its own bullet.
The core point here is general MLOps principles: lineage and provenance, observability and recoverablilty. They don’t mention DORA metrics but they are a useful mental model as AI systems need more change and therefore require high level agility than standard IT ones. I wrote about this here:
https://laszlo.substack.com/p/dora-metrics-simplified
Secure AI operation and maintenance
Enforce strict access controls:
Emphasis on role-based access controls (RBAC) and MFA, important but standard stuff.
Ensure user awareness and training:
Same as before, basic IT security
Conduct audits and penetration testing:
I guess this is really hard in a small startup setup but regularly review your systems with an adversarial point-of-view doesn’t cost too much and you can identify problems that can be fixed (Typically things that got looked over in the usual hurry).
Implement robust logging and monitoring:
This is absolutely essential. Logging usually thought of as an afterthought but it should be implemented as a first class citizen. You would thank it later.
Without logging you are running your system blind and AI is a long term statistical concept. To find out if you are doing it right you need to collect a long history. The same true if you think you are doing it wrong…
Interestingly they mix data drift (AI modelling concept) with access issues (IT concept) I would be conscious treating these separately. Repetitive usage is indeed a security issue but if it causes data drift, then we are talking a modelling concept.
Update and patch regularly:
Ugh, not sure if you are up to date with the xz backdoor https://news.ycombinator.com/item?id=40017310 but this problem just got much harder. I guess it doesn’t hurt in IT security if you are paranoid.
Patching is not just about updating version numbers and restarting services. The malicious change potentially coming from that change itself. So you _have_ to do it and you _have_ to be careful about it…
Prepare for high availability (HA) and disaster recovery (DR)
One of the benefits of IaC that it is easier to migrate your systems which helps HA and DR. These should be
Plan secure delete capabilities
This is an interesting area. I guess in regulated environments sometimes you do need to throw away things for good. But you need strict policies to make sure you only drop what really need to be.
Business-as-usual part of the AI lifecycle is probably the most important part as successful AI projects spend most their lifetime in this state so this is the largest security surface.
Summary
Good engineering concepts (IaC, RBAC etc) and good MLOps concepts (lineage, provenance, observability, DORA) will take you a long way on this path. Reviewing the other security issues, forming the right processes and documents will help you conceptually strengthen your AI systems. Comparing these to your current practices you can identify a list of TODOs to eliminate issues and reinforce your systems.