AI Testing And Evaluation: Learnings From Science And Industry

[MUSIC ENDS]

For our introductory episode, I’m pleased to welcome Amanda Craig Deckard from Microsoft to discuss the company’s efforts to learn about testing in other sectors.

Amanda is senior director of public policy in the Office of Responsible AI, where she leads a team that works closely with engineers, researchers, and policy experts to help ensure AI is being developed and used responsibly. Their insights shape Microsoft’s contribution to public policy discussions on laws, norms, and standards for AI.

Amanda, welcome to the podcast.

AMANDA CRAIG DECKARD: Thank you.

SULLIVAN: Amanda, let’s give the listeners a little bit of your background. What’s your origin story? Can you talk to us a little bit about maybe how you started in tech? And I would love to also learn a little bit more about what your team does in the Office of Responsible AI.

CRAIG DECKARD: Sure. Thank you. I’d say my [LAUGHS] path to tech, to Microsoft, as well, was a bit, like, circuitous, maybe. You know, I thought for the longest time I was going to be a journalist. I studied forced migration. I worked in a sort of state level sort of trial court in Indiana, a legal service provider in India, just to give you a bit of a flavor.

I made my way to Microsoft in 2014 and have been here since, working in cybersecurity public policy first and now in responsible AI. And the way that our Office of Responsible AI has really, sort of, structured itself is bringing together the kind of expertise to really work on defining policy and how to operationalize it at the same time.

And, you know, that means that we have been working through this, you know, real challenge of defining internal policy and practice, making sure that’s deeply grounded in the work of our colleagues at Microsoft Research, and then really closely working with engineering to make sure that we have the processes, that we have the tools, to implement that policy at scale.

And I’m really drawn to these kind of hard problems where they have the character of two things being true or there’s like, you know, real tension on both sides and in particular, in the context of those kinds of problems, roles in which, like, the whole job is actually just sitting with that tension, not necessarily, like, resolving it and expecting that you’re done.

And I think, really, there are two reasons why tech is so, kind of, representative of that kind of challenge that I’ve always found fascinating. You know, one is that, of course, tech is, sort of, ubiquitous. It’s really impacting so many people’s lives. But also, you know, because, as I think has become part of our vernacular now, but, you know, is not necessarily immediately intuitive, is like the fact that technology is both a tool and a weapon. And so that’s just, like, another reason why, you know, we have to continuously work through that tension and, sort of, like, sit with it, right, and even as tech evolves over time.

SULLIVAN: You bring up such great points, and this field is not black and white. I think that even underscores, you know, this notion that you highlighted that it’s impacting everyone. And, you know, to set the stage for our listeners, last year, we pulled in a bunch of experts from cybersecurity, biotech, finance, and we ran this large workshop to study how they’re thinking about governance in those playbooks. And so I’d love to understand a little bit more about what sparked that effort—and, you know, there’s a piece of this which is really centered around testing—and to hear from you why the focus on testing is so important.

CRAIG DECKARD: If I could rewind a little bit and give you a bit of history of how we even arrived at bringing these experts together, you know, we actually started on this journey in 2023. At that time, there were, like, a lot of these big questions swirling around about, you know, what did we need in terms of governance for AI? Of course, this was in the immediate aftermath of the ChatGPT sort of wave and everyone recognizing that, like, the technology was going to have a different level of impact in the near term. And so, you know, what do we need from governance? What do we need at the global level, in particular, of governance?

And so at the time, in early 2023 especially, there were a lot of attempts to sort of draw analogies to other global governance institutions in other domains. So we actually in 2023 brought together a different workshop than the one that you’re referring to specifically focused on testing last year. And we, kind of, had two big takeaways from that conversation.

One was, what are the actual functions of these institutions and how do they apply to AI? And, actually, one of the takeaways was they all sort of apply. [LAUGHS] There’s, like, a role for, you know, any of the functions, whether it be sort of driving consensus on research or building industry standards or managing, kind of, frontier risks, for thinking about how those might be needed in the AI context.

And one of the other big takeaways was that, you know, there are also limitations in these analogies. You know, each of the institutions grew up in its own, sort of, unique historical moment, like the one that we sit in with AI right now. And in each of those circumstances, they don’t exactly translate to this moment. And so, yeah, there was like this kind of, OK, we want to draw what we can from this conversation and then we also want to understand, what is also very important that’s just different for AI right now?

We published a book with the lessons from that conversation in 2023 (opens in new tab). And then we actually went on a bit of a tour [LAUGHS] with that content where we had a number of roundtables actually all over the world where we gathered feedback on how those analogies were landing, how our takeaways were landing. And one of the things that we took from them was a gap that some of the participants saw in the analogies that we chose to focus on. So across multiple conversations, other domains kept being raised, like, why did you not also study pharmaceuticals? Why did you also not study cybersecurity, for example? And so that, you know, naturally got us thinking about what further lessons we could draw from those domains.

At the same time, though, we also saw a need to, again, go deeper than what we went and really, like, focus on a narrower problem. So that’s really what led us to trying to think about a more specific problem where we could think across levels of governance and bring in some of these other domains. And, you know, testing was top of mind. Continues to be a really important topic in the AI policy conversation right now, I think, for really good reason. A lot of policymakers are focused on, you know, what we need to do to, kind of, have there be sufficient trust, and testing is going to be a part of that—really better understand risk, enable everyone to be able to make more, kind of, risk-informed decisions, right. Testing is an important component for governance and AI and, of course, in all of these other domains, as well.

So I’ll just add the other, kind of, input into the process for this second round was exploring other analogies beyond those that we, kind of, got feedback on. And one of the early, kind of, examples of another domain that would be really worthwhile to study that came to mind from, sort of, just studying the literature was genome editing.

You know, genome editing was really interesting through the process of thinking about other kind of general-purpose technologies. We also arrived at nanoscience and brought those into the conversation.

SULLIVAN: That’s great. I mean, actually, if you could double-click, I mean, you just named a number of industries. I’d love to just understand which of those worlds maybe feels the closest to what we’re wrestling with, with AI and maybe which is kind of the farthest off, and what makes them stand out to you?

CRAIG DECKARD: Oh, such a good question. For this second round, we actually brought together eight different domains, right. And I think we actually thought we would come out of this conversation with some bit of clarity around, Oh, if we just, sort of, take this approach for this domain or that domain, we’ll sort of have—at least for now—really solved part of the puzzle. [LAUGHS] And, you know, our public policy team the day after the workshop, we had a, sort of, follow-on discussion, and the very first thing that we started with in that conversation was like, OK, so which of these domains? And fascinatingly, like, everyone was sort of like, Ahh! [LAUGHS] None of them are applying perfectly. I mean, this is also speaking to the limitations of analogies that we already acknowledged.

And also, you know, all of the experts from across these domains gave us really interesting insights into, sort of, the tradeoffs and the limitations and how they were working. None are really applying perfectly for us. But all of them do offer a thread of insight that is really useful for thinking about testing in AI, and there are some different dimensions that I think are really useful as framing for that.

I mean, one is just this horizontal-versus-vertical, kind of, difference in domains and, you know, the horizontal technology like genome editing or nanoscience just being inherently different and seemingly very similar to AI in that you want to be able to understand risks in the technology itself and there is just so much contextual, sort of, factor that matters in the application of those technologies for how the risk manifests that you really need to, kind of, do those two things at once—of understanding the technology but then really thinking about risk and governance in the context of application versus, you know, a context like or a domain like civil aviation or nuclear technology, for example.

You know, even in the workshop itself that we hosted late last year, where we brought together this second round of experts, it was really interesting. We actually started the conversation by trying to understand how those different domains defined risks, where they were able to set risk thresholds. That’s been such a part of the AI policy conversation in the last year. And, you know, it was really instructive that the more vertical domains were able to, sort of, snap to clearer answers much more quickly. [LAUGHS] But, like, the horizontal nanoscience and genome editing were not because it just depends, right. So anyway, the horizontal-vertical dimension seems like a really important one to draw from and apply to AI.

The couple of others that I would offer is just, you know, thinking about the different kinds of technologies. You know, obviously, there’s some of the domains that we studied that they’re just inherently, sort of, like, physical technologies … a mix of physical and digital or virtual in a lot of cases because all of these are, of course, applying digital technology. But like, you know, there is just a difference between something like an airplane or a medical device or, you know, the more kind of virtual or intangible sort of technologies even, you know, of course, AI and some of the other like cyber and genome editing but also like, you know, financial services having some of that quality. And again, I think the thing that’s interesting to us about AI is to think about AI and risk evaluation of AI as being, you know, having a large component of that being about the kind of virtual or intangible technology. And also, you know, there is a future of robotics where we might need to think about the, kind of, physical risk evaluation kind of work, as well.

And then the final thing I’d maybe say in terms of thinking about which domains have the lessons for AI that are most applicable is just how they’ve grappled with these different kind of governance questions. Things like how to turn the dial in terms of being more or less prescriptive on risk evaluation approaches, how they think about the balance of, kind of, pre-market versus post-market risk evaluation in testing, and what the tradeoffs have been there across domains has been really interesting to kind of tease out. And then also thinking about, sort of, who does what?

So, you know, in each of these different domains, it was interesting to hear about, like, you know, the role of industry, the role of governments, the role of third-party experts in designing evaluations and developing standards and actually doing the work, and, kind of, having the pull through of what it means for risk and governance decisions. There were, again, there was a variety of, sort of, approaches across these domains that I think were interesting for AI.

SULLIVAN: You mentioned that there’s a number of different stakeholders to be considering across the board as we’re thinking about policy, as we’re thinking about regulation. Where can we collaborate more across industry? Is it academia? Regulators? Just, how can we move the needle faster?

CRAIG DECKARD: I think all of the above [LAUGHTER] is needed. But it’s also really important to have all of that, kind of, expertise brought together, you know, and I think, you know, one of the things that we certainly heard from multiple of the domains, if not all of them, was that same actual interest and need and the same sort of ongoing work to try to figure that out.

You know, even where there had been progress in some of the other domains with bringing together, you know, some industry stakeholders or, you know, industry and government, there was still a desire to actually do more there. Like, if there was some progress in industry and government, the need was, And more kind of cross-jurisdiction government conversation, for example. Or some progress on, you know, within the industry but needing to, like, strengthen the partnership with academia, for example. So, you know, I think it speaks to, like, the quality of your question, to be honest, that, you know, all of these domains are actually still grappling with this and still seeing the need to grow in that direction more.

What I’d say about AI today is that we have made good progress with, you know, starting to build some industry partnerships. You know, we were a founding member of the Frontier Model Forum, or FMF (opens in new tab), which has been a very useful place for us to work with some peers on really trying to bring forward some best practices that apply across our organizations. You know, there are other forums as well, like MLCommons (opens in new tab), where we’re working with others in industry and broader, sort of, academic and civil society communities. Partnership on AI (opens in new tab) is another one I think about that, kind of, fits that mold, as well, in a really positive way. And, like, there are a lot of different, sort of, governance needs to think through and where, you know, we can really think about bringing that expertise together is going to be so important.

I think about almost, like, in the near to mid-term, like three issues that we need to address in the AI, kind of, policy and testing context. One is just building kind of, like, a flexible framework that allows us to really build trust while we continue to advance the science and the standards. You know, we are going to need to do both at once. And so we need a flexible framework that enables that kind of agility, and advancing the science and the standards, that is going to be something that really demands that kind of cross-discipline or cross kind of expertise group coming together to work on that—researchers, academics, civil society, governments and, of course, industry.

And so I think that is, actually, the second problem is, like, how do we actually build the kind of forums and ways of working together, the public-private partnership kind of efforts that allow all of that expertise to come together and fit together over time, right. Because when these are really big, broad challenges, you kind of have to break them down incrementally, make progress on them, and then bring them back together.

And so I think about, like, one example that I, you know, really have been reflecting on lately is, you know, in the context of building standards, like, how do you do that, right? Again, standards are going to benefit from that whole community of expertise. And, you know, there are lots of different kinds of quote-unquote standards, though, right. You kind of have the “small s” industry standards. You have the kind of “big S” international standards, for example. And how do you, kind of, leverage one to accelerate the other, I think, is part of, like, how we need to work together within this ecosystem. And, like, I think what we and others have done in an organization like C2PA [Coalition for Content Provenance and Authenticity] (opens in new tab), for example, where we’ve really built an industry specification but then built on that towards an international standard effort is one example that is interesting, right, to point to.

And then, you know, I actually think that bridges to the third thing that we need to do together within this whole community, which is, you know, really think again about how we manage the breadth of this challenge and opportunity of AI by thinking about this horizontal-vertical problem. And, you know, I think that’s where it’s not just the sort of tech industry, for example. It’s broader industry that’s going to be really applying this technology that needs to get involved in the conversation about not just, sort of, testing AI models, for example, but also testing how AI systems or applications are working in context. And so, yes, so much fun opportunity!

[MUSIC]

SULLIVAN: Amanda, this was just fantastic. You’ve really set the stage for this podcast. And thank you so much for sharing your time and wisdom with us.

CRAIG DECKARD: Thank you.

SULLIVAN: And to our listeners, we’re so glad you joined us for this conversation. An exciting lineup of episodes are on the way, and we can’t wait to have you back for the next one.

[MUSIC FADES]

Source link

What's Hot

AI’s Big Data Step Function

Refusal Falls off a Cliff: How Safety Alignment Fails in Reasoning? – Takara TLDR

Hoth Therapeutics Expands Artificial Intelligence Initiative, Selects NVIDIA AI Enterprise Platform

AI Testing and Evaluation: Learnings from Science and Industry

When AI Meets Biology: Promise, Risk, and Responsibility

Ideas: More AI-resilient biosecurity with the Paraphrase Project

Using AI to assist in rare disease diagnosis

Matthiesen Gallery Files Lawsuit Over Gustave Courbet Painting

MoMA Partners with Mattel for Van Gogh Barbie, Monet and Dalí Figures

Underground Film Legend and Artist Dies at 92

Artwork Forfeited by Inigo Philbrick’s Partner Flops at Sotheby’s

AI’s Big Data Step Function

Refusal Falls off a Cliff: How Safety Alignment Fails in Reasoning? – Takara TLDR

Hoth Therapeutics Expands Artificial Intelligence Initiative, Selects NVIDIA AI Enterprise Platform

What's Hot

AI Testing and Evaluation: Learnings from Science and Industry

Related Posts

Subscribe to Updates