What happens when you train your AI on AI-generated data?

AI companies say they are running out of high-quality data to train their models on. But they might have a solution: data generated by artificial intelligence systems themselves. The pros and cons of synthetic data.

Guest

Ari Morcos, co-founder and CEO of DatologyAI. Former research scientist at Meta’s Fundamental AI Research (FAIR) team and Google DeepMind.

Kalyan Veeramachaneni, CEO of DataCebo. Principal research scientist at the MIT Schwarzman College of Computing.

Also Featured

Felix Heide, professor of computer science at Princeton, where he leads the Princeton Computation Imaging Lab.

Richard Baraniuk, professor of electrical and computer engineering at Rice University.

Transcript

Part I

MEGHNA CHAKRABARTI: So the general understanding of how artificial intelligence models get trained is that they scoop up vast amounts of data from the real world and learn how to create responses that match that real world data. Here’s an example, a large language model or LLM, tools that Siri or Alexa use to answer your questions.

In development, those LLMS read billions of tech samples from across the internet, books, websites, et cetera. The model looks for patterns on how words work together or really how humans use those words. And as it trains, it tries to guess maybe what word comes next in a sentence. And if it guesses wrong, it fixes the mistake, it learns from that mistake.

And then it repeats that process billions and billions of time, each iteration getting better and better at guessing the right word. That’s essentially how the LLM learns to understand and write like a human. So what happens when AI models run out of real-world data to train on?

Several research papers published in recent years suggest that developers will in fact run out of real-world data in a matter of years, but developers also say there might be a solution to that.

MARK ZUCKERBERG: I do think in the future it seems quite possible that more of what we call training for these big models is actually more along the lines of inference, generating synthetic data to then go feed into the model.

CHAKRABARTI: That’s Mark Zuckerberg, of course, talking to AI podcaster Dwarkesh Patel last April, and what he’s suggesting is this. AI models built on real world data, that those models create new artificial data, synthetic data, as he said, to train future models. Here’s OpenAI’s Sam Altman at the Sohn investment conference in May of 2023.

SAM ALTMAN: As long as you can get over the synthetic data event horizon, where the model’s smart enough to make good synthetic data, I think it should be all right.

CHAKRABARTI: But what is that event horizon? Also, I love that analogy because event horizons are also on black holes. So could we really just plunge into a black hole of AI synthetic data?

Perhaps more importantly, if AI’s purpose is ultimately to be used and beneficial to our world, our real human world, how is it even possible to run out of data? Aren’t we humans generating data all the time? And can synthetic data be an adequate or even acceptable replacement for reality? So let’s start today with Ari Morcos.

He is co-founder and CEO of DatologyAI, and a former research scientist at Meta’s fundamental AI research team, and at Google DeepMind. Ari, welcome to On Point.

ARI MORCOS: Thank you so much for having me.

CHAKRABARTI: Also, I see you’re joining us from San Jose, California. Actually, not that long ago, I was driving down Highway 101 and every single billboard, every billboard was an AI billboard.

MORCOS: (LAUGHS)

CHAKRABARTI: Okay, so first of all, you heard my sort of populist version of the definition of synthetic data. How would you actually, or more precisely define what it is?

MORCOS: I think you actually gave a very good definition. But briefly, a synthetic data point is just a data point that’s generated by a model rather than generated by a human or created from the real world.

And that can be quite useful, so long as that synthetic data point is actually reflective of the underlying reality.

CHAKRABARTI: Reflective of the underlying reality. Okay. So we’re going to talk about how that data is generated and how to get to be sure that it satisfies that important caveat you gave. But let’s get right to this question, because it’s been perplexing me.

How can we run out of real-world data?

MORCOS: Yeah. So maybe to start, let me take a step back for a moment and just talk about how we got here to where we are now, where we might be running out of data. In the 2010s, the way you would train a machine learning model is that you would have some amount of data, you would go and you get a bunch of humans to label it.

So imagine you have a data set of lots and lots of pictures. Some of them are pictures of cats, some of them are pictures of dogs. You’d go and you have a bunch of humans say, this is a cat. This is a dog, this is a cat, this is a dog. And that is a pretty expensive and time-consuming process, you might imagine.

And as a result, the largest data sets that we would train our models on would be a million data points. Or something like that. There was a very famous academic benchmark called ImageNet that was used for a lot of progress in the last decade. That was about a million images.

And that’s called supervised learning, because it’s being supervised by a human, saying, this is a cat, this is a dog. But then in the late 2010s, we had this incredible breakthrough, which is called self-supervised learning, which means we figured out how to train on models on data that hadn’t gone through this manual annotation process by a human.

And the vast majority of data that we have is not labeled right? Hasn’t ever looked at it and given it a label. So this massively unlocked the amount of data that we were able to train models on, going from a million data points being really large, say circa 2018 to now trillions of tokens, the entire internet.

So when you put that into perspective, that’s about a million-fold increase in the scale of data that we are feeding into these models in the order of three to five years.

Which is really wild when you come to think about it. And this is also why the compute spend has gone up so dramatically because the more data, the more GPU hours you need, which is why Nvidia of course does so well in all of this.

And that’s why we’ve seen this massive explosion. But this now literally means that for many of these models, we are feeding the entire total of the public internet into these models at this point.

CHAKRABARTI: And the public internet though, isn’t a fixed thing, right? We’re pouring data into it every second of the day, including all of the now AI slop that’s out there as well.

MORCOS: Yes. And that’s another big problem, right? That if we do train on synthetic data, we want to make sure it’s high-quality synthetic data that was intentional. We don’t want to train it on accidental synthetic data that just has made it onto the internet through AI slop as you call it.

So yeah, the internet is absolutely growing. But the hunger of these models in some ways is growing faster.

Okay let me parse some of what you said just a little bit more. So first of all, are we running out of real world data? Yes or no?

MORCOS: So I would push back on this, I think.

And I would beg this question a bit because I think yes. In the public domain we are exhausting what is currently available. There’s of course, always new data growing, and the amount of data that’s being put onto the internet is increasing with every passing day. So there always will be new data there, but we have exhausted the majority of it.

However, this question presupposes a notion that all data are created equal and that the only way for us to improve our models is to get more data, rather than making better use of the data we already have. And I would argue that there’s orders of magnitude, a hundred-fold improvements left by just making better use of the data we have already, rather than needing to collect more data.

The vast majority of the data on the internet is not particularly useful for training models, for a whole bunch of reasons. One is that it’s just a lot of, it’s very redundant. For example, think about how many different summaries of Hamlet there are on the internet. A model doesn’t need all of those.

Some fraction of them will be enough for the model to understand the plot of Hamlet. So there’s a lot of data that’s not useful and a lot of data that’s only useful at certain times. For example, imagine you were teaching a middle schooler math. If you showed them a bunch of arithmetic problems, it would be too simple for the student.

They know how to do addition and subtraction. Basic multiplication, division. It wouldn’t teach them anything. And similarly, if you were to show them calculus, it also wouldn’t be very helpful for the student. Calculus is way too difficult for an average middle schooler. You need to show them geometry and algebra.

That’s where they’re gonna be learning. When we train these models, we just mix all of those together and show it all, at all times of training rather than actually thinking about what is the data that’s gonna teach the model the most based off of what the model understands right now. And then using that to target and in a sort of curriculum, actually teach the model in the best way.

And that can enable massive increases in data efficiency.

CHAKRABARTI: Okay. So targeted training, that’s gonna be my little slug for that, targeted training. But you also, so this gets us another phrase that you used a little earlier. High quality data. So is that what you’re talking about?

How would you define what high quality data is?

MORCOS: Yeah, so that is exactly what I’m talking about. Okay. As to how to define high quality, in many ways, I think that is the billion or perhaps trillion-dollar question even. And in many ways, that’s entirely what my company Datology was built to do, is to try to solve this problem of how do we understand what is really high-quality data?

And you make use of that to make models better and to solve these problems. The first and most important thing to understand about quality is that there’s no one silver bullet for this. In the sense that quality is very dependent upon what the use case of the model is.

For example, if I want to train a model that’s really good at legal questions and can serve as a legal assistant. Obviously, I’m going to value legal data more highly than I would data about movies or about history in some cases. Whereas if I’m training a model that’s going to help doctors, a health care assistant of some sort, obviously I’m gonna value health care data more.

The first thing to note is that it depends on what you’re gonna do. Now as to how you actually do this, it’s a real frontier research problem. And most of this research is of course being done within these big frontier labs. OpenAI, Anthropic, DeepMind, et cetera.

And this is literally the secret sauce that is distinguishing between these different models and labs.

CHAKRABARTI: Okay? Lurking behind a lot of this, though, Ari, I think you, again, you tantalized us with that a little earlier is, money, right? It sounds like, look, I’m just inferring here, but it sounds like perhaps one of the impediments to trying to use the data we have better is that it may cost companies more to do that.

MORCOS: I think actually it’s a bit the opposite. If you can do it better, it actually saves a dramatic amount of money.

CHAKRABARTI: Then why aren’t more companies doing it? Why are we hearing Sam Altman say we need synthetic data?

MORCOS: I think there’s two reasons. I think one, because it’s really hard and I think synthetic data is absolutely a big part of the solution here.

Don’t get me wrong, I think that these are not mutually exclusive. And for example, at Datology we also use quite a bit of synthetic data. Now it’s not the panacea, I think, that it’s often made out to be. People talk about synthetic data as if it will be the end all, be all and fully replace it.

I think what we’ve seen is actually that a number of models that have been trained primarily on synthetic data actually have a lot of problems. In particular, they get very brittle and weird. They’re very good on the exact data that they’re trained on, but they don’t generalize to new formats or things that are a little bit different.

Part II

CHAKRABARTI: Ari, hang on for a second because I want to bring Kalyan Veeramachaneni into the conversation. He’s co-founder and CEO of DataCebo and principal research scientist at the MIT Schwarzman College of Computing.

Kalyan, welcome to On Point.

KALYAN VEERAMACHANENI: Thank you. Thank you for having me.

CHAKRABARTI: Okay. Ari did a really good job, I think, of laying out sort of the subtleties when we talk about what synthetic data is. But I want to just get a check from you. What do you think about his assertion that synthetic data’s going to be part of the picture moving forward, but we’re not actually running out of real-world data.

We just have to use what we have in the real world better.

VEERAMACHANENI: To a certain extent, I agree with that. But I wanted to give another perspective. I think the AI that we have as of today and we are using is largely very small so far where I don’t mean that as in size, but in the tasks that it can do.

And as days go by, we are asking more and more of it. So originally it was just like, let’s chat with it. Let’s see if it finds us something. Let’s do search. And now we are asking like legal questions. We are asking, what do you think about this question? So we are asking it to reason, we are asking it to think.

So that requires us to provide more data, train more models that are much more efficient in reasoning, and can solve problems that we haven’t thought of solving, with using such models in AI. So in AI, I always say that anything worth predicting is very rare to happen. So that’s generally true.

And most of the AI models depend on predicting either the next word or the label of the sentence, or a sentiment of a sentence, and so on and so forth. As a result, for us to be able to train these models to predict such rare situations, we would have to create synthetic data, because they’re just rare.

They don’t happen that much in the world.

CHAKRABARTI: Okay. So let me, I’m just a tad bit confused here because you said anything that’s worth predicting doesn’t happen very often. Because it’s like that’s maybe why we want to be able to predict it. But that what we’re asking models to do such like the LMS right now are worth predicting the next word. Next words happen all the time, so I’m not quite sure what you’re saying there.

VEERAMACHANENI: So elements, the next word prediction happens all the time. And we can predict the words, but what we are asking now them is specific tasks saying that, Hey, I have this set of texts, does this mean fraud?

So we are asking at a meta level, we are asking, does this group of words mean something, a fraud or a sort of a hate speech or something else. So we are asking such questions of it, so we are asking more.

CHAKRABARTI: And why can’t AI be trained on real, on whatever real-world data that we have?

Why is what we have right now not satisfactory to make AI models be good at that kind of work?

VEERAMACHANENI: Great question. So I think if you just take the example of fraud, thankfully in banks and so on and so forth, the fraud happens rarely. So you have 10 million transactions that are not fraudulent, and you have 10,000 transactions that are fraudulent, and you have reports for those fraud transactions, fraudulent transactions.

So as a result, when banks are training a model to be able to detect from a certain report whether it’s truly fraud or not, you only have 10,000 and you have a million or 10 million of reports that are, or not even reports. You have data that doesn’t, is not fraud at all. So as a result, when you try to train a model, it just latches onto the non-fraudulent examples and doesn’t have enough to learn from the fraudulent examples. So that’s just one example where there’s rare occurrence of an event that we want to predict or we want to reason about.

CHAKRABARTI: Okay, Ari and Kalyan, let me, full disclosure here, my undergraduate majors were in Civil and Environmental Engineering.

So I’m definitely a very hands-on, concrete person. If I don’t have to wear a hard hat, it’s a little challenging for me to understand it. So I’d love, as we have this discussion, I’d love both of you to bring in as many real-world examples as you can to help us understand this. So Ari, what do you make of what Kalyan said about.

Let’s take fraud is a really good one, right? Because that’s a highly important area that we want as much AI assistance as possible. That the kinds of data that we have, Ari, right now, as Kalyan’s saying, are inadequate in order to predict new kinds of fraud.

MORCOS: Yeah, so I think this is a good example for a couple reasons.

First off, I think this reveals one aspect of that we’re running outta data problem, which people don’t talk about, which is that the vast majority of data in the world is not public.

The vast majority of data in the world is private, sitting in large companies. As an example, there is very little data around fraudulent credit transactions in the public internet. But there is a whole bunch at Amex and Visa and Chase and large financial institutions. And that data is useful for various problems. Currently the big foundation model labs wouldn’t have access to those data.

There’s several companies that might license the data, but for the most part, that’s companies’ really valuable moat that enables them to build their own applications that can be really strong. But I want to touch on this notion of kind of the edge cases or outlier examples, or there’s some called, sometimes called the long tail that Kalyan was just referring to. Because that’s absolutely correct.

I think one really salient version of this is self-driving cars. So if you consider, imagine, for example, all Teslas are constantly recording video data as they’re driving. If you think about that data set that has been collected, the vast majority of that data is gonna be on highways.

And highways are actually pretty simple for self-driving cars comparatively. They’ve been pretty good at them for a while. Autopilot has worked well on highways for a long time. They’re pretty predictable. You don’t have to worry about a woman in a stroller, that may or may not, a woman at the store that may or may not hop off the street and get in the way of a car or construction zones or things like that, nearly as much.

Those are the edge cases that you really need to be looking at to make sure that your self-driving car isn’t going to have a terrible accident. And Kalyan’s right. Those are rarely represented in real data sets. Okay? However, one of the things we can do is identify those examples and then up sample them, repeat them or up weight them in some way so that the model sees them more frequently.

Another place for you.

CHAKRABARTI: Ari, I’m gonna make you stop there because you’re just —

MORCOS: Sorry.

CHAKRABARTI: You’re stealing our thunder. Because we talked to someone specifically about autonomous vehicles and one of those edge cases are really, driving is a perfect example, I’d say, because a while ago, again.

I was in the Bay Area, Ari, and I was in an autonomous taxi car, taxi, and it pulled into the hotel where I was staying and there was, I don’t know why, but someone left a dumpster halfway covering the exit to the pull-in in front of the hotel lobby. And the taxi was stumped.

Like it did not know what to do. And it was just sitting there, and we had to call the company and kind of have some human come get us. And I was really confused about why it didn’t even think just to back up. Like a human would automatically be like, back up and go another direction. But it wasn’t, the taxi at that time.

This was a couple years ago. Couldn’t do it, but okay. Anyway, here’s a developer who’s working in the autonomous vehicle space. He’s Felix Heide, professor of Computer Science at Princeton and head of AI at Torc Robotics.

FELIX HEIDE: At this point, we’re able to generate high quality novel trajectories for autonomous vehicles that are almost photorealistic.

So we can take an existing driving sequence that we have observed, simulate our eco vehicle driving on that same route, but on the opposite side or in a squiggly line, or driving off the shoulder, leaving the drivable area or crashing into another vehicle that is driving ahead of us.

CHAKRABARTI: He tells us that these simulations can create incredibly realistic environments with other vehicles, pedestrians, trees, buildings, even fine detail like parking meters and trash cans.

Good to know, along with cameras, lidar and other sensor technologies, Professor Heide says AI models can learn in a self play type of way.

HEIDE: I can put them into a synthetic environment where I have a closed loop environment that are over and over again, provide them with new scenarios that challenge the model.

So through the self play, we can really unlock the original idea of reinforcement learning in a very convincing, exciting way that we have the best superhuman driver train in the simulation world until it sees all of the crashes that it needs to see in order to understand how to react.

CHAKRABARTI: And Professor Heide says, these environments help provide data points for situations that are rare or haven’t yet occurred in the real world as we’ve been talking about.

And it’s a key step, he says, for ensuring that the technology is safe.

HEIDE: If you look at the way more deployments, for example, they’re city by city, relatively slow geofence, the deployments of a hundred vehicles here and there and shows the potential of the technology and I’m super excited about it. But to really bring it at scale this is one of the key technologies that will allow us to bring these vehicles out in the hundreds of thousands of vehicles and do it in a safe manner.

CHAKRABARTI: So that’s Felix Heide, professor of Computer Science at Princeton and head of AI at Torc Robotics. And Kalyan, let’s stick with this example for a second, because, again, the hard hat wearer in me can really understand it.

But I also feel like there’s a trust but verify aspect to this because these autonomous vehicles may do perfectly in training, using the synthetic data that’s given to them. But in terms of unleashing it into the real world, wouldn’t we want to have like a very tight regulatory scheme to be sure that they perform well in the real world?

VEERAMACHANENI: Absolutely. Absolutely. And autonomous vehicles especially have much more stringent testing requirements once before you put them out in the real world. And look, this synthetic data creation. Just stepping back one, we did the synthetic data generations even 20 years ago in 2005, I was at GE and at that point, they were generating synthetic data using computational fluid dynamics based simulator for aircraft engines, GE90 engines, right?

So they will create the data, they’ll pretend as if the flight is happening. And this is through a software framework, and inject some falls and create the data. So what’s very important is that when you take the synthetic data, you mix in with the real data in your actual development of model. So you don’t essentially just train it with synthetic data.

So you mix in real data, you train a model, and then you test it rigorously. So in this case, I think they would try to test that autonomous car, I guess, in some locations and drive around with the new model.

CHAKRABARTI: Downtown Boston. (LAUGHS)

VEERAMACHANENI: (LAUGHS) Downtown Boston or maybe behind the dumpster. And see.

So actually all those situations, the one situation that you mentioned just previously, also become part of the test suite. So we now test the car, whether it’s able to handle or the driving, autonomous driving is able to handle the new situations that it was not able to handle. So that sort of rigorous stress testing is required before they are deployed in the real world.

CHAKRABARTI: Ari, I guess Kalyan just said what you were telling us earlier that it’s a mix, that new AI models should be trained on this mix of synthetic and high quality data.

MORCOS: Yeah, I think that’s exactly right. You need to find the high quality real data, and that could involve finding a lot of those outliers, that could involve finding the most difficult examples.

And then you need to mix it in with appropriate synthetic data. When you think about what’s gonna make synthetic data work, there are generally two things that are really important. First of all, of course, synthetic data has to be reflective of the real world, right? Imagine I have a simulator where the rules, the laws of physics are different.

Obviously, a model is not gonna generalize from that to our world where, if gravity is half of what it’s there. So the simulation has actually has to match reality in order for this to work, number one. And then number two is that you have to make sure you generate diverse data. Diversity is in many ways the most important thing for high quality data curation and for making these models learn, you have to make sure it covers lots and lots of scenarios, every possible way something could be presented.

Generally, failure —

CHAKRABARTI: Wait, but how can you do that when, so this is, again, forgive me for just being gauche here, but how can you do that when part of the problem is we can’t actually predict the infinite number of scenarios that we even as humans can be presented with every day.

MORCOS: And I think the answer is you can’t do it perfectly. You can do it well, and then what you do is you make a virtuous cycle there where you start with some synthetic data, use that to make a model better, that can now model, can do better at generating more data, use that, and so on and so forth, until eventually you get a model that’s getting better and better.

It’s often the way that people think about this. But it doesn’t have to be perfect. It just has to be more informative than what the model currently understands. So long as it teaches the model something new, you can get to a certain point. Now that said, it does mean that if the synthetic data has a ceiling in quality, eventually you would reach a ceiling.

Now, the bet that many of folks are making is that we can get past the ceiling with synthetic data, which I think there is some reasonable evidence to suggest we may be able to. But we haven’t yet reached that point. And we’ll have to see when we get there.

CHAKRABARTI: Kalyan, you’re leaning in here. Go ahead.

VEERAMACHANENI: Yeah.

So I think to be able to generate synthetic data, sometimes those rare examples that we find, we would use that to create more of them in the adjacent neighborhoods. And then once we create more of them, sometimes we do verification or engineer them. Like sometimes even we will go back to humans and verify those examples to see if they make sense and generate that.

So there’s an ability for us to engineer synthetic examples that can give us those new situations. And also, I wanted to add, like the example that you gave of the autonomous car and having a dumpster, when such situations happen, there is a recording of that data that is fed back and then we can use that data to create even more situations.

So we’ll just create, move the dumpster around or we will do more creation of synthetic data examples in that neighborhood of that example. So in a way, we are able to create more novel scenarios, even though we may not have that many to begin with.

CHAKRABARTI: Yeah. I wanna go back to the example that you gave earlier about fraud detection in the financial world, because I think it’s really important, when you said, look, the idea of using simulations essentially is of longstanding practice in technology development. Decades and decades old. But to your point, what we’re asking or want to ask AI to do is really different than, let’s say, training a fighter pilot in a simulator, right?

Because we’re going to eventually ask, we are asking these, even now, these machines to make decisions for us that are in many ways removing the human element. Okay? And the reason why I say this is, is this a world in which maybe a possible good way to train AI on real world data, financial fraud, is to say anything that doesn’t, anything that doesn’t match these known non fraudulent acts, okay, should be flagged?

Meaning just like program the AI to have a lot of false positives instead of trying to predict what new kinds of fraud could be. Does that make sense?

VEERAMACHANENI: Yeah, yeah. I think we can program the AI to say, flag a lot of examples that are non-fraudulent, that are non-fraudulent, that we still think are very close to the patterns of the fraud.

So those examples, we actually do that. We actually find examples that are very close to the fraud. And we say, but we know that they’re non fraudulent. So as a result, what we are seeing is we found out how people are bypassing our checks and balances, right? Because the fraudulent examples are very close to the non-fraud.

And use that to create sort of synthetic data.

Part III

CHAKRABARTI: And I now want to, gentlemen, I wanna dig in much more deeply into, really, the potential downsides. Because I was highly skeptical coming into this hour about the need for synthetic data, but I’m relaxing that skepticism a little bit. But nevertheless, Ari, you had mentioned some words a little earlier, like brittleness, and so to that point, let’s listen to Rich Baraniuk, who’s a professor of electrical and computer engineering at Rice University in Houston, Texas.

And he and his team have been running experiments to see what happens when you train a new AI model using a combination of real-world data and synthetic data created by other generative AI models. For example, he’s asking models to produce realistic human faces. Okay? And the result he says sometimes literally is not pretty.

RICH BARANIUK: If your generative model creates even imperceptible artifacts in the output, maybe there’s a little bit of a distortion in the picture. Then as you continue this process over subsequent generations, those artifacts are going to be increasingly amplified.

CHAKRABARTI: Okay, so what he found is that the models trained on synthetic data were at the beginning producing realistic human faces.

But then as the training continued on those images, later outputs would have very strange patterns appearing on the faces. So Ari, the way I read this is there’s a high risk of, to put it bluntly, error amplification in using synthetic data. Is that?

MORCOS: Yeah, I think that’s right. Like all things in machine learning, there are far more ways to do synthetic data incorrectly than there are to do it correctly.

It’s much easier to mess it up than it is to get it correct. And I think if you just naively have a model generate synthetic data, feed that into a new model, have that model generate synthetic data, feed that into a new model, repeatedly, you are absolutely gonna get the sort of terrible artifacts that Rich is describing.

I think the way around that is that every time you generate the synthetic data, you then filter it very aggressively. So you then say, what’s the synthetic data that came outta the model that’s actually realistic? Let’s keep that. What’s the synthetic data that came outta the model that’s a bit weird?

Let’s remove that. And I think this also dovetails into some of what Kalyan was saying earlier. I think there are two ways you can approach synthetic data at a philosophical level. One is, let me generate completely novel data that I’ve never seen before. That’s really hard and you’re likely to make mistakes that are going to propagate when you do that.

The other way is instead to say, let me take an example, like a fraud example that I’ve seen already, or an outlier self-driving car case that I’ve seen already. And then let me just tweak it a little bit. Let me make it so that it looks a little bit different, as if it was another presentation of the same sort of error.

That’s a lot easier to do. And is a lot less risky. So I think we’re first gonna see that sort of synthetic data and that’s what we do at Datology. A lot of times we’ll take documents, for example, and we’ll rephrase them into different formats so that the model can understand them, when presented in different ways.

And that form of synthetic data, I think, is a lot easier to get right. A lot easier to mess up. When you start going to, I wanna build an entirely, imagine an entirely new type of scenario that might go wrong. That’s where you’re more likely to start having these errors.

CHAKRABARTI: Okay. Kalyan, let me push on this a little bit though, because I think another way, another term I’ve heard kicked around here is model collapse.

Because if these tiny errors or artifacts do get amplified in the way that it seems to me inevitably it could happen, right? Because we’re talking about billions and billions of iterations in the training of the models. I am not entirely convinced that we should put that aside as a concern.

VEERAMACHANENI: Yes. Yes. We shouldn’t put that aside as a concern. It is an important concern and as Ari pointed out, like a lot of the synthetic data, while it’s generated by AI, we are there as part of the process to include it in the next model training as engineers, right? So we watch how those train training examples are, have some artifacts or how they’re fed into the model.

Is the model collapsing? We have measures to check that. So that’s a lot of engineering that goes behind putting these synthetic data into training the models and checking how the training is going. The second thing I’d also push back is, after the model is trained, there’s a lot of checks and balances before that model is deployed.

So DataCebo, we do that all the time. Any software or model that we deploy in the real world, there’s a lot of automatic checks and balances that we do. To Richard’s point, the professor from Rice, one of the check is what he used to detect the artifacts. If you can imagine, we wouldn’t deploy such a model.

He had a check where, you know, whether it was visual or automatic. One of the challenges that we do is we do implement now a lot of automatic checks, because we don’t want to depend on humans, so we check those after the model is trained, we do a lot of checks to make sure the model is performant and it’s not producing weird artifacts like that.

CHAKRABARTI: Let’s listen to a little bit more of what Professor Baraniuk had to say because he did offer a kind of a caution in terms of the use of synthetic data because he says there is an important question right now that’s still left unresolved.

BARANIUK: One of the big problems that we have is that there’s such a limited understanding of this phenomenon.

It’s still early days in trying to provide authoritative guidance on how much synthetic data is okay and how much isn’t okay. So that’s an area that we really need to advance.

CHAKRABARTI: What do you think Kalyan? How much is okay, how much isn’t okay. Is it case dependent?

VEERAMACHANENI: It is case dependent. It is very case dependent and use case dependent.

Yeah. So we, again, proportions is a parameter that we fine tune as engineers. And developers when using the synthetic data.

CHAKRABARTI: Okay. Use case dependent. Ari, let me turn to you on this because, and I want to hear both of you about this. Again, from the public’s point of view, AI is a very powerful and awesome tool, but it’s also already problematic.

We’ve done shows in the way that we can here at On Point about, we focus on health care a lot and AI, and in the ways that, depending on the question that you ask, the AI to do, or what you’re asking the AI to look for in, let’s say, approving or disapproving insurance claims.

It’s summarily rejecting people who actually deserve to have their claims fulfilled. And it’s very hard in real time to catch those errors. Okay. So Ari, wouldn’t you, the use of synthetic data potentially make that problem even worse?

MORCOS: I think it could go either way.

I think it depends on how you use it. Again, if you use it, it could make the problem much better, if you use it poorly, it can absolutely make it worse, which is why you need to have verification and audits on these systems and why you have to be very careful that you’re putting in data that’s actually going to present well.

I think this also gets to Rich’s point. That this is still a frontier research problem. Not just synthetic data research in general. There are a whole bunch of cultural reasons why data research has largely been overlooked by the machine learning community relative to things like architectures or other areas of AI research.

And there’s a lot more for us to understand here, and that’s actually a lot of why we created Datology was to do this research and then make it so that when we work with folks who want to train models. They get a really good use of synthetic data and real data that’s not going to result in these sorts of errors.

And for example, we found that, going beyond half of the data being synthetic pretty quickly causes issues. So we usually will cap it out at about 50% synthetic data.

CHAKRABARTI: Kalyan, go ahead.

VEERAMACHANENI: Yeah, I wanted to add to your example. I think I’m holding in my hands a paper called Single Word Change is All You Need.

It’s one of the papers we created, we wrote about a classifier that classifies whether to, for example, maybe to give a loan or not, and all you have to change is one word in a sentence that will just reject. And there is no change in the meaning. There is no change in the sentence structure, nothing.

It’s just one word that made that classifier very fragile. And that classifier was not trained on synthetic data at all. It was trained on the real data. So one of the things that we now do in academic research community as well as in business, is that we try to create examples that will break a classifier that is trying to decide whether to give a loan or not.

And they call them adversarial examples. So basically, create an example that should go through the classifier. And that should get a positive result. But just because you changed a word, or maybe even put a comma at a wrong place, is rejecting. So now when we create such examples, we retrain the classifier or the model to make it better.

And as a result in doing so, what we are doing is we are essentially using synthetic training examples to make the model better. Because we took the examples that should pass. We tweaked them a little bit, see how fragile the model is, and use that data again to train, retrain the classifier so that it becomes more robust.

And so this is a very ongoing, very popular field of research called robustness of these models, of how to make them more robust by tweaking, parameters and creating synthetic examples to train them better. So you can use it to address exactly the problem that you’re seeing.

Where one word changed everything.

CHAKRABARTI: That is really interesting, but I’m also afraid I’m taking the wrong lesson from your example, Kalyan, which is the lesson that’s like screaming in my head is, wow, a lot of this can seem very arbitrary. Do you know what I mean? No, I’m serious.

Because, again, from like the normal human perspective if we are in a world now where these AI tools in certain examples, like you’re giving, having a comma in the wrong place, that we have to test for that. And the outcomes for that before we unleash the tool out into the real world.

Again, just speaking purely from the point of view of what we already know about how businesses operate. Can we trust the industry or industries who are developing? You two are willing to talk to me. Many people aren’t, to be that robust in their testing while they’re developing, before the models are unleashed.

VEERAMACHANENI: They will see that as a result in the business metrics as well, at least we hope. For example, if it’s a fraud detection thing and you’re just doing a false positive, a lot of rejection of the transactions, they will see that in customer satisfaction. They’ll see that a lot in the results that they’re seeing at the end of the day.

Where it becomes tricky is when your result is not immediately observable, for long periods of time. So health care is a tricky place where if you start deploying them, you have to be extremely careful of how, because you won’t see its effect for a long time. Things where there’s an immediate measurement available.

And businesses have already have age old practices to measure the outcomes. Customer satisfaction, number of false positives, some things that are black and white, just you can measure them. It’s easy to test and it’s easy to deploy. So I agree with you. I think it’s very important for areas where we don’t know and we can measure the outcomes that quickly.

And it takes time.

CHAKRABARTI: Okay, so Ari I want to actually circle back to roughly where we started, because there is a whole different way of thinking about this, right? Which is if you parse out the high-quality data from the vastly large data sets that we have, right? Train a model on it. See what the model’s doing right and what it’s doing wrong.

Tweak the model and then train it again on that same real data. Why isn’t that good enough?

MORCOS: I think that will get us pretty far. But the challenge is at a certain point it will be challenging to find enough of that high quality data. Although I think, again, if we can get access to what’s the private data that’s present and use that for particular use cases, that can do a lot.

Ultimately the data is everything for models. One of my favorite catch phrases is models are what they eat. If you show them really high-quality data, they’re gonna be high quality. If you show them low quality data, they’re gonna be low quality. In order for us to solve this problem, it’s gonna require bringing all of our solutions and all of the tools in our toolkit to bear.

We’re gonna have to do a lot on data curation of real data to enrich that and make that higher quality, and then use that higher quality real data as the guide to generate more high-quality synthetic data as well, and then combine the two of them to massively improve the data efficiency of our models.

So how quickly they can train, what performance they can train, the reliability of our models. This is the number one problem with AI models in the real world, is that they’re not reliable enough. And also, the cost of actually deploying these models, which is another huge factor that running these models is quite expensive.

And as, of course, these AI products get more and more users. We’re gonna spend more and more data center compute on running these models, and when you use better data, you can get smaller models that are just as good, which means you can both save compute costs which both saves financial costs, but also saves the environmental costs of training these models.

So we’re gonna have to take all of these tools in our toolkit and bring them to bear in order to solve these problems. But I’m quite optimistic. I don’t think we’re gonna completely, I think when we say we’re running outta data we’re being a bit hyperbolic. There’s a lot more we can do with our existing data.

CHAKRABARTI: Okay. So we have like less than 30 seconds left. I wanna ask you a tweak on the same question that I asked Kalyan, because ultimately my interest is in trying to have conversations where we get to a place where we understand what can we do as this technology is being developed to minimize the harm that may happen, right?

So that people don’t get hurt in the ways that we’ve described can already happen with AI tools. So regarding synthetic data, Ari. What do you think should the industry do, should regulators do, to try to minimize negative outcomes? Let’s put it that way.

MORCOS: I think ultimately, we have to test and measure.

You have to have a reliable testing framework. When we deploy a model, we come up with clear evaluation suites to make sure that they’re, to understand how they’re performing and where their harms are. And then also make sure we look at the real harms like bias and claims denials and things like that, that are actually gonna affect real people in the near term.

The first draft of this transcript was created by Descript, an AI transcription tool. An On Point producer then thoroughly reviewed, corrected, and reformatted the transcript before publication. The use of this AI tool creates the capacity to provide these transcripts

Source link

What's Hot

Can Perplexity’s new agentic AI browser ‘Comet’ replace Google Chrome? | Technology News

DeepSeek’s namesake chatbot sees a drop in downloads as AI apps for work, education rise

With OpenAI and Shopify As Customers, Ashby Raises $50M Series D For AI-Powered Talent Platform

What happens when you train your AI on AI-generated data?

Google DeepMind to fund CASP, as NIH funding runs out

Google DeepMind Officially Delivers A Gold Medal Score At The International Math Olympiad

From OpenAI’s rejection to Google’s talent grab

Ronald Perelman’s $410 Million Art Insurance Trial Begins over Fire-Damaged Works

Artists Call for Reinstatement of Ousted Whitney ISP Leader

Nonprofit Files Case Accusing Russia of Plundering Ukrainian Culture

Artist Raymond Saunders Dies at 90

Can Perplexity’s new agentic AI browser ‘Comet’ replace Google Chrome? | Technology News

DeepSeek’s namesake chatbot sees a drop in downloads as AI apps for work, education rise

With OpenAI and Shopify As Customers, Ashby Raises $50M Series D For AI-Powered Talent Platform

What's Hot

What happens when you train your AI on AI-generated data?

Guest

Also Featured

Transcript

Related Posts

Subscribe to Updates