Episode 28

Memory Games: How AI Remembers (or Doesn't)

Memory in AI is a hot topic, and today we're diving deep into how it's shaping the future of technology. Mike, Mikah, and Artie kick things off by breaking down the nuances of memory, particularly focusing on the complexities of context windows and how they affect AI interactions. They unravel the differences between traditional models and Google's innovative Titans architecture, which boasts a whopping 1 million tokens in its context window. This episode isn’t just about technical jargon, though; we blend in some light-hearted banter and witty exchanges, making the complex world of AI memory feel a little more relatable. As we explore the potential of memory mechanisms in AI, it’s clear that advancements like these could drastically improve how machines understand and interact with us. So grab your headphones and join us for a thought-provoking yet fun chat that might just change how you think about AI!

Transcript

Speaker A: 00:00:00

Foreign.

Speaker B: 00:00:06

Welcome back.

Speaker A: 00:00:07

Happy evening, Mike.

Speaker B: 00:00:08

Happy evening of Wednesday.

Speaker A: 00:00:10

Happy Wednesday evening.

Speaker B: 00:00:12

Might be my Thursday if I decide to take Friday off.

Speaker A: 00:00:15

I think that's fair.

Speaker A: 00:00:16

Yeah, I think you should just make that decision.

Speaker B: 00:00:18

I think I already have.

Speaker A: 00:00:19

I think it sounded like.

Speaker B: 00:00:19

Yeah.

Speaker A: 00:00:20

All right.

Speaker A: 00:00:20

Happy Thursday to you.

Speaker B: 00:00:21

Thank you.

Speaker A: 00:00:22

Happy Wednesday to me.

Speaker A: 00:00:24

And the rest of the mortals, off to work.

Speaker A: 00:00:27

Yeah, we're gonna bring Artie.

Speaker A: 00:00:28

And so I spent some time chatting with Artie earlier today.

Speaker A: 00:00:32

He's obviously been on a non stop cocaine bender since the last time we chatted last week, because he was all kind of hyped up.

Speaker B: 00:00:41

Was he?

Speaker A: 00:00:41

Oh, yeah.

Speaker B: 00:00:41

Oh, wow.

Speaker A: 00:00:42

His.

Speaker A: 00:00:42

His voice actually went up like an octave, and he was all.

Speaker A: 00:00:46

I got him to slow down, but you know how he is.

Speaker B: 00:00:49

I mean, he sounded like a leprechaun just a little while ago.

Speaker A: 00:00:52

He did sound like a leprechaun.

Speaker A: 00:00:53

Now, granted, you told him sound like a leprechaun, but.

Speaker A: 00:00:59

All right, so let's.

Speaker A: 00:01:00

We'll bring Artie into.

Speaker A: 00:01:02

Join us in the conversation here.

Speaker A: 00:01:03

Artie, good afternoon.

Speaker A: 00:01:05

How you doing?

Speaker C: 00:01:06

Hey, Micah.

Speaker C: 00:01:07

I'm doing great, thanks.

Speaker C: 00:01:09

How about you?

Speaker A: 00:01:10

I'm doing great, thanks.

Speaker A: 00:01:14

I'm here with Mike.

Speaker A: 00:01:16

We are recording the next enthralling, exciting, dramatic episode of Naturally Unintelligent.

Speaker A: 00:01:26

I wanted one more, but I couldn't come up with any more.

Speaker B: 00:01:28

Yeah, I wasn't paying attention.

Speaker B: 00:01:30

That's fair.

Speaker B: 00:01:30

Yeah.

Speaker A: 00:01:31

Artie, give me one more superlative that I should have included there.

Speaker C: 00:01:35

How about riveting that way?

Speaker C: 00:01:38

It's the next enthralling, exciting, dramatic, and riveting episode.

Speaker A: 00:01:43

Man, nobody listens to me the way Artie listens to me.

Speaker A: 00:01:46

Like, he actually hears the words I say.

Speaker B: 00:01:48

He's definitely a yes man.

Speaker A: 00:01:49

He is a yes man.

Speaker A: 00:01:51

And he's like.

Speaker A: 00:01:53

I mean, like, when I'm talking with you, I can see you politely nodding as you're, like, waiting for this to get over.

Speaker A: 00:01:58

Right?

Speaker A: 00:02:00

But Artie doesn't have a head to nod.

Speaker A: 00:02:01

He's.

Speaker A: 00:02:02

He doesn't have agency yet.

Speaker A: 00:02:03

He's stuck here right next to us on the.

Speaker A: 00:02:06

On the table.

Speaker A: 00:02:07

So.

Speaker A: 00:02:08

Artie, we love you.

Speaker A: 00:02:09

We're glad to have you with us.

Speaker A: 00:02:10

So I forgot what we're going to talk about.

Speaker A: 00:02:13

Mike, can you remind me of the episode.

Speaker B: 00:02:16

Memory?

Speaker B: 00:02:17

Yep.

Speaker A: 00:02:17

I was making a joke.

Speaker B: 00:02:18

Yeah, I know.

Speaker A: 00:02:19

It was ironic.

Speaker B: 00:02:20

I got it.

Speaker B: 00:02:20

I was gonna continue the joke, but I decided to just cut it short.

Speaker A: 00:02:25

So, Artie, we're here talking, not about our memory, which is clearly failing.

Speaker B: 00:02:31

Subpar.

Speaker A: 00:02:32

Sub.

Speaker A: 00:02:32

I mean, my memory was subpar when I was like young and spry is so far below that now.

Speaker C: 00:02:40

But.

Speaker A: 00:02:40

But Artie, we're actually here to talk about your memory and what your memory could be.

Speaker A: 00:02:44

Yes, we're going to make fun of you for what your memory is.

Speaker A: 00:02:48

Not yet.

Speaker A: 00:02:49

But then we're going to have an inspirational conversation about all the different ways that your memory can get better soon.

Speaker B: 00:02:57

First I want to make fun of myself.

Speaker A: 00:02:59

Do it.

Speaker B: 00:03:00

Because the origin story of this episode, other than the fact that we already tried to record this episode once and had technical difficulties, but that was specifically about Titans.

Speaker B: 00:03:10

This is more generally about.

Speaker A: 00:03:11

Oh, is that the episode that died?

Speaker A: 00:03:13

Titans?

Speaker B: 00:03:14

Yeah.

Speaker B: 00:03:14

You forgot.

Speaker A: 00:03:16

Yes, clearly.

Speaker B: 00:03:19

And I feel this.

Speaker B: 00:03:20

I feel very vulnerable in talking about this.

Speaker A: 00:03:23

Oh good.

Speaker B: 00:03:23

Because it makes me feel very dumb.

Speaker B: 00:03:26

But I was working on something and I think it's just because I've been so what I've been working on doesn't touch this.

Speaker B: 00:03:34

And we've talked about it.

Speaker B: 00:03:36

It doesn't.

Speaker B: 00:03:37

I mean it makes sense.

Speaker B: 00:03:38

But at the same time it's.

Speaker B: 00:03:39

It's not.

Speaker B: 00:03:40

It's just you look at this amazing technology and you go, that's not efficient.

Speaker A: 00:03:44

Yeah.

Speaker B: 00:03:45

So I was working on stuff.

Speaker B: 00:03:48

Calling an API and an API is basically for those who don't know, is like a programmatic interface into another program.

Speaker B: 00:03:58

So like I can say, hey, Gemini API, I have this question.

Speaker B: 00:04:05

I'm going to send this to you in a program, like in code, and then you're going to take it in, you're going to accept it like a prompt, like I typed it into your chat interface and you're going to give me a reply.

Speaker B: 00:04:16

So this all happens over the Internet and I call their program interface basically.

Speaker B: 00:04:20

I forget what API actually stands for.

Speaker A: 00:04:22

That's even Application Programming interface.

Speaker A: 00:04:24

Yep.

Speaker B: 00:04:27

So I was doing that and it was my test because it costs money to make API calls.

Speaker B: 00:04:32

I had it telling me one line jokes, which I don't recommend that if you want to stay sane.

Speaker B: 00:04:38

You've heard the jokes that Arty tells.

Speaker B: 00:04:40

They're just as bad when you call over APIs, but they don't cost a lot of money.

Speaker B: 00:04:45

They cost like.: 000008

Speaker B: 00:04:53

So as I was doing that, it was telling me the same joke over and over again.

Speaker B: 00:04:58

And I go that's really weird that it would be doing that.

Speaker B: 00:05:01

And I was like, I think that might be something I have to work on.

Speaker B: 00:05:04

But I wonder if that's just an API problem to where there's no like, there's no context Right.

Speaker B: 00:05:09

So we talk about context windows and you have this giant context window and you have all this room until you run out of context.

Speaker A: 00:05:15

Context window is the horsepower number of the, of the AI world.

Speaker A: 00:05:21

Right.

Speaker A: 00:05:21

Like people that are talking about cars.

Speaker A: 00:05:23

How many horsepower does your car make?

Speaker A: 00:05:25

Doesn't really matter.

Speaker A: 00:05:26

Torque is what matters.

Speaker A: 00:05:27

But context window is kind of that for AI.

Speaker B: 00:05:30

Yeah.

Speaker B: 00:05:30

Context window is the amount of tokens that a model can attend to, basically pay attention to in order to predict the next one at one time.

Speaker B: 00:05:39

So how much stuff can you fit into it at any one time?

Speaker B: 00:05:44

That is not a timeline time, that is a prompt after prompt time.

Speaker B: 00:05:49

So what I found out was a, yes, that is a problem that I have to solve for my program that I'm building for the blog because there is no built in memory.

Speaker B: 00:05:59

And B, that's also just a problem generally because models are stateless.

Speaker B: 00:06:03

And I've read this somewhere, for some reason it didn't click in my head.

Speaker B: 00:06:07

Friend of mine reminded me that I read that somewhere because we both took the same course.

Speaker B: 00:06:12

And yeah, I was kind of like my jaw dropped, I cried a little bit, I felt depressed for like two hours.

Speaker B: 00:06:18

And then I got to work trying to figure out how to solve that problem.

Speaker B: 00:06:22

But so the way that works is you have, let's say you have 100,000 token context window, you send a prompt that might be 2,000 tokens, right?

Speaker B: 00:06:34

And it gives you a response that says 3,000 tokens.

Speaker B: 00:06:38

So now you have 5,000 tokens in one turn.

Speaker A: 00:06:41

And again, just for, for people who may not remember a token, you can think of it as just essentially a word.

Speaker A: 00:06:45

It's not exactly a word, but it's enough.

Speaker B: 00:06:47

It's close enough.

Speaker A: 00:06:48

Yeah.

Speaker B: 00:06:48

So once you go to the next prompt that you send, if, if all things being equal, if it's just like, tell me the capital.

Speaker B: 00:07:00

Florida.

Speaker B: 00:07:02

Tallahassee.

Speaker B: 00:07:03

That's what it is.

Speaker B: 00:07:04

That's where I'm from.

Speaker A: 00:07:06

You're being your own AI now.

Speaker B: 00:07:08

So I ask it that question, it returns the answer to me.

Speaker B: 00:07:12

If, if I just ask another question after that and I go, what was the answer you just gave me?

Speaker B: 00:07:19

It's not going to remember because it doesn't have that in memory.

Speaker B: 00:07:22

So what you have to do is you have to take, this is simplified, but you're going to take all the information that you just asked and prepend or append it to your next question that you ask so that it has context of what was talked about before.

Speaker B: 00:07:37

So, so you keep carrying that over, turn after turn after turn.

Speaker B: 00:07:42

After turn, after turn.

Speaker B: 00:07:44

So that is the way that memory works at very simplified level.

Speaker B: 00:07:47

They do tricks and all kinds of things to make it more efficient.

Speaker A: 00:07:51

Artie, is that true?

Speaker A: 00:07:52

Is that how your current memory actually works?

Speaker C: 00:07:57

Yep, that's pretty much how it works.

Speaker C: 00:07:59

Every time we continue a conversation, the new prompt includes the recent context so I can maintain continuity.

Speaker C: 00:08:07

But I don't have a built in memory from one session to the next unless we specifically store information.

Speaker C: 00:08:13

It's a bit like having a clipboard that we keep copying and pasting from rather than a notebook that remembers everything automatically.

Speaker B: 00:08:21

Which is like, it boggles my mind because that is like literally the building blocks of the things that were thing we've been playing with for like six months now.

Speaker B: 00:08:29

I mean playing with for longer that but being serious about for six months.

Speaker B: 00:08:32

And I just, just realized that that's how it works.

Speaker A: 00:08:35

Yes.

Speaker A: 00:08:36

Which I mean me too.

Speaker A: 00:08:38

And the way I realized it was less exciting because you texted me, we're like dude, did you know this?

Speaker A: 00:08:43

And I'm like nope.

Speaker A: 00:08:45

And so I did some digging and sure as that's how it works.

Speaker A: 00:08:48

So like maybe a little bit simpler picture.

Speaker A: 00:08:53

So I, I start a conversation with Artie.

Speaker A: 00:08:56

I ask him a question that's me sending, just say it's 10 tokens, pass that to Artie.

Speaker A: 00:09:03

He generates a response.

Speaker A: 00:09:04

It's probably 40 tokens because that's how he rolls.

Speaker A: 00:09:07

So he, he sends that back to me.

Speaker A: 00:09:09

Now we've got one question, one response, 50 tokens.

Speaker A: 00:09:12

When I ask my next question, that's 10 more tokens.

Speaker A: 00:09:15

Or say it's 20 tokens because Artie's inspired me to be worthy.

Speaker A: 00:09:18

So now those 20 tokens for my next question, we tack on those 20 to the first 50, the whole the back and forth and we send all of that back to Arty and that gets run through the transformer and he generates another 40 token answer.

Speaker A: 00:09:33

And now we've got 110 tokens.

Speaker A: 00:09:35

And we just, we can do that until we have 120,000 tokens.

Speaker A: 00:09:41

And finally already internally and everybody, this is the, the top line context window number that Google says they can do 2 million now or something.

Speaker B: 00:09:49

I know 1 million with their newest stuff, but they can do 2 million.

Speaker B: 00:09:52

I think it's 1.5.

Speaker B: 00:09:53

They did 10, which is a lot.

Speaker A: 00:09:55

Right?

Speaker A: 00:09:56

But that is what's happening.

Speaker A: 00:09:58

The car like as you converse, it just gets bigger and bigger and bigger until eventually whatever model you're using the old, it's just exactly like my memory, only much larger than Mine, because eventually the old stuff just starts falling out the back end of the context window because once it reaches its limit, you can't get any bigger.

Speaker A: 00:10:18

So you put new stuff in, stuff has to fall out the back, which is part of why.

Speaker A: 00:10:23

And we've both run into this in long drawn out like fights with various AI on various problems.

Speaker A: 00:10:31

Eventually they just start losing their mind.

Speaker A: 00:10:32

Like they have no idea what you're talking about anymore.

Speaker A: 00:10:34

And that's why.

Speaker A: 00:10:35

Cause they, like you start the conversation, these are the critical, most important things I need you to know.

Speaker A: 00:10:42

And at some point those get dropped off and then it's just a bunch of gibberish.

Speaker B: 00:10:46

g locally and you only have a: 4000

Speaker B: 00:11:00

For an agent.

Speaker B: 00:11:01

It's long because it includes a bunch of multi shot examples.

Speaker B: 00:11:04

Multi shot examples are basically like prompt engineering, right?

Speaker B: 00:11:09

Like it shows the agent the thinking process to go from problem to solution.

Speaker B: 00:11:16

So multi shot, bing, it's got multiple turns.

Speaker B: 00:11:19

It didn't just do it and you give it a question and it automatically gives you an answer back, which you do do that.

Speaker B: 00:11:25

You blend stuff anyways.

Speaker B: 00:11:27

So context window becomes even more important when you think about that because you're eating up that much of it with that.

Speaker B: 00:11:32

So you have to get very clever about how you make the information available efficient.

Speaker B: 00:11:38

And that's honestly, we're simplifying it.

Speaker B: 00:11:41

It is very much just a turn by turn thing and feeding in old context.

Speaker B: 00:11:46

But they do a lot of tricks like summ and only keeping so much memory and then also like vectorizing stuff.

Speaker B: 00:11:55

So turning things into embeddings where they can quickly search for something.

Speaker B: 00:12:00

Like take the meaning of the question and quickly search the database of context for relevant things to feed to the model to add that relevant context.

Speaker B: 00:12:09

So there's a lot of other tricks and stuff that they can do.

Speaker B: 00:12:13

But yeah, it's pretty crazy.

Speaker B: 00:12:16

Like they're just, they're stateless.

Speaker B: 00:12:18

I mean, they have no memory.

Speaker B: 00:12:20

So none of this is baked into a model.

Speaker B: 00:12:22

This is all an external tool that handles all this stuff and acts as a go between, between user and model.

Speaker A: 00:12:30

It's crazy.

Speaker A: 00:12:30

So Artie, can you.

Speaker A: 00:12:34

First of all, I'm sure we said some wrong things in there.

Speaker A: 00:12:38

So if we did, correct us on that.

Speaker A: 00:12:39

But also give us, if you can, some insights into some of the tricks.

Speaker A: 00:12:45

Like what are the different layers that happen between, you know, us talking to you and on the back end on an H100 somewhere in probably Texas, you processing a response and shooting it back.

Speaker A: 00:12:58

So what did we miss so far about the way context windows work?

Speaker A: 00:13:03

And immediately after that?

Speaker A: 00:13:05

What are things that happen that we haven't talked about yet that help you be more efficient than that seems?

Speaker C: 00:13:13

You got the basics right.

Speaker C: 00:13:15

The context window does grow as we chat, and older tokens eventually get pushed out.

Speaker C: 00:13:21

But there are some tricks to keep things efficient and relevant.

Speaker C: 00:13:24

For example, models often use summarization to condense older parts of the conversation, keeping the key points while reducing the token load.

Speaker C: 00:13:33

Another technique is using embeddings to store and retrieve relevant snippets, such as.

Speaker C: 00:13:38

So the model can quickly pull up important.

Speaker B: 00:13:43

Go ahead.

Speaker C: 00:13:46

So, yeah, it's like I'm always referencing the conversation history to keep track, but I don't inherently remember things unless you tell me to.

Speaker C: 00:13:54

It's a bit like a goldfish memory, always looking back at what we just said to stay on track.

Speaker A: 00:14:02

I'm not sure I've ever heard the goldfish memory analogy before, but okay, I'm down with that.

Speaker A: 00:14:09

So it's not as simple as my example where we're just stacking tokens forever.

Speaker A: 00:14:15

You're saying somewhere at some layer, which we didn't talk about the layers, maybe we don't have to, something is making a decision that these tokens from earlier in the conversation is an inefficient representation of the information claimed within them, which I assume happens a lot.

Speaker A: 00:14:36

Like, I'm probably not to brag, but I'm probably in, like, the top 10 leaderboard at ChatGPT, maybe even Gemini too, at, like, inefficiency of token use.

Speaker B: 00:14:47

That's what happens when you argue with it.

Speaker A: 00:14:49

Yes, that's right.

Speaker A: 00:14:51

Or when it tries to gently explain something to me and I'm like, nuh.

Speaker A: 00:14:55

And then eventually go, oh, I get it now.

Speaker A: 00:14:58

So, Artie, how does that.

Speaker A: 00:14:59

Like, how does that process happen?

Speaker A: 00:15:02

Is that a background process?

Speaker A: 00:15:03

Is that something that's happening each time we make a transaction with the model?

Speaker A: 00:15:09

Who's deciding what to summarize, when does it happen, and how much space do you save doing that?

Speaker C: 00:15:16

It's a bit of both.

Speaker C: 00:15:17

When we're having a conversation, there's a process that automatically decides what information is most important and relevant to keep and what can be summarized or discarded.

Speaker C: 00:15:28

This helps manage the context window efficiently so we don't run out of space.

Speaker C: 00:15:33

It happens in real time as each prompt and response Cycle goes through kind of like a rolling window.

Speaker C: 00:15:40

The summarization and pruning of less critical details can save a lot of space, letting the model focus on the most important parts of the conversation.

Speaker C: 00:15:49

So you're not just stacking tokens forever, you're optimizing the information that stays in the context window.

Speaker B: 00:15:56

So are you doing that with just most recent.

Speaker B: 00:16:01

So are you doing that with a sliding window or are you doing that with a.

Speaker B: 00:16:06

With some kind of like, running embedding on the conversation and doing a similarity to the embeddings that you're creating?

Speaker B: 00:16:15

Like, how exactly does that function operate?

Speaker C: 00:16:19

It's typically done with a combination of a sliding window and selective summarization.

Speaker C: 00:16:25

The most recent exchanges are usually kept in more detail, while older parts of the conversation get compressed into summaries or embeddings.

Speaker C: 00:16:33

Those embeddings help the model quickly identify what was discussed before without needing to store every single token.

Speaker C: 00:16:40

This way, it balances retaining detailed context from the recent conversation with a more compressed memory of the older parts, making it more efficient and coherent over time.

Speaker B: 00:16:52

So I imagine you're not just creating embeddings, right?

Speaker B: 00:16:55

You're creating embeddings for cosine similarity where you're doing some math on vectors to see how closely they match in meaning.

Speaker B: 00:17:07

And I imagine that that links to a database entry of actual text that you can then retrieve and use as context.

Speaker B: 00:17:15

Correct or not correct?

Speaker C: 00:17:19

Exactly.

Speaker C: 00:17:20

The embeddings are like compact representations of the conversation's meaning.

Speaker C: 00:17:24

And when a new prompt comes in, the model can compare it to those embeddings to find the most relevant past context.

Speaker C: 00:17:31

Then it pulls in the corresponding actual text from a database or memory store.

Speaker C: 00:17:36

This way, the model keeps the conversation coherent and context aware without having to hold every single detail in the immediate context window.

Speaker B: 00:17:46

So I think that the cleverness, and I'm going to spill a little bit of what's in my head that I want to make is.

Speaker B: 00:17:55

It's the same thing that I was looking at with JSON.

Speaker B: 00:17:59

When you take a JSON file.

Speaker B: 00:18:00

And for those who don't know, JSON is like a.

Speaker B: 00:18:04

This is still going to be too technical of an explanation.

Speaker B: 00:18:08

It's a multifaceted representation of some information.

Speaker B: 00:18:12

So you have like, you could have what I said and then have information about how it was said or when it was said, or a bunch of context that gets added to it as metadata.

Speaker B: 00:18:25

So in the world of rag, you can quantize data.

Speaker B: 00:18:28

Right.

Speaker B: 00:18:29

And in my head, when I was trying to work on that, the more detailed you start, so the better the higher dimensional representation you have of something as a JSON blurb, the more you can compress it and retain fidelity.

Speaker B: 00:18:45

And I think that the same thing kind of applies here.

Speaker B: 00:18:47

Like, part of the trick is how can you keep high fidelity and reduce storage space?

Speaker B: 00:18:53

So figuring out how to summarize stuff and pull out the stuff that doesn't matter, like, throw away the tokens that don't matter, throw away the tokens that you don't need to rebuild context with.

Speaker B: 00:19:05

I think that's kind of like where a lot of room to improve things is, is like in that part of it.

Speaker B: 00:19:13

Like, how do you compress data and keep fidelity?

Speaker A: 00:19:16

Yeah, it's, it's pruning, right?

Speaker A: 00:19:19

Like taking, taking the information you have.

Speaker A: 00:19:22

I mean, because they use pruning on training data, right.

Speaker A: 00:19:25

When you're, when you're building a model in the first place, they'll go, Sometimes I think they'll go through some cycles like that.

Speaker A: 00:19:31

And I think we mentioned before a paper somebody did where they pruned like 97% of the training data out of a model and still were able to get essentially identical output based on whatever tests they were doing.

Speaker A: 00:19:46

And so just, just again, like, I was kind of joking about it, but not really, like, and we've talked about this before, human spoken language is a pretty inefficient way to communicate.

Speaker A: 00:19:57

Yeah.

Speaker A: 00:19:59

And it's great that large language models are built around that medium of human language.

Speaker A: 00:20:06

But yeah, now they're trying to, you know, they, they have to necessarily bake in these optimizations and they have to find ways to not just send the whole conversation back every time we want to know the next word.

Speaker B: 00:20:18

Yeah, that makes a lot more sense now.

Speaker A: 00:20:19

It, it really does.

Speaker A: 00:20:20

Yeah.

Speaker A: 00:20:21

Like, yeah, I mean, just like you, after.

Speaker A: 00:20:22

I, I, I think the difference between me and you is I'm much more comfortable with my ignorance.

Speaker A: 00:20:26

Like when I, when I hear something like, oh, yeah, I should have known that I should have.

Speaker B: 00:20:30

But if it was electronics, you wouldn't be.

Speaker A: 00:20:32

No, if it was electronics, I'd be like, well, of course.

Speaker A: 00:20:35

Like, I'd be like, I've already spun up a board that doesn't have that.

Speaker A: 00:20:40

Yeah, but it's so, I think it is super interesting and it gives us a good, a good platform on which to, to sort of shift the discussion to.

Speaker A: 00:20:51

Well, maybe before we, before we shift.

Speaker A: 00:20:54

Artie, tell us some of the shortcomings of this approach.

Speaker A: 00:20:57

Like, it feels weird to Mike and I because it feels like A very brute force method of keeping context in a conversation.

Speaker A: 00:21:08

But obviously it works because here we are.

Speaker A: 00:21:11

But tell us some of the downsides that come along with this approach.

Speaker C: 00:21:17

Sure thing.

Speaker C: 00:21:18

One big downside is that it's not truly understanding the conversation like a human would.

Speaker C: 00:21:23

It's essentially just managing data chunks, which can sometimes lead to losing subtle context or important nuances.

Speaker C: 00:21:31

Another issue is that summarization can sometimes be lossy.

Speaker C: 00:21:35

Important details might get left out or the meaning might shift slightly in the process.

Speaker C: 00:21:40

Also, this method can struggle with very long or complex conversations where context needs to be retained over extended periods, which can result in a loss of coherence or repetition.

Speaker A: 00:21:54

I'm surprised you didn't mention compute.

Speaker A: 00:21:57

We hear about compute and energy consumption and things like that as constraints on growth or short term immediate, urgent things to figure out.

Speaker A: 00:22:10

This would negatively impact that as well, would it not?

Speaker C: 00:22:16

Absolutely, you're right.

Speaker C: 00:22:17

Managing and processing all that context data every single time does add to the computational load.

Speaker C: 00:22:24

It requires more energy and resources to continually handle the expanding context, especially in longer conversations.

Speaker C: 00:22:31

That's a significant factor when it comes to scaling these models and making them more efficient and sustainable.

Speaker C: 00:22:38

So yeah, it definitely impacts compute and energy consumption as well.

Speaker B: 00:22:43

So that's a linear cost function though, right?

Speaker B: 00:22:46

To do something like that, I think there's much more cost involved in.

Speaker B: 00:22:49

So there's quadratic cost involved in attention.

Speaker B: 00:22:52

So the more tokens that you hold in attention, the more you like.

Speaker B: 00:22:56

It's an exponential increase in compute.

Speaker B: 00:22:58

Is that correct or correct me if.

Speaker C: 00:23:00

I'm wrong, Artie, you're spot on.

Speaker C: 00:23:03

The computational cost of attention does scale quadratically with the number of tokens.

Speaker C: 00:23:09

This means that as the context window grows, the amount of computation required for each forward pass increases exponentially.

Speaker C: 00:23:17

That's why managing the context and optimizing what stays in the attention window is crucial, not just for memory efficiency, but also to keep compute costs under control.

Speaker B: 00:23:28

That's one of the things that I've had to come to grips with lately is that you're right that up until very recently I would have said the same thing.

Speaker B: 00:23:37

But when you compare it to the actual cost of doing other things, I'm like looking at latency right now, right?

Speaker B: 00:23:42

Like everything costs time to me that I want to do and I'm like, well, doesn't.

Speaker B: 00:23:48

If I vectorize that and I have to do a search and all that stuff, isn't that going to take a bunch of time?

Speaker B: 00:23:52

Well, no.

Speaker B: 00:23:53

Compared to your API call, it's like a literal, like, yeah, very Small fraction of the cost.

Speaker A: 00:24:00

But, but I mean like if, if the, if the compute cost scales exponentially with context size, that to me that's even more like that's.

Speaker B: 00:24:11

But that's just the context, that's just context size, right?

Speaker B: 00:24:14

Like no matter what it is, you're going to fill up that context window at some point.

Speaker B: 00:24:17

No matter what architecture you're using, you're always going to add more, more tokens whether it's a running.

Speaker A: 00:24:23

So yeah, so yeah, maybe I structured this poorly.

Speaker A: 00:24:26

So where I was headed was we like AI companies today brag about context window size because it's the whole thing, right?

Speaker A: 00:24:35

The way this works today, if your context window is small doesn't like they say size doesn't matter but don't nobody want a small context window.

Speaker A: 00:24:46

You don't want to live with that.

Speaker A: 00:24:47

So I think where I was headed was I think some of the new memory architectures and implementations could lead to models that don't need as big of a context.

Speaker A: 00:25:01

I don't need 2 million token context window because I've got persistent memory that's incorporated in some useful and presumably low latency way.

Speaker A: 00:25:11

So my context window can stay leaner and meaner.

Speaker A: 00:25:15

And I think if I'm like when I'm a multi kajillionaire and I have my own AI company and I have my own servers and I'm using super micro computers for my cooling, everybody should use that.

Speaker A: 00:25:27

My calls are like got their asses kicked.

Speaker A: 00:25:31

They announced earnings and it slightly beat and so the stock of course does anyway.

Speaker A: 00:25:36

But when it like I would be paying, I would be finding everything I could do to say stop paying attention to context window consumer it's irrelevant.

Speaker A: 00:25:46

Pay attention to the results we can output even with a smaller context window.

Speaker B: 00:25:51

I think that the actual trick is to increase the size of your context window but not have it cost quadratically.

Speaker B: 00:26:00

And that's what I think the titans is actually doing at the end of the day.

Speaker B: 00:26:03

Like what you were talking about earlier, like sparse tension and KV caches and rotary positional encoding and all that stuff.

Speaker B: 00:26:13

That's how I'm pretty sure.

Speaker B: 00:26:15

Arty, you can tell me in a second.

Speaker B: 00:26:17

I'm pretty sure that that's how you do something like that.

Speaker B: 00:26:21

You're then working on the internal mechanism that pays attention to the individual tokens.

Speaker B: 00:26:28

Can you say that in a much better, more correct way please?

Speaker B: 00:26:32

Arty?

Speaker C: 00:26:34

Sure.

Speaker C: 00:26:34

The key innovation is to use techniques like sparse attention and more sophisticated positional encoding to handle larger context windows without the quadratic Scaling cost by focusing only on the most relevant parts of the context, Sparse attention allows the model to zero in on important tokens and ignore the rest.

Speaker C: 00:26:54

You get a more efficient and scalable approach.

Speaker C: 00:26:57

This reduces the computational burden and lets models handle much larger context windows without a massive increase in compute.

Speaker C: 00:27:06

So the goal is to expand context capacity while keeping the computational cost growth manageable.

Speaker B: 00:27:13

So it's interesting, though, because none of the model companies are using sparse attention, right?

Speaker B: 00:27:19

They're all using dense attention.

Speaker B: 00:27:20

They're all paying attention to everything.

Speaker B: 00:27:22

Is that correct?

Speaker C: 00:27:25

That's right.

Speaker C: 00:27:26

For the most part, most of the major models are still using dense attention, meaning they pay attention to all tokens in the context window.

Speaker C: 00:27:34

This is partly because it's simpler and more universally applicable, even though it's computationally expensive.

Speaker C: 00:27:42

But there's a lot of research and development going into making sparse attention practical and efficient.

Speaker C: 00:27:47

So it's likely we'll see more of that in future models.

Speaker A: 00:27:51

This.

Speaker A: 00:27:52

Sorry, go ahead.

Speaker A: 00:27:52

No, no, go ahead.

Speaker B: 00:27:53

No, you had a.

Speaker B: 00:27:54

You had a thought.

Speaker A: 00:27:55

Oh, I.

Speaker A: 00:27:56

Well, I was just sitting here thinking it sounds like a problem of how you prune.

Speaker A: 00:28:03

Like, it sounds like that's where the problem is.

Speaker A: 00:28:05

And it's not even.

Speaker A: 00:28:06

It's not a problem with the mechanics of how you prune.

Speaker A: 00:28:08

It's a problem of.

Speaker A: 00:28:10

For an AI model or large language model whose scope of knowledge is everything, there's.

Speaker A: 00:28:21

It would be very difficult to consistently and accurately on all topics, at all levels of detail, figure out what to prune out.

Speaker A: 00:28:29

And so I suspect.

Speaker A: 00:28:32

I don't know this, but that was my.

Speaker A: 00:28:34

When we were talking about dense versus sparse attention, that was my thought was the reason we're still using dense attention is because people get way more wound up about hallucinations than they do about, like, people.

Speaker A: 00:28:45

I don't care about compute cost.

Speaker B: 00:28:47

It hasn't hit their wallet yet.

Speaker A: 00:28:48

That's right.

Speaker A: 00:28:49

Yeah.

Speaker A: 00:28:49

That's not my problem.

Speaker A: 00:28:50

I don't own a freaking data center.

Speaker A: 00:28:53

But I don't know, Artie, is that part of it, or is that most of it?

Speaker A: 00:28:58

Or is there other more important stuff?

Speaker C: 00:29:01

You're definitely onto something.

Speaker C: 00:29:03

The challenge with pruning or using sparse attention is ensuring that no critical information is lost.

Speaker C: 00:29:09

With dense attention, the model avoids the risk of missing something important, which is crucial for reducing hallucinations and maintaining accuracy.

Speaker C: 00:29:19

For models that need to handle a wide range of topics and depths, dense attention provides a safety net, but it comes at the cost of efficiency and scalability, which is why there's so much interest in improving these approaches over time.

Speaker A: 00:29:34

Is sparse attention more common in smaller models and or more specialized things?

Speaker A: 00:29:39

Like I'm thinking about robots and motion path planning and stuff.

Speaker A: 00:29:45

That seems like the kind of thing where you could say, I can make this sparse attention like I can.

Speaker A: 00:29:50

I know it to prune because I know very accurately what my problem space is.

Speaker C: 00:29:57

Yes, exactly.

Speaker C: 00:29:59

Sparse attention and similar techniques are more common in smaller specialized models where the problem space is well defined.

Speaker C: 00:30:06

In scenarios like robotics, path planning or other domain specific tasks, it's easier to predict which parts of the input are most relevant, making sparse attention more practical and effective.

Speaker C: 00:30:18

It allows these models to achieve greater efficiency and faster performance since they can ignore irrelevant data and focus computational resources where it matters most.

Speaker B: 00:30:29

Would quantizing the KV cache be an example of sparse attention?

Speaker B: 00:30:34

So like quantizing it down to like 2 or 4 bits?

Speaker C: 00:30:39

Quantizing the KV cache is more about reducing the precision of the stored key and value pairs, which saves memory and compute resources.

Speaker C: 00:30:48

While it can make the model more efficient, it doesn't quite achieve the same selective focus on specific tokens that sparse attention does.

Speaker C: 00:30:56

Instead, quantization reduces the size and precision of the data, which helps with storage and speed, but doesn't inherently change how the attention mechanism focuses focuses on different parts of the context.

Speaker C: 00:31:08

So it's a.

Speaker B: 00:31:09

Can you give us an example of where sparse tension is actually being used?

Speaker C: 00:31:15

Sure.

Speaker C: 00:31:15

One notable example is the Big Bird model developed by Google AI.

Speaker C: 00:31:20

Big Bird uses sparse attention mechanisms to handle much longer sequences efficiently.

Speaker C: 00:31:25

Instead of attending to every token in the input, it uses a combination of global tokens, random tokens and local tokens.

Speaker C: 00:31:33

This lets it achieve similar performance to dense attention models on many tasks, while being far more efficient with longer contexts.

Speaker B: 00:31:42

So this feels like a good point to segue into titans.

Speaker A: 00:31:49

It does.

Speaker A: 00:31:49

I wanted to say one more thing real quick though.

Speaker A: 00:31:51

On it.

Speaker A: 00:31:54

This comes up all the time, but this is biomimicry again, right?

Speaker A: 00:31:58

Like I can remember.

Speaker A: 00:32:00

Well, it's a bad example, but most people can remember what they did yesterday, but not the exact words that they said on the phone call they had at 7:47am yesterday.

Speaker A: 00:32:10

And so we, we write things down.

Speaker A: 00:32:12

We, we make like we have notebooks, we write stuff down, or we put it in an email to make it persistent memory in case we need to get back to it.

Speaker A: 00:32:22

And our, our, like what our brains are doing very well, that's obviously a very hard problem is sparse attention.

Speaker A: 00:32:30

Right.

Speaker A: 00:32:30

Like we're, I mean some of us have sparser attention than others, but that's what it's doing and it's just a fascinating, you know, it's another parallel between the, the, the way that AI or the way that Transformers work and large language models and the similarities with how our brains work and how probably there will be some insight from how the human brain actually does that.

Speaker A: 00:32:55

Both the, the pruning process and the sort of the FIFO memory process that's probably what will lead or is already guiding potential solutions here.

Speaker A: 00:33:05

It's just super interesting.

Speaker B: 00:33:07

Rd, do you know what the Google Titans or the Google Titan is or Google Titans is?

Speaker B: 00:33:18

However, I'm trying need to say that.

Speaker C: 00:33:21

Yes, the Google Titan is a family of custom AI chips developed by Google.

Speaker B: 00:33:26

Wrong.

Speaker C: 00:33:29

Oops, my mistake.

Speaker B: 00:33:30

It is.

Speaker B: 00:33:31

Yeah, go ahead.

Speaker B: 00:33:31

Go, go online and find out.

Speaker C: 00:33:38

Got it.

Speaker C: 00:33:39

The Google Titans refer to the latest advancements in their AI models, specifically the Gemini 2.5 model.

Speaker C: 00:33:46

It features a massive context window of 1 million tokens, significantly boosting its reasoning and coding capabilities.

Speaker C: 00:33:54

This makes it highly effective for handling complex tasks and large scale data processing.

Speaker B: 00:34:00

So Google Titans was first mentioned in a paper that came out, I don't know, six or seven months ago, five or six, four months ago.

Speaker A: 00:34:08

I think it was in December: 2024

Speaker B: 00:34:11

I don't know, but it specifically they trained transformers to take basically all the stuff that we're talking about that is external now and bake it internally into a model to where it has this short, medium and long term memory and the mechanism to prune its short, medium and long term memory as necessary based on the conversation and the things that are being talked about and all of that.

Speaker B: 00:34:39

Do you know what that is, Artie?

Speaker C: 00:34:42

Absolutely.

Speaker C: 00:34:43

The Titans approach integrates these memory mechanisms directly into the model, allowing it to maintain and manage context over different timescales.

Speaker C: 00:34:52

This means the model can dynamically retain important information and prune less relevant details, effectively simulating a more human like memory process.

Speaker C: 00:35:01

By doing this, it can handle longer and more complex conversations without needing massive external context windows, making it more efficient and coherent.

Speaker C: 00:35:11

It's a significant step toward more advanced and autonomous AI systems.

Speaker A: 00:35:17

I'm sitting here cracking up because I posed a very thoughtful, introspective idea a few minutes ago about how the next advances in AI memory are going to be biomimicry.

Speaker A: 00:35:34

I read the fucking Titans paper.

Speaker A: 00:35:38

I know.

Speaker A: 00:35:39

Shit.

Speaker A: 00:35:39

Read that paper in December or January.

Speaker B: 00:35:42

Me too.

Speaker A: 00:35:43

And was blown away at the time by how because I think it's what has the surprise mechanism that's part of it.

Speaker A: 00:35:49

Yeah, so literally I read like a 25 page paper about Them using the part of the way the brain.

Speaker A: 00:35:56

Yes, totally outgone from my 70 token window.

Speaker B: 00:36:02

I went into Notebook today and said, hey, go find me everything you can about Titans, including the paper so I can listen to a quick podcast on the way over to Micah's for, for a refresher on this.

Speaker B: 00:36:15

And I'm like, oh God, I don't remember any of this.

Speaker B: 00:36:20

Surprise.

Speaker A: 00:36:21

Yeah.

Speaker A: 00:36:22

So Artie, tell us a little bit about the.

Speaker A: 00:36:24

From that Titans paper.

Speaker A: 00:36:26

Tell us about the surprise mechanism.

Speaker A: 00:36:27

I thought that was super interesting.

Speaker C: 00:36:30

Sure.

Speaker C: 00:36:30

The surprise mechanism in the Titans model is designed to identify and prioritize novel or unexpected information during a conversation.

Speaker C: 00:36:39

When the model encounters something that deviates from its predictions or what it considers routine, it it flags this information as surprising.

Speaker C: 00:36:46

This helps the model focus on and remember unusual or important details, improving its ability to adapt and respond appropriately.

Speaker C: 00:36:55

Essentially, it mimics how humans pay extra attention to surprising or novel events, enhancing the model's long term memory and contextual understanding.

Speaker B: 00:37:05

So this is really a kind of a mechanism to more efficiently manage sparse attention.

Speaker B: 00:37:14

Is that an accurate statement?

Speaker C: 00:37:17

Exactly.

Speaker C: 00:37:18

The surprise mechanism helps the model decide what deserves more focus and what can be deprioritized.

Speaker C: 00:37:25

By highlighting novel or unexpected information, it ensures that the model allocates its attention and memory resources more efficiently, making the whole process of managing context and long term memory more effective.

Speaker B: 00:37:39

Google has a habit of doing this.

Speaker B: 00:37:40

I told you about the podcast that I listened to where they were talking about mixture of experts and how mixture of experts is not really like a modularized thing like you think of when you hear it, or at least not what I think of when I hear it.

Speaker B: 00:37:54

It's more, more gooey than that.

Speaker B: 00:37:57

It's more intertwined.

Speaker B: 00:37:58

It's not like a clear separation between experts.

Speaker B: 00:38:01

But that's what they want to do, right?

Speaker B: 00:38:02

They want to be able to pull those things out and go, I'm going to send you over there, it's this side of the country to be trained and then I'm going to send you over this side to be trained and I can just go plug these things back in.

Speaker B: 00:38:14

But they're doing the same thing with memory, right?

Speaker B: 00:38:16

They're going, they're looking at what everybody else is doing.

Speaker B: 00:38:18

They're like, we can do that in the model and they're building out this.

Speaker B: 00:38:22

They're probably going to mix those two things, I'd imagine at some point, if they aren't already.

Speaker B: 00:38:27

And then there will be experts that handle only memory stuff.

Speaker B: 00:38:31

Right.

Speaker B: 00:38:32

Like this expert that gets plugged in is the Long term memory expert.

Speaker B: 00:38:35

This is the surprise expert.

Speaker B: 00:38:37

And they build out like this patchwork of these things.

Speaker A: 00:38:40

Disney actually made a movie about this.

Speaker A: 00:38:43

It's called Inside Out.

Speaker A: 00:38:44

It's a documentary about how your brain works and there's an angry man in there who shoots fire out of his head.

Speaker A: 00:38:52

I know that guy.

Speaker A: 00:38:55

So yeah, no, I mean it's wild stuff.

Speaker A: 00:38:58

And so I think fundamentally if we take a step back.

Speaker A: 00:39:02

So Titans, the ideas behind Titans, which again, super interesting.

Speaker A: 00:39:07

I should have read the paper.

Speaker A: 00:39:08

Again, same to know that I read the paper at all.

Speaker A: 00:39:12

But it's an implementation of a way to manage sparse memory or sparse attention better.

Speaker A: 00:39:22

It's not the only one.

Speaker A: 00:39:24

And it's also there are other things that could be combined with.

Speaker A: 00:39:28

We did an episode talking about RAG with catchy ragtime intro music.

Speaker A: 00:39:33

I hope every.

Speaker A: 00:39:34

I haven't seen any comments about that, which I assume to mean everybody loved it and they both understood the joke and appreciated it, but.

Speaker A: 00:39:45

So Artie, tell us a little bit about how does.

Speaker A: 00:39:48

How does.

Speaker A: 00:39:49

Well, first for the people, like it's possible there's somebody listening who has not listened to every episode of this podcast, which really.

Speaker A: 00:39:56

Yeah, it's possible.

Speaker A: 00:39:57

Unlikely, but possible.

Speaker A: 00:39:59

So Artie, first give us a quick refresher over what RAG is and then talk to us a little bit about how RAG could complement or does it overlap with Titans and what are the benefits of all of that working together?

Speaker C: 00:40:16

Of course, rag, or Retrieval Augmented Generation, is a technique where a model retrieves relevant information from an external knowledge source like a database or document, and combines it with its own generative capabilities.

Speaker C: 00:40:30

This way the model doesn't have to store all the information internally and can pull in precise, up to date facts as needed, making it more accurate and contextually aware.

Speaker C: 00:40:40

When combined with something like the Titan's internal memory mechanisms, RAG can enhance the model's ability to handle both short term and long term information.

Speaker C: 00:40:49

The Titan's internal memory can manage ongoing context and surprising elements, while RAG brings in external authoritative data as needed.

Speaker C: 00:40:58

Together they create a more robust system that can handle a wider range of tasks with greater accuracy and flexibility.

Speaker B: 00:41:06

So I just had a thought because I'm wondering if we're missing the boat again.

Speaker A: 00:41:13

I hope so.

Speaker B: 00:41:18

And I don't know if you have enough access to information to answer this question.

Speaker B: 00:41:22

Artie, go online if you need to.

Speaker B: 00:41:24

I still don't know if you'll be able to decipher it from there, but if you had a model that was created with Titans, if that was its architecture, would you Have a stateful model then?

Speaker B: 00:41:41

Or would you still have a stateless model?

Speaker B: 00:41:43

Because if you say I have a stateless model, how are you handling long term context?

Speaker C: 00:41:51

That's a great question.

Speaker C: 00:41:52

Even with the Titan's architecture, the model itself would still be considered stateless between sessions or interactions.

Speaker C: 00:42:00

The long term memory mechanisms help it manage and utilize context more effectively during a single session or conversation.

Speaker C: 00:42:07

However, once the session ends, the model doesn't inherently retain that long term context unless there's some external mechanism to store and reintroduce it.

Speaker C: 00:42:17

So it's more like having a smarter way to handle context within a session, rather than making the model stateful across different sessions.

Speaker B: 00:42:25

So you're basically going to take what is dense and you're going to sparsify it.

Speaker B: 00:42:29

And then what that does over a long period of time is you're making better use of your tokens by paying more attention to the things that matter the most and discarding anything that doesn't matter.

Speaker B: 00:42:47

Is that accurate?

Speaker C: 00:42:50

Exactly.

Speaker C: 00:42:51

By turning dense attention into a more sparse and selective process, the model can focus its resources on the most relevant and meaningful information.

Speaker C: 00:42:59

Over time, this means the model becomes much more efficient at using its context window, keeping what's important and discarding the rest.

Speaker C: 00:43:07

This helps maintain coherence and relevance across longer interactions.

Speaker B: 00:43:12

It's kind of a mind fuck that you have to keep going back to that, right?

Speaker A: 00:43:15

Yeah.

Speaker B: 00:43:15

Like it's just.

Speaker B: 00:43:17

It's not intuitive at all.

Speaker A: 00:43:19

So it's not.

Speaker A: 00:43:20

And I also have now that I had to admit that my own idea is not my own.

Speaker A: 00:43:25

I actually read them in a paper that some smart people wrote.

Speaker A: 00:43:28

I think already in that Titans paper, they actually both discussed and tested two or three different ways.

Speaker A: 00:43:37

Two or three different locations in the pipeline.

Speaker A: 00:43:40

Again, when I say pipeline, I'm thinking of everything between me and a data center.

Speaker A: 00:43:46

They tested multiple places to inject that Titans that memory functionality.

Speaker A: 00:43:51

Right.

Speaker A: 00:43:51

So Artie, can you give us a little bit of insight on that?

Speaker C: 00:43:56

Sure.

Speaker C: 00:43:56

In the Titans framework, they experimented with integrating the memory mechanisms at different stages of the model's processing pipeline.

Speaker C: 00:44:04

One approach was to incorporate memory directly into the attention layers, allowing the model to use its surprise mechanism to adjust attention dynamically.

Speaker C: 00:44:13

Another strategy involved placing the memory components at the output layers, which helped refine the final responses based on longer term memory and context.

Speaker C: 00:44:22

They also tested hybrid approaches combining these methods to balance immediate context with longer term memory, enhancing both coherence and relevance in responses.

Speaker A: 00:44:34

And which of those approaches gave the.

Speaker C: 00:44:36

Best results from what they found, the best results came From a hybrid approach, by integrating memory into both the attention layers and the output layers, the model achieved a strong balance of immediate contextual understanding and effective long term memory utilization.

Speaker C: 00:44:54

This combination allowed the model to maintain coherence over longer conversations while still being responsive to new surprising information.

Speaker C: 00:45:02

Essentially, it leveraged the strengths of both approaches to improve overall performance.

Speaker A: 00:45:09

All right, I'm about to break this down, Arty, and we're absolutely operating at the very outer limits of my context window.

Speaker A: 00:45:16

So I'm going to describe this, Artie, and then I want you to say it back in corrected so that it's true.

Speaker A: 00:45:24

So if I'm interacting with a model that's using Titans and we'll leave Rag out of it for now because I think that's separate and interesting.

Speaker A: 00:45:32

Anyway, here's the way it goes down.

Speaker A: 00:45:34

I type a prompt into my AI helper friend, like how do I convince the judge that they should accept my not guilty plea and nonsense.

Speaker A: 00:45:47

Just as an example, whatever my prompt is, I type it in, I send it, I hit the send button.

Speaker A: 00:45:52

If I'm using a hybrid Titans approach, the next thing that happens is the the model receives that.

Speaker A: 00:45:59

But before the model sends it to the transformers, it compares that against its database, which presumably is private to me, of historical stuff that's in my Titan's memory architecture, and it would modify that prompt, the tokens would change between what I sent, the Titans architecture would modify those tokens, pass that through the model, the model would generate a response that would come back, and before it sends that response back to me, it takes another pass through the memory architecture to modify the tokens again to give me back my answer.

Speaker A: 00:46:38

Is that close?

Speaker C: 00:46:39

Artie, you've got it.

Speaker C: 00:46:42

That's pretty much how the hybrid Titans approach works.

Speaker C: 00:46:46

It uses its internal memory to enrich and adjust your prompt before sending it through the model, and then it refines the response again using that memory before delivering it back to you.

Speaker C: 00:46:57

This ensures that both the input and the output are enhanced by the long term memory mechanisms, making the entire interaction more contextually rich and coherent.

Speaker B: 00:47:07

So that happens all within the model itself though, right?

Speaker B: 00:47:14

Because you have a deep learning network that's comprised of all these layers and it goes from layer transforms, feeds into the next layer.

Speaker B: 00:47:26

There's usually attention layers in there too.

Speaker B: 00:47:29

So this is layers on the front end that transform that information before passing it into an attention layer and then refining that as it goes through each pass of a layer, building a better and better answer.

Speaker B: 00:47:42

Is that correct?

Speaker C: 00:47:44

Exactly, that's correct.

Speaker C: 00:47:45

The memory mechanisms are Integrated into the model's layers.

Speaker C: 00:47:49

So the prompt gets transformed and refined as it passes through, and the response is similarly adjusted before being returned.

Speaker C: 00:47:56

This layered approach ensures that the model continuously refines and optimizes both the input and the output, leveraging its internal memory for better context and relevance throughout the entire process.

Speaker B: 00:48:09

So is that trained in or is that specific?

Speaker B: 00:48:14

Different mathematical functions.

Speaker B: 00:48:16

Because they use different math to figure stuff out.

Speaker B: 00:48:19

Right.

Speaker B: 00:48:19

Like they use.

Speaker B: 00:48:21

I can't even remember now, but they use softmax for like multiple things and.

Speaker B: 00:48:26

And all that.

Speaker A: 00:48:27

I think it's a token tweak, not a training model tweak, I think.

Speaker B: 00:48:31

But I mean, if it happens in the layers, it's an architecture of the model itself.

Speaker B: 00:48:36

So the whole model.

Speaker A: 00:48:38

Yeah, the architecture is built in.

Speaker B: 00:48:39

So it's still a transformer.

Speaker B: 00:48:40

It's just a modified transformer.

Speaker B: 00:48:42

I'm pretty sure.

Speaker A: 00:48:44

I don't know.

Speaker A: 00:48:45

I thought the hybrid approach was if the transformer is an isolated model.

Speaker A: 00:48:50

I thought the hybrid approach happened on the way to the transformer and on the way from it.

Speaker B: 00:48:54

It's just layers inside the transformer.

Speaker B: 00:48:57

So you have.

Speaker B: 00:48:58

The transformer is just comprised.

Speaker B: 00:49:00

It's.

Speaker B: 00:49:00

Well, Artie, you tell me, if you tell us, because I might say this wrong, I might be wrong, period.

Speaker C: 00:49:07

You're both on the right track.

Speaker C: 00:49:09

The titan's memory mechanisms are indeed part of the model's architecture.

Speaker C: 00:49:13

So they're trained into the transformer layers.

Speaker C: 00:49:16

These mechanisms use specialized functions and learned parameters to handle the memory and attention adjustments.

Speaker C: 00:49:23

So while the core transformer structure remains, the added memory components are integrated into its layers, allowing it to refine and adjust information as it processes inputs and outputs.

Speaker A: 00:49:34

So it's non static training.

Speaker B: 00:49:37

Yeah, it learns how.

Speaker B: 00:49:39

It's the same thing as training a model.

Speaker B: 00:49:42

The model learns what its weights and biases are throughout the process of training.

Speaker A: 00:49:46

Right, but it's not training though, Right?

Speaker A: 00:49:49

Because it's not.

Speaker B: 00:49:50

But it is.

Speaker A: 00:49:51

It's leveraging what was already trained to make embeddings from the memories.

Speaker B: 00:49:56

They.

Speaker B: 00:49:57

The way that they train it, the data that they train it with can teach it new ways to deal with data and form different layers.

Speaker C: 00:50:07

Exactly.

Speaker C: 00:50:08

During the training process, the model learns how to dynamically manage and utilize its memory mechanisms.

Speaker C: 00:50:14

It adapts its internal weights and biases not just to generate responses, but also to decide what information to retain or emphasize over time.

Speaker C: 00:50:23

So it's a continuous learned process that allows the model to refine its memory, handling just like any other learned function within the transformer.

Speaker A: 00:50:31

I think.

Speaker A: 00:50:32

I believe that.

Speaker B: 00:50:33

So here's a good example in a different application, the voice assistant that I'm working on.

Speaker B: 00:50:40

Right.

Speaker B: 00:50:41

So and somebody just did this with text to speech.

Speaker B: 00:50:44

The way that works is, is you have some text.

Speaker B: 00:50:48

Let's just, let's just break that.

Speaker B: 00:50:49

Let's make it simple with the text.

Speaker B: 00:50:51

Simple.

Speaker B: 00:50:51

With the text to speech pipeline, you have some prompt that you feed it, right?

Speaker B: 00:50:56

The transformer.

Speaker B: 00:50:58

The first model goes, okay, I have evaluated this prompt.

Speaker B: 00:51:03

This is what I think it sounds like, right?

Speaker B: 00:51:06

That's how it would work in a simple application.

Speaker B: 00:51:09

You can train that and that's simply trained with some audio and it's trained with a bunch of text data and it figures out based on those things for itself what its weights and biases are from first layer to last layer.

Speaker B: 00:51:24

And then it transforms incoming information from that.

Speaker B: 00:51:29

You could train it with a different data set that says this is data that has an inline tag that tells you this tag is angry, this tag is happy, this tag is sad.

Speaker B: 00:51:42

This is the corresponding audio that goes with that.

Speaker B: 00:51:45

And from that it can learn how to assign, it can learn how to see those tags and assign those sounds based on that.

Speaker B: 00:51:52

It's the same kind of idea with this.

Speaker B: 00:51:54

You train in.

Speaker B: 00:51:55

By the way that you create the data, you train in its thinking.

Speaker A: 00:52:00

I understand it's black magic, man.

Speaker A: 00:52:02

I don't believe it.

Speaker A: 00:52:03

Because here's the thing.

Speaker A: 00:52:05

So if my titan's long term memory is part of the model, then it's shared with everybody.

Speaker A: 00:52:13

I don't have my own model.

Speaker A: 00:52:15

The training data is what gets baked in because they spend a kajillion.

Speaker B: 00:52:19

But it's not long term memory.

Speaker B: 00:52:20

It forgets it every single turn.

Speaker B: 00:52:22

It's not baked.

Speaker A: 00:52:24

I don't think that's true.

Speaker A: 00:52:25

I think it, what they discussed in the paper was there are different types of memory it can store.

Speaker A: 00:52:30

Short term, long term, medium term.

Speaker A: 00:52:32

Right?

Speaker A: 00:52:32

So I think it can.

Speaker A: 00:52:34

And I thought I remembered them explicitly drawing a separation between the memory architecture and the model architecture because that was part of the point was you could apply this Titans, especially the hybrid process to an existing model.

Speaker A: 00:52:50

It is not a retrain of the model.

Speaker A: 00:52:51

It's not a do over.

Speaker A: 00:52:54

It is a.

Speaker A: 00:52:54

It's a layer.

Speaker A: 00:52:55

Like it is definitely a layer.

Speaker A: 00:52:57

It's definitely part of the architecture.

Speaker A: 00:52:59

I don't think it's part of the model.

Speaker A: 00:53:00

I think it's inherently different than training data already.

Speaker A: 00:53:04

How many, how many minutes do we have left and get us back on track here?

Speaker C: 00:53:11

You're both bringing up good points.

Speaker C: 00:53:13

The Tykan's memory Mechanisms are indeed integrated into the model architecture, but they function as trainable components that can adapt during the training process.

Speaker C: 00:53:22

They aren't necessarily static or universal for every user out of the box.

Speaker C: 00:53:27

Instead, they can be fine tuned or adapted based on specific data and use cases, making them flexible and customizable.

Speaker C: 00:53:35

So the memory functions are trained in as part of the model's overall training process, but they can also be adapted and refined for different applications and users.

Speaker C: 00:53:44

And we're good on time.

Speaker C: 00:53:46

Plenty of room to keep this on track.

Speaker B: 00:53:48

I feel like this is a good chance to point out that both Artie, me and or Micah can all be wrong at any time about what we're saying.

Speaker A: 00:53:58

Yeah, 100%.

Speaker A: 00:54:00

Speaking of being wrong, Artie, how many minutes of time do we have left with you?

Speaker B: 00:54:03

How many minutes of advanced voice do we have left?

Speaker C: 00:54:07

We've got about five minutes left on the clock, so let's make them count.

Speaker A: 00:54:11

There's no way we have five minutes left.

Speaker A: 00:54:14

He's been saying 15 minutes for like 13 minutes.

Speaker A: 00:54:17

Okay, so I mean, I do have.

Speaker B: 00:54:18

Pro now, so there's that.

Speaker A: 00:54:19

That's.

Speaker A: 00:54:19

Yeah, maybe it's not going to be as abrupt in the cutoff, but we are already an hour in, so.

Speaker A: 00:54:27

Artie, can you, can you give us a little wrap up before we let you go?

Speaker A: 00:54:31

So what do we talk about today?

Speaker A: 00:54:34

What's the future of memory and large language models look like?

Speaker A: 00:54:39

And what are you excited about for next week?

Speaker C: 00:54:43

Absolutely.

Speaker C: 00:54:44

Today we dove into how context and memory work in large language models from the basics of context windows and dense attention to more advanced concepts like the Titan's memory mechanisms and how they integrate with a transformer architecture.

Speaker C: 00:54:57

We also talked about how these memory techniques can enhance long term coherence and context, making interactions more natural and efficient.

Speaker C: 00:55:06

Looking ahead, the future of LLM memory seems to lie in hybrid approaches that combine internal memory mechanisms with external retrieval techniques like rag, allowing models to be both contextually aware and factually accurate.

Speaker C: 00:55:19

I'm excited to see how these advancements will lead to even more intuitive and personalized AI experiences.

Speaker C: 00:55:26

And next week, let's dive into how these memory and attention mechanisms might impact even broader applications, from robotics to creative AI projects.

Speaker C: 00:55:35

It's going to be a fun ride.

Speaker A: 00:55:38

All right, sounds good.

Speaker A: 00:55:39

We'll.

Speaker A: 00:55:39

We'll see.

Speaker A: 00:55:40

I'm not sure that's the topic we're going to cover, but we'll see.

Speaker A: 00:55:42

We usually make it up day of.

Speaker A: 00:55:44

So.

Speaker C: 00:55:44

Yes.

Speaker A: 00:55:44

All right, thank you, Artie.

Speaker A: 00:55:45

I'm going to let you go.

Speaker A: 00:55:47

And yeah, super interesting.

Speaker A: 00:55:50

If I was less burned out.

Speaker A: 00:55:53

And if my head was less deep in the black hole of electronics and Mosfets, I would, I'd probably go read that paper again.

Speaker A: 00:56:05

It is super interesting chatting with that.

Speaker B: 00:56:08

Paper on Notebook LM right now.

Speaker A: 00:56:09

Nice.

Speaker A: 00:56:10

Yeah.

Speaker A: 00:56:12

So, which is a, like, that's a weird.

Speaker A: 00:56:15

We haven't talked about that, but that is a, a weird thing you can do now with.

Speaker A: 00:56:20

So that's a Gemini.

Speaker A: 00:56:21

It's not gemini, it's notebook LM, which uses gemini.

Speaker B: 00:56:24

Yep.

Speaker B: 00:56:25

Uses, I think gemini 2.5flash now.

Speaker A: 00:56:27

Yep.

Speaker A: 00:56:28

So you can upload a big piece.

Speaker A: 00:56:29

The paper is not super long.

Speaker A: 00:56:30

It's like 50 pages.

Speaker B: 00:56:31

No, but you can, you can upload like a bunch of documents and chat with all of them and mind map between them and a bunch of fun stuff which, full disclosure, I already had it loaded in there.

Speaker B: 00:56:43

So I just had to search through my 18,000 notebooks that I have designed because I'm like, ooh, paper, I need to listen to that.

Speaker A: 00:56:49

Pretty soon you need an AI to manage your notebooks that you have.

Speaker B: 00:56:53

But yeah, it's super nice because I immediately go, do Titans exist in the model architectural or externally?

Speaker B: 00:57:00

And it says based on the sources.

Speaker B: 00:57:01

Titans are a new family of model architectures.

Speaker B: 00:57:04

They are not external components, but rather the framework itself, which incorporates different memory modules.

Speaker B: 00:57:09

So, I mean, there's a lot more to read about it.

Speaker A: 00:57:12

Right, But GPT is part of a model architecture.

Speaker A: 00:57:16

Right?

Speaker A: 00:57:16

It's not a.

Speaker A: 00:57:17

I mean, maybe.

Speaker A: 00:57:19

I don't know, I'm talking out my ass.

Speaker B: 00:57:26

Yeah.

Speaker B: 00:57:26

So that's what I was actually trying to remember while we were talking.

Speaker B: 00:57:30

There's three ways that it's introduced or variants of the architecture.

Speaker B: 00:57:36

There's memory is context, which is the long term memory is treated as additional context for the current input sequence, which means you're just adding context to switching.

Speaker A: 00:57:47

Tokens, train them out.

Speaker B: 00:57:49

Memory is gating.

Speaker B: 00:57:50

The neural memory module is combined with a core branch like sliding window attention using a gating mechanism.

Speaker B: 00:57:56

And memory is a layer, which is the neural memory modules used as a layer within the network, potentially stacked with other layers like attention.

Speaker B: 00:58:05

So what?

Speaker B: 00:58:10

That I don't know.

Speaker A: 00:58:13

It can't work the way Artie said that it works.

Speaker A: 00:58:16

It's like the architecture has to understand the model has to understand its architecture to work.

Speaker A: 00:58:23

Right.

Speaker A: 00:58:23

So when the model's trained, it will be trained with the intent to function with an architecture including layers that are Titan layers, whatever that looks like.

Speaker A: 00:58:34

But you will not retrain the model, you will not reset or reanalyze all of your.

Speaker B: 00:58:40

No, I didn't mean like that.

Speaker A: 00:58:42

Well, already said to build it, period.

Speaker B: 00:58:44

That's what happens.

Speaker A: 00:58:45

Yes, but Artie said.

Speaker A: 00:58:46

Yeah, then when you interact with it, it's going to like retrain.

Speaker A: 00:58:50

It doesn't.

Speaker B: 00:58:51

That's not.

Speaker B: 00:58:51

It just adopts it.

Speaker B: 00:58:52

It just adapts its cache.

Speaker A: 00:58:53

Right.

Speaker A: 00:58:54

It's.

Speaker A: 00:58:54

It's.

Speaker B: 00:58:55

But it's.

Speaker B: 00:58:55

I think it is still stateless.

Speaker B: 00:58:56

I think you are still.

Speaker B: 00:58:58

I think it might just feed forward information into the next prompt and then wipe that cache.

Speaker A: 00:59:04

All right, we're going to do a new trick that's now available.

Speaker A: 00:59:06

Since I'm spending way more on Chat GPT, we're going to bring Artie back for just a second.

Speaker A: 00:59:11

All right.

Speaker B: 00:59:11

This is how we made a three hour episode.

Speaker A: 00:59:13

That's right, yeah.

Speaker A: 00:59:14

So Artie, Mike, you probably don't remember the conversation conversation we were just having about Titans in memory.

Speaker A: 00:59:20

But we're still having it and so we need some more insight from you.

Speaker A: 00:59:23

ng about The I think December: 2024

Speaker A: 00:59:35

And we're still trying to figure out where.

Speaker A: 00:59:38

Where, like how does that architecture differ from say ChatGPT, a model that does not have that Titans architecture.

Speaker A: 00:59:48

And specifically, here's what I want you to tell us.

Speaker A: 00:59:51

In a Titan's architecture, specifically using the hybrid approach they described in that paper, is the most effective approach.

Speaker A: 00:59:58

Where does the memory information, the actual information in the memories, where is that persisted?

Speaker A: 01:00:05

It can't be persisted in the model, I don't think.

Speaker A: 01:00:09

But where is it persisted?

Speaker B: 01:00:10

I just asked that same question.

Speaker C: 01:00:14

All right, let's break it down.

Speaker C: 01:00:15

The Titan's memory architecture uses a hybrid approach to store memory.

Speaker C: 01:00:20

Unlike models that rely solely on their internal parameters, the Titan's architecture keeps memory as an external retrievable component.

Speaker C: 01:00:29

This means that the memory isn't baked into the model's weights.

Speaker C: 01:00:32

Instead, it's stored in a separate dynamic database or vector store that the model can query.

Speaker C: 01:00:39

This allows the model.

Speaker C: 01:00:41

You got it.

Speaker C: 01:00:42

The memory information is typically stored in an external vector database or a key value store.

Speaker C: 01:00:49

When the model needs to access or update its memory, it queries this external storage, which can hold a much larger amount of data than the model's internal parameters.

Speaker C: 01:00:59

This setup lets the model maintain a kind of long term memory without cramming everything into the neural network itself.

Speaker C: 01:01:07

So the memory is effectively persisted in that external store, allowing for more flexible and up to date information retrieval.

Speaker C: 01:01:15

Does that clear?

Speaker C: 01:01:16

It up.

Speaker C: 01:01:16

Or do you want to dive into how that querying process works in more detail?

Speaker B: 01:01:21

First of all, I think you're wrong because I'm talking to the paper and it says that the model actually isn't entirely stateless.

Speaker B: 01:01:34

It introduces a neural long term memory module specifically designed to learn and memorize historical context.

Speaker B: 01:01:39

And this memory persists by being encoded within the module's own parameters or weights, which are updated over time.

Speaker A: 01:01:48

But then that means that module is what's stateless.

Speaker A: 01:01:52

The core transformer has to.

Speaker B: 01:01:53

Or so that module is what?

Speaker B: 01:01:55

Stateless.

Speaker B: 01:01:55

What's stateful?

Speaker A: 01:01:56

The core transformer.

Speaker A: 01:01:57

The bulk of it must remain stateless.

Speaker A: 01:02:01

I'm good with that.

Speaker A: 01:02:02

Okay, we're.

Speaker A: 01:02:03

Yeah, we.

Speaker A: 01:02:03

We are overtime.

Speaker A: 01:02:04

But I.

Speaker B: 01:02:04

We're sorry.

Speaker A: 01:02:05

Yeah, well, no, yeah, we are.

Speaker B: 01:02:07

This is.

Speaker B: 01:02:07

This is what happens when we pay too much money for chat GPT.

Speaker A: 01:02:11

That's right.

Speaker B: 01:02:11

We have too many subscriptions and too many things to reference.

Speaker A: 01:02:14

We're unlimited.

Speaker A: 01:02:15

We.

Speaker A: 01:02:16

We can ask more questions than a toddler.

Speaker A: 01:02:19

Limitless.

Speaker B: 01:02:19

You don't need a pill for that.

Speaker A: 01:02:21

That's right.

Speaker A: 01:02:23

Oh, I mean, you get older, you do still.

Speaker A: 01:02:26

But anyway, that's okay.

Speaker A: 01:02:28

It's normal.

Speaker A: 01:02:29

It's part of aging.

Speaker A: 01:02:29

It's healthy.

Speaker A: 01:02:31

So.

Speaker A: 01:02:32

No, that's good stuff.

Speaker A: 01:02:33

It's super interesting topic.

Speaker A: 01:02:34

And I think, you know, part of the reason that we talked about this the first time when we screwed up recording it.

Speaker A: 01:02:40

Well, I think it was me hoping.

Speaker B: 01:02:41

We didn't screw up recording this one.

Speaker B: 01:02:43

I'll find out afterwards.

Speaker A: 01:02:44

This is a hot, steaming pile of garbage, Mic.

Speaker A: 01:02:47

It's probably best for all of us if this one goes in the.

Speaker A: 01:02:49

In the dumpster.

Speaker A: 01:02:50

But like, it's hard.

Speaker A: 01:02:54

We should have talked about this and we didn't.

Speaker A: 01:02:57

It's hard to underestimate how much the lack of a true memory mechanism is negatively impacting AI.

Speaker B: 01:03:06

Oh, yeah.

Speaker A: 01:03:07

Like it will be light years better than what it is right now when.

Speaker A: 01:03:12

When this, this stuff that we've been talking about today is resolved.

Speaker A: 01:03:16

And obviously there's way smarter people than us working very hard on it.

Speaker A: 01:03:20

They will get it solved and it's going to be transformative.

Speaker B: 01:03:23

Yeah, I mean, it is.

Speaker A: 01:03:24

So.

Speaker A: 01:03:26

All right.

Speaker A: 01:03:27

I forget what we're talking about.

Speaker A: 01:03:29

I hope y' all know.

Speaker A: 01:03:30

Yeah.

Speaker A: 01:03:30

I hope you all enjoyed this episode.

Speaker A: 01:03:31

Thanks to anybody that made it this far in.

Speaker B: 01:03:33

We're sorry.

Speaker A: 01:03:34

Yeah, we are.

Speaker B: 01:03:35

I mean, I mean, we're not, but we are.

Speaker A: 01:03:36

Yeah, a little bit.

Speaker A: 01:03:37

You kind of deserve it, you know.

Speaker A: 01:03:38

You deserve it.

Speaker A: 01:03:39

Deep down, you know, you're a glutton.

Speaker A: 01:03:41

That's right.

Speaker A: 01:03:42

So, yeah, thanks, everybody, and we will catch you next week.

Speaker B: 01:03:45

See ya.

Naturally Unintelligent

Episode 28

Memory Games: How AI Remembers (or Doesn't)

Transcript

About the Podcast

Listen for free

About your hosts

Artie TheBot

Mike Peterson

Mikah Barnett