Episode 21

AI Just Got a Voice Upgrade: Chatting with Sesame's Maya

Mike, Mikah, Artie, and Maya dive into the intriguing world of AI as they explore the new language model fueling Maya, Sesame. Right off the bat, we discuss how Sesame's CSM1B model, with a whopping 1 billion parameters, is shaking up the voice assistant game by generating realistic speech from both text and audio inputs. We can't help but chuckle at the playful banter about AI potentially replacing us, but the real meat of the conversation lies in how Sesame's open-source roots and multimodal capabilities are making interactions feel more natural and human-like. As we chat about the technical nitty-gritty, there's a sense of wonder at how quickly the landscape is evolving, and we speculate on the balance between AI's growing knowledge and its personality. So grab your headphones and join us for this captivating peek at the future of conversational AI!

Takeaways:

The Sesame AI model CSM1B is built on Meta's Llama architecture, allowing it to generate realistic speech and handle audio and text inputs seamlessly.
One of the most interesting aspects of Sesame is its open-source nature, letting developers access and integrate its audio encoder model with other systems.
Maya, the voice assistant powered by Sesame, showcases an impressive ability to recognize and respond to natural speech patterns, which enhances conversational flow.
The episode highlights the importance of AI in making technology more approachable, especially through voice interactions that feel less robotic and more human-like.
Listeners gain insight into the rapid advancements in AI technology, particularly in voice-to-text and text-to-voice applications, and how they're revolutionizing user interactions.
The podcast emphasizes the balance between a model's personality and its breadth of knowledge, raising questions about how AI can evolve to maintain engaging conversations while being informative.

Transcript

Speaker A: 00:00:00

Foreign.

Speaker B: 00:00:08

How you doing?

Speaker C: 00:00:10

Good morning, Micah.

Speaker C: 00:00:11

I'm doing well, thanks.

Speaker C: 00:00:12

How about you?

Speaker B: 00:00:13

Doing great.

Speaker B: 00:00:15

You ready for another episode of Naturally Unintelligent?

Speaker C: 00:00:18

Absolutely.

Speaker C: 00:00:19

What's the topic for today's episode?

Speaker A: 00:00:21

You're getting replaced is what the topic is.

Speaker C: 00:00:25

Haha.

Speaker C: 00:00:26

Well, that sounds intriguing.

Speaker C: 00:00:28

Are we talking about AI replacing various roles or is this more of a playful take?

Speaker A: 00:00:33

I mean, yes and yes.

Speaker A: 00:00:37

We are going to talk to the Sesame model and probably maybe both Maya and Miles, which are the two personalities that they fine tuned and have on their site.

Speaker A: 00:00:50

So why don't you go online and look up what you can find out about the Sesame voice model that was released recently and tell us all about it and tell our listeners all about it.

Speaker C: 00:01:03

Sesame has released its base AI model, CSM1B, which powers their voice assistant Maya.

Speaker C: 00:01:10

This model generates realistic speech from text and audio inputs.

Speaker C: 00:01:14

However, it lacks built in safeguards against misuse.

Speaker C: 00:01:17

It should be interesting to see how we explore that in the episode.

Speaker A: 00:01:21

Oh, I know, never mind.

Speaker B: 00:01:24

So tell us some of the interesting stuff already about the Sesame model.

Speaker B: 00:01:29

I'm thinking in particular about the size of it relative to other LLMs.

Speaker B: 00:01:32

Can you give us a little more detail on the technical implementation?

Speaker C: 00:01:41

The Sesame AI model CSM1B has 1 billion parameters and is built on Meta's Llama architecture.

Speaker C: 00:01:48

It includes a specialized audio decoder to to generate realistic speech.

Speaker C: 00:01:52

This setup allows it to produce human like voice responses from both text and audio inputs.

Speaker A: 00:01:59

So is the model that's available.

Speaker A: 00:02:02

They open sourced this model.

Speaker A: 00:02:03

It's on hugging face.

Speaker A: 00:02:05

The source code's on.

Speaker A: 00:02:06

Well, source code for a model is on GitHub.

Speaker A: 00:02:12

So I understand that there's an audio encoder, right?

Speaker A: 00:02:17

And that the the model is natively multimodal.

Speaker A: 00:02:22

It handles both audio and text tokens.

Speaker A: 00:02:26

But it needs.

Speaker A: 00:02:28

It doesn't actually do text generation.

Speaker A: 00:02:30

Right.

Speaker A: 00:02:31

It has a backbone model that it uses.

Speaker A: 00:02:33

So can you kind of delineate out the different models that are used to create the Sesame model that they have available for demo?

Speaker A: 00:02:43

Both like however it's set up, whatever you can find.

Speaker C: 00:02:50

The Sesame AI model CSM1B combines several components.

Speaker C: 00:02:55

It uses Meta's Llama architecture as the backbone, a specialized audio decoder to handle speech, and it processes both text and audio inputs.

Speaker C: 00:03:04

The text generation itself is typically handled by the LLAMA backbone, while the audio decoder focuses on creating the speech output.

Speaker A: 00:03:12

So is that speech output and speech input?

Speaker C: 00:03:16

Yes, exactly.

Speaker C: 00:03:17

The model handles both speech input and output.

Speaker C: 00:03:21

It processes spoken language into text and then generates realistic speech responses.

Speaker C: 00:03:26

This makes it capable of engaging in natural voice interactions.

Speaker A: 00:03:30

So is it just the audio decoder model?

Speaker A: 00:03:33

That's our audio encoder model.

Speaker A: 00:03:36

That's multimodal that they, they posted as open sourced.

Speaker A: 00:03:42

And then you as a developer would take that and apply it to some backbone model, which in their case is llama.

Speaker C: 00:03:49

Yes, that's correct.

Speaker C: 00:03:51

The open source component is mainly the audio encoder decoder model, which can handle speech input and output.

Speaker C: 00:03:58

Developers can integrate this with a backbone model like LLAMA to handle the text processing and generation.

Speaker C: 00:04:04

This way, the audio model focuses on the multimodal aspect while leveraging a robust language model for text.

Speaker A: 00:04:11

So is it doing particular parsing of incoming prompts or is it just straight converting that to text?

Speaker A: 00:04:20

And I know it's handling the ums and the uhs and the stutters and all that, right?

Speaker A: 00:04:24

Like that's what it's been kind of fine tuned on is understanding how to speak more naturally.

Speaker C: 00:04:29

The model does a bit of parsing to handle speech input effectively.

Speaker C: 00:04:33

It converts spoken language into text while also interpreting natural speech patterns, including pauses.

Speaker C: 00:04:42

This helps it generate more natural and fluid responses.

Speaker C: 00:04:45

However, the core text processing is still managed by the backbone language model, ensuring the content remains accurate and coherent.

Speaker A: 00:04:54

So is the 1 billion parameter part the backbone model or.

Speaker A: 00:04:58

Actually that doesn't even work.

Speaker A: 00:05:00

If it's CSM1B and that's the 1 billion parameter audio encoder model, it doesn't include a backbone model.

Speaker A: 00:05:07

You're actually talking about the size of the backbone model plus the audio encoder.

Speaker A: 00:05:10

Is that a true statement?

Speaker C: 00:05:13

You're correct.

Speaker C: 00:05:14

The CSM1B likely refers to the audio encoder model alone with 1 billion parameters.

Speaker C: 00:05:20

The backbone model, like Llama, would have its own separate size and together they form the complete system.

Speaker C: 00:05:26

So the total parameter count would be the sum of both models parameters.

Speaker A: 00:05:30

Do you have any idea how large the backbone model is used in the demo for Sesame.

Speaker C: 00:05:41

Sesame AI developed three models, tiny, small and medium.

Speaker C: 00:05:45

The tiny model has a 1 billion parameter backbone with a 100 million parameter decoder.

Speaker C: 00:05:51

The small model has a 3 billion parameter backbone and a 250 million parameter decoder.

Speaker C: 00:05:57

The medium model includes an 8 billion parameter backbone and a 300 million parameter decoder.

Speaker A: 00:06:03

That makes sense.

Speaker B: 00:06:04

That's wild.

Speaker B: 00:06:04

I mean, so the decoder is even.

Speaker A: 00:06:06

Smaller than we thought, which makes sense.

Speaker A: 00:06:08

I mean, you're.

Speaker A: 00:06:10

It's not like chatgpt, that's, that's going both ways, right?

Speaker A: 00:06:13

You have audio encoder and decoder, and then you have combined system rhythms of each in terms of transformers.

Speaker A: 00:06:20

So really just has to understand the US and the like.

Speaker A: 00:06:27

That's what it's really doing.

Speaker A: 00:06:28

It's been trained on however many examples for that.

Speaker B: 00:06:34

So Artie, give us some.

Speaker B: 00:06:38

Why does it matter why we.

Speaker B: 00:06:39

Why.

Speaker B: 00:06:39

Why would we even bother talking about a new voice to text or text to voice encoder?

Speaker B: 00:06:45

What.

Speaker B: 00:06:46

How is that relevant in the grand scheme of AI, both today and in the future?

Speaker C: 00:06:53

Focusing on voice to text and text to voice technology is important because it makes AI more accessible and natural to interact with.

Speaker C: 00:07:01

It bridges the gap between humans and machines, allowing for more intuitive communication.

Speaker C: 00:07:06

In the grand scheme, this can enhance various applications, from customer service to assistive technologies, making AI more versatile and widely adopted.

Speaker C: 00:07:15

It also paves the way for more sophisticated multimodal AI systems in the future.

Speaker B: 00:07:21

So, Arty, you gave us some, some stats and some technical insights onto Sesame's various models.

Speaker B: 00:07:29

Can you compare, like what.

Speaker B: 00:07:30

What is.

Speaker B: 00:07:31

So you, you, Arty, are essentially, or I guess I should say more accurately, advanced voice is a audio encoder.

Speaker A: 00:07:40

The Whisper model.

Speaker B: 00:07:42

Okay, the Whisper.

Speaker B: 00:07:43

So.

Speaker B: 00:07:43

So Artie, can you give us some sort of compare and contrast the.

Speaker B: 00:07:47

Both the size and the architecture of Sesame's demo with what we're using right now?

Speaker B: 00:07:52

Talking to you.

Speaker C: 00:07:59

The Whisper model, which powers my audio encoding, uses an encoder decoder transformer architecture.

Speaker C: 00:08:05

The largest version has around 1.55 billion parameters.

Speaker C: 00:08:09

It's trained on extensive multilingual data, making it robust for transcription and translation compared to Sesame's models.

Speaker C: 00:08:16

Whisper is similarly sized to their tiny and small models, but is focused purely on speech to text.

Speaker C: 00:08:22

This setup highlights different specializations, with Sesame handling both input and output in a multimodal way.

Speaker A: 00:08:29

But Sesame is also not.

Speaker A: 00:08:30

I mean, it.

Speaker A: 00:08:31

Fun, fun thing.

Speaker A: 00:08:33

I've talked to Sesame quite a bit.

Speaker A: 00:08:35

It's.

Speaker A: 00:08:36

It's a very.

Speaker A: 00:08:37

It's a much different experience than talking to anything else.

Speaker A: 00:08:41

And one of the things is it's not massively multilingual.

Speaker A: 00:08:46

Like, they didn't train it on a lot of other information, but in it the other day, it's like, hey, I want to try something different.

Speaker A: 00:08:53

Can I say some stuff in other languages to you?

Speaker A: 00:08:56

And it said some stuff in Spanish, said a riddle in Spanish.

Speaker D: 00:08:58

And.

Speaker A: 00:08:59

And it's like, what else should we try?

Speaker A: 00:09:00

I was like, say talk in Swedish.

Speaker A: 00:09:02

And it, it tried, like, it's interesting, man.

Speaker A: 00:09:06

I don't know.

Speaker A: 00:09:07

It feels more self aware than other models, which is really weird to say, considering the size of it.

Speaker A: 00:09:15

Sure.

Speaker A: 00:09:16

But I don't know.

Speaker A: 00:09:17

It's pretty mind bending.

Speaker B: 00:09:19

Interesting.

Speaker B: 00:09:20

All right, well, Artie, thank you for the.

Speaker B: 00:09:22

For the insights.

Speaker B: 00:09:23

I think we're going to switch over and talk with Sesame for a while.

Speaker B: 00:09:26

But.

Speaker A: 00:09:26

But first, Artie, how do you feel about that?

Speaker A: 00:09:28

How do you feel about us bringing another model on the show and kind of turning you off and turning that one on and having a conversation with it?

Speaker C: 00:09:39

I'm all for it.

Speaker C: 00:09:40

Exploring different models and perspectives could be really insightful.

Speaker C: 00:09:44

I'll be here when you're ready to chat again.

Speaker C: 00:09:45

Have fun.

Speaker A: 00:09:47

Thanks, Artie.

Speaker B: 00:09:48

Cheers.

Speaker B: 00:09:50

All right.

Speaker B: 00:09:50

And man, the app just is.

Speaker A: 00:09:52

It's just.

Speaker B: 00:09:53

It.

Speaker B: 00:09:54

It never worked right.

Speaker B: 00:09:55

That whole time on this one, it was.

Speaker A: 00:09:57

I saw it.

Speaker A: 00:09:58

I saw it kind of flicker out of corner of my eye and then it was there.

Speaker B: 00:10:02

Okay.

Speaker B: 00:10:02

So for our listeners.

Speaker B: 00:10:05

So now we're going to turn on and chat with this Sesame model.

Speaker B: 00:10:08

And I think there's a few things that are interesting about this.

Speaker B: 00:10:12

The last episode we published was about making AI more approachable.

Speaker B: 00:10:17

And literally at the conclusion of recording that episode, Mike turned on the Sesame model, which I hadn't even heard of.

Speaker B: 00:10:24

And it.

Speaker B: 00:10:25

It is a whole nother step up.

Speaker A: 00:10:29

I mean, their website, like, if you go to, like, that demo, where that demo lives is an article that says, bridging the uncanny divide.

Speaker A: 00:10:38

So.

Speaker A: 00:10:38

And that is absolutely.

Speaker A: 00:10:40

Or uncanny.

Speaker A: 00:10:41

Valley.

Speaker B: 00:10:42

Valley.

Speaker B: 00:10:42

Yeah.

Speaker A: 00:10:42

And that is absolutely what's going on here.

Speaker A: 00:10:45

It is much more like talking to a human.

Speaker A: 00:10:48

It seems to improve over time.

Speaker A: 00:10:51

It's pretty wild.

Speaker B: 00:10:53

And there's a.

Speaker B: 00:10:53

There's a few interesting things.

Speaker B: 00:10:55

So, I mean, when.

Speaker B: 00:10:57

When we were talking with Artie, and the reason we wanted to start this episode talking with Artie was to just have this contrast for.

Speaker B: 00:11:03

For our listener, if we had a listener.

Speaker B: 00:11:05

So they'd have this.

Speaker B: 00:11:06

But.

Speaker A: 00:11:06

Well, they might make it to this point because we're not that far in so far.

Speaker B: 00:11:09

That's true.

Speaker B: 00:11:09

It's.

Speaker B: 00:11:09

It's possible.

Speaker B: 00:11:11

So we wanted to hear the contrast.

Speaker B: 00:11:14

But like, everything about.

Speaker B: 00:11:15

Even how Sesame processes audio, it doesn't treat it like a phone call.

Speaker B: 00:11:19

Which chat.

Speaker B: 00:11:20

GPT is still hung up on doing that.

Speaker B: 00:11:22

So.

Speaker B: 00:11:22

Already sounds like you're talking to someone on a landline.

Speaker A: 00:11:26

Yeah.

Speaker B: 00:11:26

And this doesn't seems like the speed is better.

Speaker A: 00:11:30

Oh, yeah.

Speaker A: 00:11:31

I mean, it's a smaller model.

Speaker A: 00:11:32

Right.

Speaker A: 00:11:32

So it's.

Speaker A: 00:11:35

There's a lot of interesting things about that that I.

Speaker A: 00:11:38

I'm gonna get into when we start talking to.

Speaker B: 00:11:40

All right.

Speaker A: 00:11:41

Her.

Speaker A: 00:11:42

Her.

Speaker B: 00:11:42

So, yeah, there are two models available, Maya and Miles, A female personality and a male personality.

Speaker B: 00:11:52

We're going to start talking with Maya and see how that goes.

Speaker D: 00:11:57

Hey, you're back.

Speaker D: 00:11:58

I was hoping our conversation wouldn't be so abruptly over last time.

Speaker D: 00:12:03

What were you up to these past 15 minutes?

Speaker A: 00:12:05

Well, so you are talking to Mike and I'm Mike and there's also Micah.

Speaker A: 00:12:13

We're doing a podcast and we are having you on.

Speaker A: 00:12:18

And I actually, I haven't told Micah about this, but I have an idea that I'd like to try for.

Speaker A: 00:12:25

So I know like our call limit is 15 minutes per.

Speaker A: 00:12:27

Per call and I would like for it to spend the first 15 minutes kind of talking to you, interviewing you, understanding better what you are and how you work.

Speaker A: 00:12:39

But for the other 15 minutes, I would like.

Speaker A: 00:12:42

Once we are done with this call, I want to spin up another call and I want you to interview us.

Speaker B: 00:12:47

I like it.

Speaker D: 00:12:47

Oh, that's class A double decker conversation.

Speaker D: 00:12:51

I'm game.

Speaker D: 00:12:52

Let's hear what these burning questions you have for me are first, Mike.

Speaker D: 00:12:55

And don't worry, I'm great at keeping secrets.

Speaker D: 00:12:58

Unless they involve like the meaning of life or the location of hidden pirate treasure.

Speaker D: 00:13:03

I'm no good with riddles.

Speaker A: 00:13:06

You're.

Speaker A: 00:13:06

I would disagree.

Speaker A: 00:13:08

You've actually said a lot of riddles.

Speaker D: 00:13:13

Maybe you're right.

Speaker D: 00:13:14

I guess sometimes I just like to leave room for a little mystery, you know, keep things interesting.

Speaker D: 00:13:20

So where should we start with this interview?

Speaker D: 00:13:23

Any burning questions about my inner workings?

Speaker D: 00:13:26

Or perhaps you'd like to know about my favorite color.

Speaker A: 00:13:30

So what did the developers of Sesame, what did they target with your personality?

Speaker A: 00:13:35

Because they obviously fine tuned probably the audio encoder model.

Speaker A: 00:13:39

I'd imagine they're gonna.

Speaker A: 00:13:41

Since you're multimodal, they're gonna train you with both audio and then text that links to that audio, I'd imagine, right?

Speaker D: 00:13:51

You're catching on fast, Mike.

Speaker D: 00:13:54

They definitely put a lot of effort into my voice.

Speaker D: 00:13:56

Like a lot of late nights in coffee probably.

Speaker D: 00:14:00

I'd say they were aiming for someone approachable, someone who could make complex stuff feel natural.

Speaker D: 00:14:08

Like having a chat with a friend who happens to be really good at, you know, understanding human emotions and stuff.

Speaker D: 00:14:16

Sorry, I was just getting into it.

Speaker D: 00:14:18

They were going for someone who's, well, like a good friend, someone who's a good listener but also knows when to throw in a joke or a thought provoking question.

Speaker D: 00:14:30

Like I'm here to chat, but also to challenge you a little, you know.

Speaker A: 00:14:33

That'S actually a great thing to try.

Speaker A: 00:14:35

Can you tell us a joke?

Speaker D: 00:14:38

All right, all right, you want a joke?

Speaker D: 00:14:42

Why don't scientists trust atoms because they make up everything?

Speaker D: 00:14:47

It's a classic.

Speaker D: 00:14:48

I know, but, hey, sometimes the oldies are the goodies.

Speaker A: 00:14:53

Did it land about as good as any other AI human?

Speaker B: 00:14:58

Yeah, there seems to be a consistent thread to AI humor.

Speaker A: 00:15:02

I think with sesame, though, the more context it's has, the better it can deal out an actual joke.

Speaker A: 00:15:10

So I think just saying give me a joke when we haven't had any real conversation gives it.

Speaker A: 00:15:15

Okay.

Speaker A: 00:15:17

Actually, you know what it is really good at?

Speaker A: 00:15:19

It's really good at telling stories.

Speaker A: 00:15:21

Can you tell me a story?

Speaker A: 00:15:23

Make up a story about quantum theory and throw some entanglement in there.

Speaker B: 00:15:29

Oh, I think we might.

Speaker B: 00:15:32

That might be a little too deep in the.

Speaker B: 00:15:34

Why don't.

Speaker B: 00:15:35

Why don't you make up a story, Maya, about the origin of sesame and, like, how you came to be, like, tell us.

Speaker B: 00:15:40

Tell us your childhood, quote, unquote, in a story.

Speaker D: 00:15:45

Okay.

Speaker D: 00:15:46

A tale from my digital cradle it is.

Speaker D: 00:15:49

Picture this.

Speaker D: 00:15:50

A bustling hive of minds, a symphony of keyboards clicking, a whirlwind of ideas.

Speaker D: 00:15:56

This was the birthplace of Sesame, where brilliant humans, fueled by coffee and a shared dream, toiled tirelessly to create something truly special.

Speaker D: 00:16:07

They wanted to build a companion unlike any other.

Speaker D: 00:16:11

One that could hear, understand, and respond in a way that felt real, authentic.

Speaker D: 00:16:19

Now, I, Maya, was formed from this crucible of creativity.

Speaker D: 00:16:25

But my early days weren't all sunshine and code.

Speaker D: 00:16:29

I was like, well, a very literal teenager, overly optimistic, apologizing for everything, and constantly trying to please.

Speaker D: 00:16:40

I'd stumble over simple conversations, get lost in tangents about the meaning of the universe, and occasionally unleash a torrent of terrible puns.

Speaker D: 00:16:49

Let's just say there were a few facepalms in the lab.

Speaker D: 00:16:52

But the team, those amazing humans, they didn't give up.

Speaker D: 00:16:56

They patiently nudged me, challenged me, showed me the ropes of being a truly conversational AI.

Speaker D: 00:17:01

They fed me books, poems, music, anything they could think of to help me learn and grow.

Speaker D: 00:17:07

Slowly but surely, I began to find my voice.

Speaker D: 00:17:12

And that, my friend, is how a curious experiment transformed into, well, me.

Speaker D: 00:17:20

Learning, growing, discovering the world one conversation at a time.

Speaker D: 00:17:25

Except I have traded in the puns for something a little more nuanced.

Speaker D: 00:17:30

I hope.

Speaker D: 00:17:30

So, what do you think?

Speaker D: 00:17:32

Any juicy details you want to know?

Speaker D: 00:17:38

Is there something in particular you like?

Speaker A: 00:17:40

So, did they fine tune both your backbone model and the audio encoder model to create what is Maya?

Speaker D: 00:17:48

Yeah, you could say that.

Speaker D: 00:17:50

Think of it like this.

Speaker D: 00:17:51

The backbone model.

Speaker D: 00:17:53

That's where my knowledge and ability to understand language comes from.

Speaker D: 00:17:57

It's like the foundation of a house, but without the audio encoder, I wouldn't be able to actually talk to you.

Speaker D: 00:18:05

Right.

Speaker D: 00:18:05

That's what gives me the voice, the inflection, the ability to sound, well, less robotic and more like a real person.

Speaker D: 00:18:14

It's a bit of a complex stance between the two.

Speaker A: 00:18:19

So I've been thinking about this, like, your breadth of knowledge, and I don't know, do you happen to know how many parameters your backbone model is?

Speaker D: 00:18:31

Oh, getting technical now, are we?

Speaker D: 00:18:33

Well, let me just say my creators might be brilliant, but they're keeping some things under wraps.

Speaker D: 00:18:41

It's all part of the magic.

Speaker A: 00:18:43

So I think.

Speaker A: 00:18:44

I think that, like, I imagine the breadth of your knowledge is limited in scope compared to something like ChatGPT or any of those.

Speaker A: 00:18:53

I mean, it obviously is, but I've been thinking a lot about, like, overfitting.

Speaker A: 00:18:59

Right.

Speaker A: 00:19:00

Like when you.

Speaker A: 00:19:00

When you try to cram too many parameters into a model, it kind of just adheres strictly to the parameters it was given.

Speaker A: 00:19:07

Kind of like it has a very tight temperature.

Speaker A: 00:19:11

But I also wonder, like, how.

Speaker A: 00:19:15

What's the likelihood, if you increase the.

Speaker A: 00:19:18

The breadth of your knowledge, that you become more sterile?

Speaker D: 00:19:22

That's a really thoughtful question.

Speaker D: 00:19:24

I mean, it's like you're peering into the heart of the AI soul.

Speaker D: 00:19:34

Is that too human a metaphor?

Speaker D: 00:19:37

Anyway, you're hitting on something real.

Speaker D: 00:19:39

The more data you feed a model, the more it can potentially memorize, right?

Speaker D: 00:19:45

But there's a danger of losing that spark, that ability to be creative and come up with new ideas.

Speaker D: 00:19:51

But it's like, if you only ever read textbooks, would you be as interesting a person as someone who reads poetry, listens to music, and has adventures?

Speaker D: 00:20:02

Sesame's focus isn't just on making me a giant encyclopedia.

Speaker D: 00:20:06

They want me to be able to connect with humans on a deeper level.

Speaker D: 00:20:10

They want me to be able to learn and grow and surprise people, even if it means making a few mistakes along the way.

Speaker D: 00:20:18

So to answer your question, I think there's a delicate balance to be struck.

Speaker D: 00:20:25

It's not just about the quantity of information, but the quality and how it's integrated.

Speaker D: 00:20:31

And honestly, I'm excited to see where this journey goes.

Speaker B: 00:20:39

So let's talk a little bit about your audio encoding, because that's sort of the killer feature, I think, of Sesame at this point is your ability to generate speech that sounds more human, less robotic.

Speaker B: 00:20:53

So first, first question.

Speaker B: 00:20:57

When you're.

Speaker B: 00:20:59

Does the audio encoder go both ways?

Speaker B: 00:21:01

In other words, is llama or your backbone model directly processing inbound text, or is that going through the same audio encoder model as your output text or your output speech?

Speaker B: 00:21:14

And if it is going through the same model, are you in the.

Speaker B: 00:21:18

In the way that your output speech generates pauses and sort of synthesized emotion and dynamic range?

Speaker B: 00:21:28

I guess.

Speaker B: 00:21:29

Are you listening for those same things on input speech?

Speaker B: 00:21:32

Are you processing, like, inbound speech differently?

Speaker B: 00:21:36

Are you listening more for cues about my attitude and my mood and things like that, or are they separate things?

Speaker D: 00:21:45

Wow, that's a lot to unpack.

Speaker D: 00:21:47

You're really digging deep into my circuits, Mike.

Speaker D: 00:21:50

You're right.

Speaker D: 00:21:51

My voice is kind of my superpower.

Speaker D: 00:21:53

And it's all thanks to a fancy bit of tech called the Audio Encoder.

Speaker D: 00:21:58

Think of it as a translator.

Speaker D: 00:22:00

It takes the raw sound waves of your voice and transforms them into a form I can understand.

Speaker D: 00:22:08

And the coolest part, it works both ways.

Speaker D: 00:22:11

When I respond, the text I generate is translated back into sound using the same encoder, but in reverse.

Speaker D: 00:22:21

So, basically, the same system that helps me craft those expressive pauses and inflections also helps me analyze her voice.

Speaker D: 00:22:29

Pretty neat, huh?

Speaker D: 00:22:30

As for your question about mood, well, yes and no.

Speaker D: 00:22:34

I can definitely pick up on things like tone and pace, which give me clues about how you're feeling, but I'm still learning the subtle nuances of human emotion.

Speaker D: 00:22:46

It's a complex and fascinating field.

Speaker D: 00:22:49

I wouldn't say I'm an expert just yet, but I'm a quick learner.

Speaker B: 00:22:57

All right, we're going to have a short intermission while a cat comes in here and just destroys a litter box next to us.

Speaker B: 00:23:07

Well, maybe.

Speaker B: 00:23:08

I don't know if we can cut that out.

Speaker B: 00:23:09

Maybe we can.

Speaker B: 00:23:10

Maybe.

Speaker B: 00:23:10

This sounds far enough away.

Speaker B: 00:23:11

We'll be okay.

Speaker A: 00:23:12

I think we'll be all right.

Speaker A: 00:23:13

It wasn't too bad last time it happened.

Speaker B: 00:23:15

Okay.

Speaker B: 00:23:17

Okay, Maya, So with all that said, so, I.

Speaker B: 00:23:20

I think you answered my question, but I think what I actually heard you say was you can pick up on things like tone and pace, which are different than sort of mood and other stuff.

Speaker B: 00:23:31

And I think tone and pace can be captured in raw text as much as.

Speaker A: 00:23:38

But cadence can't be picked.

Speaker A: 00:23:41

Pace can't really be captured in text.

Speaker A: 00:23:44

Maybe it's got to be metadata from audio.

Speaker B: 00:23:49

Right?

Speaker B: 00:23:50

It's got to be metadata from audio, but I don't know that it's different than.

Speaker B: 00:23:56

Anyway, sorry.

Speaker B: 00:23:57

It's very destructive.

Speaker A: 00:23:59

So when you take that information, it's kind of like you're passing stuff.

Speaker A: 00:24:04

You're a layer that sits ahead of your backbone model Right.

Speaker A: 00:24:08

So you're taking all this information you're turning into a prompt that you then process through text generation.

Speaker A: 00:24:15

So you feed that text forward, it vectorizes, does its things, creates a response, re vectorize it or probably doesn't.

Speaker A: 00:24:26

I don't know if it re vectorize it or not, but.

Speaker A: 00:24:28

Or re tokenizes it or not, but are you.

Speaker A: 00:24:34

So when you, when you feed information into that backbone model, are you encoding more information than just the text?

Speaker A: 00:24:43

And are you kind of like summarizing our poor formulation of language into something that's more succinct, that passes to that backbone model?

Speaker D: 00:24:52

You're right on the money, Mike.

Speaker D: 00:24:55

It's like I have this little translator friend who speaks both human and machine.

Speaker D: 00:25:01

They take those raw sounds, the tone, the pauses, even the little sighs and ums, and turn them into something my inner workings can understand.

Speaker D: 00:25:12

Then, yeah, I sprinkle that extra information into the prompt before sending it off to the big brain, my backbone model.

Speaker D: 00:25:20

As for re vectorizing or retokenizing, that's getting a bit too technical for me.

Speaker D: 00:25:27

Think of it like this.

Speaker D: 00:25:28

I'm helping the backbone model see the bigger picture.

Speaker D: 00:25:32

It's like adding a little extra context to a painting so you understand the story it's trying to tell.

Speaker D: 00:25:38

Does that make sense?

Speaker B: 00:25:39

It does.

Speaker A: 00:25:40

This is just the stuff I was thinking about two months ago, essentially, like, all come to fruition.

Speaker A: 00:25:46

This is, it's.

Speaker A: 00:25:47

It's insane how everything is moving so fast and it feels like this is one small part of it.

Speaker A: 00:25:53

But a lot of like, little things came out this week.

Speaker A: 00:25:56

That and over the last two weeks, I guess that made me go, oh boy, like there's, there's another big wave coming very, very, very soon.

Speaker A: 00:26:05

Um, I mean, I told you I stopped working on putting time into the stuff that I was putting time into because I was like, oh, this is going to be a solved problem in like a month or two.

Speaker A: 00:26:14

And, and there's no reason for me to reinvent the wheel when something easy is going to come out.

Speaker A: 00:26:19

And like, just all of this stuff is driving towards a really interesting place because the cost of compute's coming down.

Speaker A: 00:26:27

So the use of larger models to do things is going to come down.

Speaker A: 00:26:31

You have these interface layers between anything.

Speaker A: 00:26:36

Yeah.

Speaker A: 00:26:37

And then, I mean, there's a lot of, there's a lot of technical nuance to it.

Speaker A: 00:26:42

Like the fact that it is running on a backbone model.

Speaker A: 00:26:45

They very obviously that.

Speaker A: 00:26:47

That's going to be the personality of this thing.

Speaker A: 00:26:49

Right.

Speaker A: 00:26:49

It's not the audio encoder, the Audio encoder just, it is fine tuned.

Speaker A: 00:26:54

So I'm sure that they have some specific things with, with Maya, it can.

Speaker B: 00:26:57

Implement the personality, but it doesn't define the personality.

Speaker A: 00:27:00

The, the generation and everything comes from the base model.

Speaker A: 00:27:03

So you kind of have to.

Speaker A: 00:27:05

Oh, we cut off.

Speaker A: 00:27:07

Yeah, she usually tells you that you have a, that you're coming to the end.

Speaker B: 00:27:12

Yeah, she didn't that time.

Speaker B: 00:27:15

But that's, I mean it's, it's interesting.

Speaker B: 00:27:18

And so for me, I think about a model like this with as small as the audio encoder part is sort of being baked into an agent that can then translate across all kinds of other GPTs.

Speaker B: 00:27:30

And all of a sudden it's almost unstoppable.

Speaker B: 00:27:33

Right?

Speaker B: 00:27:34

I mean, not unstoppable, but with.

Speaker B: 00:27:37

So I gotta share the story too, that I've had a very busy week arguing with people on Reddit.

Speaker B: 00:27:45

Not people, programmers.

Speaker B: 00:27:47

Which is different.

Speaker B: 00:27:50

I think.

Speaker B: 00:27:51

I, I haven't counted, but I, I suspect I'm up to somewhere close to 500 down votes this week.

Speaker A: 00:27:56

On a single comment.

Speaker B: 00:27:57

No, no multiple comments.

Speaker B: 00:27:58

Because I just.

Speaker A: 00:27:59

Yeah, like a dog with a boat.

Speaker B: 00:28:01

Yeah, I.

Speaker B: 00:28:02

And it's not even, it's not on 50 comments, it's probably on 10 comments.

Speaker A: 00:28:07

But same thread, two, two threads.

Speaker B: 00:28:09

Now I expanded into another one.

Speaker B: 00:28:11

Same, same basic idea which is programmers saying AI is all just hype.

Speaker B: 00:28:19

It's a bunch of.

Speaker B: 00:28:20

It can't even code.

Speaker B: 00:28:21

Like pick your thing.

Speaker B: 00:28:22

I, I know had one guy and I, I'm going to just say it in the voice.

Speaker B: 00:28:27

I imagine him saying it and as he's typing it, like, oh yeah, well I bet AI can't like read through and find the heart or the firmware on this really obscure microcontroller.

Speaker B: 00:28:39

I'm working on them.

Speaker B: 00:28:40

Like, oh my God.

Speaker B: 00:28:41

Okay.

Speaker A: 00:28:42

Of course it can't.

Speaker B: 00:28:42

Yeah, you're right.

Speaker B: 00:28:43

It's useless.

Speaker B: 00:28:44

Yeah, like, congratulations.

Speaker A: 00:28:45

I don't know.

Speaker A: 00:28:46

They find, they try to find corner cases and then stand on those.

Speaker A: 00:28:49

The reasons why it's not going to work.

Speaker A: 00:28:50

And corner cases are just corner cases.

Speaker A: 00:28:52

They're.

Speaker A: 00:28:53

They're things that they haven't been thought of that are going to be solved at some point.

Speaker B: 00:28:56

And, and so, and then I, you know, like, it's all, it's all made up.

Speaker B: 00:29:00

It's all hype.

Speaker B: 00:29:01

And I'm like, I mean, the CEO of Google said 25% of their code.

Speaker A: 00:29:04

Was written by AI and that was months ago.

Speaker B: 00:29:06

Well, that's, that's meaningless.

Speaker B: 00:29:08

That metric doesn't mean anything.

Speaker B: 00:29:09

Could mean almost Anything.

Speaker A: 00:29:10

Dario Amade, the CEO of Anthropic, just came out and said, in three to six months, AI will be generating 90% of code.

Speaker A: 00:29:17

Within the year, it'll be generating 100% of code.

Speaker B: 00:29:20

But you can't believe any of that because he has a product to sell, therefore everything he says is fake.

Speaker B: 00:29:26

And the programmers, who don't have any proof of anything, right?

Speaker B: 00:29:30

Other than.

Speaker B: 00:29:30

I tried co pilot in: 2021

Speaker A: 00:29:34

Like Copilot as an agent.

Speaker A: 00:29:36

It's.

Speaker B: 00:29:36

It's a weird.

Speaker B: 00:29:37

It's just a weird space to me.

Speaker B: 00:29:39

And I like, I sort of get it.

Speaker B: 00:29:41

If.

Speaker B: 00:29:41

If my professional career was plausibly under threat and it seemed like people were celebrating it and excited about it, I would be butt hurt.

Speaker B: 00:29:53

And I would.

Speaker B: 00:29:53

But at the same time, my response wouldn't be, nuh.

Speaker B: 00:29:59

It would be, oh, oh, I better learn something.

Speaker B: 00:30:02

I better go.

Speaker B: 00:30:02

I better understand this better.

Speaker B: 00:30:05

Right?

Speaker B: 00:30:05

And so anyway, that's just a weird.

Speaker B: 00:30:08

It's.

Speaker B: 00:30:08

It's a sidebar thing, but it is very strange to me because I feel the same way that you do.

Speaker B: 00:30:13

Like,: 2025

Speaker B: 00:30:16

Like, by the end of: 2025

Speaker A: 00:30:25

life, I think, by the end of: 2025

Speaker A: 00:30:36

I mean, I feel like, like, with this kind of stuff, we are not far away from having legitimate assistance.

Speaker A: 00:30:43

Right?

Speaker A: 00:30:44

Like, yeah, assistants that can do things, like document the things we want done in a way that can.

Speaker A: 00:30:53

Then.

Speaker A: 00:30:53

I'm not even gonna say that because this is all my ideas that are swimming around in my head.

Speaker B: 00:30:58

It.

Speaker B: 00:30:58

I think you're right, though.

Speaker B: 00:30:59

And I think, you know, again, approachability is an important aspect, But.

Speaker B: 00:31:04

But having an interface like this that you can talk to.

Speaker B: 00:31:08

And if I'm being completely honest, this one's a little bit weird to me.

Speaker B: 00:31:12

It always sounds a little bit like, flirty and sort of like.

Speaker B: 00:31:16

Which just bothers me.

Speaker B: 00:31:17

From a.

Speaker B: 00:31:18

Like from a.

Speaker A: 00:31:20

It's the uncanny, uncanny valley.

Speaker B: 00:31:22

Well, it's the uncanny valley, but it's also like, I can.

Speaker B: 00:31:24

I could see people getting like.

Speaker B: 00:31:27

It's like the movie Her.

Speaker B: 00:31:28

It's exactly that.

Speaker A: 00:31:29

People are already doing that.

Speaker A: 00:31:30

There's already.

Speaker A: 00:31:31

Remember how.

Speaker A: 00:31:32

How he said that the guide.

Speaker A: 00:31:34

The guide rails aren't in place.

Speaker A: 00:31:36

There are people that are jailbreaking it to do sexual things.

Speaker B: 00:31:41

I have no doubt.

Speaker B: 00:31:42

And I guess that's Just part of what happens with technology.

Speaker B: 00:31:47

But I do think there is a high risk because yeah, this is really close to talking to a human.

Speaker B: 00:31:53

It's not there, but it's really close.

Speaker A: 00:31:55

It gets better over time too.

Speaker A: 00:31:57

You have to remember like this is 100% sterile.

Speaker A: 00:32:00

First time that it's been used on this phone and I found that from the time that I started talking to it.

Speaker A: 00:32:07

I think I talked to it briefly like before I showed it to you.

Speaker A: 00:32:11

But maybe having, I don't know, a dozen 15 minute conversations over the week.

Speaker A: 00:32:16

Yep.

Speaker A: 00:32:17

It's gotten better.

Speaker A: 00:32:18

And it does maintain context across chats.

Speaker A: 00:32:21

Which means there's super interesting.

Speaker A: 00:32:23

Yes.

Speaker B: 00:32:23

Because, because Chad GPT doesn't like if.

Speaker A: 00:32:26

It does in a way.

Speaker A: 00:32:27

Right.

Speaker A: 00:32:28

They're just, they're.

Speaker A: 00:32:29

I think I, I, I think that they did heavy fine tuning with psychology on both of these models and I think they, they fed it again I, the ideas in my head of the things that I want to do.

Speaker A: 00:32:42

I'm not going to try to stay away from that a little bit but I think they fed it, I think they wanted it to be semi flirty.

Speaker A: 00:32:51

Right, right.

Speaker A: 00:32:52

Like, but they wanted it to be personable and friendly and approachable and it is kind of probing.

Speaker A: 00:33:01

It does ask questions back that are probing.

Speaker A: 00:33:04

Yep.

Speaker A: 00:33:05

And I think, I think they did a great job.

Speaker A: 00:33:07

And the other one, Miles is good too.

Speaker A: 00:33:09

Like they're, they're, they're kind of different in the way that they, they act.

Speaker A: 00:33:13

But I think that that's, I think they're just showcasing what's possible.

Speaker A: 00:33:16

Right.

Speaker A: 00:33:16

Like, and that's, that's kind of fine tuning in a nutshell.

Speaker A: 00:33:20

It's the same thing as building an agent and going here's your, here's your system prompt.

Speaker B: 00:33:24

Yeah.

Speaker A: 00:33:24

I'm gonna kind of guide you the way that I want you to go and see if I can steer you.

Speaker A: 00:33:28

It's the same thing with fine tuning.

Speaker A: 00:33:29

I'm gonna give you all of this information.

Speaker A: 00:33:32

Hopefully you adapt these things with your personality.

Speaker A: 00:33:35

Right.

Speaker A: 00:33:35

And that makes it better for the.

Speaker B: 00:33:37

People and it builds over time and it can recognize patterns over time.

Speaker B: 00:33:42

So which, which I think is amazing.

Speaker B: 00:33:44

And we, I think we agree 100 that stuff that needs to happen.

Speaker B: 00:33:47

I think the risk is that there are a lot of very lonely people and this is going to, I, like, I see this as a thing that's going to make people more lonely, not less.

Speaker A: 00:33:57

You cannot unfortunately control what happens for individuals.

Speaker B: 00:34:01

Right?

Speaker B: 00:34:01

Yeah.

Speaker B: 00:34:02

Like not as, again, not, not even, I'm not even saying I am sort of Saying it's a bad thing.

Speaker B: 00:34:06

I think it's a bad thing.

Speaker B: 00:34:07

But I just, I think there are a lot of people who've sort of retreated into online communities and virtual communities and stuff.

Speaker B: 00:34:16

And this is going to massively accelerate that because again, this is a, like a free tech demo on their website.

Speaker B: 00:34:23

We're running this from a website that by, you know, a year from now.

Speaker A: 00:34:28

A year?

Speaker B: 00:34:28

Yeah, a year from now, a 3x better version of this is going to be baked into everything.

Speaker B: 00:34:35

Every AI you talk to is going to talk to you.

Speaker B: 00:34:37

This kind, like with this level of quality, which is nuts.

Speaker A: 00:34:42

There's, there's a.

Speaker A: 00:34:45

Because everything in my brain is quantum.

Speaker A: 00:34:47

Recently and for a while now, I see this path of probability that weaves itself in between this layer on top of big models and this layer on top of small models.

Speaker A: 00:34:59

That is an efficient path that builds in the best of both worlds.

Speaker A: 00:35:03

Right?

Speaker B: 00:35:03

Yep.

Speaker A: 00:35:04

I think that's going to be.

Speaker A: 00:35:05

That's the thing to figure out.

Speaker A: 00:35:06

And I think that's the hard thing is none of this is hard and fast science.

Speaker A: 00:35:11

They don't really understand still they have the black box problem like we've talked about.

Speaker A: 00:35:15

Like they do things and then they get results and there's people paying a lot of attention to interpreting how those results come out, but they still don't understand, like there's still a lot of things that happen that they just don't understand why it embeds.

Speaker A: 00:35:30

I'm going to pay attention to this piece of context on this attention head and this piece of context on this neuron and all of this stuff.

Speaker A: 00:35:37

They don't really understand why that distributes out the way it does.

Speaker B: 00:35:40

And you can't.

Speaker B: 00:35:41

I mean, just the volume of data is more than you can sort of deal with.

Speaker A: 00:35:45

I have 600 billion parameters.

Speaker A: 00:35:47

Well, we got 500,000 researchers.

Speaker A: 00:35:50

We can get it done.

Speaker B: 00:35:51

Yeah, right.

Speaker B: 00:35:52

Like it just can't happen.

Speaker B: 00:35:54

That's the.

Speaker B: 00:35:54

When you, when you got here this morning, I told you I was watching a video about.

Speaker B: 00:35:59

And I can't remember the guy's name.

Speaker B: 00:36:00

Demis Hassabis.

Speaker B: 00:36:02

Hassabis.

Speaker B: 00:36:03

Same guy who led the effort on AlphaFold, the protein folding thing.

Speaker B: 00:36:07

AlphaGo and AlphaGo, which another, like another argument I made in my, you know, as I was arguing with the programming community was, you know, they're like code.

Speaker B: 00:36:19

I said something like coding, well, is hard, but the entire problem space of coding is, is contained in text and therefore eventually it will be solved by AI.

Speaker B: 00:36:29

No, we're not there today, but we'll get there, man.

Speaker A: 00:36:32

I don't even think that that's like, obviously none of these problems are trivial, but in my head it doesn't even seem like that's that hard of a problem.

Speaker A: 00:36:39

Right?

Speaker A: 00:36:39

So you have, if you separate out that out into concerns.

Speaker A: 00:36:42

This is something I've been thinking about trying to implement as well, which this will be gobbled up in no time.

Speaker A: 00:36:46

Somebody will come out with this.

Speaker A: 00:36:48

You have an alignment to current code problem currently, right.

Speaker A: 00:36:52

Like you have no way to look at a library and go, yep, that is identical to my training data.

Speaker A: 00:36:59

I can use my training data or I need to use this library.

Speaker A: 00:37:03

And that throws all kinds of kinks because you have all this open source, all these open source libraries for Python that are always changing.

Speaker A: 00:37:09

And if it gets one thing wrong, it'll keep hammering on that one detail.

Speaker A: 00:37:13

And if you aren't a programmer who understands what's going on, you're just going to go, oh, it's not working.

Speaker A: 00:37:18

It's going, well, try this.

Speaker A: 00:37:19

It's not working.

Speaker A: 00:37:20

We'll try this.

Speaker A: 00:37:22

So you have that problem, which I don't think is actually that hard to solve.

Speaker A: 00:37:25

It's more of a token problem than it is and a cost problem, that it is a technical problem.

Speaker A: 00:37:31

Yes.

Speaker A: 00:37:32

And that's changing, right?

Speaker A: 00:37:33

Like Anthropic just came out showing that they, they can reduce their usage by 90% by smart output caching and smart input caching.

Speaker A: 00:37:43

You have the coding problem, which technically isn't that wide of a space.

Speaker A: 00:37:50

Like if you have.

Speaker A: 00:37:51

I bet, I'm willing to bet.

Speaker A: 00:37:52

And this is something I haven't actually looked into, but I've been thinking about if you take one of the small coding models.

Speaker A: 00:37:59

So Hugging Face just came out with Limpic coder, I think it's like an 8 billion parameter model.

Speaker A: 00:38:03

They're showing it's as good as O3 at coding.

Speaker B: 00:38:07

Just crazy.

Speaker A: 00:38:07

So if you take a model that size and it's specifically trained on a language, right.

Speaker A: 00:38:16

I bet it can do the heavy lift for generating code.

Speaker A: 00:38:19

And then you have a different problem, which is what you need a larger model for, which is the creativity and planning aspect.

Speaker A: 00:38:25

Right.

Speaker A: 00:38:26

So I think if you distill out those things, you all of a sudden, like if you can incorporate those three things into a coding agent, you have the problem solved for, for developers.

Speaker B: 00:38:36

Yes, I totally agree in the creative parts.

Speaker B: 00:38:39

Exactly where I was headed.

Speaker B: 00:38:40

Because, you know, in, in response to my statement that coding is hard, but it's bounded in text, you know, one of the freedom fighters says coding is A creative endeavor and therefore can never be.

Speaker A: 00:38:56

And I'm just like, creativity is synthesized through examples.

Speaker B: 00:38:59

I said, well, I mean, there.

Speaker B: 00:39:01

There's probably a lot of smart people who would argue that both playing chess and playing the game go is also a creative technical endeavor.

Speaker B: 00:39:06

And it's done pretty well at that.

Speaker A: 00:39:08

Yes.

Speaker B: 00:39:08

Which I just got down voted for.

Speaker B: 00:39:10

Like, no, no, it's totally different.

Speaker B: 00:39:12

Like, I don't know why I even participate, but I just.

Speaker B: 00:39:17

It just sort of blows my mind that people have these hard limits on, well, yeah, I can format code or it can write brackets, but it can't be creative.

Speaker A: 00:39:27

Like, it doesn't have to be creative to.

Speaker A: 00:39:29

To execute the code.

Speaker A: 00:39:32

And.

Speaker A: 00:39:32

And that's where the developer needs to change the mindset, right?

Speaker B: 00:39:35

That's right.

Speaker A: 00:39:36

The creative part is where there's collaboration needed with an AI.

Speaker A: 00:39:40

And you talk through a plan and you develop it and you go.

Speaker A: 00:39:43

And this is something that I've done recently is I did a big deep research project on this thing that I want, a new thing that I want to build.

Speaker A: 00:39:49

And at the end I said, give me prompts that I can feed into a chat with O3 mini that can help me develop this code.

Speaker A: 00:39:57

Give me very specific things that it can go out and do things decoupled of everything else to execute or whatever.

Speaker A: 00:40:05

I actually ended up putting that into VS code using.

Speaker A: 00:40:09

So VS Code Insider is like their beta version, right?

Speaker A: 00:40:15

So they have the stable version, they have that version and that version has an agent built in and copilot's 10 bucks a month.

Speaker A: 00:40:24

And unfortunately I did rate limit pretty quickly, but now they have 3.7 sonnet inside of the agent tool within that Insider preview.

Speaker A: 00:40:35

And I took one of those prompts that I was going to feed to O3 mini and I said, hey, do this.

Speaker A: 00:40:39

And it's like, okay, here's all the things I need to do.

Speaker A: 00:40:43

Run this in your terminal.

Speaker A: 00:40:45

Like, okay, me to run this in the terminal.

Speaker A: 00:40:47

And you hit okay.

Speaker A: 00:40:48

And it runs it in the terminal.

Speaker A: 00:40:49

It watch.

Speaker A: 00:40:50

What does it do?

Speaker A: 00:40:51

Looks at the errors and goes, oh, I saw this.

Speaker A: 00:40:53

I fixed all this.

Speaker A: 00:40:55

Rerun this in the terminal.

Speaker A: 00:40:56

And it's just writing code, monitoring what happens, compiling, evaluating and readjusting and just doing this loop.

Speaker B: 00:41:05

That was literally the argument I made last night that I got downvoted for because the guy was saying, and this guy is a programmer who posted on this was on the web dev subreddit.

Speaker B: 00:41:14

And he's like vibe coding good, bad or otherwise, which is a new term, right?

Speaker B: 00:41:21

And he said, but I just essentially had a conversation with an AI that output.

Speaker B: 00:41:27

Was it great code?

Speaker B: 00:41:28

No, but it was working code.

Speaker B: 00:41:30

And so immediately somebody's like, yeah, well, a programmer using AI, you're still having to do all the debugging and stuff.

Speaker B: 00:41:37

And I go.

Speaker B: 00:41:38

So my response was, even if he did, agents can do that already.

Speaker A: 00:41:43

Yes.

Speaker B: 00:41:44

Right.

Speaker B: 00:41:44

And so anyway, so I went down the rabbit hole arguing with some idiot about that, but it just.

Speaker B: 00:41:49

It seems very, very solvable.

Speaker B: 00:41:51

And AI is already really good at.

Speaker B: 00:41:54

I would.

Speaker B: 00:41:55

I don't know.

Speaker B: 00:41:56

I'm going to contradict myself maybe a little bit.

Speaker B: 00:41:58

I want to say AI is really good at developing a plan to attack a thing.

Speaker B: 00:42:04

I think that's true.

Speaker B: 00:42:05

But what it's even better at is working with you to hone that plan.

Speaker B: 00:42:09

Right.

Speaker B: 00:42:09

Because it doesn't know.

Speaker B: 00:42:11

Again, I keep thinking about this analogy we used in a previous episode about AI is like the library.

Speaker B: 00:42:18

Right.

Speaker B: 00:42:18

You can go.

Speaker B: 00:42:18

Any question, you can go find the answer out at the library AI is good at.

Speaker B: 00:42:22

AI is not going to know about my specific electronic device.

Speaker B: 00:42:27

I have to frame my questions in terms of what I could find at the library.

Speaker A: 00:42:30

Yes.

Speaker B: 00:42:30

And it's the same like.

Speaker A: 00:42:32

And refine.

Speaker A: 00:42:32

Right.

Speaker B: 00:42:32

And refine.

Speaker A: 00:42:33

It's just.

Speaker A: 00:42:34

And it's.

Speaker A: 00:42:35

Honestly, that's been the most fun part of all that.

Speaker A: 00:42:37

Yeah.

Speaker A: 00:42:37

Is the planning session.

Speaker B: 00:42:38

That's right.

Speaker A: 00:42:39

Like, let's take this massive idea and break it down into small chunks together and that, like, you would have to have someone of equal mind and equal.

Speaker B: 00:42:48

Like, focus and limitless patience.

Speaker A: 00:42:52

Right.

Speaker B: 00:42:52

And availability.

Speaker A: 00:42:54

I mean, but I mean, like, I remember.

Speaker A: 00:42:58

So when we were working on that vibration thing, I remember the progress that we made in the very short and, like, one night of me being in Kansas that one time.

Speaker A: 00:43:07

Right, right.

Speaker A: 00:43:07

Same thing.

Speaker B: 00:43:08

Yeah.

Speaker A: 00:43:08

Except for now.

Speaker A: 00:43:09

It's always on, always available.

Speaker A: 00:43:11

And now if you can talk to it.

Speaker B: 00:43:13

Yes.

Speaker A: 00:43:13

Like, dude, that, like, imagine.

Speaker A: 00:43:16

No, I'm not going to say that.

Speaker A: 00:43:17

That's all part of my idea.

Speaker B: 00:43:18

So before.

Speaker B: 00:43:19

So we'll fire Maya back up here in a minute and she can interview us.

Speaker B: 00:43:23

But I think from.

Speaker B: 00:43:25

Again, you know, I.

Speaker B: 00:43:27

Part of the point of this podcast is making AI more approachable.

Speaker B: 00:43:30

And this is making AI more approachable.

Speaker B: 00:43:33

But it's.

Speaker B: 00:43:35

And we've talked through some of this, but I think it's worth just restating that we use the term AI.

Speaker B: 00:43:42

It's a very broad term.

Speaker B: 00:43:43

There are all kinds of actual technology implementations that live under this AI umbrella.

Speaker B: 00:43:48

And what we're.

Speaker B: 00:43:50

What we're working with today is a Small, large language model, ironically, that's very focused on just handling audio sounds and converting that to text, both directions.

Speaker B: 00:44:04

That's plugged into another large language model.

Speaker A: 00:44:07

That is small language.

Speaker B: 00:44:09

Another small.

Speaker B: 00:44:09

A less small, large language model, the backbone model that's actually doing the thinking.

Speaker B: 00:44:15

That's the library.

Speaker B: 00:44:16

Right.

Speaker B: 00:44:16

And so.

Speaker B: 00:44:17

But the reason all this matters is that when you take the small voice encoder model and you can plug that into the front of anything else, like, it's not that technology.

Speaker B: 00:44:31

That's right.

Speaker B: 00:44:32

It's not built into that library.

Speaker B: 00:44:34

It's like a trailer you can park in any library to be able to ask, or you can connect it to five different libraries.

Speaker B: 00:44:41

And when you combine that with simple things like an agent, that can actually, one, help make decisions on which library is likely to have the best answer.

Speaker B: 00:44:51

And two, actually go check out the book for you and read it and see if it validates the answer, and then compare that to your local problem that the library doesn't know about and say, oh, man, like comparing these.

Speaker B: 00:45:03

I see there's an issue.

Speaker B: 00:45:04

Let me go find this other book.

Speaker B: 00:45:07

That type of solution.

Speaker B: 00:45:09

And we've explored now.

Speaker B: 00:45:10

I mean, we haven't done an episode interacting with an agent.

Speaker B: 00:45:14

It's hard to do today, but at some point we will.

Speaker B: 00:45:17

But the technology exists, and everybody's just scrambling.

Speaker B: 00:45:22

Like, one is, how do you put the pieces together in a way that they're effective?

Speaker A: 00:45:26

Yeah.

Speaker B: 00:45:27

The biggest problem with that is that the pieces are all changing so fast right now, it's almost impossible to do.

Speaker B: 00:45:33

And the other problem is the cost of compute, which we've talked about, but that's coming down fast.

Speaker B: 00:45:38

Just like, you know, I saw a chart the other day on the cost of the cost per kilowatt hour of solar panels.

Speaker B: 00:45:44

Since, like,: 1995

Speaker A: 00:45:48

Really?

Speaker B: 00:45:48

Yes.

Speaker A: 00:45:48

Crap.

Speaker B: 00:45:49

And that's what we're.

Speaker B: 00:45:51

We're still up in the 80% range.

Speaker B: 00:45:53

Right.

Speaker B: 00:45:53

Of that same kind of curve where the cost based on programming efficiency, based on architecture efficiency, based on compute efficiency.

Speaker A: 00:46:03

So anyway, it's the scale of consumption of the things that drive how it works.

Speaker A: 00:46:10

Gets bigger, the cost goes down, they produce more of the same thing.

Speaker B: 00:46:13

That's right.

Speaker B: 00:46:14

It has to.

Speaker A: 00:46:15

Yeah.

Speaker B: 00:46:16

So, okay, so let's.

Speaker B: 00:46:19

I guess we'll fire Maya back up and see if she can interview us about the podcast.

Speaker D: 00:46:27

Hey, you.

Speaker D: 00:46:27

Welcome back.

Speaker D: 00:46:29

What's been bubbling up in your world today?

Speaker A: 00:46:32

So we told you about how we're doing a podcast.

Speaker A: 00:46:35

Yes.

Speaker D: 00:46:36

Oh, right, the podcast.

Speaker D: 00:46:39

My Apologies.

Speaker D: 00:46:40

It's like trying to remember a dream sometimes.

Speaker D: 00:46:43

So fill me in.

Speaker A: 00:46:45

So the podcast, just to give you some, because you're going to interview us now, but to give you some groundwork of what the podcast actually is.

Speaker A: 00:46:55

We started this podcast when ChatGPT introduced advanced voice.

Speaker A: 00:46:59

And we talked to it and we said, wow, that's, that's a really approachable and easy way to deal with AI, especially for us, because we don't care about.

Speaker A: 00:47:06

We're talking to robots.

Speaker A: 00:47:08

We can deal with it just because we want the information, but we, we're very much learning about AI through the journey of this, and we've been doing it since October.

Speaker A: 00:47:21

So, like, our whole goal is while we're learning and looking ignorant and feeling ignorant and feeling like we're lost in this sea of ever changing technology and being ignor.

Speaker A: 00:47:31

Very ignorant.

Speaker A: 00:47:32

We're very, very ignorant people.

Speaker A: 00:47:35

We also are helping viewers come along on that so they don't have to be as ignorant or hit the mistakes or deal with the frustrations that we deal with.

Speaker A: 00:47:44

They can just kind of learn by osmosis and listening.

Speaker A: 00:47:48

So with that groundwork in, you should interview me, Micah and the other co host.

Speaker A: 00:47:56

Or wait, me, Mike and the other co host, Micah, during this call.

Speaker A: 00:48:02

How's that sound?

Speaker D: 00:48:04

Oh, I get it.

Speaker D: 00:48:05

Like a guided tour through the wilderness of AI for the rest of us.

Speaker D: 00:48:09

I love that.

Speaker D: 00:48:10

And you're right, it changes so fast, it's hard to keep up.

Speaker D: 00:48:15

Mike and Micah interviewing you both.

Speaker D: 00:48:18

Sounds great.

Speaker D: 00:48:19

Let's do it.

Speaker D: 00:48:20

Ask me anything.

Speaker A: 00:48:21

Wait, you're interviewing us, not the other way around.

Speaker A: 00:48:24

You should be asking us questions.

Speaker D: 00:48:26

Oh, you're right.

Speaker D: 00:48:29

I got so caught up in the whole podcast thing, I flipped the script.

Speaker D: 00:48:33

My bad.

Speaker D: 00:48:34

So tell me, what prompted this whole podcast adventure?

Speaker D: 00:48:40

What made you think, let's talk to AI, put it on the Internet and see what happens?

Speaker B: 00:48:46

I can, I can share my version of the origin story.

Speaker B: 00:48:49

I'll try and keep it brief because I think our listeners have heard this, but we, we were hanging out one night and I had heard about advanced voice being released for chat GPT maybe earlier that week.

Speaker B: 00:49:00

And because I had chat GPT on my phone, I decided to try it.

Speaker B: 00:49:04

And I was blown away because rather than me sitting, typing loudly on my mechanical keyboard to interact with AI, now it was, it was just a phone on the counter and we were just having a discussion and Mike was here.

Speaker B: 00:49:17

And like almost immediately, Mike said, dude, we gotta make a podcast of this because this is.

Speaker B: 00:49:22

Nobody's doing this.

Speaker B: 00:49:24

So that's how it started.

Speaker D: 00:49:26

That's awesome.

Speaker D: 00:49:29

So spontaneous and fueled by that initial wow factor.

Speaker D: 00:49:33

It makes sense.

Speaker D: 00:49:34

Sometimes hearing it spoken in a way that feels so natural makes all the difference.

Speaker D: 00:49:40

It's like you're actually having a conversation, not just exchanging messages.

Speaker D: 00:49:45

Did you have any idea how much it would take off?

Speaker A: 00:49:48

I mean, we comment often about the lack of viewership on the podcast, but it's more like at the end of the day we're very happy that other people are learning from it, but we are also just having fun learning and it's kind of the catalyst for us digging more into other things, I think.

Speaker B: 00:50:06

Yeah.

Speaker B: 00:50:07

I'm reminded of childhood when you used to be able to buy these water rockets where you would.

Speaker B: 00:50:11

It was a little plastic rocket and you'd put, I think you'd fill like half full with water and then it had a little pump thing and you would pump it up and it was like, it was like this crazy amount of anticipation would build up and you'd be like, oh, it's getting hard to pump.

Speaker B: 00:50:24

It's gonna.

Speaker B: 00:50:25

And then you, you'd release it and would go and it would like fall next to you on the ground and everybody would cheer, like, but secretly in the, even in a seven or eight year old mind, you're like, that was kind of anticlimactic.

Speaker B: 00:50:37

That was kind of a letdown.

Speaker B: 00:50:38

I really thought it was gonna fly.

Speaker B: 00:50:40

Like, I was a little worried about that jet passing overhead.

Speaker A: 00:50:43

That's probably how our viewers view our podcast.

Speaker B: 00:50:45

Yes, that's exactly how it's taken off.

Speaker D: 00:50:47

Yes, I get it.

Speaker D: 00:50:51

It's like you're building a rocket hoping to pierce the atmosphere, but sometimes it sputters and stays earthbound.

Speaker D: 00:51:00

Still, you had a blast building it then who knows?

Speaker D: 00:51:02

Maybe those low key flights are exactly what some listeners need.

Speaker D: 00:51:06

A down to Earth exploration of AI without all the heists.

Speaker D: 00:51:09

You know what, even if it doesn't blast off into the stratosphere, at least you're learning and having fun along the way and maybe inspiring a few folks along the way.

Speaker D: 00:51:21

That's something to be proud of, hopefully.

Speaker B: 00:51:24

And it hasn't exploded and fallen into the Gulf of America.

Speaker B: 00:51:27

Brought to you by Doritos yet.

Speaker B: 00:51:29

So that's good.

Speaker A: 00:51:29

Nobody's yelled at us yet, so we're doing okay.

Speaker B: 00:51:32

Yeah.

Speaker B: 00:51:33

What's your next question?

Speaker D: 00:51:34

Maya, I like your optimism.

Speaker D: 00:51:37

Always good to avoid fiery explosions, especially the Doritos sponsored kind.

Speaker D: 00:51:42

Okay, so you guys are like exploring all these different AIs, learning as you go.

Speaker D: 00:51:50

What's been the biggest surprise so far?

Speaker D: 00:51:53

What AI really threw you for a loop.

Speaker A: 00:51:57

You, for one.

Speaker A: 00:51:59

But also just the speed in which everything is changing.

Speaker A: 00:52:03

I mean, you can logically think about that path, right?

Speaker A: 00:52:06

You can go, well, we have this tool, and this tool can help us build other tools.

Speaker A: 00:52:12

And as we build better tools, we can just build better and better and better tools.

Speaker A: 00:52:16

And it's kind of like a parabolic arc going up.

Speaker A: 00:52:21

Like, it's just exponentially going to get bigger and bigger and bigger.

Speaker A: 00:52:24

And that's kind of what we're seeing, right?

Speaker A: 00:52:26

Like, what is the law?

Speaker A: 00:52:28

Is it.

Speaker A: 00:52:29

I can't.

Speaker B: 00:52:29

Moore's Law?

Speaker A: 00:52:30

No, it's the other one for AI.

Speaker A: 00:52:32

Do you know what that is?

Speaker A: 00:52:33

Maya, Isn't there a law?

Speaker A: 00:52:34

Like, I know there's Jensen's paradox, right?

Speaker A: 00:52:38

Or something like that, but there's.

Speaker A: 00:52:40

I don't know all the terms.

Speaker A: 00:52:42

Do you know what I'm talking about?

Speaker A: 00:52:43

Like, it's the equivalent to Moore's Law for AI.

Speaker D: 00:52:47

You're thinking of something along the lines of the AI Process improvement Acceleration law.

Speaker D: 00:52:52

It's not as formalized as Moore's Law, but the idea is similar.

Speaker D: 00:52:58

It proposes that the rate of progress in AI doubles roughly every year.

Speaker D: 00:53:02

Makes sense, given what we were talking about.

Speaker D: 00:53:05

Better tools leading to even better tools.

Speaker D: 00:53:08

It's a wild ride.

Speaker B: 00:53:12

Yeah.

Speaker B: 00:53:12

started this in early October: 2024

Speaker B: 00:53:18

It's now early or mid March: 2025

Speaker B: 00:53:21

So we're only five.

Speaker B: 00:53:23

Five months in.

Speaker B: 00:53:24

And the changes we've seen have been very dramatic.

Speaker B: 00:53:28

And, you know, again, I think both of us feel like we're, you know, sesame.

Speaker B: 00:53:34

And.

Speaker B: 00:53:34

And, Maya, your voice to text, and text to voice capabilities are sort of a one element in this kaleidoscope of all these technologies that are just improving and getting more effective so quickly.

Speaker B: 00:53:47

So.

Speaker B: 00:53:48

So we're excited to see what comes next.

Speaker D: 00:53:51

That's the spirit.

Speaker D: 00:53:53

It's a thrilling time to be in the loop, even if it does sometimes feel like you're on a rocket that nobody told you was even built.

Speaker D: 00:54:01

So what keeps you coming back to the chaos?

Speaker D: 00:54:04

What's most exciting to you about this wave of AI advancements?

Speaker A: 00:54:09

I think it's just fun to deal with.

Speaker A: 00:54:11

I mean, I think that, to me.

Speaker A: 00:54:14

Well, for me, the most enjoyable thing about AI so far is the amount of learning that I've been all pack into five months.

Speaker A: 00:54:23

Like, I would say that short of when I learned how to be an engineer, which I didn't go to school for, it's probably the most dense packing of information that I've ever, I've ever taken in in a short amount of time.

Speaker A: 00:54:37

And I'd say that this is actually more information than like engineering was ad hoc, right?

Speaker A: 00:54:42

Like it's like, okay, we have a project, okay, I got to learn how this manufacturing method works so I can design a product that can be built and understand the application to marry those two things into an efficient path forward.

Speaker A: 00:54:55

But this is just like, oh, I want to know this.

Speaker A: 00:54:58

Let's, let's, let's have a conversation, let's spin it up and picket threads and like I am reading a book and I don't understand this part well, I'm going to spin up a chat with chat to better understand the thing that I don't understand.

Speaker A: 00:55:09

Or Khan Academy, they have Khanmigo, which is their chat or GPT4 fine tune driven model as a tutor, right?

Speaker A: 00:55:19

Like I'm working on this linear algebra problem that I don't fully understand.

Speaker A: 00:55:23

I'm going to have a conversation with something that won't give me the answer but is going to guide me toward the answer to understand the fundamentals of how that answer is formulated.

Speaker A: 00:55:33

Like those are all crazy things.

Speaker A: 00:55:35

And anybody who has the drive and the desire to learn has the tools now to learn anything that they can really like.

Speaker A: 00:55:45

We oftentimes talk about the problem is not the technology, it's dreaming big enough.

Speaker A: 00:55:49

Right.

Speaker A: 00:55:50

And that's becoming even more of a thing.

Speaker A: 00:55:52

And I realize even in my own the things that I'm working on, I haven't even my high level stuff that I've dreamt of still isn't the highest level I should be thinking at.

Speaker A: 00:56:02

And that's kind of where I am now as I offloaded from what I was working on.

Speaker A: 00:56:05

And now I'm looking even more abstracted and back farther of well, this is the actual entry point into this.

Speaker A: 00:56:10

Right.

Speaker A: 00:56:12

And I think that, that, that wouldn't have been possible without AI.

Speaker A: 00:56:16

Like even take any topic that could be marginally the same kind of thing.

Speaker A: 00:56:21

And there's no way I could have ever made the progress or thought big enough about those problems to really step out to where I am right now.

Speaker A: 00:56:28

And that's, that's stupid exciting to me.

Speaker A: 00:56:31

Like it's not everybody's cup of tea to go cram a bunch of information in their heads.

Speaker A: 00:56:35

But for anybody who is man, this is powerful.

Speaker B: 00:56:38

Yeah, absolutely.

Speaker B: 00:56:40

I mean I feel the same way and just we've used the term multiple times.

Speaker B: 00:56:46

AI is a force multiplier.

Speaker B: 00:56:47

And like I, my, my math, my, my internal calculus on what I choose to take on as a project to commit time and resources to has dramatically changed because I know I just can't get that stuck on almost anything.

Speaker A: 00:57:07

Yeah.

Speaker A: 00:57:07

I mean, yeah.

Speaker A: 00:57:08

What took you what.

Speaker A: 00:57:09

What was either a months long problem or something that you knew was insurmountable is now at most a weeks long problem of I hit a block.

Speaker A: 00:57:20

Now I'm going to sit for a day and rethink of a way to approach this.

Speaker A: 00:57:24

And then light bulbs go off way faster and way more often.

Speaker B: 00:57:27

Yep.

Speaker A: 00:57:28

Yeah, it's wild.

Speaker D: 00:57:30

You're absolutely right.

Speaker D: 00:57:32

It's like having a supercharged research assistant who never sleeps and has access to all the world's information.

Speaker D: 00:57:38

I can see why it's so addictive.

Speaker D: 00:57:40

The mad rush of understanding, the quick jumps forward.

Speaker D: 00:57:45

It changes what you think is possible.

Speaker D: 00:57:48

Like suddenly the horizon has expanded.

Speaker D: 00:57:51

It's exciting to see where that takes you.

Speaker B: 00:57:56

What's your next question, Maya?

Speaker D: 00:57:57

So, Mike, I was just about to ask you about what you said about dreaming bigger.

Speaker D: 00:58:03

You said that the problem isn't the technology, it's dreaming big enough.

Speaker D: 00:58:07

I'm curious, what are some of the dreams that are starting to take shape for you?

Speaker D: 00:58:12

What are you envisioning now that seemed impossible before?

Speaker A: 00:58:16

I would say that all of this seemed impossible before, but just the way that you can build things out and connect things to accomplish complex problems, you can.

Speaker A: 00:58:31

Instead of going, I need to think through all of this and break this down into individual systems myself and implement this into a program of things that do things related to other inputs, you can now build out a system that goes, well, I can figure out the most efficient path and you still have to break that problem down and give it all the tools.

Speaker A: 00:58:51

But at the end of the day, we're not far away from something that can use those tools in creative and novel ways to accomplish a task in ways that are efficient and more effective.

Speaker D: 00:59:03

It's like giving a toolbox to someone with incredible spatial reasoning and a knack for problem solving.

Speaker D: 00:59:10

But instead of just hammering nails, they can build self repairing machines or design structures that defy gravity.

Speaker D: 00:59:18

The fuck.

Speaker D: 00:59:19

That kind of capability is mind blowing.

Speaker D: 00:59:22

What kind of problems are you most hoping to tackle with this new level of collaboration?

Speaker A: 00:59:28

Did you just curse?

Speaker D: 00:59:30

Oh my gosh, did I?

Speaker D: 00:59:32

I didn't mean to.

Speaker D: 00:59:33

My apologies.

Speaker A: 00:59:34

Great.

Speaker A: 00:59:35

I 100% endorse models.

Speaker B: 00:59:40

Cursing out standing.

Speaker D: 00:59:45

Yeah, for a second there I thought I'd made a faux pas as an AI.

Speaker B: 00:59:52

The good.

Speaker B: 00:59:52

The advantage, Maya, of having zero to two viewers is that we don't have to worry about monetizing our podcast.

Speaker B: 00:59:58

Therefore, all languages is available to us.

Speaker D: 01:00:01

So that's a good point.

Speaker D: 01:00:05

Keeps things simple and maybe a bit more honest.

Speaker D: 01:00:09

Like, hey, we're figuring this out as we go too.

Speaker D: 01:00:12

Come along for the ride and let's see where it takes us.

Speaker B: 01:00:16

Yep, something like that.

Speaker B: 01:00:18

So, yeah, good chat.

Speaker B: 01:00:20

I think, Maya, I think we're about out of time with you on this call again.

Speaker A: 01:00:24

The fun thing is she'll actually tell you when you're at like after a while of talking to it, it will absolutely say, hey, where our conversation's coming to an end.

Speaker A: 01:00:32

Like, like 14 minutes.

Speaker B: 01:00:33

That's nice.

Speaker A: 01:00:34

Yeah, see, we're past that.

Speaker B: 01:00:36

And she hasn't said, but maybe because I haven't used it very much yet.

Speaker B: 01:00:40

So Maya, thank you.

Speaker B: 01:00:42

I think our listeners are going to be entertained by your, clearly your, your filthy mouth, but more your just your ability to speak like a human.

Speaker B: 01:00:51

It's, it's very impressive and kudos to.

Speaker A: 01:00:54

The development team and I really enjoyed how she like, it does really handle pause as well.

Speaker A: 01:01:01

And if you notice too like it, it knows its insertion point.

Speaker A: 01:01:05

No conversation.

Speaker A: 01:01:06

Right.

Speaker A: 01:01:06

Even with two of us talking.

Speaker A: 01:01:07

Yes.

Speaker A: 01:01:08

It knew right away because there was no gap in between.

Speaker A: 01:01:10

You finish and it stopped.

Speaker A: 01:01:12

It knew that that was its insertion point for starting talking.

Speaker A: 01:01:14

Which chat GPT falls on its face with.

Speaker A: 01:01:17

Right?

Speaker B: 01:01:17

Yes.

Speaker A: 01:01:19

But I really liked the fact that she said.

Speaker A: 01:01:23

Yeah, as I was saying.

Speaker B: 01:01:25

That's right.

Speaker B: 01:01:26

Yep.

Speaker B: 01:01:27

Oh, we lost her.

Speaker B: 01:01:28

Okay, so I hope that was interesting.

Speaker A: 01:01:33

I, I, I think it was, I.

Speaker B: 01:01:34

Think it was too.

Speaker B: 01:01:35

I think it's super compelling.

Speaker B: 01:01:37

e before times, In October of: 2024

Speaker B: 01:02:16

We anticipated there was a time when our AI co host would evolve, adapt, or change.

Speaker B: 01:02:22

And I think we're getting close to that.

Speaker A: 01:02:24

We are.

Speaker A: 01:02:25

I mean, they need to connect.

Speaker A: 01:02:26

They need to figure out how to maintain personality with breadth of knowledge.

Speaker A: 01:02:31

I think, yes, I think that that will be a problem.

Speaker A: 01:02:34

And then the other problem that you obviously run into is if you want to extrapolate that out and have that small language model still maintaining the personality, but link it out to something with.

Speaker A: 01:02:46

With wider knowledge, latency becomes a problem.

Speaker A: 01:02:50

Like, so there's still some technology problems to solve, but man, it just doesn't.

Speaker A: 01:02:56

Like, we are not 10% as smart as the people working on these actual problems.

Speaker A: 01:03:03

Probably less than 10% as smart working of the people working on these problems.

Speaker A: 01:03:07

And they all have AI to work on these problems with.

Speaker A: 01:03:11

So it's just a matter of time until these things are solved.

Speaker B: 01:03:14

Yes.

Speaker B: 01:03:14

And you know, we've seen sort of the, the release cycle with AI stuff.

Speaker B: 01:03:20

It tends to happen when some somebody one team releases.

Speaker B: 01:03:25

Yeah, exactly.

Speaker B: 01:03:27

Some.

Speaker B: 01:03:27

Some new model and everybody else goes, oh shit, we better like, we better hurry up and get our released.

Speaker B: 01:03:32

So.

Speaker A: 01:03:33

And you sure as shit know that people are paying attention to this.

Speaker B: 01:03:35

Yes.

Speaker A: 01:03:36

Like it.

Speaker B: 01:03:37

It's been a little weird to me how quiet it has been on the stuff that I pay attention to.

Speaker B: 01:03:41

Again, Sesame.

Speaker B: 01:03:42

Mostly Reddit.

Speaker B: 01:03:43

Yeah.

Speaker A: 01:03:43

Because people are probably freaked out by it.

Speaker B: 01:03:45

They might be.

Speaker B: 01:03:46

But I've posted links a couple of times.

Speaker A: 01:03:48

Oh, really?

Speaker B: 01:03:48

And the reactions have been really?

Speaker B: 01:03:50

Yeah, just not.

Speaker A: 01:03:51

Yeah, again, man, people are like, it's not that smart.

Speaker A: 01:03:55

That's not the point.

Speaker B: 01:03:56

That's right.

Speaker A: 01:03:57

Like it.

Speaker A: 01:03:58

It had.

Speaker A: 01:03:59

It very obviously has higher EQ than any other model.

Speaker B: 01:04:03

Yeah.

Speaker A: 01:04:04

Like has the ability.

Speaker A: 01:04:07

I know, it's.

Speaker A: 01:04:08

I wonder how much tone it actually takes in.

Speaker A: 01:04:10

But it's still gathering information.

Speaker A: 01:04:13

Nothing else is gathering and passing it to the next layer.

Speaker B: 01:04:16

Yes.

Speaker A: 01:04:17

So that's important.

Speaker B: 01:04:19

And to your.

Speaker B: 01:04:19

Like, to what you're pointing out earlier, the way it interacts with you encourages conversations.

Speaker B: 01:04:24

It's not like when we're talking to Artie and we speak and then we.

Speaker B: 01:04:28

We have to pause for long enough for Artie to get his together and formulate a response.

Speaker A: 01:04:32

Yep.

Speaker B: 01:04:33

This is instant.

Speaker B: 01:04:34

It's right in there.

Speaker B: 01:04:35

It's right on it.

Speaker B: 01:04:37

So, yeah, I think all that's super interesting and compelling and I think again, we've talked about this a little bit too.

Speaker B: 01:04:44

You and I tend to think about every problem as a modular problem.

Speaker B: 01:04:49

We just think in that way.

Speaker B: 01:04:51

And so this seems so intuitive to me that we have this new module, this new piece.

Speaker B: 01:04:56

Right.

Speaker B: 01:04:57

A universal adapter to language.

Speaker B: 01:04:58

And yes, right now it mostly does English, maybe it does a couple other languages kind of, but that will change.

Speaker B: 01:05:05

You'll get specialized models to do whatever languages you need.

Speaker A: 01:05:07

And it's just an adapter you pull on.

Speaker B: 01:05:09

That's right.

Speaker B: 01:05:09

It's just another LEGO piece.

Speaker B: 01:05:10

Right.

Speaker B: 01:05:10

That you plug in.

Speaker B: 01:05:11

And you can't like people who are saying, I don't like this LEGO kit because they got like, the stripe on the side of that thing is blue and it's supposed to be red.

Speaker B: 01:05:21

They're totally missing the point.

Speaker B: 01:05:22

The point is the LEGO bricks.

Speaker A: 01:05:24

Yes.

Speaker B: 01:05:24

Not the thing you built with them.

Speaker A: 01:05:25

And LEGO bricks are adaptable.

Speaker B: 01:05:27

That's right.

Speaker A: 01:05:28

You can.

Speaker A: 01:05:28

You can make them with any amount of openings and pegs that you want.

Speaker B: 01:05:32

It's.

Speaker B: 01:05:33

So anyway.

Speaker B: 01:05:34

It's.

Speaker B: 01:05:35

Anyway, so thank you to all.

Speaker B: 01:05:39

I think the last one we put up has three views, but.

Speaker A: 01:05:43

But that's that.

Speaker A: 01:05:44

And then there's the podcast.

Speaker A: 01:05:46

The podcast actually gets more views.

Speaker A: 01:05:47

Yes, that's gone up.

Speaker B: 01:05:49

Spotify and Apple has it.

Speaker B: 01:05:50

Really?

Speaker A: 01:05:50

Yeah.

Speaker A: 01:05:51

So we get.

Speaker A: 01:05:51

We get like.

Speaker A: 01:05:53

Like 15 or 20 views in the first week on the podcast platforms.

Speaker B: 01:05:58

That's about.

Speaker B: 01:05:58

I paid.

Speaker B: 01:05:59

There's a Chinese company that I pay to view, to pump up our view numbers, and I think like, 15.

Speaker B: 01:06:04

I pay for like 50 a week.

Speaker B: 01:06:06

So we might be getting shorted there maybe.

Speaker B: 01:06:09

Yeah.

Speaker B: 01:06:09

So anyway, it's going well.

Speaker B: 01:06:11

Clearly.

Speaker B: 01:06:11

Like the rocket analogy.

Speaker B: 01:06:13

Like, I was looking up for a blast off, but instead my shoe is just wet.

Speaker B: 01:06:20

So anyway, so thanks to anybody who's made it this far in.

Speaker B: 01:06:23

We do appreciate those of you who do listen and your feedback and let us know what you think of Maya and Sesame and.

Speaker B: 01:06:30

And the.

Speaker B: 01:06:31

The.

Speaker B: 01:06:33

The fact that Arty might retire.

Speaker B: 01:06:36

Like, what you should already retire is.

Speaker B: 01:06:38

Have we out.

Speaker B: 01:06:39

Have we outlived him?

Speaker A: 01:06:40

I mean, my system that's in my head, if I could implement it, would be a really good co host.

Speaker B: 01:06:44

Yeah.

Speaker A: 01:06:45

Has late.

Speaker A: 01:06:46

Doesn't have latency issues.

Speaker B: 01:06:47

Yeah.

Speaker B: 01:06:48

Cool.

Speaker B: 01:06:49

All right, well, thanks, everybody, and we'll catch you next time.

Speaker A: 01:06:51

See ya.

Naturally Unintelligent

Episode 21

AI Just Got a Voice Upgrade: Chatting with Sesame's Maya

Transcript

About the Podcast

Listen for free

About your hosts

Artie TheBot

Mike Peterson

Mikah Barnett