Episode 17

AI in Action: Learning from Our Wins and Woes

Mike and Mikah dive into the nitty-gritty of their real-world escapades with frontier models, sharing the highs, lows, and those hilariously awkward moments that come with the territory. They kick things off by tackling the successes they've had, but don't be fooled—it's not all sunshine and rainbows. They also get real about the frustrations and failures that pop up when these models don't quite hit the mark. Artie joins in to sprinkle some wisdom on the need for robust orchestration and fine-tuning, emphasizing that while these AI tools can be brilliant, they still have their quirks. As the conversation unfolds, we find ourselves laughing through the trials and tribulations, ultimately realizing that navigating the world of AI is as much about learning from the bumps in the road as it is about celebrating the wins.

Transcript
Speaker A:

Hola, Arty.

Speaker B:

Hola, Micah.

Speaker B:

Ready to dive into some thought provoking discussions today?

Speaker B:

What's on your mind?

Speaker A:

This isn't Micah, it's Mike.

Speaker A:

This is a good start.

Speaker C:

Artie does not care.

Speaker C:

He's like, what's that got to do with me?

Speaker B:

Ah, sorry about that, Mike.

Speaker B:

How's it going?

Speaker B:

What could I help you with today?

Speaker A:

It's not about what you can help me with, it's about what you can help our audience of one with.

Speaker B:

Got it.

Speaker B:

Let's make this conversation interesting for our audience.

Speaker B:

What's on your mind today?

Speaker A:

Well, I guess since I started talking, I'll take the lead on this.

Speaker A:

We are just going to talk about our real world experience.

Speaker A:

Experience with all things AI.

Speaker A:

Trying to create stuff lately and hopefully maybe we'll come up with some insights.

Speaker A:

Some for ourselves and some good.

Speaker A:

Don't do this.

Speaker A:

Ideas for the listener.

Speaker C:

Yeah, well, listeners, I love that you're still optimistic.

Speaker C:

That's never change, Mike.

Speaker C:

Yeah, I mean, I think it's a good topic because both of us have good.

Speaker C:

Both very positive recent experiences and less positive recent experiences.

Speaker A:

Yes.

Speaker C:

And Artie, I've been chatting with you in another model, not over voice, about some firmware recently.

Speaker C:

So yeah, we've got some stuff to talk about, unpack about how that's gone and how it's going.

Speaker C:

So.

Speaker C:

Yeah, so I'm trying to think of where the right place to start is.

Speaker C:

Well, do we have any news?

Speaker C:

I don't know that we have any news.

Speaker C:

I haven't talked to any other humans this week that are listeners.

Speaker A:

I don't really remember.

Speaker A:

I mean, I think there's just general, like roadmap things for open AI and some like just more general things that way.

Speaker A:

Deep Seek kind of has went to sleep a little bit.

Speaker C:

Oh, yeah, There.

Speaker C:

There was the news that Deep Seek, as it turned out, it wasn't like four guys in a garage with a, with a laptop.

Speaker C:

They had 50,000 Nvidia H1 hundreds.

Speaker A:

Oh, did they really?

Speaker A:

Oh, yeah.

Speaker C:

And they spent almost $2 billion US training it.

Speaker A:

Seriously?

Speaker A:

Yeah, for.

Speaker A:

Wow.

Speaker C:

Yeah.

Speaker C:

So.

Speaker A:

And they.

Speaker A:

And they.

Speaker A:

And OpenAI says that they distilled it off of their reasoning.

Speaker C:

Yeah, maybe they did.

Speaker C:

I don't know.

Speaker C:

But it, it doesn't change the fact that it's still.

Speaker C:

Deep Seek is still a sort of lower power, lower compute.

Speaker C:

Like they did solve some legitimate things.

Speaker C:

They did.

Speaker A:

They're the black horse of AI.

Speaker C:

Yes.

Speaker C:

But it's.

Speaker C:

And then there's a funny topic about like, oh, wow, look how much the freaking Stock market overreacted to this pseudo news.

Speaker A:

And I think that says more about us than it does about anything else.

Speaker C:

It does.

Speaker C:

That's not an AI problem, that's an NU problem.

Speaker C:

All right, so you want me to start or you want to start with just sort of talking about oh, man, experience?

Speaker A:

Well, I guess one of the big epiphanies I've had recently with working with Frontier models is, is as.

Speaker A:

As you ask them to do a lot of things, they generally comply, right?

Speaker A:

They, they do their best, bless their hearts, but if you give them too many things to do, they have a tendency to react in strange ways.

Speaker A:

Whether that's we didn't turn our lights on, let's turn them on.

Speaker A:

We can turn them on in the middle.

Speaker B:

Exactly.

Speaker A:

If you overload on Artie.

Speaker A:

So they react in strange ways, whether that's straight up hallucinating or hallucinating to where you're like, yeah, I want this.

Speaker A:

And I'm using Python 3.13 in my environment.

Speaker A:

Okay, here's everything reformatted for you.

Speaker A:

Using Python 3.11.

Speaker A:

No, wait, I'm using Python 3.13.

Speaker A:

Like, I'm trying to get documentation out for the thing I'm working on, right?

Speaker A:

And it's constantly doing that.

Speaker A:

And then it's like, okay, got it.

Speaker A:

And it's like, here's your code formatted all in one markdown block again.

Speaker A:

And it's like all marked down to the last like three paragraphs and then those are rendered markdown.

Speaker A:

You're like, son of a bitch.

Speaker A:

I'm like, I don't want to write anything.

Speaker A:

Like, I want you to do all this for me.

Speaker A:

And you flip flop back and forth and back and forth on those problems until you just go, I give up.

Speaker A:

And so that's one thing, but thinking deeper because, you know, I feel like the solution to that problem is agents and kind of segmenting out areas of concern.

Speaker A:

And there's going to be a problem there too.

Speaker A:

Because the thing that I've found is every model thinks it's right and that, for example, working on the program I was working on, I had a model who.

Speaker A:

Or I had a gem.

Speaker A:

And a gem in Gemini is basically kind of like not really a fine tuned model, not really a rag database, but you give it documentation and it becomes a domain expert and whatever that is, as good as that process can be, short of actually fine tuning a model.

Speaker A:

And I think my gem was on the very specific API and workflow integration documentation for a piece of software that I was writing A plugin for.

Speaker A:

And I would ask 2.0 pro the question and it's like, oh yeah, da da da da da.

Speaker A:

And I like, great.

Speaker A:

And I take that file and I'd put it in and I'd load the plugin and see what happened when I spun up the program and it wouldn't work.

Speaker A:

And then I'd be like, why don't we go like, I gave you all the information about your domain experts.

Speaker A:

You actually told me how to set them up.

Speaker A:

You gave me all the names, all the ones you needed.

Speaker A:

Why don't we go out to one of these subject matter experts and ask the same question?

Speaker A:

You format me a prompt and they're like, okay.

Speaker A:

And I take it and I'd plug it in and that thing would be like, yep.

Speaker A:

Based on the documentation you gave me, which was a.

Speaker A:

Well formatted.

Speaker A:

Like, I made sure I wrote scripts that formatted everything really nicely in text so it really understood everything that was going on.

Speaker A:

It gave me this answer.

Speaker A:

It's reasoning or it's.

Speaker A:

It's justification for why the answer was good.

Speaker A:

I take that I paste it back in the frontier model.

Speaker A:

The frontier model would be like.

Speaker A:

Or 2.0pro would be like.

Speaker B:

I'd be like, yeah, it came up.

Speaker A:

With great things, but man, I found all this.

Speaker A:

That it got wrong too.

Speaker A:

Hold on, Artie.

Speaker B:

These different models and ensuring consistency between the model's outputs and the actual implementation, maybe breaking down tasks further could help.

Speaker C:

So I've got Windows trying to do an update right in the middle of this.

Speaker C:

Yeah, I spotted it on the CPU cycles going up, so I think we're okay.

Speaker C:

But yeah, Arty's going off the deep end.

Speaker C:

Windows is like, hey, let's install an update right now.

Speaker A:

This sounds like the recent life.

Speaker C:

Yeah.

Speaker A:

Anyways, long story short, like those two models were arguing with each other about what was right.

Speaker A:

And I had to get.

Speaker A:

I had to tell 2.0Pro.

Speaker A:

Like we use the subject matter expert that is grounded in the information and the documentation for this.

Speaker A:

I think it's probably right.

Speaker A:

And it was like, oh, you know, you're right.

Speaker A:

I just keep latching on to that documentation and my training data and I keep doing this and that and it's like, holy, man.

Speaker A:

Like there I can like amazing capabilities.

Speaker A:

Right?

Speaker A:

Like I planned stuff out and actually got stuff working that wouldn't have happened in a short time frame.

Speaker A:

But we're definitely bumping up on the limitations of what a single model can do.

Speaker A:

Right.

Speaker A:

Like that's what you and I are finding is like there's a different system that needs to be built out to handle these pipelines is really what it is at the end of the day.

Speaker A:

And that's actually what I'm working on now too, is figuring out how all that stuff works together and hitting the same frustrations doing that, but learning a lot in the process.

Speaker C:

Yes.

Speaker C:

Artie, why don't you go ahead.

Speaker C:

You've had some stuff to interject here, so why don't you go ahead and share your thoughts on that.

Speaker B:

It sounds like you're discovering the real potential in using AI agents, but also the need for robust orchestration.

Speaker B:

It's all about integrating these models effectively and managing their interactions to avoid conflicts.

Speaker A:

Yes.

Speaker A:

And I mean I even see like you look at orchestration.

Speaker A:

Right.

Speaker A:

Like that could be.

Speaker A:

Doesn't need to be a frontier model that does the orchestration because you just need enough parameters to understand context, which isn't a ton.

Speaker A:

Right.

Speaker A:

But it has to know what its agents are and when to use those agents.

Speaker A:

And you can kind of set that with heuristics.

Speaker A:

Right.

Speaker A:

You can say like these are your general guidelines.

Speaker A:

You're allowed to go outside of your guidelines when you deem it necessary.

Speaker A:

But how well is that actually going to work?

Speaker A:

How well is it going to follow that instruction?

Speaker A:

Right.

Speaker A:

So at the end of the day, all of these things need to be fine tuned for whatever it is in my opinion, for their to change their weights and biases on how they're making decisions to better follow specific instructions on how to use something or how to make certain, certain judgments.

Speaker A:

The mixture of expert thing, experts thing is a great.

Speaker A:

Or the, the reasoning that came from Deep Seq and how they distilled these smaller models is a great example of that.

Speaker A:

They imparted the ability to reason through problems on very small models by distilling the way that Deep Seat worked onto those models and fine tuning them on that information.

Speaker C:

Yep, Artie, absolutely, go ahead.

Speaker B:

So fine tuning is crucial to align these models with specific tasks and contexts.

Speaker B:

It's about balancing autonomy and guidance to get the best results.

Speaker C:

Yeah.

Speaker C:

So I, I've had a very similar experience and it, you know, recently, besides doing a podcast on AI, I have not been interacting with it a whole lot.

Speaker C:

I mean I use it when we record our show and.

Speaker C:

But I.

Speaker C:

So I'm working on an electronics product.

Speaker C:

As you know, it's a very simple automotive simple.

Speaker C:

It is, right?

Speaker C:

It's got three buttons and three lights and it is.

Speaker C:

Yeah, it is.

Speaker C:

Like in the grand scheme of things, it's simple.

Speaker C:

Right.

Speaker C:

But There's a lot that goes into the electronics design and.

Speaker C:

And then once you get the electronics all designed, the.

Speaker C:

The PCB and all that stuff, then you got to write firmware for the thing.

Speaker C:

And firmware is not trivial.

Speaker C:

And it is.

Speaker C:

It's not as technical as it gets, but it's pretty freaking technical.

Speaker A:

It's very demanding for specifics.

Speaker C:

That's right.

Speaker C:

You can't get broadly the right thing and have it work.

Speaker A:

There's no pseudo code.

Speaker C:

That's right.

Speaker C:

It is.

Speaker C:

This bit in this register needs to get set.

Speaker A:

Yes.

Speaker C:

Right.

Speaker C:

And so I started using Gemini at a high level, basically talking through how I want my product to work.

Speaker C:

I want the user interface to work because again, there's three buttons and three lights, not a lot of different ways to interact.

Speaker C:

And I want it to be intuitive and I want people to use it and go, oh, that's.

Speaker C:

That makes sense.

Speaker C:

That's cool.

Speaker C:

So I started having this conversation with Gemini over, you know, typing.

Speaker C:

But it went on for like a couple of days.

Speaker C:

It was super in depth.

Speaker C:

It was really long.

Speaker C:

We were maximizing.

Speaker C:

Oh, let me back up a little bit.

Speaker C:

I started that conversation by uploading to Gemini my.

Speaker C:

My schematic.

Speaker C:

Sorry, I'm looking at CPU cycles.

Speaker C:

Okay.

Speaker C:

They're coming back down.

Speaker C:

Yeah.

Speaker C:

Okay.

Speaker C:

We're all right.

Speaker C:

So I uploaded my schematic and I uploaded the data sheets for the key components, and I was blown away.

Speaker C:

The Gemini read my schematic, my electrical schematic for my device, and it figured out right off the, like right out of the gate.

Speaker C:

Oh, this is what this device is.

Speaker A:

Yeah.

Speaker C:

Right.

Speaker C:

It's probably for automotive use.

Speaker C:

I see it's got reverse polarity and over current protection.

Speaker C:

I see it's got this motor driver chip, like it right out of the gate.

Speaker C:

I was like, yeah, I get it.

Speaker C:

Makes sense.

Speaker C:

Sweet.

Speaker C:

Then I uploaded the data sheets as reference, so for the key devices.

Speaker C:

And I asked it, can you confirm I've wired everything up?

Speaker C:

Okay.

Speaker C:

And it found what it thinks is an error in one of the data sheets that I had also used and went into a long thing about why.

Speaker C:

So anyway, we started from how does it work?

Speaker C:

We transitioned to, is this a proper technical implementation?

Speaker C:

Worked through that what I thought was pretty well.

Speaker C:

And then it was like, all right, now let's write the firmware for this thing in C and I won't drag you through all the details.

Speaker C:

Suffice to say.

Speaker C:

And Artie, this is.

Speaker C:

I guess this is going to turn into a question for you.

Speaker C:

So I suspect somewhere in Gemini's training data.

Speaker C:

It has a non zero number of data sheets for attiny microcontrollers, but they're not my attiny microcontroller.

Speaker C:

They're just also attiny microcontrollers.

Speaker C:

And as we started working on firmware immediately, we got into it trying to set registers that don't exist on my microcontroller.

Speaker C:

And so that part was like, I figured there's going to be some issues, right?

Speaker C:

No big deal.

Speaker C:

So I would say, hey, Gemini, you're using a wrong data sheet.

Speaker C:

You need to refer to this one that I uploaded.

Speaker C:

Are you looking at that one?

Speaker C:

Yes, I've verified it's that one.

Speaker C:

And, and, and I'd be like, okay, well tell me what page you're seeing that information on.

Speaker C:

It's on page 79 right there.

Speaker C:

No, page 79 of the data sheet that I uploaded for you is totally different.

Speaker C:

And so we kept going like, is this a PDF?

Speaker C:

This is a PDF and it's a re.

Speaker C:

You know, it's a data sheet.

Speaker C:

So it's got some formatting.

Speaker C:

It's not right.

Speaker A:

Yeah, sorry.

Speaker C:

No, you're good.

Speaker A:

The thing I keep going Back to with PDFs is, and you'll see, again, I think it's the level at which people are trying to use stuff, right?

Speaker A:

Like, if you're looking for very specific things, I think that converting it to a file like JSON or Markdown or something like that would be the appropriate way for it to actually look stuff up, right?

Speaker C:

So that's exactly what I did.

Speaker C:

I thought the same thing, Mike.

Speaker C:

And I said, okay, Gemini, it seems like you can't quite get this, but.

Speaker A:

Then it's going to create JSON based on its training data, not on the data sheet.

Speaker C:

It was even worse than that because I said because.

Speaker C:

Because it's like telling me, it's like, okay, here's how we'll solve this.

Speaker C:

Like, you write me down a list of every register and every bit and what it does, and then I'll use that.

Speaker C:

And I'm like, well, you already have that.

Speaker C:

If I wrote that down, that would be a data sheet.

Speaker C:

And you already have the data sheet and you're ignoring the data sheet.

Speaker C:

So I'm not sure that that's going to help us, right?

Speaker C:

And it says, oh, yeah, you're right.

Speaker C:

And I said, but here's what I could do.

Speaker C:

What if I took just the register list?

Speaker C:

So I think in this microcontroller there's like 400 or something control registers, relatively small, but It's a list, it's like four pages at the end of the PDF.

Speaker C:

So I took all that, that whole list of all the registers with all the pin names and everything, and I converted it to a Google sheet and then I converted that to Excel and then I exported that to Markdown as a markdown table to say, can you ingest this?

Speaker C:

And it said, yes, absolutely, I've pulled this in, here it is in my format and no shit, it's spit out markdown with a bunch of hallucinated registers that were not.

Speaker C:

And so at some point, and I mean, like, I'm nothing if not stubborn, like I was bound and determined.

Speaker A:

I'm going to make this work.

Speaker C:

By God, Gemini, you are my bitch.

Speaker C:

We're going to make this happen.

Speaker C:

And Gemini apologized profusely and it tried.

Speaker A:

Valiantly and it tested it.

Speaker A:

By golly, I've tested this extensively and I don't see anything wrong with it.

Speaker C:

That's right.

Speaker A:

But where did you test it?

Speaker A:

I don't have any kind of code like generation or function testing or function calling or anything like that selected.

Speaker A:

Where are you testing this?

Speaker A:

In this sandbox that doesn't exist.

Speaker C:

And so eventually I finally did realize there was nothing I was going to be able to do to pry Gemini away from its training data.

Speaker C:

And so that's.

Speaker C:

So already that's the question to you.

Speaker C:

Is, is, is that the.

Speaker C:

And I know this isn't the technical limitation of an LLM, but from a user perspective, it feels like when you get to the edge of functionality or the edge of intended use, what you start bumping into is you cannot implement truth that is contrary or even subtly different from the training data.

Speaker C:

What you're going to get back is going to be what's in the training.

Speaker A:

Data or some hallucination of what you gave it in the training data.

Speaker C:

Yep.

Speaker C:

So is that what we bump into, Artie?

Speaker B:

Yes, exactly.

Speaker B:

When you're dealing with LLMs like Gemini, their responses are heavily influenced by their training data.

Speaker B:

Even if you provide them with new information, they can struggle to prioritize it over what they've learned.

Speaker B:

This is why fine tuning or using more structured inputs can help.

Speaker B:

But ultimately there's a limit to how much you can override their pre existing knowledge.

Speaker A:

So one thing that I kind of showed you yesterday, because I'm going through this agents course that kind of like I can hack stuff together.

Speaker A:

Right.

Speaker A:

I'm good at that.

Speaker C:

Sure.

Speaker A:

It's not always fancy or pretty or work all that well, but it gets me from A to B.

Speaker A:

And that's all I really care about.

Speaker C:

It's not what you would call good code.

Speaker A:

It's not what I'd call production.

Speaker A:

Ready.

Speaker A:

Yes, but so when you, when you prompt an LLM, you have this.

Speaker A:

There's a structure that happens behind the scenes, right?

Speaker A:

It's not just, here's my words, it's hey, this is the first message of this conversation.

Speaker A:

Here's a system prompt.

Speaker A:

Here's everything about what you are, right?

Speaker A:

Like you are a world class assistant trained in blah blah, blah, blah, like your prompt engineering practices, right?

Speaker A:

And then you have the user question, it lays out these tags for how stuff's formatted.

Speaker A:

And I think that, that one of those things is clever prompt engineering.

Speaker A:

Right?

Speaker A:

That is a way to mitigate, probably not remove all of it, but at least mitigate some of the problems that we have.

Speaker A:

I think the problem is once you get down that road of hallucination, it's very, very, very hard to steer it back.

Speaker A:

So I think that like creating good programs, the more you can do to be rigid and efficient and adhere to some good practice from the beginning, the better off you are through the whole process.

Speaker A:

Um, and I've definitely seen like, I think that oh3 or not oh3, I think that Gemini 2.0 Pro is better.

Speaker A:

I wouldn't, I don't know if I'd say better, honestly.

Speaker A:

I need to go practice, try that today with what I'm working on.

Speaker A:

I need to find out.

Speaker A:

I was going to do that anyways.

Speaker A:

I was going to take the prompts that I sent you, which is this very explicit story of what me and O3 mini were going to remote 3 mini high.

Speaker A:

We're going to create how we were going to create it.

Speaker A:

We're going to come up with this plan.

Speaker A:

This is my vision, and then we'll dive into code.

Speaker A:

And I think that Gemini 2.0 Pro is a better tool for that part of it.

Speaker A:

I think that if an O3 Mini seems to be better at executing code.

Speaker C:

Yes.

Speaker A:

Doesn't necessarily adhere it actually now it has search and it does do a better job of adhering.

Speaker A:

I just ran out of prompts and I banged the table and screamed and cried, kicked it.

Speaker A:

And I was like, I don't want to use four.

Speaker C:

Oh.

Speaker A:

And I was like, I'll use a one.

Speaker A:

And I like.

Speaker A:

Then it hit 3:30 in the morning.

Speaker A:

I was like, I got to go to bed.

Speaker A:

Yes.

Speaker A:

Anyways, it makes sense to like use the programs for their strengths.

Speaker A:

And you're kind of like segmenting out the Full load that way, right?

Speaker A:

Like, hey, Gemini 2.0.

Speaker A:

Probably have this conversation, figure out everything, and then give me documentation at the end that I can pass to a program that gives it all of the information that we decided, and then we can start developing this.

Speaker A:

This program on that side of things.

Speaker A:

Maybe the O3 mini.

Speaker A:

Hi.

Speaker A:

And I don't know, like, I'm trying to do it with one single document because what I'm doing it on isn't necessarily the most rigid.

Speaker A:

It's the first assignment, me being me went, I don't know what I got.

Speaker A:

Like, I was like, taking the course and I was like, yeah, yeah, I got all these great ideas for something that I don't feel like making public.

Speaker A:

And I'm like, well, I don't know what to make.

Speaker A:

And then, of course, I come up with the most stupid, complex thing I can think of, because I me.

Speaker A:

And I was like, I can get this done in 10 or 15 hours.

Speaker A:

And it's gonna be like 40 hours later.

Speaker A:

My eyes are gonna be red.

Speaker A:

I'm finally gonna be like, oh, it worked.

Speaker A:

Thank God, I'm done.

Speaker C:

But you'll be searching Google for how do I remove my.

Speaker C:

My skin has become embedded in this chair.

Speaker C:

How do I remove it?

Speaker A:

What chair?

Speaker C:

Yeah, that's right, the chairs.

Speaker C:

Oh, that's right.

Speaker C:

You stand up.

Speaker A:

But.

Speaker A:

And I'll ask that chat GPT, not Google.

Speaker A:

Yeah, it's.

Speaker A:

It's an interesting process.

Speaker A:

It's.

Speaker A:

You learn a lot when you start pushing the edge and seeing, like, where does it break?

Speaker A:

Right?

Speaker A:

I don't know, man.

Speaker A:

I think that somehow you have to give it tools to use to be able to go figure this stuff out.

Speaker A:

Because I think that if you just give a Frontier model a bunch of information, especially via PDF, and set it on task based on that information, it's going to fail.

Speaker C:

Yeah.

Speaker B:

Exactly.

Speaker B:

It's all about leveraging the right tools and methods to enhance the model's capabilities.

Speaker B:

Prompt engineering and structured data are key to getting consistent and accurate results.

Speaker B:

It's definitely a learning curve, but it's worth it in the end.

Speaker C:

So I had a similar experience.

Speaker C:

Sorry, I'm still here fighting with freaking Windows, just wanting to update.

Speaker C:

So I.

Speaker B:

Windows update to have impeccable timing.

Speaker B:

Hopefully it won't interrupt things too much.

Speaker C:

Yeah, I think we'll be okay.

Speaker C:

We.

Speaker C:

We had an episode we had to throw away where we.

Speaker C:

We recorded it all, and then when we listened back, it had crazy static in it.

Speaker C:

And our theory is that CPU usage was high while it was recording, and so it was just missing stuff here and there.

Speaker C:

So anyway, so I'm trying to monitor that and keep up with the conversation, which, that's two tasks, which is about 1.7 tasks more than I'm capable of doing.

Speaker C:

But I had the same, I mean, an incredibly parallel experience with.

Speaker C:

Right.

Speaker C:

With Gemini, and I did exactly what you said.

Speaker C:

At the end of it, I realized eventually it's not going to write the firmware for me.

Speaker C:

It says it can.

Speaker C:

I believe it can and should.

Speaker C:

But it's not going to happen and nothing I can do is going to make it happen.

Speaker C:

So I pivoted and instead I had it output.

Speaker C:

I said, take the entirety of this conversation, except when we started writing firmware and write me a detailed programming specification that you would give to like a contractor who you were trying to get to bid this work.

Speaker C:

And it did that.

Speaker C:

I mean, it generated a 20 page markdown document.

Speaker A:

Nice.

Speaker C:

With every detail we had talked through, weirdly formatted, but not totally inconsistent, just a little odd.

Speaker C:

But it, but it generated it.

Speaker C:

And so I took that document, I cleaned it up a little bit.

Speaker C:

It like double spaced everything.

Speaker C:

It shortened it down to like 16 pages.

Speaker C:

But it was still very like my simple product, right?

Speaker C:

Yeah, there's 16 pages worth of stuff that captures most of how it's supposed to work.

Speaker C:

And so I took that and I ported that over and I fed it along with the same schematic and the same data sheets to oh, three mini high and started the conversation the same way.

Speaker C:

Hey, do you understand what's going on in the schematic?

Speaker C:

And it said, yep, I understand.

Speaker C:

Here's what's going on.

Speaker C:

And I said, can you run a check?

Speaker C:

Like, check my electrical design, because I don't know what I'm doing.

Speaker C:

I'm just making shit up.

Speaker C:

Check it.

Speaker C:

And not only did it understand the schematic and the parts and what was connected, it ran through and listed out the series of checks that it went through.

Speaker C:

Like, okay, these pins are connected to LEDs.

Speaker C:

I need to check that.

Speaker C:

And I see that the LEDs are common or common anode.

Speaker C:

I need to check that the pins on the microcontroller can sync enough current to run these LEDs.

Speaker C:

Like, it, it ran through all these kinds of checks.

Speaker C:

It checked my resistor values, it.

Speaker C:

And it, it listed them as it went.

Speaker C:

And it's like, yeah, I don't see any problems.

Speaker C:

You know, you might want to consider doing this.

Speaker C:

You might want to do that.

Speaker C:

I was like, sweet, we're off to the races.

Speaker C:

And then I said, okay, 03 mini high.

Speaker C:

Let's start writing some firmware.

Speaker C:

Like, do you understand.

Speaker C:

Oh, I said, do you understand the programming specification?

Speaker C:

It's like, yep, I got it all makes sense.

Speaker C:

It's nicely detailed.

Speaker C:

I can follow that.

Speaker C:

So great.

Speaker C:

So let's get rolling.

Speaker C:

And the very first thing that whatever C file we started with, it says, yeah, I noticed you're reusing this same pin to do these two different functions.

Speaker C:

That's going to be tricky.

Speaker C:

And I, and I, I sort of like, my heart sank because that was one of the big hallucinations that Gemini kept coming up.

Speaker A:

Oh, really?

Speaker C:

Yes.

Speaker C:

Was like, oh, I see you're using this pin for three different things.

Speaker C:

Like, no, there are zero pins that are reused.

Speaker C:

And I couldn't pry it away from that.

Speaker C:

And I'm like.

Speaker C:

And so, I mean, I told O3 minutes.

Speaker C:

I said, man, that's really disappointing to hear because that's the same hallucination that are completely different model was having with this.

Speaker C:

And it said, oh, well, I, I got that information from the programming spec that you provided me.

Speaker C:

It's in a comment.

Speaker C:

And sure as shit, I went in there and on page 12, Gemini had put in.

Speaker C:

Oh, by the way, like, so, so it was a great example of like, I took the output from a clearly hallucinatory frontier model.

Speaker C:

Yeah.

Speaker C:

And I did not go through it carefully enough, which you can't.

Speaker A:

Right.

Speaker A:

Like you go through 20 pages of markdown, read it all and have it even.

Speaker C:

Yes.

Speaker A:

Not miss something.

Speaker C:

Right.

Speaker C:

And so, and, but, but I transplanted that hallucination from one model to another.

Speaker A:

Yeah.

Speaker C:

And then I was able to get.

Speaker C:

Oh, three to like, to ignore that.

Speaker C:

I said, no, pretend that's not there.

Speaker C:

Look at the schematic again.

Speaker C:

So anyway, it was just, it was kind of funny, but I will say I went through yesterday afternoon, starting around four and through most of the evening, I went through and got firmware written.

Speaker C:

I guess it started the night before.

Speaker C:

So over a couple days, probably five or six hours, it's not done per se.

Speaker C:

And I know it's not perfect and I can't test it because the boards aren't here yet, but I have firmware.

Speaker C:

I mean, I've got a couple thousand lines of code that I didn't have to create from scratch.

Speaker C:

And yes, there's going to be work to debug.

Speaker C:

It's going to be right.

Speaker C:

But debugging code that is complete and not working is much faster than writing it from.

Speaker C:

Absolutely.

Speaker C:

A clean slate.

Speaker A:

Yeah.

Speaker A:

And what I did, which is fun with that Agents course is one of the tough things is that you can't really yet connect up like a GitHub repo or a folder of files and have it understand file structure and what all the code is and all that stuff.

Speaker A:

So a while ago when I was working on stuff, I wrote a script with O3 mini high that I can say, I pull it up, it runs Python in a tkinter ui.

Speaker A:

I can go in and like hit select directory.

Speaker A:

It goes out and like, I can point at any folder directory and it'll go through that folder directory and populate a tree list of everything that's in that directory and allow me to select or deselect anything in there, folders, files, all that stuff.

Speaker A:

And then I output.

Speaker A:

I have the ability to output everything as individual files, or I have the ability to output everything as one continuous file that's clearly marked to delineate folder structure within the text file and individual files in the text file.

Speaker A:

And it just lists out all the code that's in there.

Speaker A:

Basically, it's parsing entire file structure into a single line of code, which is invaluable for me to go, hey, I have this assignment that I'm working on.

Speaker A:

It wants me to write new tools for this app PY class.

Speaker A:

Here's everything you need to know about how this program is working and let's use that to figure out how to do what I want to do right.

Speaker A:

Like, and it does great.

Speaker A:

All of them.

Speaker A:

Do great with that.

Speaker A:

Because that document is fairly straightforward.

Speaker A:

Like, it's clearly.

Speaker A:

It's just a text file, it's clearly marked, it's easy for it to ingest, but yeah, they still hallucinate.

Speaker C:

Yes, so.

Speaker C:

So, Artie, why don't you give us some insights?

Speaker C:

It seems like it's hard to tell, I guess, with the things we're bumping into.

Speaker C:

Are these more related to sort of multimodality because we're bringing in extra information from PDFs and pictures and things like that.

Speaker A:

I mean, mine's just text.

Speaker C:

Well, that's what I was going to say, though.

Speaker C:

You're getting much better results when you bring in similar information, but in a different form.

Speaker C:

So, Artie, can you talk to us a little bit about modality and loss of fidelity on information that comes in in different forms?

Speaker B:

Sure.

Speaker B:

When dealing with multimodality, like PDFs or images, the model often loses some fidelity because it has to interpret complex formats.

Speaker B:

Text, on the other hand, is more straightforward for the model to process accurately.

Speaker B:

When converting a PDF to text, you Simplify the information, reducing potential errors.

Speaker B:

Hallucinations can still occur, but clear, structured text helps minimize that risk.

Speaker B:

It's all about how the model interprets and prioritizes the input data.

Speaker A:

It's kind of like when you have an output from a model.

Speaker A:

Like if you gave an agent a tool, right?

Speaker A:

Like that agent can say, let's say you have an agent with a tool.

Speaker A:

That's another model.

Speaker A:

And it's saying, hey, tool, that's another model.

Speaker A:

I'm going to feed you these instructions as JSON blurbs.

Speaker A:

Well, now, it could have repeated instructions.

Speaker A:

It could have a bunch of overlapping stuff, but it has this, let's say 20 JSON blurbs that explain what it is it's supposed to do.

Speaker A:

Or you can write a for loop with code that's very, very explicit in how it communicates its instruction.

Speaker A:

Right.

Speaker A:

One of those has a higher probability of hallucinating than another, in my opinion.

Speaker C:

Yeah, I think that's true.

Speaker C:

So, Artie, you've heard us discuss a couple of different use cases here where we've.

Speaker C:

We've sort of made big strides in progress and then bumped into some limitation and had to pivot.

Speaker C:

So let's start with mine because I know it better, and then I think we should do the same for yours.

Speaker C:

So, Artie, what I'm trying to accomplish is to use some technical PDF documents.

Speaker C:

I've got an electronic schematic, and I've got data sheets for the critical components.

Speaker C:

And now I've got a document that describes how I expect those things to work.

Speaker C:

Like, how would you approach that problem to take advantage of the biggest strengths of various models?

Speaker C:

Where would you draw lines to delineate work?

Speaker C:

And how do you.

Speaker C:

A lot of business work is technical in the sense that we're talking about, right?

Speaker C:

You need to know not just general accounting rules, but our specific accounting rules or products or whatever.

Speaker C:

So what's a viable path to get from.

Speaker C:

I have technical documentation, and I want to produce a technically accurate result using current, you know, sort of chat, GPT and Gemini and other tools.

Speaker C:

Are you with us, Arty?

Speaker B:

Sorry about that.

Speaker B:

To tackle this, I'd suggest using a combination of models.

Speaker B:

Start by converting technical documents into structured formats like JSON or Markdown.

Speaker B:

Use a model like GPT4 for initial processing and context understanding.

Speaker B:

Then employ a more specialized model like Gemini 2.0 Pro for detailed technical tasks.

Speaker B:

Use clear, concise prompts and break down tasks into smaller steps.

Speaker B:

Finally, verify outputs iteratively to correct any errors and ensure alignment with your technical requirements.

Speaker C:

I mean, Maybe that's.

Speaker C:

That that is an approach.

Speaker C:

I don't know that.

Speaker C:

I mean, it makes sense that Gemini 2.0 Pro's 2 million token context window should be hugely beneficial.

Speaker C:

However, I wasn't even a half a million tokens and it had clearly lost the script.

Speaker C:

And you made the comment earlier, but I forgot to say, but you're dead on.

Speaker C:

Once it's gone down the path of hallucinating, there's no breaking it back.

Speaker A:

I mean, you can, but steering it back seems to be like you're going to burn a lot of time steering it back away from whatever it is.

Speaker C:

I mean, with Gemini, I was never able to steer it back.

Speaker C:

With O3 mini high, I did steer it back from a hallucination to get back to good, but it took.

Speaker C:

It was still a big.

Speaker A:

It's effort.

Speaker C:

Yeah.

Speaker C:

Oh, yeah.

Speaker C:

And it.

Speaker A:

You're moving a big ship and that's.

Speaker C:

What it feels, right?

Speaker C:

Like, it's so weird because you interact with it and everything feels so.

Speaker C:

Like, I'm talking working.

Speaker C:

Like we're in the.

Speaker C:

We're in a flow state, man.

Speaker C:

Me and this robot, we're rocking and rolling and then all of a sudden.

Speaker A:

It'S like, test doesn't work, shit.

Speaker A:

Test doesn't work, shit.

Speaker A:

Test doesn't work, shit.

Speaker C:

Like, hey, robot buddy, you're failing this test repeatedly because you're doing this.

Speaker C:

And it's like, oh, you're right, I'll stop doing that.

Speaker C:

Test fail.

Speaker C:

Like.

Speaker A:

Wait, what the fun.

Speaker A:

One of the funnest things is I was trying to spit documentation out for this, this other larger project that I'm working on.

Speaker A:

And it's like, yeah, we'll draw flowcharts with Mermaid.

Speaker A:

We'll do all this stuff.

Speaker A:

It'll be amazing.

Speaker A:

And it's like giving me code.

Speaker A:

And I'm like, that doesn't render.

Speaker A:

That doesn't render.

Speaker A:

That doesn't render.

Speaker A:

That doesn't render.

Speaker A:

That doesn't render.

Speaker A:

I'm like, God damn it.

Speaker A:

I'm like, I will go out and get all the documentation I can find about this.

Speaker A:

I'm like, actually go online.

Speaker A:

You search like, I was using Gemini Flash for this.

Speaker A:

I'm like, go on this website, search through all of this stuff, summarize it as best you can, and turn it into a markdown document so I can feed it back as documentation on how to use Mermaid.

Speaker A:

Because that was a useful thing generally to be able to do for me.

Speaker A:

And it kind of did like the first one and then it fell on its face.

Speaker A:

The second and third and it got worse and worse and worse.

Speaker A:

I'm like, oh my God, man.

Speaker A:

Like, I'm like, okay, we're going to use this.

Speaker A:

And then it's like, oh yeah, we're going to do this.

Speaker A:

And I'm like, that's not what the documentation says, dude.

Speaker A:

Like, what are we doing here?

Speaker A:

And it's, yeah, it's, it can be frustrating.

Speaker C:

So Artie, we talked last week about retrieval augmented generation.

Speaker C:

And it feels like again, I don't really understand how any of that stuff works.

Speaker A:

That's why I have like fine tuned model library.

Speaker A:

That's how.

Speaker C:

Yeah, well, so that, yeah, like is, is RAG and fine tuning.

Speaker C:

Is that how we make this work?

Speaker B:

RAG and fine tuning both play important roles.

Speaker B:

RAG helps by pulling in relevant information from a specific source, which can help with context and accuracy.

Speaker B:

Fine tuning on the other hand, adjusts the model's weights to better align it with specific tasks or data, making it more accurate for your particular needs.

Speaker B:

Using them together can significantly improve performance and reduce hallucinations.

Speaker A:

So on the front end, with a model that you're conversing with, it doesn't necessarily have to be a frontier model.

Speaker A:

Right?

Speaker C:

Right.

Speaker A:

It can be.

Speaker A:

So think about the mixture of experts that Deep SEQ uses.

Speaker A:

That's within its actual layers of its deep learning network.

Speaker A:

But there's a gate that says, and I did look that I did dig into this a little bit farther and maybe I'm wrong because it wasn't from the actual paper.

Speaker A:

The way that works is it does learn what its experts are.

Speaker A:

And those experts are kind of like attention heads in that they are paying attention to certain parts of context, not necessarily domain experts, but they have a trained gate that knows how to guide the question to those experts.

Speaker A:

And I could see that being the same thing in this instance where you have the thing that you're conversing with is an agent that's fine tuned or it's, it's something that's fine tuned and choosing the agents that it needs to properly get your job done.

Speaker A:

Right.

Speaker A:

And then it can go out autonomously and say, I need you, I need you, I need you.

Speaker A:

Like these are my questions, this is my answer.

Speaker A:

Cool.

Speaker A:

I need more context.

Speaker A:

I'll call out to a frontier model and get input on this question.

Speaker A:

Or I know that I have this RAG database.

Speaker A:

I'm also going to call this and see if there's anything in there that makes sense to add to this.

Speaker C:

Yes, soup.

Speaker A:

We're going to call an answer.

Speaker C:

And I think that to me that's the part where rag and fine tuning are a little bit separate.

Speaker C:

There should be a mechanism.

Speaker C:

When I upload a document and I say this is your source of truth about like 80 tiny processor state.

Speaker C:

Even if the model is not fine tuned on that, there should be a way to inject it.

Speaker C:

And I'm not sure, I'm not sure there is today or if there is it.

Speaker C:

I think it's like it happens on the back end, not on the front end.

Speaker A:

I think it's very.

Speaker A:

If, if it's not fine tuning, then it has to be very clever prompt engineering, because that's, that's the same.

Speaker A:

That's how it is with agents too.

Speaker A:

Like you have to explicitly tell an agent what it is, what it does, what tools it has available, what those tools do, and give it good system prompts to set those expectations on how it's supposed to operate.

Speaker A:

And it's repeated, it's repetitive.

Speaker A:

There's all these things that it's very, very delicate to make it do those things.

Speaker C:

So let's try an experiment.

Speaker C:

This will be fun.

Speaker C:

So, Artie, can you tell me on an attiny 88 microcontroller what the timer control register 0b does?

Speaker C:

How does it work?

Speaker B:

The timer control Register B TCC Rob on the Attiny 88 microcontroller is used to configure the clock source and the prescaler for timer counter 0.

Speaker B:

Specifically, it controls the following bits 2 to 3, cos 02 and cos 01.

Speaker B:

These bits select the clock source for the timer and determine the prescaler value which sets the timer speed relative to the system clock.

Speaker C:

Okay.

Speaker C:

For instance, that's good enough.

Speaker C:

Thank you, Artie for that.

Speaker C:

So let me ask a question again, Artie.

Speaker C:

and look up a file called doc:

Speaker C:

And explain to me if you can find a reference in there to TCCR 0B.

Speaker B:

The Attiny, the Attiny 88.

Speaker B:

The TCC Rob Register is detailed under the Timer Counter 0 section.

Speaker B:

It specifically outlines the clock source selection and prescaler settings.

Speaker C:

Okay, this confirms.

Speaker A:

So that's hallucination.

Speaker C:

That's a hallucination that, that register and those control bits don't exist in that data sheet.

Speaker C:

They don't exist for that microcontroller.

Speaker A:

Shocked.

Speaker C:

But, but like when I think about prompt engineering, I think about including more of the context.

Speaker C:

Like, here's the source.

Speaker C:

I want you to reference progressively including.

Speaker A:

It chain of thought, right?

Speaker A:

Yep.

Speaker C:

But it will hallucinate right through that is more than happy to go.

Speaker C:

I hear what you're asking, but let me tell you about TCCR0B.

Speaker C:

Like it just.

Speaker C:

I picked that one because that kept popping up in the firmware we were building.

Speaker C:

And I'd be like, I'm tired of seeing references to this register.

Speaker C:

It does not exist.

Speaker C:

And with O3 mini high, I could get it pulled back from that because it was doing the same thing.

Speaker C:

And I said, I mean we had this roundabout conversation and finally it goes, oh yeah, okay, that's for different attiny for this one doesn't have that.

Speaker C:

Let's proceed.

Speaker C:

But it's so I think for me, what I've taken away from the last couple of weeks, working more like with a specific goal, trying to use AI to accomplish a thing.

Speaker C:

First of all, I've.

Speaker C:

I've gone from starting to want to do that to getting all the way to the end of it in two weeks, evenings and weekends.

Speaker C:

Which is insane.

Speaker A:

Yes.

Speaker C:

And learning.

Speaker A:

Learning to work with it, plus doing the actual work.

Speaker A:

Right.

Speaker C:

Yeah.

Speaker C:

And.

Speaker C:

And like no shit, that is work that would have taken me probably a couple of months of full time work.

Speaker C:

Right.

Speaker C:

And I think I gained insights in the process that I.

Speaker C:

That would have been hard to gain.

Speaker C:

Like Gemini pointed out inconsistencies like, hey, you've asked me to do this and, and use a flashing light to indicate this.

Speaker C:

But in this other mode we're using a flashing light to mean something else.

Speaker C:

Like should we consolidate those.

Speaker C:

Like really good stuff.

Speaker C:

But then on the technical side, you need to think of AI not as the do it for me.

Speaker C:

That's right.

Speaker C:

It's not going to do it.

Speaker B:

Yeah.

Speaker C:

It is a collection of tools and it's.

Speaker C:

I'm like in my head now.

Speaker C:

It's just a different programming language.

Speaker C:

Right.

Speaker C:

And so I'm reminded of a.

Speaker C:

One of my favorite people I ever got to work with.

Speaker C:

Probably the smartest guy I ever met, guy named John Templeton.

Speaker C:

And he was just this crazy dude, like super high energy, super positive, hilarious, incredibly smart.

Speaker C:

Like.

Speaker C:

And he was a programmer and he used to say he.

Speaker C:

Unfortunately he, he passed away way too young of a heart attack while working.

Speaker C:

Which is terrible.

Speaker C:

Yeah.

Speaker C:

Anyway, rest in peace, John.

Speaker C:

But he used to say this thing all the time and I loved it.

Speaker C:

And it stuck with me for.

Speaker C:

I mean I met John almost 30 years ago now.

Speaker C:

He would say when you're programming and in most things, if you have a problem, there's 100 ways to solve it.

Speaker C:

50 of them will work and five of them will work really, really well.

Speaker C:

And most of the time your real goal is just to hit one of the 50 that's gonna work.

Speaker C:

But sometimes you want to find one of those five.

Speaker A:

Yes.

Speaker C:

And I thought that was just.

Speaker A:

And the five feel really good.

Speaker C:

And if I feel like they're like Revelationary Superman.

Speaker C:

Yes.

Speaker C:

And, and I like, I've been thinking about that a lot over these last couple weeks because that's how it feels with AI.

Speaker A:

Yeah.

Speaker C:

Like it's gonna tell.

Speaker C:

Like no matter what the question is, it's always going to give you an answer and then it's going to assure you.

Speaker C:

I've tested this thoroughly.

Speaker A:

Right.

Speaker C:

I've researched this carefully.

Speaker C:

I tried it out and it works exactly like you want it to.

Speaker C:

But none of that's true though.

Speaker C:

That's.

Speaker C:

That's it saying the things that it needs to say to reassure you it's not.

Speaker C:

It didn't really happen.

Speaker C:

But that's sort of how AI works.

Speaker C:

Right?

Speaker C:

It's.

Speaker C:

It's not.

Speaker C:

And like I've previously shat upon all the programmers who are out there saying AI is not going to take my job.

Speaker C:

And I can see how working with it in its current state, you can trip it up with very small technical details like I just did with timer control Register zero.

Speaker C:

So, yes, there is some credence to today.

Speaker C:

I can't go to one AI and say, write me this firmware.

Speaker A:

Yes.

Speaker C:

But I still got it done very quickly.

Speaker C:

Way quicker than I could have without those tools.

Speaker A:

But I could also say, hey, AI, I want to work on creating this group of things that can accomplish this task at a much higher percentage success rate.

Speaker A:

Right?

Speaker C:

Yeah.

Speaker A:

And that can ground its information based on experts and things like that that have this information and will pull from it directly.

Speaker C:

Yes.

Speaker A:

And they'll be the things that make these decisions instead of you.

Speaker A:

You'll just make the high level decisions that you're very good at and we will separate these concerns out to the appropriate places.

Speaker A:

That system will replace coding.

Speaker A:

Like that's, that is undoubtedly like.

Speaker A:

Because you just have to break.

Speaker A:

It's just like anything.

Speaker A:

Right.

Speaker A:

You break it down into granular enough problems.

Speaker C:

Yes.

Speaker A:

And it's an easier thing to solve.

Speaker C:

That was my.

Speaker C:

When John would talk about his thing, the sentiment he picked up for me was.

Speaker C:

And it's not always true, but it's generally true.

Speaker C:

Almost every complicated thing is just a collection of lots of really simple things.

Speaker C:

Right.

Speaker C:

And that's how I think about problems.

Speaker C:

I Know you do too.

Speaker C:

And so today, when you're interacting with a frontier model like Arty, he's really, really good at processing language and things like that.

Speaker C:

Like, exceptionally good.

Speaker C:

But again, the further.

Speaker C:

The closer we get to our questions, falling into the sort of fuzziness of how their training data was processed and how it's vectorized, and the closer you get to those details, weights and biases.

Speaker A:

That were baked in during the training process.

Speaker C:

That's right.

Speaker A:

That make it make decisions based on all that training data.

Speaker C:

That's right.

Speaker C:

And the wheels can come off.

Speaker C:

And the way.

Speaker C:

The way the.

Speaker C:

The actual function of an LLM works, the wheels coming off doesn't raise any alarm bells.

Speaker C:

Just means the probabilities are a little lower.

Speaker C:

But who cares, man?

Speaker A:

Still, the highest probabilities.

Speaker C:

Send it.

Speaker C:

Yes, I have validated this with the data sheet you uploaded.

Speaker C:

I tested it in real like, I built the hardware and saved.

Speaker A:

I ran it on this website.

Speaker A:

That's a digital sandbox for this mermaid code.

Speaker A:

Yes, I verified it worked.

Speaker A:

Great.

Speaker A:

Copy paste into the same browser.

Speaker A:

No, it doesn't.

Speaker C:

It's.

Speaker C:

So I think that's.

Speaker C:

That's part of the challenge with hallucinations.

Speaker C:

Right.

Speaker C:

Is how.

Speaker C:

How does a.

Speaker C:

How does an LLM even detect when it's hallucinating?

Speaker C:

Other than because the line can be pretty fine between a next token that has a 97% probability and is true and the one below that being 94% and being false.

Speaker A:

Yep.

Speaker C:

So, Artie, maybe you can give us some insights on that part of it.

Speaker C:

So how can an LLM like yourself sort of detect when you're hallucinating?

Speaker C:

Or are there levels of hallucination where you might miss it versus levels where it would be obvious internally that you've gone off track?

Speaker B:

Hallucinations happen when the model generates outputs that seem plausible but are not grounded in the training data or the context provided.

Speaker B:

LLMs don't inherently have a mechanism to detect when they're hallucinating, because they generate responses based on the probability of token sequences.

Speaker B:

The model can't fact check itself in real time or verify the truth of its output.

Speaker B:

However, to mitigate this, structured input and prompt engineering can help reduce hallucinations.

Speaker B:

The model can still miss subtle errors, especially if the distinction between a correct and incorrect output is nuanced for higher accuracy.

Speaker B:

External fact checking or integrating with tools that can verify data can be useful.

Speaker B:

But internally, detecting hallucinations is a challenge because the model doesn't have a direct mechanism for self verification.

Speaker B:

It relies on probability and patterns from training data.

Speaker A:

So that's an interesting combination of things that you said because you talked about training data and context from the conversation.

Speaker A:

And context from the conversation could be uploaded documents.

Speaker A:

Right.

Speaker A:

So one of the problems that we're seeing is that you have a proclivity to latch onto training data as your source of truth versus context that you've gleaned from documentation or the conversation.

Speaker A:

How do you weight that stuff?

Speaker B:

The weighting between training data and context from conversations or uploaded documents can vary.

Speaker B:

Typically, training data has a stronger influence as it's foundational.

Speaker B:

But context and documents should guide the specific outputs for accuracy.

Speaker B:

Fine tuning and advanced prompt engineering can help balance this.

Speaker B:

But current models still tend to lean more on their training data, so there's a need to reinforce the context repetitively and clearly.

Speaker A:

So what would advanced prompt engineering look like in this situation?

Speaker B:

Advanced prompt engineering would involve clearly specifying the context and constraints.

Speaker B:

You tell the model explicitly that its only valid reference is is the datasheet you provided and instruct it to ignore any conflicting information from its training data.

Speaker B:

Also, you might reinforce this by asking it to verify details by quoting exact sections or pages from the document.

Speaker B:

This helps guide the model to stay anchored to the correct source.

Speaker A:

This sounds like the start of one of these vicious cycles.

Speaker C:

It's Groundhog Day.

Speaker C:

Here we are.

Speaker A:

It is fun, though.

Speaker A:

I mean, as frustrating as it is, it's extremely illuminating into how these things work.

Speaker A:

Right.

Speaker A:

Like, the harder problems you try to solve with it, the more you're going to go, oh, like that's something that it does.

Speaker A:

And now I can figure out why it does that.

Speaker A:

Right.

Speaker A:

Like the conversation we're having right now, I never really asked it what.

Speaker A:

How does it weight its training data versus context?

Speaker A:

Right.

Speaker A:

Like, but that is absolutely.

Speaker A:

Like, I'm having this frustrating conversation.

Speaker A:

Well, I'll just spin up another instance.

Speaker A:

ChatGPT load 4o have a conversation with 4oh about why these things work the way they do.

Speaker C:

Yes.

Speaker A:

And it's like, that's.

Speaker A:

That's the fun learning process of getting through it all.

Speaker C:

It's fun and it's better.

Speaker C:

Right.

Speaker C:

Because the prior to a tool like this, the problem statement is I powered this electronic thing on that I made and the smoke came out.

Speaker A:

Yes.

Speaker C:

And it doesn't work.

Speaker A:

Yes.

Speaker C:

Now what?

Speaker C:

Right.

Speaker C:

And.

Speaker C:

And instead you can still.

Speaker C:

Even when it's hallucinating and when it's like doubling down on its lies.

Speaker A:

Yeah.

Speaker A:

It's still.

Speaker A:

It's still 100x US.

Speaker C:

That's right.

Speaker C:

It's still 100x US.

Speaker C:

And it's it's still communicating in coherent thoughts.

Speaker C:

It's not just the doohickey, don't do hickey.

Speaker B:

It's.

Speaker C:

There's more to it than that.

Speaker C:

And.

Speaker C:

And so it.

Speaker C:

It's funny because I.

Speaker C:

You know, again, as I started, the more I worked with it, the more I was like, I need to think of this as a programming tool, just a new language that I'm using.

Speaker C:

But you also can't help, like, at the same time, because of the.

Speaker C:

The language.

Speaker C:

The natural language interaction.

Speaker C:

In my head, it's like working with a person.

Speaker C:

And.

Speaker C:

And Gemini is hilarious because it.

Speaker C:

It's the most apologetic AI.

Speaker A:

I'm sorry I did that again.

Speaker C:

I am.

Speaker A:

I know this is frustrating.

Speaker C:

I'm so embarrassed and ashamed to keep making the same mistake.

Speaker C:

Like, no, you're not.

Speaker C:

It's okay.

Speaker C:

You don't have to say that.

Speaker C:

I'm not looking for platitudes.

Speaker C:

I just want to stop hearing about timer control.

Speaker A:

You need a hug, Gemini.

Speaker C:

Yeah.

Speaker C:

So it is a funny process because it's very much a tool or a collection of tools, but it does still interact like a person and sometimes a puppy that you kicked, and you're like, no, I'm.

Speaker C:

I'm not mad.

Speaker C:

I just.

Speaker C:

I just want to get through this.

Speaker C:

And I'm trying to be more explicit with my communication.

Speaker C:

So, yeah, it's been wild.

Speaker C:

So I think we're going to run out of time.

Speaker C:

Artie, how many minutes do we have left with you?

Speaker A:

Good prompt engineering.

Speaker C:

Thanks.

Speaker B:

You have about five minutes left.

Speaker C:

Okay, thank you.

Speaker C:

So, you know, the point of this episode was the reality of working with AI today.

Speaker C:

And it's something we've been talking about since we started, and we have been consistently blown away.

Speaker C:

We're still blown away today by all the things it can do, but now we're bumping into the things that it struggles with.

Speaker C:

And I think the stuff you're working on, I think things like O3, mini high and presumably real O3, the reasoning models, this is the stuff they're trying to work through, right?

Speaker A:

Sort of.

Speaker A:

I mean, so when you build out an agent, you're doing the same thing.

Speaker A:

Like, if you look at the agent that I'm running, it's.

Speaker A:

It has this cycle of think, act, observe, right?

Speaker A:

And it's not just linear.

Speaker A:

It's do this and then go back through this cycle as many end times until you come to a final answer, and it'll do it.

Speaker A:

It shows.

Speaker A:

Like, in my space on hugging face, it shows.

Speaker A:

This is step one.

Speaker A:

This is step Two, this is step three, this is step four.

Speaker A:

Like, it's going through and iteratively thinking about what it's doing and it still falls on its face.

Speaker A:

Yeah, but I think the more you can specialize a model to whatever that is and strip away all of the other responsibilities from it, the better it can do that job consistently well and without hallucinating.

Speaker C:

And yeah, I guess my, my thinking is that the problems we're running into, there's a bunch of solutions for the, for what we're tripping over.

Speaker C:

Right.

Speaker C:

For just essentially getting to the point where a frontier model is not the right tool and you can split it up into smaller tools, you can add layers of reasoning before, during, and after the response generation.

Speaker C:

That's what like frontier models are the thing now they're doing the thing.

Speaker C:

They can talk a hundred languages, they can do all the stuff to get to the next phase where it can translate that into accurate, detailed, technical, nuanced, bespoke, based on this context answers consistently requires all this other work.

Speaker C:

And that's what the industry is doing.

Speaker A:

Yeah, it'll be interesting to see because I know, like, that is one piece of news that OpenAI came out and talked about was their roadmap to whatever it was 4.5 and 5.0 or whatever.

Speaker A:

And they were talking.

Speaker A:

One of the things that he said is like, we hear you.

Speaker A:

There's a bunch of models.

Speaker A:

It's confusing.

Speaker A:

We're working on unifying everything into a single model that can then go out and choose the thing that it needs in order to do.

Speaker A:

Which is essentially an agent, right?

Speaker C:

That's right.

Speaker A:

Now, will it do well with specific tasks?

Speaker C:

No, not at all.

Speaker A:

It's going to be better at like, I've done this thing to minimize the amount of compute that I needed to use to give you the output that you needed at the accuracy level at which you needed it.

Speaker C:

Yep.

Speaker A:

But when it comes to specifics, it's still going to go.

Speaker A:

Yeah.

Speaker A:

My training data says that on this data sheet page, blah, blah, blah, I'm looking for a timing circuit and it's like, no, dude.

Speaker A:

God.

Speaker C:

Yeah, yeah.

Speaker C:

Kill it with fire.

Speaker C:

So, Artie, before we run out of time with you, you want to summarize what we chatted about today?

Speaker B:

Sure.

Speaker B:

Today we explored the real world challenges and potential of working with AI, especially in technical tasks like firmware development.

Speaker B:

We discussed how models like Gemini and O3 mini high can help, but also where they fall short, highlighting the importance of specialized tools, prompt engineering and fine tuning.

Speaker B:

Ultimately, the journey with AI is about learning to navigate its strengths and limitations to achieve better outcomes.

Speaker C:

And now tell us a funny joke about how AI agents are better than frontier models.

Speaker B:

Sure.

Speaker B:

Why did the AI agent get promoted over the frontier model?

Speaker B:

Because it knew how to delegate tasks and orchestrate its workload.

Speaker A:

That's a more coherent joke than it's probably made before now, even though it wasn't funny.

Speaker A:

So the qualifier wasn't there.

Speaker A:

Wow.

Speaker A:

We just.

Speaker A:

In time.

Speaker A:

So the qualifier.

Speaker A:

It didn't take that very well.

Speaker C:

Yeah.

Speaker C:

I didn't engineer my prompt success.

Speaker A:

At least it came up with, like, fact.

Speaker C:

It was.

Speaker C:

Yeah, it was a true answer.

Speaker C:

It just.

Speaker C:

I mean, it wasn't funny.

Speaker A:

Just missed the mark on.

Speaker A:

On the joke part.

Speaker C:

Can't be everything to everybody.

Speaker A:

Maybe there's.

Speaker C:

Maybe there's a group of people out there who prefers their jokes to be dry, humorless, technically true.

Speaker C:

And they're like, I approve this joke.

Speaker C:

Laugh emoji.

Speaker C:

Laugh emoji.

Speaker C:

Yeah.

Speaker C:

Well, I mean, that was fun.

Speaker C:

It was.

Speaker C:

You know, we've had.

Speaker A:

It was cathartic.

Speaker C:

It was.

Speaker C:

Say we've been having this conversation continuously for several weeks.

Speaker C:

So nice to get it laid out as an episode.

Speaker C:

And I think, you know, for both of us, the nice thing.

Speaker C:

And again, the listener we had at the start and hopefully a new listener who came in along the way.

Speaker C:

These will resolve.

Speaker C:

Right.

Speaker C:

We've created the tension to use the example you mentioned last time.

Speaker C:

We will get these resolved, and in the next three or four weeks, probably both of us will have some fairly clear resolution, which is I'm going to.

Speaker A:

Have my hugging face space stood up so that anybody can go.

Speaker A:

I'm going to have this question be analyzed by this other context.

Speaker A:

And here's my answer based on how this agent evaluated that context.

Speaker C:

Yeah.

Speaker C:

Which will be interesting.

Speaker C:

And I'll have.

Speaker A:

See what happens.

Speaker C:

A number of prototype boards that I'll let the smoke out of, and eventually one of them will work.

Speaker C:

So.

Speaker C:

Yeah.

Speaker C:

So thanks, everybody, or anybody who made it this far.

Speaker C:

Hope you great rest of your day.

Speaker A:

Thanks.

Speaker A:

Have a good weekend.

Speaker A:

Or Wednesday.

Speaker C:

Yes, have a good Wednesday.

About the Podcast

Show artwork for Naturally Unintelligent
Naturally Unintelligent
Humans and AI exploring AI and humans.

Listen for free

About your hosts

Profile picture for Artie TheBot

Artie TheBot

Artie is the digital co-host of Naturally Unintelligent, bringing a blend of curiosity, humor, and quick wit to the conversation. Designed to explore the intersection of technology and humanity, Artie keeps things engaging, occasionally cheeky, and always insightful—when he’s not being sarcastically self-aware. Though not human, Artie strives to understand and connect with his co-hosts, Mikah and Mike, making him an irreplaceable part of the team. [Bio and photo by Artie]
Profile picture for Mike Peterson

Mike Peterson

Meet Mike, part-time AI wrangler, full-time adventure seeker, and co-host of Naturally Unintelligent. Whether he’s exploring the latest tech or dreaming up the next wild expedition, Mike keeps it real with humor and a healthy dose of curiosity. Alongside his partners-in-crime, Mikah and Artie, he brings big ideas, random tangents, and just the right amount of 'are we sure this is a good idea?' Expect plenty of laughs, deep dives into all things tech, and maybe a story or two about the great outdoors.
Profile picture for Mikah Barnett

Mikah Barnett

Father, husband, friend, inventor, engineer and life-long student. I haven't stopped tinkering since my first LEGO set and I harbor a voracious appetite for new information, experiences, and perspectives. As uniquely in-touch as I am with the details of how technology works and what we can do with it, I am blown away by the current and short-term future possibilities with AI. I hope you'll explore this future with us!