AI Signals From Tomorrow

AI Signals from Tomorrow is a podcast channel designed for curious minds eager to explore the frontiers of artificial intelligence. The format is a conversation between Voyager and Zaura discussing a specific scientific paper or a set of them, sometime in a short format and sometime as a deep dive.

Each episode delivers clear, thought-provoking insights into how AI is shaping our world. From everyday impacts to philosophical dilemmas and future possibilities, AI Signals from Tomorrow bridges the gap between cutting-edge research and real-world understanding.

Whether you're a tech enthusiast, a concerned citizen, or simply fascinated by the future, this podcast offers accessible deep dives into topics like machine learning, ethics, automation, creativity, and the evolving role of humans in an AI-driven age.

Join Voyager and Zaura as they decode the AI signals pointing toward tomorrow—and what they mean for us today.

All Episodes

AI Signals From Tomorrow

The Storm Before the Page

June 29, 2025 • 1az

Send us a text

The overwhelming flood of information in our digital world creates a constant challenge: how do we quickly become truly informed on complex topics? This fascinating episode examines whether AI can help solve this problem by creating comprehensive, factual articles similar to high-quality Wikipedia pages. https://arxiv.org/pdf/2402.14207

We explore groundbreaking research on STORM (Synthesis of Topic Outlines Through Retrieval and Multiperspective Question Asking), a revolutionary system that mirrors how skilled humans approach unfamiliar subjects. Unlike previous AI attempts that skip the crucial pre-writing phase, STORM embraces the messy but essential process of research and organization that happens before any actual writing begins.

The system's brilliance lies in its multi-stage approach. First, it examines related Wikipedia articles to adopt diverse perspectives, ensuring comprehensive coverage. Then, it simulates conversations between these various viewpoints and topic experts, with responses grounded in trusted online sources. Like a detective following leads, each answer informs increasingly sophisticated follow-up questions. Finally, STORM creates a detailed, fact-enriched outline that serves as the foundation for the complete article.

Testing revealed remarkable results: STORM captured nearly all key topics human editors deemed important and significantly outperformed baseline methods in organization and coverage. Ten experienced Wikipedia editors unanimously agreed the system would be valuable for their pre-writing process, with some noting STORM's articles occasionally provided greater depth than certain human-written content.

Yet challenges remain. Beyond avoiding simple factual errors, STORM struggles with "bias transfer" from sources and making unwarranted logical connections between distinct pieces of information. These limitations highlight that while AI can revolutionize information gathering and structuring, it still requires human judgment for truly exceptional content.

What if your next deep dive into an unfamiliar subject began with an AI assistant handling the heavy lifting of research, while you applied your uniquely human critical thinking? That collaborative future may be closer than we think.

Support the show

Speaker 1: 0:10

In our incredibly fast-paced world, how do you get truly well-informed quickly?

Speaker 2: 0:14

Yeah, it's tough right, it really is. The sheer volume of information. It's overwhelming.

Speaker 1: 0:19

Exactly, and sifting through it all to find something comprehensive, something grounded. That's a massive challenge, which brings us to a really fascinating question for our digital age Can large language models, llms, actually write these comprehensive, factual, long-form articles from scratch?

Speaker 2: 0:40

Articles like, say, a good Wikipedia page.

Speaker 1: 0:42

Yeah, exactly, can they do that? And maybe, more importantly, how would they even start Right? So our mission for this deep dive is to unpack a pretty groundbreaking new research paper that tackles this very challenge head on. A pretty groundbreaking new research paper that tackles this very challenge head on. It focuses on something often overlooked but, honestly, totally crucial for any serious writing.

Speaker 2: 1:00

Yeah, the pre-writing stage. Ah, yes, a bit before the actual writing.

Speaker 1: 1:04

Precisely, we're going to introduce you to an innovative system called Storm and show you exactly how it aims to well, maybe revolutionize the way AI creates detailed factual content.

Speaker 2: 1:14

Indeed, and it's a fascinating area because, while LLMs are amazing at generating text, aren't they?

Speaker 1: 1:22

Definitely.

Speaker 2: 1:22

Crafting something like a Wikipedia page detailed, well-researched, impeccably organized that presents some really unique complexities.

Speaker 1: 1:30

So it's not just about the writing itself.

Speaker 2: 1:32

Not at all. The core challenge here isn't just generating coherent sentences. It's about effectively gathering and structuring. You know, vast amounts of information.

Speaker 1: 1:41

And doing that without a human holding its hand the whole time.

Speaker 2: 1:44

Exactly Without constant human intervention. It's a problem that really pushes the boundaries of what these sophisticated models can do autonomously.

Speaker 1: 1:52

Okay, so let's unpack this pre-writing thing a bit more before we even you know, put pen to paper or fingers to keyboard. We humans do a lot in that stage. What does that messy but crucial process really involve for us?

Speaker 2: 2:08

Well for humans. That pre-writing stage is all about thorough research, right Diligent information gathering, meticulous planning.

Speaker 1: 2:14

And outlining. We start with an outline.

Speaker 2: 2:16

Crucially, yes, Crafting a solid, comprehensive outline. And what's particularly interesting here is that prior work on AI generating these kinds of articles. They often bypass this exact stage. They usually assumed that reference documents or outlines were already provided.

Speaker 1: 2:34

So they skipped the hard part.

Speaker 2: 2:35

In a way, yes, they focused more on just expanding existing sections. This approach sidestepped the really demanding tasks of identifying, evaluating and organizing external sources which, let's be honest, are challenging even for experienced human writers.

Speaker 1: 2:50

Right, and that highlights a pretty big limitation of current LLMs if you try to get them to do this directly, doesn't it?

Speaker 2: 2:57

It does.

Speaker 1: 2:57

Like if you just ask an LLM, hey, give me 30 questions about topic X or write me an article on Y, what happened? It often just spits out really basic what, when, where? Questions. The paper gives examples like when was the opening ceremony held? Where are the opening ceremony held? How many countries participated? Very factual but basics ceremony held, how many countries participated?

Speaker 2: 3:14

Very factual, but basics. It's like syllable stuff.

Speaker 1: 3:15

Exactly. While those are facts, this direct, prompting approach typically leads to very shallow information or real lack of depth.

Speaker 2: 3:23

And sometimes worse, right For less common topics.

Speaker 1: 3:26

Oh, definitely For less common topics. It can even lead to outright hallucinations, just making stuff up and presenting it as fact.

Speaker 2: 3:34

Because the LLM's internal knowledge just isn't enough for that kind of grounded, verifiable, long-form content.

Speaker 1: 3:41

Precisely, it needs external information.

Speaker 2: 3:43

And an interesting parallel the paper draws here is to human learning theories. It points out that asking effective questions, not just the basic ones, is absolutely key to acquiring new, meaningful information.

Speaker 1: 3:56

Makes sense. Better questions lead to better answers.

Speaker 2: 3:58

Exactly so. This challenge of better research, moving beyond those basic queries to truly delve into a topic, uncover nuances. That's precisely where this new system STORM comes in.

Speaker 1: 4:09

Okay, STORM. This is where the paper proposes its core innovation. Let's dive into it First off, what does STORM even stand for?

Speaker 2: 4:17

Right STORM stands for Synthesis of Topic Outlines Through Retrieval and Multiperspective Question Asking.

Speaker 1: 4:23

OK, quite a mouthful Synthesis. Retrieval, multiperspective what's the fundamental idea?

Speaker 2: 4:30

So its core design is driven by two key hypotheses really. First, that having diverse perspectives helps you ask a broader range of varied, deeper questions.

Speaker 1: 4:39

Like looking at a topic from different angles.

Speaker 2: 4:41

Exactly. And second, that formulating those truly in-depth questions requires iterative research. You have to constantly build on the answers you receive.

Speaker 1: 4:51

Kind of like a detective piecing together clues.

Speaker 2: 4:53

That's a great analogy you find one clue, it leads to another question, and so on.

Speaker 1: 4:56

That sounds well. It sounds a lot like how I might research a complex topic. Actually.

Speaker 2: 5:01

It mimics that human process.

Speaker 1: 5:03

Okay, so let's walk through Storm's multi-stage approach. Stage one is about perspectives. How does Storm make sure it gets broad coverage right from the start?

Speaker 2: 5:11

It's quite clever, actually. It begins by surveying existing Wikipedia articles, but on similar topics to the one it needs to research.

Speaker 1: 5:17

Okay, similar topics. Why?

Speaker 2: 5:19

It extracts their tables of contents and from those it identifies various potential perspectives, or you could call them roles.

Speaker 1: 5:27

Can you give an example? Sure.

Speaker 2: 5:28

Say it's writing about the 2022 Winter Olympics opening ceremony. It might look at other Olympic ceremony articles and identify an event planner perspective.

Speaker 1: 5:39

Ah, and an event planner would ask different questions, right?

Speaker 2: 5:42

That perspective would logically lead to questions about, say, transportation arrangements or the budget Things a general query might miss. And to make sure it covers the absolute basics, it also includes a basic fact writer perspective by default.

Speaker 1: 5:56

So it gets the fundamentals and the broader context.

Speaker 2: 5:58

That's the idea Ensuring fundamental facts aren't missed while still encouraging a much broader exploration.

Speaker 1: 6:04

OK, stage two simulating conversations. This sounds intriguing. Is it like AI chatbots talking to each other? Ai chatbots talking to each other?

Speaker 2: 6:13

Sort of the system essentially personifies an LLM as a Wikipedia writer that adopts one of those specific perspectives we just talked about.

Speaker 1: 6:21

Okay, so you have an AI playing the role of, say, the event planner.

Speaker 2: 6:25

Exactly this writer then asks questions related to its perspective, like can you provide me with a list of the participating countries? Or maybe something more specific like what were the main logistical challenges for spectator transport?

Speaker 1: 6:40

And who answers?

Speaker 2: 6:41

A simulated topic expert provides answers, but and this is crucial these answers are grounded.

Speaker 1: 6:49

Grounded meaning.

Speaker 2: 6:50

Meaning they're based on information retrieved from trusted Internet sources and it's not just making things up.

Speaker 1: 6:56

OK, so it's fetching real info.

Speaker 2: 6:58

Yes, and it's a truly iterative process. The answers provided by the expert lead the writer to ask follow up questions.

Speaker 1: 7:05

Ah, so it allows for much deeper exploration, like that detective analogy again.

Speaker 2: 7:09

Precisely Just like a human researcher, refining their queries as they learn more. The system even breaks down complex search queries and filters out untrustworthy sources, based on Wikipedia's own guidelines.

Speaker 1: 7:20

That's pretty sophisticated Ensuring reliability.

Speaker 2: 7:23

Yeah, it aims for that.

Speaker 1: 7:24

So, after all this intense research and these simulated chats from different angles, what happens next? How does all this raw information turn into an organized article? That's stage three. Right, creating the outline.

Speaker 2: 7:36

Right. So, after all those multi-turn conversations from different perspectives, essentially after thoroughly researching the topic, Storm has this wealth of information?

Speaker 1: 7:46

A big pile of facts and answers.

Speaker 2: 7:48

A very organized pile, hopefully. A big pile of facts and answers. A very organized pile, hopefully. It then first prompts the LLM to generate a draft outline based on its own internal knowledge.

Speaker 1: 7:56

Kind of a first pass.

Speaker 2: 7:57

Exactly, this draft typically provides a general, you know, organized framework, but then, and this is key, it refines this draft, outline how, using all the rich factual, grounded information, it gathered during those simulated conversations.

Speaker 1: 8:13

Ah, so it injects the research findings into the structure.

Speaker 2: 8:16

Precisely this multi-level, fact-enriched outline becomes the robust foundation for generating the full article later. It's truly structuring knowledge before writing.

Speaker 1: 8:25

That makes a lot of sense. Build the skeleton before you add the flesh.

Speaker 2: 8:28

Couldn't have put it better myself.

Speaker 1: 8:29

Okay, this all sounds great in theory, but does it actually work? Yeah, how did the researchers test STORM? You need rigorous evaluation for this stuff.

Speaker 2: 8:37

Absolutely. They set up a very smart evaluation process. First, they curated a special data set called FreshWiki.

Speaker 1: 8:44

FreshWiki. Okay, what's special about it?

Speaker 2: 8:46

It consists of recent high-quality Wikipedia articles, specifically B-class or above, which means they're pretty well-developed and fact-checked and, crucially, these articles were created or heavily edited after the training cutoff date of the LLMs being tested.

Speaker 1: 9:03

Ah, so the LLM couldn't have just memorized these articles during its training?

Speaker 2: 9:06

Exactly right. This is super important to avoid data leakage and ensure the LLM is genuinely generating something new based on the process, not just regurgitating.

Speaker 1: 9:15

Clever. Okay, so they have the data. What did they measure? How do you measure if an AI-generated outline or article is good?

Speaker 2: 9:22

Good question. They looked at two main things. First, outline quality they measured how much of the human-written outline's content the storm-generated outline managed to cover.

Speaker 1: 9:32

Comparing AI outline to human outline Make sense.

Speaker 2: 9:36

Yeah, using metrics like heading soft recall and heading entity recall, Technical maybe.

Speaker 1: 9:42

Break it down for us Sure.

Speaker 2: 9:43

Heading. Soft recall basically measures how well Storm's outline captured the key concepts or topics. A human would include Edding. Entity Recall looked at how many specific names places things entities it identified for the outline sections.

Speaker 1: 9:57

Got it. Concepts and specifics, yeah, and for the full article.

Speaker 2: 10:01

For the full article quality. They used standard measures like Roux scores.

Speaker 1: 10:05

Which compare text overlap basically.

Speaker 2: 10:07

Yeah, Comparing the AI text against human text for overlap in phrases, ideas and entity recall again. But they also used more human-like criteria.

Speaker 1: 10:16

Like what.

Speaker 2: 10:17

Things like interest level, coherence in organization, relevance and focus, coverage and, importantly, verifiability, judged by another LLM, in this case.

Speaker 1: 10:27

Okay, a pretty comprehensive evaluation. So the big question what were the results? Did STORM actually make a difference compared to, say, just asking GPT-4 directly?

Speaker 2: 10:37

Oh, it absolutely did. The difference was significant when they compared STORM to other LLM-based baselines like direct prompting or basic retrieval, augmented generation or RAG. Stor Storm significantly outperformed them in outline quality. Using GPT-4, for instance, Storm achieved an impressive 92.73% heading soft recall.

Speaker 1: 10:58

Whoa 92%, that sounds high.

Speaker 2: 11:01

It's very high and 45.91% heading entity recall.

Speaker 1: 11:06

Okay, what do those numbers actually mean, though?

Speaker 2: 11:07

Well, effectively, that 92% soft recall tells us that Storm's outlines were capturing almost all the key topics or sections that a human editor thought were important for that subject. It was nearly as comprehensive conceptually.

Speaker 1: 11:20

Okay, that is impressive and the entity recall.

Speaker 2: 11:22

Nearly half the specific entities, which is also quite good, showing it's not just vague topics but getting down to specifics. Essentially, Storm's outlines were way more detailed and comprehensive than the baseline.

Speaker 1: 11:32

So it's genuinely structuring the knowledge much better.

Speaker 2: 11:34

That's the key takeaway. For the outline stage, yes, and for the full article quality.

Speaker 1: 11:38

I have that stack up.

Speaker 2: 11:39

Articles generated using the Storm outline were rated significantly higher by the evaluator LLM, particularly in interest level, relevance and focus and coverage.

Speaker 1: 11:49

So better outlines lead to better articles Makes sense.

Speaker 2: 11:53

It does, and they did ablation studies too.

Speaker 1: 11:55

Where they take parts of the store and away to see what happened.

Speaker 2: 11:57

Exactly those confirmed that both the prospective discovery part and the simulated conversations part were vital. You couldn't just remove them.

Speaker 1: 12:07

What happened if you remove the conversations?

Speaker 2: 12:09

Removing the multi-turn conversations specifically led to much worse results. It really proved that the iterative research simulation is key.

Speaker 1: 12:17

So asking questions, getting answers, asking more questions.

Speaker 2: 12:20

Yeah.

Speaker 1: 12:20

That whole loop is critical.

Speaker 2: 12:22

Absolutely critical and the study also explicitly confirmed that having that dedicated outline stage is necessary. Removing it significantly deteriorates the final article's performance.

Speaker 1: 12:32

You can't just jump straight to writing.

Speaker 2: 12:34

Not, if you want quality. Oh, and one more thing on quality citations.

Speaker 1: 12:38

Ah important for factual articles.

Speaker 2: 12:40

Very An impressive 84.83% of sentences in storms-generated articles were judged to be supported by their citations.

Speaker 1: 12:49

Okay, that's a strong number for verifiability.

Speaker 2: 12:51

It really is, shows the grounding is working quite well.

Speaker 1: 12:54

The results certainly paint a compelling picture. Numbers look good, the process seems sound, but does it hold up to human scrutiny? What did actual human experts think, especially experienced Wikipedia editors?

Speaker 2: 13:07

Ah, yes, the human evaluation. This is where it gets really insightful, I think, because their feedback gives you a unique window into the real world usefulness and also the remaining challenges.

Speaker 1: 13:18

So what did they do?

Speaker 2: 13:19

They invited 10 experienced Wikipedia editors people who really know what makes a good encyclopedia article to evaluate Storm's output against the baselines.

Speaker 1: 13:28

Okay, the real test. What did they find?

Speaker 2: 13:30

Well, first the positives. The editors found Storm's articles to be noticeably more organized. There was a 25% absolute increase in articles they deemed organized, compared to the best baseline method 25%.

Speaker 1: 13:42

That's a big jump in perceived organization.

Speaker 2: 13:44

It is, and they also found Storm had broader coverage, showing a 10% increase there.

Speaker 1: 13:50

So more organized and covering more ground.

Speaker 2: 13:53

Exactly. Some editors even praised Storm's output specifically for providing, and I quote a bit more background information and more depth compared to even some human written articles. Wow.

Speaker 1: 14:05