AI Signals From Tomorrow

The Storm Before the Page

1az

Send us a text

The overwhelming flood of information in our digital world creates a constant challenge: how do we quickly become truly informed on complex topics? This fascinating episode examines whether AI can help solve this problem by creating comprehensive, factual articles similar to high-quality Wikipedia pages. https://arxiv.org/pdf/2402.14207

We explore groundbreaking research on STORM (Synthesis of Topic Outlines Through Retrieval and Multiperspective Question Asking), a revolutionary system that mirrors how skilled humans approach unfamiliar subjects. Unlike previous AI attempts that skip the crucial pre-writing phase, STORM embraces the messy but essential process of research and organization that happens before any actual writing begins.

The system's brilliance lies in its multi-stage approach. First, it examines related Wikipedia articles to adopt diverse perspectives, ensuring comprehensive coverage. Then, it simulates conversations between these various viewpoints and topic experts, with responses grounded in trusted online sources. Like a detective following leads, each answer informs increasingly sophisticated follow-up questions. Finally, STORM creates a detailed, fact-enriched outline that serves as the foundation for the complete article.

Testing revealed remarkable results: STORM captured nearly all key topics human editors deemed important and significantly outperformed baseline methods in organization and coverage. Ten experienced Wikipedia editors unanimously agreed the system would be valuable for their pre-writing process, with some noting STORM's articles occasionally provided greater depth than certain human-written content.

Yet challenges remain. Beyond avoiding simple factual errors, STORM struggles with "bias transfer" from sources and making unwarranted logical connections between distinct pieces of information. These limitations highlight that while AI can revolutionize information gathering and structuring, it still requires human judgment for truly exceptional content.

What if your next deep dive into an unfamiliar subject began with an AI assistant handling the heavy lifting of research, while you applied your uniquely human critical thinking? That collaborative future may be closer than we think.

Support the show

Speaker 1:

In our incredibly fast-paced world, how do you get truly well-informed quickly?

Speaker 2:

Yeah, it's tough right, it really is. The sheer volume of information. It's overwhelming.

Speaker 1:

Exactly, and sifting through it all to find something comprehensive, something grounded. That's a massive challenge, which brings us to a really fascinating question for our digital age Can large language models, llms, actually write these comprehensive, factual, long-form articles from scratch?

Speaker 2:

Articles like, say, a good Wikipedia page.

Speaker 1:

Yeah, exactly, can they do that? And maybe, more importantly, how would they even start Right? So our mission for this deep dive is to unpack a pretty groundbreaking new research paper that tackles this very challenge head on. A pretty groundbreaking new research paper that tackles this very challenge head on. It focuses on something often overlooked but, honestly, totally crucial for any serious writing.

Speaker 2:

Yeah, the pre-writing stage. Ah, yes, a bit before the actual writing.

Speaker 1:

Precisely, we're going to introduce you to an innovative system called Storm and show you exactly how it aims to well, maybe revolutionize the way AI creates detailed factual content.

Speaker 2:

Indeed, and it's a fascinating area because, while LLMs are amazing at generating text, aren't they?

Speaker 1:

Definitely.

Speaker 2:

Crafting something like a Wikipedia page detailed, well-researched, impeccably organized that presents some really unique complexities.

Speaker 1:

So it's not just about the writing itself.

Speaker 2:

Not at all. The core challenge here isn't just generating coherent sentences. It's about effectively gathering and structuring. You know, vast amounts of information.

Speaker 1:

And doing that without a human holding its hand the whole time.

Speaker 2:

Exactly Without constant human intervention. It's a problem that really pushes the boundaries of what these sophisticated models can do autonomously.

Speaker 1:

Okay, so let's unpack this pre-writing thing a bit more before we even you know, put pen to paper or fingers to keyboard. We humans do a lot in that stage. What does that messy but crucial process really involve for us?

Speaker 2:

Well for humans. That pre-writing stage is all about thorough research, right Diligent information gathering, meticulous planning.

Speaker 1:

And outlining. We start with an outline.

Speaker 2:

Crucially, yes, Crafting a solid, comprehensive outline. And what's particularly interesting here is that prior work on AI generating these kinds of articles. They often bypass this exact stage. They usually assumed that reference documents or outlines were already provided.

Speaker 1:

So they skipped the hard part.

Speaker 2:

In a way, yes, they focused more on just expanding existing sections. This approach sidestepped the really demanding tasks of identifying, evaluating and organizing external sources which, let's be honest, are challenging even for experienced human writers.

Speaker 1:

Right, and that highlights a pretty big limitation of current LLMs if you try to get them to do this directly, doesn't it?

Speaker 2:

It does.

Speaker 1:

Like if you just ask an LLM, hey, give me 30 questions about topic X or write me an article on Y, what happened? It often just spits out really basic what, when, where? Questions. The paper gives examples like when was the opening ceremony held? Where are the opening ceremony held? How many countries participated? Very factual but basics ceremony held, how many countries participated?

Speaker 2:

Very factual, but basics. It's like syllable stuff.

Speaker 1:

Exactly. While those are facts, this direct, prompting approach typically leads to very shallow information or real lack of depth.

Speaker 2:

And sometimes worse, right For less common topics.

Speaker 1:

Oh, definitely For less common topics. It can even lead to outright hallucinations, just making stuff up and presenting it as fact.

Speaker 2:

Because the LLM's internal knowledge just isn't enough for that kind of grounded, verifiable, long-form content.

Speaker 1:

Precisely, it needs external information.

Speaker 2:

And an interesting parallel the paper draws here is to human learning theories. It points out that asking effective questions, not just the basic ones, is absolutely key to acquiring new, meaningful information.

Speaker 1:

Makes sense. Better questions lead to better answers.

Speaker 2:

Exactly so. This challenge of better research, moving beyond those basic queries to truly delve into a topic, uncover nuances. That's precisely where this new system STORM comes in.

Speaker 1:

Okay, STORM. This is where the paper proposes its core innovation. Let's dive into it First off, what does STORM even stand for?

Speaker 2:

Right STORM stands for Synthesis of Topic Outlines Through Retrieval and Multiperspective Question Asking.

Speaker 1:

OK, quite a mouthful Synthesis. Retrieval, multiperspective what's the fundamental idea?

Speaker 2:

So its core design is driven by two key hypotheses really. First, that having diverse perspectives helps you ask a broader range of varied, deeper questions.

Speaker 1:

Like looking at a topic from different angles.

Speaker 2:

Exactly. And second, that formulating those truly in-depth questions requires iterative research. You have to constantly build on the answers you receive.

Speaker 1:

Kind of like a detective piecing together clues.

Speaker 2:

That's a great analogy you find one clue, it leads to another question, and so on.

Speaker 1:

That sounds well. It sounds a lot like how I might research a complex topic. Actually.

Speaker 2:

It mimics that human process.

Speaker 1:

Okay, so let's walk through Storm's multi-stage approach. Stage one is about perspectives. How does Storm make sure it gets broad coverage right from the start?

Speaker 2:

It's quite clever, actually. It begins by surveying existing Wikipedia articles, but on similar topics to the one it needs to research.

Speaker 1:

Okay, similar topics. Why?

Speaker 2:

It extracts their tables of contents and from those it identifies various potential perspectives, or you could call them roles.

Speaker 1:

Can you give an example? Sure.

Speaker 2:

Say it's writing about the 2022 Winter Olympics opening ceremony. It might look at other Olympic ceremony articles and identify an event planner perspective.

Speaker 1:

Ah, and an event planner would ask different questions, right?

Speaker 2:

That perspective would logically lead to questions about, say, transportation arrangements or the budget Things a general query might miss. And to make sure it covers the absolute basics, it also includes a basic fact writer perspective by default.

Speaker 1:

So it gets the fundamentals and the broader context.

Speaker 2:

That's the idea Ensuring fundamental facts aren't missed while still encouraging a much broader exploration.

Speaker 1:

OK, stage two simulating conversations. This sounds intriguing. Is it like AI chatbots talking to each other? Ai chatbots talking to each other?

Speaker 2:

Sort of the system essentially personifies an LLM as a Wikipedia writer that adopts one of those specific perspectives we just talked about.

Speaker 1:

Okay, so you have an AI playing the role of, say, the event planner.

Speaker 2:

Exactly this writer then asks questions related to its perspective, like can you provide me with a list of the participating countries? Or maybe something more specific like what were the main logistical challenges for spectator transport?

Speaker 1:

And who answers?

Speaker 2:

A simulated topic expert provides answers, but and this is crucial these answers are grounded.

Speaker 1:

Grounded meaning.

Speaker 2:

Meaning they're based on information retrieved from trusted Internet sources and it's not just making things up.

Speaker 1:

OK, so it's fetching real info.

Speaker 2:

Yes, and it's a truly iterative process. The answers provided by the expert lead the writer to ask follow up questions.

Speaker 1:

Ah, so it allows for much deeper exploration, like that detective analogy again.

Speaker 2:

Precisely Just like a human researcher, refining their queries as they learn more. The system even breaks down complex search queries and filters out untrustworthy sources, based on Wikipedia's own guidelines.

Speaker 1:

That's pretty sophisticated Ensuring reliability.

Speaker 2:

Yeah, it aims for that.

Speaker 1:

So, after all this intense research and these simulated chats from different angles, what happens next? How does all this raw information turn into an organized article? That's stage three. Right, creating the outline.

Speaker 2:

Right. So, after all those multi-turn conversations from different perspectives, essentially after thoroughly researching the topic, Storm has this wealth of information?

Speaker 1:

A big pile of facts and answers.

Speaker 2:

A very organized pile, hopefully. A big pile of facts and answers. A very organized pile, hopefully. It then first prompts the LLM to generate a draft outline based on its own internal knowledge.

Speaker 1:

Kind of a first pass.

Speaker 2:

Exactly, this draft typically provides a general, you know, organized framework, but then, and this is key, it refines this draft, outline how, using all the rich factual, grounded information, it gathered during those simulated conversations.

Speaker 1:

Ah, so it injects the research findings into the structure.

Speaker 2:

Precisely this multi-level, fact-enriched outline becomes the robust foundation for generating the full article later. It's truly structuring knowledge before writing.

Speaker 1:

That makes a lot of sense. Build the skeleton before you add the flesh.

Speaker 2:

Couldn't have put it better myself.

Speaker 1:

Okay, this all sounds great in theory, but does it actually work? Yeah, how did the researchers test STORM? You need rigorous evaluation for this stuff.

Speaker 2:

Absolutely. They set up a very smart evaluation process. First, they curated a special data set called FreshWiki.

Speaker 1:

FreshWiki. Okay, what's special about it?

Speaker 2:

It consists of recent high-quality Wikipedia articles, specifically B-class or above, which means they're pretty well-developed and fact-checked and, crucially, these articles were created or heavily edited after the training cutoff date of the LLMs being tested.

Speaker 1:

Ah, so the LLM couldn't have just memorized these articles during its training?

Speaker 2:

Exactly right. This is super important to avoid data leakage and ensure the LLM is genuinely generating something new based on the process, not just regurgitating.

Speaker 1:

Clever. Okay, so they have the data. What did they measure? How do you measure if an AI-generated outline or article is good?

Speaker 2:

Good question. They looked at two main things. First, outline quality they measured how much of the human-written outline's content the storm-generated outline managed to cover.

Speaker 1:

Comparing AI outline to human outline Make sense.

Speaker 2:

Yeah, using metrics like heading soft recall and heading entity recall, Technical maybe.

Speaker 1:

Break it down for us Sure.

Speaker 2:

Heading. Soft recall basically measures how well Storm's outline captured the key concepts or topics. A human would include Edding. Entity Recall looked at how many specific names places things entities it identified for the outline sections.

Speaker 1:

Got it. Concepts and specifics, yeah, and for the full article.

Speaker 2:

For the full article quality. They used standard measures like Roux scores.

Speaker 1:

Which compare text overlap basically.

Speaker 2:

Yeah, Comparing the AI text against human text for overlap in phrases, ideas and entity recall again. But they also used more human-like criteria.

Speaker 1:

Like what.

Speaker 2:

Things like interest level, coherence in organization, relevance and focus, coverage and, importantly, verifiability, judged by another LLM, in this case.

Speaker 1:

Okay, a pretty comprehensive evaluation. So the big question what were the results? Did STORM actually make a difference compared to, say, just asking GPT-4 directly?

Speaker 2:

Oh, it absolutely did. The difference was significant when they compared STORM to other LLM-based baselines like direct prompting or basic retrieval, augmented generation or RAG. Stor Storm significantly outperformed them in outline quality. Using GPT-4, for instance, Storm achieved an impressive 92.73% heading soft recall.

Speaker 1:

Whoa 92%, that sounds high.

Speaker 2:

It's very high and 45.91% heading entity recall.

Speaker 1:

Okay, what do those numbers actually mean, though?

Speaker 2:

Well, effectively, that 92% soft recall tells us that Storm's outlines were capturing almost all the key topics or sections that a human editor thought were important for that subject. It was nearly as comprehensive conceptually.

Speaker 1:

Okay, that is impressive and the entity recall.

Speaker 2:

Nearly half the specific entities, which is also quite good, showing it's not just vague topics but getting down to specifics. Essentially, Storm's outlines were way more detailed and comprehensive than the baseline.

Speaker 1:

So it's genuinely structuring the knowledge much better.

Speaker 2:

That's the key takeaway. For the outline stage, yes, and for the full article quality.

Speaker 1:

I have that stack up.

Speaker 2:

Articles generated using the Storm outline were rated significantly higher by the evaluator LLM, particularly in interest level, relevance and focus and coverage.

Speaker 1:

So better outlines lead to better articles Makes sense.

Speaker 2:

It does, and they did ablation studies too.

Speaker 1:

Where they take parts of the store and away to see what happened.

Speaker 2:

Exactly those confirmed that both the prospective discovery part and the simulated conversations part were vital. You couldn't just remove them.

Speaker 1:

What happened if you remove the conversations?

Speaker 2:

Removing the multi-turn conversations specifically led to much worse results. It really proved that the iterative research simulation is key.

Speaker 1:

So asking questions, getting answers, asking more questions.

Speaker 2:

Yeah.

Speaker 1:

That whole loop is critical.

Speaker 2:

Absolutely critical and the study also explicitly confirmed that having that dedicated outline stage is necessary. Removing it significantly deteriorates the final article's performance.

Speaker 1:

You can't just jump straight to writing.

Speaker 2:

Not, if you want quality. Oh, and one more thing on quality citations.

Speaker 1:

Ah important for factual articles.

Speaker 2:

Very An impressive 84.83% of sentences in storms-generated articles were judged to be supported by their citations.

Speaker 1:

Okay, that's a strong number for verifiability.

Speaker 2:

It really is, shows the grounding is working quite well.

Speaker 1:

The results certainly paint a compelling picture. Numbers look good, the process seems sound, but does it hold up to human scrutiny? What did actual human experts think, especially experienced Wikipedia editors?

Speaker 2:

Ah, yes, the human evaluation. This is where it gets really insightful, I think, because their feedback gives you a unique window into the real world usefulness and also the remaining challenges.

Speaker 1:

So what did they do?

Speaker 2:

They invited 10 experienced Wikipedia editors people who really know what makes a good encyclopedia article to evaluate Storm's output against the baselines.

Speaker 1:

Okay, the real test. What did they find?

Speaker 2:

Well, first the positives. The editors found Storm's articles to be noticeably more organized. There was a 25% absolute increase in articles they deemed organized, compared to the best baseline method 25%.

Speaker 1:

That's a big jump in perceived organization.

Speaker 2:

It is, and they also found Storm had broader coverage, showing a 10% increase there.

Speaker 1:

So more organized and covering more ground.

Speaker 2:

Exactly. Some editors even praised Storm's output specifically for providing, and I quote a bit more background information and more depth compared to even some human written articles. Wow.

Speaker 1:

More depth than human articles.

Speaker 2:

Yeah.

Speaker 1:

Quite a compliment coming from experienced editors.

Speaker 2:

It's a remarkable comment. Absolutely, it suggests Storm isn't just summarizing, it's potentially synthesizing information in a useful way.

Speaker 1:

And did they think it was actually useful for them? Like, could they see themselves using this?

Speaker 2:

This was really interesting. The editors were unanimous. All 10 agreed that Storm could be specifically helpful for their pre-writing stage.

Speaker 1:

Unanimous Okay, helpful how.

Speaker 2:

Things like collecting sources, generating that initial outline, basically tackling the research grunt work.

Speaker 1:

Makes sense. That's often the most time-consuming part.

Speaker 2:

Right and 80% thought it would help them edit a Wikipedia article for a completely new topic. 70% found it a potentially useful tool for the Wikipedia community at large.

Speaker 1:

So strong endorsement for its potential as a research assistant. Essentially, Absolutely.

Speaker 2:

It sounds like, as you suggested, it's doing that heavy lifting, that initial research and structuring almost like a very fast, very diligent research assistant.

Speaker 1:

OK, so clearly a powerful tool. Lots of praise, but there's always a but, isn't there? What challenges did these expert reviewers still identify? Where does AI content, even from Storm, still fall short?

Speaker 2:

That's the crucial next question. And they identified issues beyond just simple factual hallucination, which Storm seems pretty good at avoiding thanks to the grounding.

Speaker 1:

Okay, so what new problems emerged?

Speaker 2:

Two significant, more subtle challenges came up. First, a major one was source bias, transfer Bias transfer. Meaning. The articles often contained emotional or unneutral language. They were directly transferring biases or maybe promotional phrasing from the Internet sources Storm used.

Speaker 1:

Ah, so if the source website was biased, that bias could leak into the generated article.

Speaker 2:

Exactly. The AI isn't necessarily equipped yet to identify and neutralize that bias perfectly. It reflects the imperfections of its source material, the internet itself.

Speaker 1:

That's a tricky problem. What was the second challenge?

Speaker 2:

This one is perhaps even more crucial and subtle. They called it the over-association of unrelated facts.

Speaker 1:

Over-association, like connecting things that shouldn't be connected.

Speaker 2:

Precisely. Editors noted instances of what you might call a red herring fallacy or over-speculation. The AI sometimes fabricated unverifiable connections between different pieces of information it found, or between some information and the main topic itself.

Speaker 1:

So it wasn't just getting facts wrong, it was making faulty inferences or drawing unsupported conclusions.

Speaker 2:

Exactly. It's making logical leaps that aren't actually justified by the evidence it gathered. This goes way beyond basic fact checking. It requires a much higher level of critical thinking, a kind of discernment about logical inference, to spot these false connections.

Speaker 1:

That feels like a much harder problem to solve than just factual accuracy.

Speaker 2:

It likely is, and the editors noted that, while Storm was significantly better than the other AI methods, the machine-generated articles still weren't quite as informative or nuanced as well-revised human articles. So good starting points, maybe even great starting points, yes potentially excellent drafts or research summaries, but they still need that human touch, that critical eye for true polish neutrality and logical soundness.

Speaker 1:

Okay, so wrapping this up, we've seen how Storm is really pushing the boundaries here for LLMs generating grounded long-form articles.

Speaker 2:

Definitely, especially by tackling that crucial pre-writing stage, the research and outlining, mimicking how humans approach it, and it's clearly a powerful approach.

Speaker 1:

I mean even experienced human editors see immense value in it, particularly for their own workflow.

Speaker 2:

Absolutely. The potential as a research and outlining tool seems undeniable, based on their feedback.

Speaker 1:

Yet that human evaluation also highlights these fascinating new frontiers, doesn't it? The challenges aren't just about facts anymore.

Speaker 2:

No, it's moved beyond that the challenges of mitigating source bias, preventing that red herring fallacy where LLMs make improper inferential leaps. That means the path to truly human quality, neutral, verifiable long form content is well still evolving. There's more work to do.

Speaker 1:

It really feels like this deep dive reveals the future isn't necessarily AI replacing humans in knowledge creation.

Speaker 2:

Probably not entirely no.

Speaker 1:

But maybe it's about a much more sophisticated collaboration. What stands out most to you about this potential shift?

Speaker 2:

For me, it's that focus on the process. Storm isn't just about generating text, it's about simulating the research and structuring process. That feels like a more fundamental step forward, but also the subtlety of the remaining challenges. Bias, faulty inference shows just how complex human critical thinking really is.

Speaker 1:

Yeah, Maybe a final thought for you, the listener If AI tools like Storm can already be this effective at the discovery and organization stages, the parts that often take the most time how might this fundamentally change how you approach learning something new?

Speaker 2:

Or how you might even contribute to collective knowledge, like on Wikipedia.

Speaker 1:

Right, Imagine having an AI assistant that not only helps you gather information super efficiently, but maybe in the future could even flag potential biases in your sources or point out where you might be making an unsupported logical leap as you write.

Speaker 2:

That collaborative future is certainly interesting to think about.