ChatGPT, Research Outputs and Writing for the Web
The reasons are new. The recommendations, not so much.
By this point, pretty much everyone has written about what ChatGPT means for the research sector. (Here’s On Think Tanks founder Enrique Mendizibal on why it may be a game-changer for the knowledge industry.) If you’ve somehow not yet heard of ChatGPT, then I recommend giving Louise Ball’s getting-started primer a read.
If you’ve read anything I’ve written before, then you’ll know that I’m pretty interested in how we publish research outputs on the web. (I wrote a whole book about it!)
So how do you prepare your content for the world of ChatGPT-style AI?
Semantics and modularity.
I’ll explain both. But a bit of context will help.
The flavors of artificial intelligence
Expert systems
Research into artificial intelligence has been going on for a long time. For much of that time, researchers focused on providing detailed rulebooks that would help computers draw conclusions.
That’s a huge project. It means providing definitions for everything and then defining relationships between all those things.
Those definitions and relationships are called ontologies and AI researchers have been creating them for years. (Several of my fellow alumni from UVA’s philosophy PhD program went to work building ontologies for AI startups all the way back in the late ‘90s.)
Modeling all information in this sort of formal way is a monumentally huge task. So most of the AI systems built through this method tend to be single-purpose systems (often called “expert systems”). Deep Blue—the chess-playing system that won a match against then-world champion Gary Kasparov—is probably the best known example.
The trouble with expert systems is that they don’t scale. There aren’t enough computers in all the land to string together expert systems into anything approaching general intelligence.
So AI researchers hit on another method.
Machine learning
Machine learning systems use an iterative process to gradually “learn” answers. That means feeding in huge sets of data, asking questions, and then correcting wrong answers.
Machine learning is what powers much of the AI we were already familiar with — Google’s search algorithms, and voice assistants like Siri or Alexa. But these models still made pretty significant use of all those ontologies that data scientists and information architects have been creating for years. Those are what help machines learn.
Large language models
The newest breed of AI (and, yes, we’re finally coming back to ChatGPT) is a little different. It still uses the basic principles of machine learning. But instead of ontologies, these programs are fed huge quantities of natural language. (You’ll sometimes see the term “large language model” to describe the concept.)
These systems aren’t learning what concepts mean. They’re learning to predict what word comes next in a sentence.
That’s why they’re able to generate new text, rather than just parroting back existing text.
The limits of large language models
If you’ve followed much of the conversation about ChatGPT, you may have heard of its tendency to “hallucinate.”
I wanted to write a whole piece about ChatGPT and Harry Frankfurt’s little book On Bullshit, but (a) Clive Thompson beat me to it and (b) it’s better than what I would have written. You should go read it.
But the tl;dr is that LLM-based systems don’t understand anything. They’re just predicting words. They will bullshit (that is, say stuff without knowing whether it’s true) because they don’t know whether anything is true. They are stuck in Searle’s Chinese Room.
What does any of this mean for research content?
This brings us back to semantics and modularity.
The obvious next stage for AI is to pair LLM-derived ability to generate new text with a semantic-driven variety (meaning, the type that uses ontologies to “understand” things and their relationships to other things).
That means we’ll need to tag our content with semantic meaning.
Unfortunately, it’s not enough just to apply a few taxonomy terms to a 50-page document and call it a day. ChatGPT is reading the stuff inside that 50-page document. And it’s providing answers based on specific chunks of content — individual paragraphs or even specific sentences within a larger paragraph.
Providing better semantics means tagging content at a more granular level.
Of course, we can’t tag content at a granular level unless we’re somehow storing it at a granular level. That means breaking our content into smaller chunks that can then be tagged. Yes, it’s the very same modular content that I’ve been arguing for for *checks notes* half a decade now.
There are, of course, ways of doing tagging chunks of content after the fact. A group of volunteers at #SemanticClimate is working on adding semantic markup to the IPCC reports. The work requires
converting PDFs into HTML
- running automated searches for acronyms and standard climate-related terms
- creating dictionaries for the terms
- linking those terms to a universal standard (wikidata) where terms/definitions already exist
- extending wikidata where terms do not yet exist
It’s a lot of work, featuring a veritable army of folks. (Volunteers are always wanted!)
It’d be far more efficient to do all this work from the start.
That’s going to take a different sort of workflow and a very different kind of website.
The ChatGPT revolution is here. The best time to get your content ready for it was two years ago. The second best time is right now. Get in touch if you’re ready to start!
Joe Miller is principal at Fountain Digital Consulting, where he specializes in helping researchers and research organizations produce content for the brave new digital world.