A Practical Guide to Trustworthy AI for Social & Media Monitoring

As exciting as recent developments in AI have been, it’s no secret that distrust in AI and confusion over how it works have kept pace. The current furor surrounding generative large language models (LLMs) and their tendency to “hallucinate facts” is just the most recent in a long line of ethical and data quality concerns surrounding AI.

And yet there are few solutions aimed at solving these problems, at least in the social listening and media monitoring industries. It’s easy to understand why. Almost everyone can align on the exciting potential of AI to make our lives and work more efficient. For example, at Converseon we’ve seen our NLP models cut the time our clients spend on manual QA/data cleaning by 80%. And this dramatically increased accuracy has enabled us to use our model outputs to predict business outcomes such as revenue or share price, thereby directly demonstrating the value of unsolicited social media and news data for attribution and better business decision-making.

But it’s much harder to quantify the nature, scope and long-term ramifications of AI’s ethical and data quality problems. Perhaps this is because these vary so much depending on where and for what purpose AI is being used.

Converseon has been in the field of AI-powered natural language processing (NLP) for over a decade, specifically focused on its application for social media, mainstream media and voice-of-customer text (referred to henceforth as “conversation data”). We built our first AI-powered NLP model (or “classifier”) in 2008, initially as a way to automate the time-consuming and costly manual data labeling we were doing for our social insights reporting. As early as 2012 we began integrating our models with monitoring platforms and applying them to large, streaming feeds of conversation data. With that shift to continuous, “real time” feeds of conversation data, we encountered a now well-known challenge of using AI for social and media monitoring: as conversation data changes, model accuracy will change and often degrade, potentially to the point of making the model’s classifications unusable. More than ten years later, this remains the status quo for most AI-powered NLP in social listening and media monitoring. (If you don’t believe me, just ask your social/media monitoring users how much they trust the out-of-box sentiment analysis in the platforms they use.)

So, how do we get AI-powered social and media monitoring that is both accurate out-of-box and maintains its accuracy over time? If we can solve this, we may have addressed a core driver of distrust in AI in our industry.

The solution starts with good training data governance: carefully curated and labeled training data is critical for building high-performance NLP models, especially when using state-of-the-art LLMs and transformer-based learning architectures. In the charts below, we see this clearly. Interestingly, there is not much payoff in upgrading to LLMs when your use-case specific training data isn’t carefully labeled and curated. Only with LLMs tuned on high-quality data (“strict training data governance”) do we see a very significant lift in model performance.

The chart on the left also hints at why, despite LLMs existing for half a decade, we haven’t seen a widespread increase in the accuracy of out-of-box AI-powered NLP in social and media monitoring. It’s because lax training data governance is still widespread. The MIT Technology Review put it plainly in a recent article about the negative repercussions of this data governance policy (or lack thereof):

“Tech companies don’t document how they collect or annotate AI training data and don’t even tend to know what’s in the data set, says Nithya Sambasivan, a former research scientist at Google and an entrepreneur, whose 2021 paper laid out the ways the AI industry undervalues data. […] That’s because tons of work has gone into developing cutting-edge techniques for AI models, while data collection methods have barely changed in the past decade. In the AI community, work on AI models is overemphasized at the expense of everything else, says Sambasivan. ‘Culturally, there’s this issue in machine learning where working on data is seen as silly work and working on models is seen as real work.’ […] ‘As a whole, data work needs significantly more legitimacy,’ Sambasivan says.”

It’s a strange state of affairs. A quick Google search of the phrase “garbage in, garbage out” shows that it has its own Wikipedia page and has been in use at least since 1957. We all understand intuitively that mediocre inputs will yield mediocre outputs. So why is training data governance such an afterthought?

It might be because good training data governance has historically been costly, time-consuming, and very difficult to achieve. Without the right tools and deep human expertise, these efforts can easily amount to a set of guidelines, policies and manual workflows that are not adhered to, or cannot be put into practice at scale. The good news: this is a surmountable problem, and there are now AI providers who are successfully implementing AI-powered NLP at enterprise scale, with good training data governance at their foundation. But picking these vendors out of the crowd will require a more educated buyer. Next time you’re approached by an AI vendor or hear claims about high-performance AI, I hope you’ll ask them about their training data governance. Do they have humans labeling their training data? Do they vet their human labelers? How do they isolate the “raw” data samples that will be passed to the human labelers? Is that process scalable? Do they outsource the labeling to third-party labeling services, or do they run it all in-house? Do they have clear definitions for all labeling tasks, and if so, how do they ensure that labelers are understanding and adhering to them? Successful training data governance policies require strategies for addressing all of these questions and more.

–

But good training data governance isn’t enough on its own. To build trustworthy AI for social and media monitoring, we need NLP models that are accurate out-of-box and easy to continually maintain. As mentioned above, even the best NLP models will likely suffer from performance degradation as conversation data changes over time.

At its core, ongoing model maintenance requires a scalable workflow for ongoing performance testing, where your tests are statistically representative of the model’s performance on your data. It’s not enough to test a single model once on a generic set of test data and then retrain based on that test. You need a way to test dozens or even hundreds of models on different datasets, at regular time intervals. At Converseon, the process looks something like this:

This may look like it couldn’t scale, involving dozens of service hours per client, per time interval. And if your current model testing workflow involves deploying models from custom scripts only your ML engineer can run, sharing data via email or chat, human labeling in Excel sheets, and custom macros or scripts for calculating performance metrics, that would be true. Each step of this process, from deploying models, to generating representative test samples, to detecting semantic change (“drift”), to labeling those test samples, to calculating and reporting performance metrics based on those labels, to using labeled test data to retrain model(s), must be as platform-supported, automated and “push-button” as possible.

But there is a bigger issue to address than platform-supported automation. We all know that the potential “scale killer” in this process is human labeling. At Converseon, we’ve developed a patent-pending algorithm that massively simplifies the process of human labeling, thereby enabling us to turn ongoing model performance testing into a highly scalable workflow. As Converseon’s team labels auto-generated test samples, the algorithm doesn’t only compute the model’s performance metrics. It also computes the statistical certainty that the model’s performance on the sample represents the model’s overall performance on the full data population. This certainty estimation updates on the fly, as test data is being labeled, enabling us to label the minimum sample size required for a highly confident (95%) estimation of the model’s performance. Without this algorithm, labeling is an inefficient “bottomless pit”, where labelers never know how much labeling is enough to confidently ascertain a model’s current performance.

Let’s look at an example. In the below screenshot, we see a classification workflow in Converseon’s Conversus platform. In a nutshell, NLP models are being deployed to specific sets of conversation data about a few airlines in a social listening platform (Brandwatch). During that process, a statistically representative sample of the processed data is being collected at user-specified intervals (in this case, every month).

From here, the sample is pushed to a labeling interface, where human labelers supply “ground truth” labels against which the model’s classifications will be compared. This will be the basis for ascertaining the model’s accuracy and other performance metrics. The better the model is at mirroring the “ground truth” labels assigned by the human labelers, the more accurate the model is. But as described above, through Conversus’ certainty estimation algorithm, we are able to very rapidly ascertain the model’s performance on a very small set of human-labeled test data. In the screenshot below, we see that by labeling just 101 records from the sample, we can say with 95% confidence that the model’s accuracy is between 94% and 100%, and the model’s F1 score is between 0.92 and 1.

Just as with good training data governance, pushing the social and media monitoring industries to adopt model maintenance best practices will likely come down to more educated buyers making more educated demands of vendors. We all know AI isn’t magic, and yet the lack of insistence on training data and model maintenance transparency implies that we’re content to treat it like a wondrous black box. In all of the RFPs Converseon has received so far in 2023, almost all have asked whether we use AI-powered NLP techniques, but none have asked about how we manage our training data or ensure high-quality model performance over time. That needs to change if we want trustworthy AI for social and media monitoring.

To drive this shift in buyer and vendor thinking, perhaps the power of AI as a driver of business value needs to become clearer. At Converseon, our approach to delivering trustworthy and transparent AI is the result of more than a decade of R&D, but that R&D has always been guided by the knowledge that AI can deliver highly relevant insights with clear ties to business outcomes at a scale and speed that humans can’t match. As briefly mentioned at the outset, we’ve been able to show clear statistical relationships between conversation data and revenue. AI’s ability to accurately pick relevant conversations out from the noise has been absolutely central to this and will continue to be. But for AI to play such a clear role in generating business insight, good training data governance and ongoing model maintenance are key.

By Ben Sigerson, VP of Solutions, Converseon

About the author:

Ben Sigerson is VP of Solutions at Converseon, a leading AI-powered NLP provider that uses news, social and VoC data to predict business outcomes and drive better decision-making. Ben has been with Converseon for over a decade and has played many key roles, helping to lead the company’s transition from a successful social media consultancy and early AI innovator to an award-winning AI-powered NLP provider. In his current role, Ben combines strategic vision with a deep practical understanding of AI tools to provide both clients and partners with the solutions they need.

AMEC Innovation Series

Why Guest Blogging Still Matters — And How You Can Contribute

Leadership and Staff

Step Up, Stand Out: Why Now Is the Time to Lead at AMEC

Amec News

Keep up to date with the latest news and trends

Connect with AMEC

Quick Resources

Member Areas