Topic analysis with classification models and its use with generative AI_AMEC_The Innovation Hub Series_Hanz Saenz Gomez_Technological Development and Innovation Manager_buho Media

Topic analysis with classification models and its use with generative AI

by Hanz Saenz Gomez, Technological Development and Innovation Manager, buho Media

In the information age we live in, there is an overwhelming amount of content available online. Sources such as Kaggle and the sklearn library feature individual and collaborative projects on the use of AI models with real and generated data that are freely available for use. The ability to automatically classify and label that content can be a valuable tool for organizing and analyzing data. One of the tools we can use for this process is the classification model for topic prediction.

These models are machine learning algorithms used to classify data into different categories. In the case of topic prediction, they are used to define the themes discussed in a specific text. For example, in the world of corporate reputation measurement, we can take a list of news articles and classify them according to the different topics they each cover; politics, economics, sustainability, etc. These models require a dataset with pre-existing classifications for their construction, in order to train and classify data by identifying patterns and features.

Achieving High Accuracy with Topic Classification Models

At buho, we have 15 years of data that is already classified with these labels, data from different industries, crises, as well as tonality classification, helping the models developed internally for topic classification to achieve an accuracy of 80%, surpassing the 76% threshold proposed in a research study by buho on “Defining the accuracy percentage for variables according to label prediction models.” This research is based on the analysis of eight text and label classification models validated by Cornell University, which belongs to the research group of the State University of New York and has access to the arXiv platform, a free distribution service and open access archive for over 2,104,827 academic articles in the fields of physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics.

The accuracy percentage is a key metric in topic prediction models, as it measures how well the model can correctly predict topics based on its training. However, the exact definition of this metric may vary depending on the context and the variables being evaluated.

One of the variables considered in this research is the distribution of labels in the dataset. In particular, when labels are imbalanced, meaning there are many more instances of one label than another, accuracy can be misleading and not adequately reflect the quality of the model. In these cases, other metrics such as precision, recall, or F1-score can be used, which take label imbalance into account.

Another important variable discussed in this research is the presence of errors in training labels. If the training dataset contains incorrect or misclassified labels, the model’s accuracy may be low even if it is capable of making accurate predictions. In these cases, label correction techniques or semi-supervised training can be used to improve label quality and, in turn, improve model accuracy.

Integration of Classification Models into Generative AI and its Applications

Semi-supervised training is a technique used to automatically label unlabeled data. Instead of manually labeling the entire dataset, a set of already labeled data is loaded for training. Then, a machine learning model is used to predict the labels of the unlabeled examples in the dataset. These labels are used to retrain the machine learning model. The process is repeated several times until an acceptable level of accuracy is achieved in predicting labels for the unlabeled data.

Once the model has been trained, we can automatically classify new data. Returning to the example, if we have a news article that has not been classified, it can be entered into the classification model and will automatically be assigned a category based on the model’s prior learning. Our analysis system at buho automatically performs this process from the moment news articles are entered. It is not the only model implemented; we also have tonality classification, entity extraction, and others, which help our data be more deeply analyzed and provide more value to our clients.

In addition to measurement, classification models for topic prediction can be used in a variety of fields. For example, in the field of medicine, they could be used to classify a patient’s symptoms into different categories to assist in diagnosis. In the field of marketing, they could be used to classify customer feedback into different topics to better understand their needs and preferences.

Currently, these classification models are already integrated into the growing field of generative artificial intelligence, an exciting and constantly evolving branch. In this list, we can find the GPT-3 model developed by OpenAI, which has already been trained on a massive amount of data and is known for producing high-quality text based on literary compositions. From this model, we have CHAT-GPT, as its name suggests, it is a chat where we can ask the model various topics, from writing texts, proposing activities, inquiries, writing articles, and even requesting development code in different programming languages. Similarly, we can find models like Jasper, PerplexityAI, YouChat, ChatSonic, among others, created for different purposes.

These models have proven to be a useful tool for topic analysis due to their ability to identify patterns and semantic relationships in large text datasets. These models identify topics that may not be evident to humans, allowing for a deeper understanding of text content. Additionally, generative models are efficient in processing large amounts of text, enabling topic analysis in large datasets.

Classification models for topic prediction are a valuable tool for organizing and analyzing data, and they are already integrated into generative AI models like the aforementioned Chat-GPT, the most widely used. In the first quarter of 2023, we have seen how these technologies are changing the way companies work and think, leading them to automate and improve response times.

In measurement, technology allows data analysts to obtain relevant and useful information in record time, which can be a significant competitive advantage in an increasingly competitive and demanding market. If used responsibly and ethically, generative artificial intelligence can be a powerful tool for improving the quality and efficiency of work in the field of communication and corporate reputation measurement. It is important to note that when using generative AI, we do not seek to replace human creativity, but rather to amplify and enrich it. We use technology to be more human, to explore our creative capacity, and to find new forms of artistic expression, expanding our creative reach in the digital world. This leads us to conclude that human supervision will always be present.

One way to begin this journey of incorporating generative AI in corporate reputation measurement is:

  1. Define measurement objectives: It is important to have clear objectives and the metrics to be used.
  2. Identify relevant data: AI can process large volumes of data, so it is essential to identify the sources available and which data are relevant. This can include social media, news, forums, blogs, etc.
  3. Select the appropriate technology for the right use: There are various generative AI tools available, so it is crucial to choose the one that suits the objectives and needs.
  4. Train the model: With the technology and data, we must train the model to ensure accurate and relevant results.
  5. Generate customized reports: With the results, customized reports can be created, such as sentiment analysis, identification of relevant topics, among others.
  6. Adjust the model: It is important to evaluate and adjust the model to ensure its accuracy.

It is essential to keep in mind that generative AI is not a one-size-fits-all solution for corporate reputation measurement and should be used alongside other communication and marketing tools and strategies.

As we embark on this journey through the digital landscape, imagine standing on the edge of a vast ocean of data, where each wave represents a piece of information, a story, or an opinion. It’s the era of information, and we need a trustworthy companion to help us navigate these waters. Generative AI models, and ML technologies, are like skilled sailors, guiding us to make sense of the vast ocean, uncovering hidden treasures, and providing valuable insights. By responsibly and ethically harnessing and controling the power of these AI models, we can enhance human creativity, enrich our understanding, and sail towards a brighter future in the realm of communication and corporate reputation measurement. Together, we chart a course through the ever-changing tides of information, embracing the opportunities and challenges that await us on this uncertain and exciting voyage.

About Hanz Saenz Gomez, Technological Development and Innovation Manager, buho Media

Hanz has worked on integrating Machine Learning models applied to communication and reputation measurement, thus expanding his knowledge in predictive and classification models as part of his education in Artificial Intelligence. He has participated in business innovation projects presented to the Ministry of Science and Technology of Colombia in collaboration with buho, which focus on process innovation through the application of customized Machine Learning models for reputation and corporate sustainability products.

His professional growth and achievements have been demonstrated at buho, where he started as a developer in 2016 and currently holds the position of Manager, showcasing his ability to adapt to change and his leadership skills throughout this period.

Hanz holds a Master’s degree in Big Data from Esden Business School in Madrid, Spain. He is passionate about technology and all tools that can help automate processes to bring more value to human quality.

This is article is part of The Innovation Hub Series