This is a guest post by Candelario A. Gutierrez. Thanks for sharing your CrateDB experience with us. You can also watch Candelario's talk about the same topic he held at our CrateDB Community Day.
My name is Candelario A. Gutierrez, and in this post, I will talk about how we leverage CrateDB into our platform for analyzing social-platform data. We have reported on this work in our paper on “Analyzing and visualizing Twitter conversations” (see https://dl.acm.org/doi/abs/10.5555/3507788.3507791). In this post, I provide more detailed information on the technical aspects of our platform, of interest to software developers who deal with similar data and are interested in exploring CrateDB.
Conversations about public interest issues have shifted from mediated traditional media sources to unmediated online media, blogs, and webpages.
As a result, there is substantial interest in developing a variety of computational methods for analyzing social platform data (Sapountzi & Psannis, 2018). Our own work has been motivated by the need to develop a robust and scalable platform, able (a) to deal with multiple and big data sets, and (b) to easily integrate new analyses. The platform in its current version analyze data at the lexical level (extracting words, hashtags, URLs, emojis, mentions, and media); integrates three dictionary-based semantic analyses, i.e., personal values, sentiment (of the tweet text and images), and humor expressed; extracts domain-specific aspects (specified as phrases) and their associated sentiments; and quantifies engagement in terms of a variety of alternative metrics.
Social media is a prime example of semi-structured data consisting primarily of free-form texts, called posts, containing references to users (e.g., "at mentions" in Twitter), topics of interest (e.g., hashtags), or other posts (e.g., "re-tweets" in Twitter). This type of data can easily grow to terabytes. Therefore, for a system to be capable to process, analyze and visualize from medium to big data, it needs to fulfill a set of requirements:
Thanks to open-source platforms and tools, we managed to fulfill all those requirements as pictured below.
There are three key concepts that this platform brings to the table: automation, scalability, and response. Let me explain how they are reflected in our components:
In the case that external services need to consult the processed or any other type of data, with FastAPI we developed a set of endpoints that allow to do different types of select into CrateDB:
Now, let me introduce you to the internal components of the data processing and analysis section. As depicted in the image below, we have a set of different analysis pipelines.
Dictionary-Based Lexicon Look-Up
Aspect-Based Sentiment Analysis
We apply dependency parsing to reveal the relationships between words that modify the meaning of other words. When multiple aspects are found in the text, they are separated in the dependency structure and their specific modifiers can later be analyzed for sentiment independently.
Engagement Calculation (Public Metrics)
We apply standard formulas to compute engagement and/or influence, extended reach, and posts’ impressions based on the metadata that we can extract from the query search.
We have recently incorporated an API into our system that is capable of processing individual (batches of) image(s). Once an image is received, a number of analyses are applied to it, including:
Such analyses are later transferred into CrateDB as a single table with their respective post ID.
Custom curated visualizations were created in order to interpret all of the text and metadata analysis. Two groups of visualizations are available to contextualize (a) the frequency of personal values, sentiment, and humor in tweets, and (b) the mention of terms associated with an aspect and what combinations of them are popular in tweets. The added value of such visualizations is the dictionary multi-select, aspect filter, and time control bar. Functionalities that allow the user to get a more in-depth view of the data.
To validate the architecture of our data platform and the usability of our pipelines analyses. We performed a case study based on the “#energyeast” hashtag query.
The characteristics of this dataset were the following:
Moving forward, the respective pipelines for text, engagement, metadata and URLs processing got executed on a Linux PC with 16 cores and 56 GB of RAM, which also contained a dockerized deployment of CrateDB v4.6.3 - 3 nodes. And for the image processing, we used a Linux PC with 4 cores, 22 GB of RAM and an NVIDIA vGPU with 8 GB of memory. These were the approximate execution wall times of our pipelines:
Based on the final results, we realized 2 bottlenecks, one is the URLs expansion and the other one is the image processing. The reason for this is because both make asynchronous API calls to fetch data from external resources, and for this we depend on two factors:
To conclude, no matter the number of columns that our tables needed, we experienced a really good writing execution time. Also, in our methodology to serve our visualizations, based on the selection of all 148,317 rows in batches of 2,500 results, it takes approximately 17 seconds. Hence, we realized that CrateDB was a good fit to handle the requirements for this specific study and for this type of data, with really low latency.
If you wish to take a look into the project, you can visit https://ai4buzz.ca/. Further development of this platform will include: Topic and language modeling; evolution analysis and more visualizations.
This project was supported by:
Lefsrud, L., Westbury, C., Keith, J., & Hollis, G. (2015). A basis for genuine dialogue: Developing a science-based understanding of public/industry communication. Phase I Report Prepared for the Alberta Chamber of Resources.
Mechler, R. (2016). Reviewing estimates of the economic efficiency of disaster risk management: opportunities and limitations of using risk-based cost–benefit analysis. Natural Hazards, 81(3), 2121-2147.
Oberg, A., Lefsrud, L., & Meyer, R. E. (2021). Organizational (issue) field perspective on climate change. economic sociology_the european electronic newsletter, 22(3), 21-29.
Sapountzi, A., & Psannis, K. E. (2018). Social networking data analysis tools & challenges. Future Generation Computer Systems, 86, 893-913.