Taraxa Echo: Analytics Pipeline Walkthrough

Steven Pu

Published in

Taraxa Project

12 min readNov 21, 2022

In our previous article, we outlined the decentralized architecture of Taraxa Echo 👇👇.

Taraxa Echo: a Decentralized Social Data Network

In our previous article, we introduced the rationale behind building Taraxa Echo 👇👇.

medium.com

Here, we’ll focus on a specific and critical component of the Echo Node: the Data Analytics Pipeline. We’ll walk step by step through the pipeline with actual social data collected from Telegram.

The following walkthrough uses actual Telegram data, which you can download below. This is a sample data set covers,

One-week period 2022–11–03 to 2022–11–10
3,030,387 Telegram messages
4,665 Telegram groups
138,175 users

All data files mentioned in this article can be found here: https://drive.google.com/drive/folders/1qMtQll0Ffy33sm9p1A3Uzlq0GO25BJrD

📣 Taraxa Hype: example target use case

We’re going to use Taraxa Hype, an app we’re developing to help web3 projects to run high-ROI social campaigns as the target use case. Hype makes use of the data & analytics from the Echo platform.

The goal is to be able to automatically identify and reward people in the crypto ecosystem who help spread the word around a project’s social campaigns.

🔄 Pre-Processing

Pre-processing is the first step of the analytics pipeline that removes information that’s unnecessary or unhelpful for follow-up analyses. Specifically, the following are removed,

Non-text: data such as images, videos, emojis etc. are removed since the analytics algorithms focus on text only
Non-English: data are removed since we decided to focus on English social data first
Very short messages: are those that are shorter than 6 words — a simple heuristic to filter out exceedingly short messages that convey little meaning (e.g., “Hi”, “gm”)
Links: are removed since the links themselves aren’t meaningful, while the information contained in the destination of the links are beyond the scope

In our sample data set,

We began with 3,030,387 raw social messages with 1,896,069,635 bytes
Pre-processing eliminated 1,871,757 messages with 687,438,064 bytes, with 1,158,630 messages and 1,217,900,627 bytes remaining

You can find the raw social data set in file raw_data_2022–11–03_2022–11–10__0.csv and the data after pre-processing in file clean_messages_2022–11–03_2022–11–10__1.csv .

🚫 Anti-Spam(mers)

The biggest challenge social campaigns face today is that the vast majority of their funds aren’t going to actual community supporters or influencers, but are wasted on spam bots or bounty hunters that at best, adds nothing to your community, and at worst, tarnishes your project’s reputation.

In Echo, spam is defined as a message or its near-identical variants being posted at a significantly higher frequency than the average in a specific group.

Here is how spam is detected,

Clustering near-duplicate messages: by using locality-sensitive hashing (LSH). This helps the pipeline identify messages that are either identical or very close to one another and then grouping them together. We often notice spammers or bots will slightly adjust a spam message by a few words, or adjusting the order of sentences to escape generic frequency filters, but our algorithms are able to identify slight and not-so-slight variations and count them all as the same message.
Measuring the clusters’ posting frequencies: by looking at how often these messages are being posted. A series of identical or nearly-identical messages posted infrequently (e.g., once a day) can’t count as spam, since the users aren’t seeing it too often. There’s also a good reason why messages are being posted frequently in social, since messages scroll up very quickly in a group and not every user may see the message when first posted. A message is only spam if it’s being posted “too frequently”.
Compare these clusters’ posting frequencies vs. average: posting frequencies in the group. If the messages in any cluster’s posting frequency in-group is more than 5x higher than the average per-user posting frequency in the group, then that’s deemed as spam.
Label and filter out spam and spammers: by not only removing clusters of messages labeled as spam, but also removing all other messages posted by the authors of messages labeled as spam — basically, banning spam and spammers.

For our sample data set,

We start with 1,158,630 messages after the initial Pre-processing
Within these, we identified 19,217 clusters of near-duplicate messages at a group level, accounting for 805,549 messages total
Out of these, 8,340 clusters exceeded the 5x average per user, per group posting frequency within their specified groups and were labeled as spam, and removed, accounting for a total of 436,305 messages
From these 436,305 spam messages, we identified 6,320 authors (spammers), as a consequence all of these spammers’ messages across all groups were also removed, accounting for another 114,724 messages removed
In total, 607,601 spam and spammer-posted messages were removed, leaving us with 551,029 messages going into the next stage

Here are a few examples of the spam clusters, and 3 raw (pre-cleaning) sample messages from each cluster. Quick note, the \n characters are newlines.

Spam Cluster 309198848 (10047 messages total)


" Doggie Coin offical pancakeswap links🐕🦴\n\n📈Bogged charts: https://charts.bogged.finance/?token=0x4CB2d8974E20025b6D85Af3a754A922bb27bBAFc\n📈Poo charts: https://poocoin.app/tokens/0x4CB2d8974E20025b6D85Af3a754A922bb27bBAFc\n📝Contract: 0x4CB2d8974E20025b6D85Af3a754A922bb27bBAFc\n🥞Pancakeswap: https://exchange.pancakeswap.finance/#/swap?outputCurrency=0x4CB2d8974E20025b6D85Af3a754A922bb27bBAFc\n\n\n🌐 Website https://www.doggiecoin.org/\n💬 Telegram https://t.me/doggi_coin\n🐥 Twitter https://twitter.com/doggie_coins"

" Doggie Coin offical pancakeswap links🐕🦴\n\n📈Bogged charts: https://charts.bogged.finance/?token=0x4CB2d8974E20025b6D85Af3a754A922bb27bBAFc\n📈Poo charts: https://poocoin.app/tokens/0x4CB2d8974E20025b6D85Af3a754A922bb27bBAFc\n📝Contract: 0x4CB2d8974E20025b6D85Af3a754A922bb27bBAFc\n🥞Pancakeswap: https://exchange.pancakeswap.finance/#/swap?outputCurrency=0x4CB2d8974E20025b6D85Af3a754A922bb27bBAFc\n\n\n🌐 Website https://www.doggiecoin.org/\n💬 Telegram https://t.me/doggi_coin\n🐥 Twitter https://twitter.com/doggie_coins "

" Doggie Coin offical pancakeswap links🐕🦴\n\n📈Bogged charts: https://charts.bogged.finance/?token=0x4CB2d8974E20025b6D85Af3a754A922bb27bBAFc\n📈Poo charts: https://poocoin.app/tokens/0x4CB2d8974E20025b6D85Af3a754A922bb27bBAFc\n📝Contract: 0x4CB2d8974E20025b6D85Af3a754A922bb27bBAFc\n🥞Pancakeswap: https://exchange.pancakeswap.finance/#/swap?outputCurrency=0x4CB2d8974E20025b6D85Af3a754A922bb27bBAFc\n\n\n🌐 Website https://www.doggiecoin.org/\n💬 Telegram https://t.me/doggi_coin\n🐥 Twitter https://twitter.com/doggie_coins "

Spam Cluster 307101696 (1516 messages total)


" Risu (RISU)\nPrice:0.001149𝑈𝑆𝐷\nPrice:0.000000057𝐵𝑇𝐶\nPrice:0.0000007562𝐸𝑇𝐻\n1ℎ𝑟𝐶ℎ𝑎𝑛𝑔𝑒:−0.2174,201.77\nFully Diluted Market Cap: $1,149,272.27\n\n🚀 View on CoinMarketCap "

" OmniaVerse (OMNIA)\nPrice:0.001824𝑈𝑆𝐷\nPrice:0.0000000905𝐵𝑇𝐶\nPrice:0.000001201𝐸𝑇𝐻\n1ℎ𝑟𝐶ℎ𝑎𝑛𝑔𝑒:0.23501,580.14\nFully Diluted Market Cap: $1,824,287.75\n\n🚀 View on CoinMarketCap "

" Metakings (MTK)\nPrice:0.0004762𝑈𝑆𝐷\nPrice:0.00000002347𝐵𝑇𝐶\nPrice:0.0000003089𝐸𝑇𝐻\n1ℎ𝑟𝐶ℎ𝑎𝑛𝑔𝑒:0.06706.30\nFully Diluted Market Cap: $476,177.78\n\n🚀 View on CoinMarketCap "

Spam Cluster 306479111 (4178 messages total)


" Help me get more floors and be one of the top 7 lakh teams this round https://gpay.app.goo.gl/5mXSct "

" Now visit mine \nHelp me get more floors and be one of the top 7 lakh teams this round https://gpay.app.goo.gl/RsH4Jm\nHelp me get more floors and be one of the top 7 lakh teams this round https://gpay.app.goo.gl/RsH4Jm\nHelp me get more floors and be one of the top 7 lakh teams this round https://gpay.app.goo.gl/RsH4Jm\nHelp me get more floors and be one of the top 7 lakh teams this round https://gpay.app.goo.gl/RsH4Jm\nHelp me get more floors and be one of the top 7 lakh teams this round https://gpay.app.goo.gl/RsH4Jm\nHelp me get more floors and be one of the top 7 lakh teams this round https://gpay.app.goo.gl/RsH4Jm\nHelp me get more floors and be one of the top 7 lakh teams this round https://gpay.app.goo.gl/RsH4Jm\nHelp me get more floors and be one of the top 7 lakh teams this round https://gpay.app.goo.gl/RsH4Jm\nHelp me get more floors and be one of the top 7 lakh teams this round https://gpay.app.goo.gl/RsH4Jm\nHelp me get more floors and be one of the top 7 lakh teams th... "

" Help me get more floors and be one of the top 7 lakh teams this round https://gpay.app.goo.gl/Ebur1J \n\n\n\n\n\n\nPlzz visit "

As expected, many of the messages that are clustered together aren’t identical, but are near-duplicates, which help to identify a much wider range of spam messages that would otherwise escape classification.

You can find the spam clusters in file spams_week=2022–11–03 00/00/00__2.csv and the post-spam filtered messages in folder clean_minus_spams_week=2022–11–03 00/00/00__3.csv.

💛 Relevance

After the spam and spammers are removed, we now move into identifying how relevant each message is to the projects and their campaigns.

Projects running social campaigns only want to reward authors of messages that help them raise awareness, so we need to identify messages that are highly relevant to the project and its specific social campaign.

In our sample we included just one project, Avalanche, for analysis.

This is done with the following,

Identify messages containing the project’s keywords: such as the project’s name (or variations) or its token name. Without the proper context, it’s impossible to tell if a message is relevant to a specific project unless the right keywords appear inside the message.
Identify messages that further contains the social campaign’s keywords: in addition to the project’s keyword. For example, if Taraxa were running a social campaign about its testnet launch, then you’d expect to see the words “Taraxa” and “testnet” in the message. This helps to further identify messages that are not only related to the project, but specifically related to the social campaign.
Compute relevance of of the message to the project: by turning messages containing the right keywords as well as a generalized message description into vectors, and the calculating the cosine similarity (angle) between the vectors. This makes use of a well-known class of NLP algorithm called sentence transformers. In our first version, we make use of open-source, pre-trained models such as this one.
Compute relevance of of the message to the campaign: by using the same method, but this time computing the cosine similarity between the message and the social campaign’s description.
Filter & score: the messages according to generalized relevance to the project and relevance to the project’s specific social campaign. The pre-trained models used are just good enough to be used as an exclusionary measure at the low-end — i.e., any message that has a below 0.3 similarity to the project (0 being lowest, 1 highest) are cut out, and those with a great than 0.7 similarity to the specific campaign are weighed for additional rewards.

The exact reward mechanisms are to be determined pending real world testing. We’ll run a few Taraxa Hype Pools to see what incentivized social behaviors look like, since the data set we’re using are generic social chatter that were not driven by this specific form of incentive structure.

For our example, we’ll use the project Avalanche, a well-known project that has plenty of natural social mentions on Telegram.

We start with 551,029 messages that survived the spam & spammer filter
Of these, 258 messages had either the word “Avalanche” or “AVAX” in it
Of these, 226 messages had a cosine similarity to Avalanche’s project introduction (first 2 paragraphs) on their documentation site ≥ 0.3

While we tested many hypothetical social campaigns, their results were not meaningful absent any actual incentives, hence they’re not covered in this article. The algorithms used to identify project relevance vs. social campaign relevance are identical.

Here are a few example messages that had very high relevance to Avalanche’s project description,

Messages with cosine similarity scores > 0.6


"- Avalanche with no subnets is a Layer 1, it's a blockchain\n- Avalanche + scaling capabilities provided by the Subnet tech, is Layer 0, because you can create new blockchains, totally new blockchains with their validators set, rules, blockspace, and so forth..."

"I did hear they'll be deploying on Avalanche soon as a partnership has already been made"

"And if we assume this is the definition for Layer0, so yup, Avalanche provides all the building blocks to build new blockchains, to scale, to run those blockchains via Subnets and in a fully customized way"

Here are a few example messages that didn’t make the cut,

Messages with cosine similarity scores < 0.3


"🔺Colony is giving to its Community tge unique ability to directly participate in Seed / Private sales of new projects on Avalanche and also to trade the vested tokens acquired trough its new Early Stage Feature (something never done before in crypto). Users will be able to act like VCs, and will learn the VC way of investing, that is to fund a lot of projects knowing that a lot of them will fail, but that the ones that will succeed will cover for the losses and grant a huge profit, even when the percentage are that 90 projects out of 100 will fail and only 10 will be absolutely killers. Colony of course won't give financial advices, users will have to make their own analysis and due diligences, knowing that they're playing a high risk / high reward game, and that not all projects will succeed. Seed and Private sales, however, are also among the most profitable deals when a project succeed, for example Platypus had a Seed price of 0.08  𝑎𝑛𝑑𝑤𝑒𝑛𝑡𝑢𝑝𝑡𝑜12  🔥\n\n🔺The new Early Stage... NaN"

"- an average of 1,800 DAU (daily active users)\n- over $1B in total trade volume on Polygon and Avalanche within six months"

"Crypto Newsletter\n\nBlood Moon Party - The market had a terrible day yesterday when the FTT token of the world's 3rd largest exchange FTX collapsed since CZ's announcement of the sale of FTT. Following FTT, Bitcoin and other altcoins also plummeted, creating a new bottom in 2022 at17,166...\n\n✍️𝑀𝑎𝑖𝑛𝑛𝑒𝑤𝑠𝑜𝑓𝑡ℎ𝑒𝑑𝑎𝑦\n\n📌𝐶𝑟𝑦𝑝𝑡𝑜𝑀𝑎𝑟𝑘𝑒𝑡𝑃𝑙𝑢𝑛𝑔𝑒𝑠−𝐵𝑖𝑡𝑐𝑜𝑖𝑛𝐻𝑒𝑎𝑑𝑠𝑇𝑜𝑤𝑎𝑟𝑑𝑠17,500 Support\n\n📌 Avalanche Foundation launches $4 million liquidity stimulus package for GMX\n\n📌 The market was in chaos when there was news that FTX stopped withdrawal \n\n📌 EU announces new plan for digital EUR"

A common trait of messages that contained the “Avalanche” or “AVAX” keyword but scored low on relevance are typically longer messages that merely mentioned Avalanche in passing, and are not focused on the project itself. In this respect, the relevance measure is pretty good.

You can find the post-relevance filtered messages in file relevant_scored_week=2022–11–03 00/00/00_AVAX__4.csv .

👀 Impressions

The final step towards calculating rewards is to measure the impact of each message. Our first iteration of impact measurement is simply an estimate on the number of people who have plausibly seen the message, or impressions.

We measure impressions for each message with two metrics,

Number of people who have explicitly spoken: in the group during the hour around when the message has been typed. This is the most accurate measure of how many people who have seen the message, since these people have been talking at around the same time.
Number of people online: in the group during the hour when the message has been typed. This is a much rougher measure, as we can’t be sure if these people were paying any attention to the message or the chat where the message appears.

We believe a combination of the two measures gets us closer to the real measurement. Our current measure for impressions for each message is the sum of number of people who have explicitly spoken, plus the the natural log of the difference between the number of people online vs. the number of people who’ve spoken, something like this,

message_impressions = num_people_spoken * LOG₁₀ ( MAX [ 10, no_people_online — no_people_spoken ] )

Here we are heavily discounting those who are simply “online” vs. who have actually spoken, using it as a simple “amplification” factor on the number of people who have explicitly spoken. Moreover, the larger the difference between those who are online vs. those who are spoken, the more heavily that number is discounted. We’ve all been in Telegram groups where you see just 3 people talking, but the group has 10k people online, making you suspicious that the 10k people are probably all bots or probably aren’t paying attention. The logarithm reflects the discount.

The Hype app allows the pool initiator / social campaign runner to specify a price for every 1,000 impressions. All we have to do is to group message impressions by Telegram user, divide by 1,000, and multiply it by the price, and that gives the rewards for each eligible user.

For our sample data set,

In the 258 messages that made it this far, we have a total of 302 impressions across them in total

You can find the impression counts in file relevant_plus_impressions_week=2022–11–03 00/00/00_AVAX__5.csv .

🤖 A few words on machine-learning

We tested a large number of off-the-shelf NLP models for a variety of tasks, but to truly get extremely accurate outcomes, models will need to be specifically trained for our task at hand.

We specifically were looking at pre-trained NLP models hosted on Hugging Face. A few key takeaways for all the open-source, pre-trained NLP models we’ve encountered,

Not trained on social text: all seemed to be trained on long-form, well-written “book” type inputs. Social text is fragmented, grammatically incorrect, often written by non-English speakers, contains a ton of acronyms, slangs, misspellings, and of course 😸.
Not trained on crypto topics: which isn’t surprising, everything we’ve encountered were trained on generic English-language data sets. This especially limits the ability for vectorization models as they fail to capture nuanced differences within the crypto-sphere, making all crypto-related text seemingly extremely similar.
Not trained in adversarial contexts: all of these models seem to be intended for deployment in use cases where the users are well-intentioned. In the Hype use case, incentives drive user behaviors that try to game the system and receive rewards for doing very little. For example, models (e.g., COLA) designed to find grammatical errors do a great job at detecting nuanced mistakes, but fail miserably when given complete gibberish. In addition to the steps outlined in the pipeline, we’ve also deployed a series of simple filters specifically targeting adversarial & gaming behaviors. We’ll continue to refine these as Echo is being deployed in the field.

Training models take time and great deal of resources that’s not a good fit for our current goals of demonstrating the usefulness of decentralized social data & analytics. We’ll be on the lookout for partners, open-source efforts, or incentivized crowd-sourced labeling as Taraxa evolves.

Stay Tuned!

Twitter | Discord | Telegram | Blog | Website