Since we explained in our first post why we switched from CouchDB to Riak, several readers asked us about what we are doing at Linkfluence, what kind of tools we use and what’s our global architecture.
We harvest social data on the web
As explained on our main website, Linkfluence is a research company, specialized in the social web, which sells marketing and opinion research studies to advertisers and communication agencies.
Unlike traditionnal research institutes which work with polls and classical surveys, at Linkfluence we focus on spontaneous speech on the social web (i.e. blogs, forums, social networks and so on).
Our main goal at Linkfluence labs is to harvest data from the social web. To make this possible, we have developed during the last four years several systems to retrieve data on the web and extract intelligence from it.
Main Architecture
To achieve this we have designed a modular system composed of workers communicating with each other through a centralized message queue system. We have workers which:
- fetch syndication feeds and store the extracted metadata to Riak;
- resolve and canonize URLs;
- fetch and store the raw HTML pages inside Riak;
- extract qualified data and links from the raw page harvested and store them inside Riak (with the original document) and inside MongoDB (for graph manipulation);
- create a mask in order to extract qualified data from the HTML;
- etc.
PostgreSQL is our database of choice. Each PostreSQL database is wrapped inside a Catalyst instance which provides RESTful APIs. Workers are written in Perl too, and they use
SPORE to connect to the various APIs. All the data fetched and extracted is stored in Riak.
We make platforms and visualizations
Ultimately, we make country and/or language centric engines.
We have a specialized worker which pushes specific data into dedicated business unit:
- a Solr index which stores textual posts and external links;
- a MongoDB which stores the graph of internal links;
- a Redis which allows to compute dynamic indicators for ranking.
All these tools are driven by Perl code which handles all data and provides front APIs.
These APIs are comsumed by our frontend CMS and Flash expert visualizations which are then used by researchers to write their studies.
Today, we cover 6 countries (France, Germany, USA, UK, Italy and the Netherlands) which represent over 60k sources (websites) and we have more than one year and a half of archive available. We limit ourselves to 60k sources for some specific reasons (they are not technical).
Evolution
I have described our current system architecture. Today, when we want to add data from new source types, we just have to write new workers to retrieve their content and insert it in the system. In the same way, we have built other business modules to manage twitter data using MongoDB and ElasticSearch. Why we are considering switching from Solr to ElasticSearch for this purpose will be the topic of another post. But I can say, among other problems, that Twitter can’t be qualified in a country centric method, which makes it harder to design a good way to shard Solr instances. The way ElasticSearch scales matches better for now the way in which we plug Twitter to our system.