Alex Rodrigues

Data science, distributed systems and big data

Probe Your System With Synthesized Realistic Data

I wonder how ancient civilizations, that have built incredible structures, knew that once built they would remain steady. I’m not talking about mysterious pyramids but more recent landmarks left us by romans such as roman aqueducts. How sure were they on the resiliency to nature powers, strong winds and tough winters? Well… the answer seems to be easy: with lots of theoretic work, designs and calculations.

While this is true, things were also achieved with lots of experimentation, both on the structural side and on materials applications. That knowledge was vital to have good practices on what to use in each situation and predict maintenance and consequences in extreme conditions. The same applies to the conception of fault-tolerant reliable systems to process large volumes of data.

Normal approaches start with mere reasonable assumptions on data volume, ingestion rate, document size, but others are hard to predict such as processing time, indexing time, etc. It really helps to know how components behave under pressure and calculate what are the limits of the current design.

Many (self-entitled) architects like to take the designing process as cooking a big blend of hype-based technologies and it’s very easy to get burnt. Key factors such as SLAs, peak times and hardware limitations greatly affect which components you choose to put in the pan and how would you mix them together.

The only way to know how certain software components behave, they have to be exercised with a great volume of data, similar to the one they will process once in production. Data that can be just fed from the productions servers if it exists and if the infrastructure allows. More often than not, the data is not yet available in the target formats and there’s the necessity of trying out with generated data in the chosen formats.

Computing Approximate Histograms in Parallel

Today I’m going to write a little about Approximate Histograms and how can they be used to get more insight on streamed big data feeds. I also provide a simple Java implementation and explain some parts of it.

Most of the common aggregation operations like counting and summing can be performed in parallel, as long there is a reduce phase where the result on each node can be combined. However, this is not very trivial for calculating histograms, as we need all the data on one dimension so that we can represent it in an histogram.

Having the data being processed by multiple nodes, each node is only able to construct an histogram of the partial data it receives. Ben-Haim and Tom-Tov presented a solution that uses an heap-based data structure to represent the data and a merge algorithm that allows to merge the data structures computed on different nodes into one that is an approximate histogram of all the dataset.

This technique has been applied by MetaMarkets with good accuracy for most of what an histogram can tell us about the data distribution: calculating the average and counting the quartiles and total number of data/events.

Going Event-driven With Kafka in Two Weeks - Part II

In the first part, I’ve described the motivations and requirements for the transition into a event-driven architecture. In this post, I am going to talk about how to perform distributed counting and how Kafka partitioning is handy for this kind of task.

Online group-by operations

As mentioned on part one, each consumer will receive messages from a set of partitions in a way that each partition will have only one consumer. The subscriber process can then have a consumer stream per thread and can co-operate with other instances running on other machines by defining each consumer to belong to the same consumer group. There no gain on having the total number of threads in the consumer cluster higher than the number of partitions, as each partition will comunicate at most through one consumer stream.

Going Event-driven With Kafka in Two Weeks - Part I

During the last couple weeks I’ve been working on a project that involves the transformation of a batch-based data pipeline into an event-driven one.

I am working in real-time advertisement industry and most of the reports generated require processing huge amounts of data. The reports are essentially counting events for traffic estimation or classification. However, counting a thousand events per second, in a distributed environment might not be as trivial as it seems.

The old process ingested hourly logs, captured from some dozens of nodes every hour. Each node uploaded a compressed bulk of log files to AWS S3, and Hadoop jobs triggered every hour had to load them up again into HDFS. The whole time spent in this IO-dependent ETL process was considerably huge and the increasing amount of data was causing a delay of several hours to get reporting metrics ready.

In real-time bidding, the reporting data of some hours ago might become totally useless for the current bidding decisions, as the market and the opportunities in inventory vary a lot, depending on the countries you serve ads and depending also on the time of the day.

The latency and the time-based utility of reporting data motivated the change into an event-driven pipeline.