Remi Mollicone is the Chairman of CFAR-m. CFAR-m is an original method of aggregation based on neural networks which can summarize with objectivity, the information contained in a large number of complex variables. CFAR-m solves the major problem of fixing the subjective importance of each variable in the aggregation (It avoids the adoption of an equal weighting or a weighting based on exogenous criteria).
The Big Data phenomenon stems from the fact that data is growing at a rate that exceeds Moore’s law, and the volume of data being processed is becoming greater than the capability of our hardware. Big Data does not only refer to the volume of data, but more importantly the complexity of the data.
By the complexity of the data, I refer to interactions existing between data and groups of data. Complexity makes it inappropriate to study all datasets part by part and then agglomerate results. This is primarily because, as parts interact with others, a move inside one part has impacts elsewhere, and studying them individually has no meaning.
To make a comparison, consider the quote by Micheal Pidd:
“One of the greatest mistakes that can be made when dealing with a mess is to carve off part of the mess, treat it as a problem and then solve it as a puzzle — ignoring its links with other aspects of the mess.”
As such, to understand the complexity of data, one has to observe the interaction between data within a dataset, as this can explain how groups of data are linked, and also the structures governing the data.
Tackling the problem of complexity in Big Data:
The discovery of this complexity can be solved by several types of clustering, according to the kind of problems being faced. What we want to see, among others things, are techniques like topologic data analysis, which allow us to show the relation between a group of variables and other different groups. However, two important points ought to be mentioned here:
1) What do I want to get, and what are the right models and theoretical backgrounds that I should use to get these results? This point is really important. If you don’t use the right models you won’t be able to get the right results.
2) The quote “garbage in garbage out” is important when tackling the complexity of Big Data. Effectively, even if you have the right models, if you don’t feed them with the right data you won’t get the right results.
What is the problem?
Generally the algorithms don’t pay attention to the meaning of data; they just focus on searching for links between data and groups of data. The remaining problem is to understand what the results of the discovery mean.
If we want to solve this problem we must consider the representation of reality based on the insights provided. This is a separate consideration to the interpretation of the results.
Why understanding and representing are important?
Understanding and representing are crucial when it comes to big data because the aim of data analysis is getting results in order to understand a problem and manage it effectively. If you can’t understand or represent it, you cannot get what you want.
When you can represent a problem using the theoretical background it means that you can use all the theoretical works previously made by the academic community and then build a better model to solve your problems. A model must describe a phenomenon and it must predict it; then you can make simulations to support you in decision-making. This is why it is important, when possible, to use the right theoretical backgrounds when approaching data.
This way of working allows one to build robust models that are descriptive, predictive and prescriptive.
The collaboration of scientists and specialists in the field are needed for the following reasons:
1) Only the specialists of each field involved can decipher what sectors are most important in the discovery phase, while also knowing the dimensions and variables.
2) Only scientists at the appropriate level are able to understand which theoretical background to use when modelling and solving the data problem.
Once these points are addressed, it is possible to realise and deliver an application that will help managers work efficiently.
To realise this work there are some corner stones
1) Discovery: clustering, topological data analysis …
2) Relations between variables and groups of variables: regression, correlation.
3) What is the relative weight of each group and each variable?
There are lots of methods that help to solve these problems all have their pros and cons. One of them is CFAR-m.
CFAR-m
In summary, CFAR-m helps to describe complex reality with many interacting fields; to deliver metrics, make simulations, and build powerful models. Extracting more objective information from datasets helps to provide a better analysis and understanding and forms the foundations to building advanced solutions to problems, issues or situations.
It can be applied in known fields or used to investigate clusters coming from patterns discovery tools
Main features of CFAR-m
1) Automatic extraction of weightings; only data driven
2) No reduction of variables (it accounts for all the variables for processing the calculus and delivering the results)
3) Each item has its own vector of weight
4) It shows the contribution of each variable (or group of variables) to the ranking (sensitivity)
5) By taking into consideration all the variables without exception, CFAR-m is able to determine the level of influence (lots, some, and none) of each one. Variables are used to build simplified models that work in real-time. That said, in an ever-changing world, one needs to be able to detect and anticipate any changes. This is why CFAR-m is reused periodically to check that no major changes have occurred and that the influence of any previously non-influential variables has not grown exponentially in the interim. If that is the case, then those variables are integrated into the simplified model.
6) Whilst CFAR-m can be used for aggregation and ranking purposes, CFAR-m is better as a tool to build advanced and sophisticated applications.
CFAR-m can model complex relationships without needing any a priori assumptions about the distribution of variables (a major constraint of conventional statistical techniques).
___________________
Different stages:
1) What do we want to measure (risk, governance etc.). This confirms which theoretical framework is relevant. Even though CFAR-m is a powerful tool, it must be correctly used. If you are not able to accurately describe what you want, specialists may potentially be required or even R&D conducted.
2) What are the “dimensions” to take into account to get the results you want? If you don’t know, specialists may be required or even R&D conducted.
3) What are the variables that represent each dimension? If you don’t know, specialists may be required or even R&D conducted.
4) Information to provide in order to use CFAR-m: Sense of contribution of each variable to the ranking. That means that for each variable we have to clarify whether a higher value of a variable will push the ranking of the item towards the first ranking position (positive contribution) or have the opposite effect. CFAR-m delivers ranking and if we want to rank we must be able to clarify the contribution of each variable. If we do not know, further investigations will be needed or R&D must be conducted.
5) Output: What results do you want to obtain? Ranking, contribution of variables, index, partial index, weightings, or/and simulations?
Conclusion:
CFAR-m can be a precious data driven solution to understand Big Data and what happens inside each cluster at the level of each variable and dimension; get the contribution to the ranking (sensitivity) of each variable.
Qualitative and quantitative aspects are the two faces of a same coin. The qualitative aspect of some techniques combined to the quantitative aspect of CFAR-m allows delivering a unique and powerful solution to the big data that deals with clusters while CFAR-m operates inside each cluster.
CFAR-m can be combined with technologies to take in account the following aspects of big data that are also very important:
– complexity as indicator of disruption as a certain level complexity introduce instability;
– uncertainty
As CFAR-m can aggregate many different dimensions without any presupposition it can be considered as a quite holistic tool.