big data applications – Dataconomy

Why Outsourcing Social Media Data Access is a Good Thing

Mike Madarasz — Thu, 15 Nov 2018 13:12:04 +0000

It’s no news that unstructured data has been a highly sought after source since its inception, first for determining public topical insights and now for training machine learning algorithms. The critical question to answer is whether you should outsource the collection to overcome business challenges or not? Mike Madarasz explains why it could be worth it!

Welcome to the new world of analytics! Brands are hiring data scientists to overcome challenges and shortcomings of single-point solutions. However, better algorithms and better data are not enough. There is no magic button that drives actionable insights to solve business challenges. At least not yet. Combine the right technology with experienced operators and subject-matter expertise and now you have something of substance. Technical skills are not enough to satisfy market expectations. We need data scientists, engineers, designers, people who understand data, people who are creative with data and who have empathy for the insight-challenged community that’s now represented and shaped by experienced practitioners. The tide is now turning toward making data and sophisticated algorithms king.

Social media data, for instance, is truly a global phenomenon. Everyone uses a few popular sites but you have many regionally dominant platforms. Even without different apps, each culture and region have its own social media nuances, and you have to account for that when analyzing data. Furthermore, incorporating social data, adds substantial depth to a variety of use cases in practically any research question and can be informed by at least one of these data sets.

But what sources are relevant? Twitter or Facebook? Reddit or YouTube? Or maybe various forums, blogs and reviews? Maybe it’s just one source or maybe it’s all of them. Many factors contribute to the relevance of a data source as it pertains to a specific use case. However, regardless if you are using these datasets for research or training your machine learning algorithms, they can be invaluable.

The data analytics community essentially builds cars, which require gas. And data is the gasoline that fuels these sophisticated engines. So, the million-dollar question is, “how can I access it?” Understanding how to access unstructured data sources like online conversation, is an integral, yet tricky, part of the equation. With today’s compliance and access standards more scrutinized than ever before, knowing how to best prepare for that from a licensing and technical perspective in order to maximize the opportunity for successful analytics is essential. For example, social media today looks nothing like it did 15 years ago. Data has become more complex, more global, and has more uses than anyone could have predicted. We are talking about hundreds of millions of data points from millions of sources. According to market experts, more data has been created just in the past two years than in the entire previous history of the human race. And within five years there will be over 50 billion smart connected devices in the world, all developed to collect, analyze and share data. Just accessing the raw data isn’t enough anymore because it’s so nuanced that you need to have it sifted and parsed to make any real use of it.

Which begs the question, buy vs. build?

Data aggregation challenges ensue as the activity can be time-consuming, costly and very difficult to do effectively. I’d compare it to renting an office. Do you want to have to find your own source of water? Electricity? Of course not, but gathering those things is not in your wheelhouse and your energy is best used on other things. The insights from social media data today is right up there with water and power when it comes to keeping a business functioning. You need it, but just like you don’t want to run your own pipes and your own power plant, you shouldn’t have to find social media data relevant to your requirements.

There are viable options to identify, index and make unstructured data available in a structured way, and enable access to social media content from a vast array of sources in near-real to real-time delivery mechanisms. Standardized data sets support business intelligence algorithms and predictive modelling by providing on-demand access to years of historical data.

Social data is a basic necessity and should be delivered to a business already mined and sifted so that analytics systems can do their work. Even if you have a truly huge business, and want full vertical integration, you can do your own social media mining but it’s still probably not worth it. At the end of the day, if you are a Data Scientist or Analyst, your time is best spent focusing on your core competency rather than tedious data collection.

With the rise of machine learning, the ability to analyze data is only getting better. That’s why we’ll be able to look at information much more quickly and organize it more precisely, which is great because there will be so much more useful data coming in. Our whole world is online, and IoT alone is going to load-up systems worldwide with more data than we can conceive. Better machine learning and heuristics are necessary just to keep up with the flood of information we’re expecting to see, and the companies that are best positioned to thrive in this digital tomorrow are the ones who make the best use of all this information. To do that you need the best, smoothest access to well-organized data.

CFAR-m: Simplifying The Complexity of Big Data

Remi Mollicone — Thu, 08 May 2014 13:34:35 +0000

Remi Mollicone is the Chairman of CFAR-m. CFAR-m is an original method of aggregation based on neural networks which can summarize with objectivity, the information contained in a large number of complex variables. CFAR-m solves the major problem of fixing the subjective importance of each variable in the aggregation (It avoids the adoption of an equal weighting or a weighting based on exogenous criteria).

The Big Data phenomenon stems from the fact that data is growing at a rate that exceeds Moore’s law, and the volume of data being processed is becoming greater than the capability of our hardware. Big Data does not only refer to the volume of data, but more importantly the complexity of the data.

By the complexity of the data, I refer to interactions existing between data and groups of data. Complexity makes it inappropriate to study all datasets part by part and then agglomerate results. This is primarily because, as parts interact with others, a move inside one part has impacts elsewhere, and studying them individually has no meaning.

To make a comparison, consider the quote by Micheal Pidd:

“One of the greatest mistakes that can be made when dealing with a mess is to carve off part of the mess, treat it as a problem and then solve it as a puzzle — ignoring its links with other aspects of the mess.”

As such, to understand the complexity of data, one has to observe the interaction between data within a dataset, as this can explain how groups of data are linked, and also the structures governing the data.

Tackling the problem of complexity in Big Data:

The discovery of this complexity can be solved by several types of clustering, according to the kind of problems being faced. What we want to see, among others things, are techniques like topologic data analysis, which allow us to show the relation between a group of variables and other different groups. However, two important points ought to be mentioned here:

1) What do I want to get, and what are the right models and theoretical backgrounds that I should use to get these results? This point is really important. If you don’t use the right models you won’t be able to get the right results.

2) The quote “garbage in garbage out” is important when tackling the complexity of Big Data. Effectively, even if you have the right models, if you don’t feed them with the right data you won’t get the right results.

What is the problem?

Generally the algorithms don’t pay attention to the meaning of data; they just focus on searching for links between data and groups of data. The remaining problem is to understand what the results of the discovery mean.

If we want to solve this problem we must consider the representation of reality based on the insights provided. This is a separate consideration to the interpretation of the results.

Why understanding and representing are important?

Understanding and representing are crucial when it comes to big data because the aim of data analysis is getting results in order to understand a problem and manage it effectively. If you can’t understand or represent it, you cannot get what you want.

When you can represent a problem using the theoretical background it means that you can use all the theoretical works previously made by the academic community and then build a better model to solve your problems. A model must describe a phenomenon and it must predict it; then you can make simulations to support you in decision-making. This is why it is important, when possible, to use the right theoretical backgrounds when approaching data.

This way of working allows one to build robust models that are descriptive, predictive and prescriptive.

The collaboration of scientists and specialists in the field are needed for the following reasons:

1) Only the specialists of each field involved can decipher what sectors are most important in the discovery phase, while also knowing the dimensions and variables.

2) Only scientists at the appropriate level are able to understand which theoretical background to use when modelling and solving the data problem.

Once these points are addressed, it is possible to realise and deliver an application that will help managers work efficiently.

To realise this work there are some corner stones

1) Discovery: clustering, topological data analysis …

2) Relations between variables and groups of variables: regression, correlation.

3) What is the relative weight of each group and each variable?

There are lots of methods that help to solve these problems all have their pros and cons. One of them is CFAR-m.

CFAR-m

In summary, CFAR-m helps to describe complex reality with many interacting fields; to deliver metrics, make simulations, and build powerful models. Extracting more objective information from datasets helps to provide a better analysis and understanding and forms the foundations to building advanced solutions to problems, issues or situations.

It can be applied in known fields or used to investigate clusters coming from patterns discovery tools

Main features of CFAR-m

1) Automatic extraction of weightings; only data driven

2) No reduction of variables (it accounts for all the variables for processing the calculus and delivering the results)

3) Each item has its own vector of weight

4) It shows the contribution of each variable (or group of variables) to the ranking (sensitivity)

5) By taking into consideration all the variables without exception, CFAR-m is able to determine the level of influence (lots, some, and none) of each one. Variables are used to build simplified models that work in real-time. That said, in an ever-changing world, one needs to be able to detect and anticipate any changes. This is why CFAR-m is reused periodically to check that no major changes have occurred and that the influence of any previously non-influential variables has not grown exponentially in the interim. If that is the case, then those variables are integrated into the simplified model.

6) Whilst CFAR-m can be used for aggregation and ranking purposes, CFAR-m is better as a tool to build advanced and sophisticated applications.

CFAR-m can model complex relationships without needing any a priori assumptions about the distribution of variables (a major constraint of conventional statistical techniques).

___________________

Different stages:

1) What do we want to measure (risk, governance etc.). This confirms which theoretical framework is relevant. Even though CFAR-m is a powerful tool, it must be correctly used. If you are not able to accurately describe what you want, specialists may potentially be required or even R&D conducted.

2) What are the “dimensions” to take into account to get the results you want? If you don’t know, specialists may be required or even R&D conducted.

3) What are the variables that represent each dimension? If you don’t know, specialists may be required or even R&D conducted.

4) Information to provide in order to use CFAR-m: Sense of contribution of each variable to the ranking. That means that for each variable we have to clarify whether a higher value of a variable will push the ranking of the item towards the first ranking position (positive contribution) or have the opposite effect. CFAR-m delivers ranking and if we want to rank we must be able to clarify the contribution of each variable. If we do not know, further investigations will be needed or R&D must be conducted.

5) Output: What results do you want to obtain? Ranking, contribution of variables, index, partial index, weightings, or/and simulations?

Conclusion:

CFAR-m can be a precious data driven solution to understand Big Data and what happens inside each cluster at the level of each variable and dimension; get the contribution to the ranking (sensitivity) of each variable.

Qualitative and quantitative aspects are the two faces of a same coin. The qualitative aspect of some techniques combined to the quantitative aspect of CFAR-m allows delivering a unique and powerful solution to the big data that deals with clusters while CFAR-m operates inside each cluster.

CFAR-m can be combined with technologies to take in account the following aspects of big data that are also very important:

– complexity as indicator of disruption as a certain level complexity introduce instability;

– uncertainty

As CFAR-m can aggregate many different dimensions without any presupposition it can be considered as a quite holistic tool.

Big Data to Make You Healthier

admin — Fri, 25 Apr 2014 12:25:59 +0000

Yesterday IBM announced that it was making new investments in its US Federal Healthcare Practice to provide for the ever increasing tech needs of the health institutions. Big data solutions will be used, among others, to improve the level of care, lower costs, as well as combine and then analyse information. The data being looked at will for the first time include previously ignored unstructured data, such as physician’s notes.

According to a statement by Anne Altman, General Manager, IMB US Federal: “IBM has a proven track record in delivering transformational, value-based healthcare solutions that can increase the quality of care and lower costs in both the public and private sector. … Government leaders recognise that there is a tremendous opportunity to combine new and existing data sources with advancements in technology to find innovative ways to build a sustainable and affordable healthcare system.”

One pilot project taking place at Carilion Clinic involved IBM trawling through more than 2 million patients’ records with their Advanced Care Insights solution to identify those at risk for congestive heart failure. This work with hospitals and health care providers can contribute to identifying and intervening earlier to help patients receive the most beneficial preemptive care. In the pilot project, at-risk patients were identified correctly by the predictive models 85% of the time.

Big data innovations from IBM in conjunction with various other players in the health care industry have lead to more inter-connected systems, enabling improved care with fewer avoidable errors. Disease prediction as well as medical innovation are among the other transformations in the face of medicine that IBM and big data are helping to facilitate.

LAPD Using Big Data to Fight Crime

admin — Thu, 24 Apr 2014 11:19:04 +0000

Predicting future crime has left the realm of the futuristic science fiction and is now part of the very day reality, at least for the officers of the LAPD. These officers are now using the power of big data to determine hotspots for crime and as a result lowering crime across the Los Angeles metropolitan area. What started with a mathematical program to predict earthquakes has now contributed to a reduction of 33% in burglaries, 21% in violent crime, and 12% in property crime across the area where the algorithms are being applied.

Feeding the mathematical model developed by Assistant Professor George Moher with huge amounts of past crime data – 13 million crimes spanning the last eight decades – since criminal activity often has similar ‘aftershocks’ as earthquakes, big data now allows the LAPD to predict the areas that will be hotspots of crime in the near future. In collaboration with the University of California and PredPol, whose goal it is to “place officers at the right time and location to give them the best chance of preventing crime” the algorithm and results were further fine tuned.

When the pilot project was launched, police officers were hesitant to start using the program and resistant to the idea that a data program would better tell them where and how to do their job than years of experience on the force. But application of the program and monitoring the results in real time has swayed opinions – as have the stellar results mentioned earlier. Now, the model is also being updated in real time with crime data as it comes in, to hone the predictive power of the big data even more.

Finally, the power of big data analytics does not end with predicting crime. Big data is now also being used to fight insurance fraud and putting the crime that occurs into the context of the larger picture in which it is taking place. Correlations can now also be found, as Shaun Hipgrave, a former police officer and current security consultant for IBM, illustrates: “When you use big data you can see the relationships between one family and another troubled family and you see the absences from school.” The stellar results across the board from this trial program has now resulted in the PredPol software being rolled out for trials in over 150 American cities, and is likely to expand and spread further across the country.

Big Data Now Environmental Activist

admin — Thu, 17 Apr 2014 11:38:40 +0000

The day has come where big data is part of environmental activism fighting deforestation. Powered by Google Earth Engine, Global Forest Watch was launched in February by the World Resources Institute to facilitate a global fight against deforestation. Users can look back at trends starting in 2000 and with parts of the map being updated close to every two weeks, hot spots in Indonesia and Brazil can be tracked.

“They log on, access all the data, and run their own algorithms,” David Thau, senior developer advocate for Google Earth Engine, says of the project. With data drawn from a number of satellites, this program enables scientists to direct access information. Having grown out of a variety of other projects, the researchers’ algorithms are what generates the extraordinary map following the deforestation, and have now scaled up to do this globally on an unprecedented scale.

According to Thau, Google Earth Engine eases and aids this big-data research. With normal cloud computing researchers distribute computing tasks across the network. Working with this program, researchers use a programming interface to enter queries, which get ‘parallelized’ automatically.

The World Resources Institute is also giving the public access to this information and may be particularly useful in Indonesia. With the pubic’s support, agribusiness campaigner Gemma Tillack is asking a number of large companies to only grow new plantations that are sustainable and not perpetuating forest loss. What larger changes can be pushed forward when the public is given free access to the information remains to be seen.

(Image Credit: Wagner T. Cassimiro)

MBN Big Data Survey Findings

admin — Thu, 17 Apr 2014 10:38:57 +0000

A big data survey conducted by MBN in 2013 has revealed some of the shifting attitudes toward big data in the industry. Conducted with upwards of 135 companies, the trend appears to be towards integration of big data projects and initiatives. “Surprisingly, some 93 percent of businesses with dedicated business analysts do not consider data analysts as part of their IT headcount. When explored as a whole, these findings point to an ever-increasing emphasis by businesses to embed data management and analysis throughout the business – getting data into the hands of many more users.”

Even though a vast majority of respondents already were using business insight tools of some form in their businesses, over half of these indicated that they would be further increasing their investment in this area. The added expenditure seem to be less technological, but rather on the hiring side. This augmented financial backing to the tech sector is also obviously correlated with an added expectation for return on the investment, which companies expect to see in the near future. As the survey indicated: “The vast majority of companies (92 percent) with dedicated data scientists and analysts have turned data into revenue. Among companies without dedicated analysts, less than 30 percent have successfully converted data into revenue.”

Aside from generating a revenue stream, the attitudes and implementation suggest there are other added benefits to adopting some form of big data strategy in a company. According to the survey, “93 percent of respondents reported that using data has helped them make informed business decisions.”

This implementation of big data to analyze trends and draw more information-based conclusions about the best course of action for a business is only going to increase in future. As the current obstacles and flaws with regards to storing and most effectively analyzing the massive quantities of data on hand are removed, big data will become the crux of all business insights and decision making processes.

(Image Credit: Marja van Bochove)