Elasticsearch – Dataconomy

How to use ElasticSearch for Natural Language Processing and Text Mining — Part 2

Saskia Vola — Wed, 24 May 2017 08:30:23 +0000

Welcome to Part 2 of How to use Elasticsearch for Natural Language Processing and Text Mining. It’s been some time since Part 1, so you might want to brush up on the basics before getting started.

This time we’ll focus on one very important type of query for Text Mining. Depending on the data it can solve at least 2 different kinds of problems. This magical query I’m referring to is the More Like This Query.

GET /_search
{
    "query": {
        "more_like_this" : {
            "fields" : ["title", "content", "category", "tags"],
            "like" : "Once upon a time",
            "max_query_terms" : 12
        }
    }
}

Let’s have a look at the basic parameters:

It can take either a set of IDs that are inside the index you’re querying or an external document (“like”).

Now the question: what do you want to compare it to? You can list all the fields that are interesting. Let’s assume your dataset consists of news articles.

The relevant fields will be for example: title, content, category, tags.

What happens when that query is fired?

It will analyse your input text that comes either from the documents in the index or directly from the like text. It will extract the most important keywords from that text and run a Boolean Should query with all those keywords.

How does it know what a keyword is?

Keywords can be determined with a formula given a set of documents. The formula can be used to compare a subset of the documents to all documents based on word probabilities. It is called Tf-Idf and despite several attempts to find something new and fancy it is still a very important formula for TextMining.

It assigns a score to each term in the subset compared to the entire corpus of documents.

A high score indicates that it is more likely that a term identifies or characterizes the current subset of documents and distinguishes it clearly from all other documents.

If you have a very clean dataset of — let’s continue with the example — news articles, you should easily be able to extract keywords that describe each section: Sports, Culture, Politics and Business.

But if you have to solve a real world Big Data problem, you will probably have a lot of noise in your data: links, words from another language, tags etc. If that “garbage” is not equally distributed you will have a problem. Tf-Idf will score very high all those rare “mistakes” in your dataset as they look very unique to the algorithm.

So you need to be aware of this and clean up your dataset.

Anyway. This logic is used under the hood when running a More Like This Query.

You can further configure the maximum number of query terms and some frequency cutoffs that can also help you with cleaning up the input.

The MLT query will return results most of the times if your document corpus (index) is large enough.

If you don’t trust the “magic” query or want to understand why it returns certain hits you can activate highlighting.

So you will be able to see the query terms that matched the documents.

That’s the best you can get. There is no option to return all the generated keywords from the input document.

To enable highlighting with the More Like This query you need to configure your mapping for the fields you want to be highlighted.

Just add this to the properties of the field:

"term_vector" : " with_positions_offsets"

We talked a lot about the MLT query and maybe you already have a few applications in mind.

3. Recommendation Engine

The most basic TextMining application for the MLT query is a recommendation engine.

There are usually 2 types of recommendation engines: social and content based. A social recommendation engine is also referred to as “Collaborative Filtering” mostly known as Amazons “People who bought this product also bought…”

This works based on the assumption that a user will be interested in what other users with a similar taste liked. You need quiet a lot of interaction data for this to work well.

The other type of recommendation engine is called “Item based recommendation engine”. This tries to group the datasets based on the properties of the entries. Think of novels or scientific papers as an example.

With Elasticsearch you can easily build an item based recommendation engine.

You just configure the MLT query template based on your data and that’s it. You will use the actual item ID as a starting point and recommend the most similar documents from your index.

You can add custom logic by running a bool query that combines a function score query to boost by popularity or recency on top of the more like this query.

4. Duplicate Detection

Depending on your dataset that same MLT query will return all duplicates. If you have data from several sources (news, affiliate ads, etc.) it is pretty likely to run into duplicates. For most end user applications this is unwanted behaviour.

But for an expert system you could use this technique to clean up your dataset.

How does it work?

There are always 2 big problems with duplicate detection:

You need to compare all documents pairwise (O(n²))
The first inspected element will remain, all others will be discarded
So you need a lot of custom logic to choose the first document to look at. It should be the best.

As the complexity is very high you might not want to detect duplicates offline in a batch process but online as they are needed.

The industry standard algorithms for duplicate detection are Simhash and Minhash (used by Google and Twitter e.g.).

They generate hashes for all documents, store them in an extra datastore and use a similarity function. All documents that exceed a certain threshold are considered duplicates.

For very short documents you can work with the Levenshtein distance or Minimum Edit Distance. But for longer documents you might want to rely on a token based solution.

The more like this query can help you here.

I have the next blog post in the works but don’t worry, you’ll have enough time to let all of the knowledge in Part 1 and Part 2. sink in. For now, stay tuned for Part. 3 of the How to use Elasticsearch for NLP and Text Mining series, where we’ll tackle Text Classification and Clustering.

Like this article? Subscribe to our weekly newsletter to never miss out!

Follow @DataconomyMedia

How to use ElasticSearch for Natural Language Processing and Text Mining — Part 1

Saskia Vola — Fri, 30 Dec 2016 09:00:26 +0000

ElasticSearch is a search engine and an analytics platform. But it offers many features that are useful for standard Natural Language Processing and Text Mining tasks.

1. Preprocessing (Normalization)

Have you ever used the _analyze endpoint?

As you know ElasticSearch has over 20 language-analyzers built in. What is an analyzer doing? Tokenization, stemming and stopword removal.

That is very often all you need for preprocessing for higher level tasks such as Machine Learning, Language Modelling etc.

You basically just need a running instance of ElasticSearch, without any configuration or setup. Then you can use the analyze-endpoint as a Rest-API for NLP-preprocessing.

curl -XGET "http://localhost:9200/_analyze?analyzer=english" -d'
{
  "text" : "This is a test."
}'

{
  "tokens": [
  {
  "token": "test",
  "start_offset": 10,
  "end_offset": 14,
  "type": "",
  "position": 3
  }
  ]
 }

Here’s a list of all available built in language analyzers.

2. Language Detection

Detecting languages is a so called “solved” NLP problem. You just need a character ngram language model derived by a relatively small plain text-corpus from all languages you want to distinguish.

So no need to reinvent the wheel over and over.

When you’re already have ElasticSearch up and running, you can simply install another plugin.


curl -XPOST 'localhost:9200/_langdetect?pretty' -d 'This is a test'
{
  "profile" : "/langdetect/",
  "languages" : [ {
    "language" : "en",
    "probability" : 0.9999971603535163
  } ]
}

That’s it. It’s open source, free to use and super simple.

How to use ElasticSearch for Text Mining appeared originally on textminers.io ‘s blog

Like this article? Subscribe to our weekly newsletter to never miss out!

Follow @DataconomyMedia

Image: born1945, CC 2.0

Travel Startup Hopper Uses Big Data to Find Cheaper Travel Options for Users

Eileen McNulty — Tue, 09 Dec 2014 11:32:18 +0000

Travel startup innovator Hopper is mining Big Data to help users to find the cheapest available travel options from across the web.

Enunciates Chief Data Scientist Patrick Surry, “Every time you check a price, it’s different, it changes day-to-day, and people have no idea whether they’re getting a good deal. What we’re trying to do is bring some transparency to that, and the way we’re doing that is by working with billions of flight prices that we collect every day.”

The research and data team at Hopper taps Apache HBase, Apache Hive, and Elasticsearch combinations to harness vast datasets available with sites like Expedia and Priceline, using which, it establishes reports analysed over large scales, that make sense of travel news and trends. With custom built tools it carries out flexible searching and offers interactive and information-rich results. “We provide a wealth of data about pricing, scheduling, and airlines for every origin-destination combination,” according to their website.

Based in Boston and Montreal has landed funding several times, however Hopper is not looking for another round of funding. A mobile application is in the pipeline, reports CRN.

What You Don’t Know About Apache Lucene

Eileen McNulty — Thu, 10 Jul 2014 11:32:38 +0000

According to his LinkedIn profile, Robert Muir is Mongolia-based Ghostbuster for Elasticsearch. Any activities involving the elimination of supernatural entities aside, what we do know is that his work at Elasticsearch involves implementing and improving the reliability of Apache Lucene. He’s also an Apache Lucene committer; Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform. We caught up with Robert at Berlin Buzzwords to discuss his work, how people are using Lucene and what we can expect from Lucene in the future- sadly, there was no talk of ghosts.

Tell us a little bit about yourself and your work.

My name is Robert Muir and I’ve been an Apache Lucene committer for five years now.I work for Elasticsearch; I’m a developer there and I mostly work on Lucene.

The talk you gave today was on the new features or Apache Lucene. Do you want to give us a brief overview of that?

Essentially Lucene has grown a lot since Lucene 4. It’s more than just a core indexing library. We have features that people expect of search engines driven by Google, like auto-suggest and highlighting and faceting. So in Lucene 4, we have all of this other stuff you need around search as part of the library. The idea of this talk is that you can go to the store and buy Lucene in action and get a good description that’s maybe three or four years out of date, but it won’t tell you about all these cool features that you need to deal with. Like you need auto suggest. All users expect it. So the talk was to give people an idea of how it works in Lucene 4. A sort of up-to-date overview.

So there’s been alot of talks about search engines this year- search seems to be the buzzword of this year’s Berlin Buzzwords. What is it about Lucene that you think makes it stand out?

I think the first thing that people are attracted to when they use Lucene is that it’s fast- much faster than you would expect. Maybe because it’s Java code, they expect it to be much slower than it is. It’s much faster than a database usually for a lot of types of queries that users want to do these days. I think a part of dealing with lots of data is that you can’t deal with it all at once. So search is more naturally here because you’re just saying, I want to look at the most relevant stuff because I can’t look at all of it.

One of the things that stood out to me during the talk was how customizable Lucene is. How important is the customization when you are developing Lucene? Is that one of the main priorities?

Lucene always began as an API, which is different than say an Oracle database, where you have a server. Because of that, I think customization has always been a high priority. It’s built for just that. It’s built if you want to embed search somewhere to do something custom. If you want to have something more out-of-the-box, you can get Solr or Elasticsearch, which are the server version. We just make the customisable low-level engine and people use it in radically different ways for different purposes. So it’s definitely a huge priority.

Are there any particular use cases of Lucene that you find particularly interesting?

At Elasticsearch, we see a lot of people using it for log analysis. We see a lot of people doing stuff that’s more like analytics. And I think it’s really interesting because I just never thought about using Lucene for that, but it works pretty well and it solves a lot of real-world needs. I mean, I think we could probably make some improvements- we see these use cases and as developers, we haven’t tuned in for that or thought about it. So it’s cool for that reason.

Can you tell us a little bit more as well about your work with Elasticsearch?

I just started working there for about a month or two ago, and basically I work on Lucene. The first thing we did is we worked on improving sort of the reliability of Lucene. Lucene didn’t have bugs, but we just didn’t have features that you would expect to have for a data store. And these features are things like adding detection of errors to improve reliability. And you’ve got systems like Solr and Elasticsearch taking Lucene indexes and sending them around on the network, so we need to detect when something goes wrong. So we added file check summing, for example, to Lucene. That’s one of the first things I did. I think we improved the robustness a lot just with that change. It’s changes like that which make working on Lucene exciting.

What are you working on for the future of Lucene?

I can tell you what we’re working on right now, because we don’t really have a good idea of what’s coming- it’s open source, so it’s all up in the air. Currently I’m working on improving the way queries execute. And long-term, hopefully the way they work with positions to have more power, more flexibility and greater speed. So hopefully this is something we’ll fix this year.

Big data has gained a huge amount of momentum and hype over the past couple of years- where do you think this is headed?

There’s more and more information and we’re getting overloaded by it. I think search is an important role here as it allows you to sift through everything and find the needle in a haystack.As we’re drowning in data, I think improving the quality, performance and usability of the search is really important.

Elasticsearch is a real-time search server based on Lucene, with high availibility and multi-tenancy. In collaboration with Logstash and Kibana, they formed an end-to-end “ELK” stack that delivers actionable insights in real-time from almost any type of structured and unstructured data source.

(Image credit: Apache Lucene)

Berlin Buzzwords is Back- Our Pick of the Events

Eileen McNulty — Fri, 23 May 2014 13:49:58 +0000

Berlin Buzzwords, ‘Germany’s most exciting conference on storing, processing and searching large amounts of digital data’, is back for a fifth year. The conference will take place on May 25-28, at Kulturbrauerei Berlin. It will feature a range of presentations on large scale computing projects, ranging from beginner-friendly talks to in-depth technical presentations about various technologies. Here is our pick of some of Buzzword’s events:

Hitfox (and Dataconomy) meet Berlin Buzzwords
An obvious highlight of Berlin Buzzwords will be the meetup organised by us and held in our Dataconomy HQ.
“Peter Grosskopf, Chief Development Officer at HitFox Group, is going to welcome everyone and show the way we approach Big Data at HitFox. Thorsten Bleich, Chief Technology Officer at HitFox’s mobile targeting venture, Datamonk, will follow with his talk on how Datamonk provides targeting solutions in mobile real-time advertising by collecting, transforming, and feeding data into the mobile ecosystem with their own cutting-edge technology.”
See you there!

SHARK ATTACK on SQL-On-Hadoop
This talk gives a quick intro to Apache Spark and its SQL query engine Shark. Additionally Shark is compared to other SQL-on-Hadoop tools from the ecosystem, like Impala and Hive, including a live “usage-demo”.

Staying ahead of Users & Time – two use cases of scaling data with Elasticsearch
People typically choose Elasticsearch for it’s horizontal scaling capabilities and ease of use. We will architect two solutions that both scale well and do it in a way that still allows for change – whether it is data changes, growth rates or resources.
The talk will be accessible to both people who know Elasticsearch (fairly) well and those who have never used it. If you know Elasticsearch and have used it before you will learn how to put some of it’s more advanced data management API to good use. If you only heard of Elasticsearch (or even if not), you will get an impression of why people choose it to index ever growing amounts of data.

What’s new in MongoDB 2.6.
A short talk about what’s new in our biggest release ever! We’ve changed up to 80% of our codebase and added major value to the Database.

Modern Cassandra
Cassandra continues to be the weapon of choice for developers dealing with performance at scale. Whether in social networking (Instagram), scientific computing (SPring-8), or retail (eBay), Cassandra continues to deliver. This talk will look at new features in Cassandra 2.x and the upcoming 3.0, such as lightweight transactions, virtual nodes, a new data model and query language, and more.