Google BigQuery – Dataconomy

400,000 GitHub repositories, 1 billion files, 14 terabytes of code: Tabs or spaces?

Felipe Hoffa — Wed, 04 Jan 2017 10:08:04 +0000

Using tabs or spaces when writing a new line of code has been one of the fiercest battles ever fought among coders. Because we don’t live in a perfect world where everybody indents and aligns according to the same standards, the debate is ultimately reduced to how source-code is displayed in editing software. Conflict arises because code doesn’t display nicely if there is no consistency throughout the file, which can get especially tricky when several coders are involved in the same project.

The debate has been around for so long that it has become polarizing enough, so that coders now proudly self-classify to as “tab people or “space people.”

Tabs or spaces – A battle in BigQuery

Our long-time collaborator and Google developer advocate Felipe Hoffa set out to find an answer. Felipe’s data came from GitHub files stored in BigQuery. He considered the top 400,000 repositories, sorted by number of starts received on Github between January and May, 2016. He only considered files with more than 10 lines of code, and excluded duplicates. In cases where files used both tabs and spaces, he gave the vote to whichever method was used more frequently.

Below is his original post on Medium, where he parses a billion files among 14 programming languages to decide which one is on top – tabs, or spaces.

The rules:

Data source: GitHub files stored in BigQuery.
Stars matter: We’ll only consider the top 400,000 repositories — by number of stars they got on GitHub during the period Jan-May 2016.
No small files: Files need to have at least 10 lines that start with a space or a tab.
No duplicates: Duplicate files only have one vote, regardless of how many repos they live in.
One vote per file: Some files use a mix of spaces or tabs. We’ll count on which side depending on which method they use more.
Top languages: We’ll look into files with the extensions (.java, .h, .js, .c, .php, .html, .cs, .json, .py, .cpp, .xml, .rb, .cc, .go).

Numbers

How-to

I used the already existing [bigquery-public-data:github_repos.sample_files] table, that lists the files of the top 400,000 repositories. From there I extracted all the contents for the files with the top languages extensions:

SELECT a.id id, size, content, binary, copies, sample_repo_name , sample_path
FROM (
  SELECT id, FIRST(path) sample_path, FIRST(repo_name) sample_repo_name 
  FROM [bigquery-public-data:github_repos.sample_files] 
  WHERE REGEXP_EXTRACT(path, r'\.([^\.]*)$') IN ('java','h','js','c','php','html','cs','json','py','cpp','xml','rb','cc','go')
  GROUP BY id
) a
JOIN [bigquery-public-data:github_repos.contents] b
ON a.id=b.id

864.6s elapsed, 1.60 TB processed

That query took a relative long time since it involved joining a 190 million rows table with a 70 million rows one, and over 1.6 terabytes of contents. But don’t worry about having to run it, since I left the result publicly available at [fh-bigquery:github_extracts.contents_top_repos_top_langs].

In the [contents] table we have each unique file represented only once. To see the total number of files and size represented:

SELECT SUM(copies) total_files, SUM(copies*size) total_size
FROM [fh-bigquery:github_extracts.contents_top_repos_top_langs]

1 billion files, 14 terabytes of code

Then it was time to run the ranking according to the previously established rules:

SELECT ext, tabs, spaces, countext, LOG((spaces+1)/(tabs+1)) lratio
FROM (
  SELECT REGEXP_EXTRACT(sample_path, r'\.([^\.]*)$') ext, 
         SUM(best='tab') tabs, SUM(best='space') spaces, 
         COUNT(*) countext
  FROM (
    SELECT sample_path, sample_repo_name, IF(SUM(line=' ')>SUM(line='\t'), 'space', 'tab') WITHIN RECORD best,
           COUNT(line) WITHIN RECORD c
    FROM (
      SELECT LEFT(SPLIT(content, '\n'), 1) line, sample_path, sample_repo_name 
      FROM [fh-bigquery:github_extracts.contents_top_repos_top_langs]
      HAVING REGEXP_MATCH(line, r'[ \t]')
    )
    HAVING c>10 # at least 10 lines that start with space or tab
  )
  GROUP BY ext
)
ORDER BY countext DESC
LIMIT 100

16.0s elapsed, 133 GB processed

Analyzing each line of 133 GBs of code in 16 seconds? That’s why I love BigQuery.

Space people, rejoice

According to the data, the winner here is spaces. C is the only major language where spaces were not used in the most popular files on GitHub. This might just be as close as we will ever get in discovering whether tabs or spaces are more popular.

The reason for the popularity of spaces might be that they display consistently across all hardware and text viewing software. However, it is still unclear whether spaces are objectively, qualitatively better than tabs.

Like this article? Subscribe to our weekly newsletter to never miss out!

Follow @DataconomyMedia

Mapping the Future of Mankind Using Historical Datasets

Eileen McNulty — Wed, 10 Sep 2014 11:37:18 +0000

(Conflict (red) and cooperation (green) in Nigeria Jan-March 2014; source)

For centuries, society has been captivated by the idea of the eternal return. From Indian philosophy and ancient Egyptian thought, from Pythagoras through to Nietzsche, the idea that history is cyclical and events will recur again and again infinitely has always captured the human imagination.

This idea was most recently put to the test by the team behind the Global Database of Events, Language and Tones (GDELT), a quarter-billion record database of events in human history. They wanted to see if they could find correlations between current world events and historical events, to see if they could map the outcomes of current political unrests. Their experiment was likened to Asimov’s Pyschohistory, in which history, sociology, and mathematical statistics are combined to make general predictions about the future behavior of very large groups of people. Although the experiment worked astoundingly well, GDELT’s founder Kalev Leetaru was keen to point out to Dataconomy that “to me the story here is not about forecasting, but rather about the “big data” world having reached a point where we can now run an analysis of this magnitude with just a single line of code and get back the results in just a few minutes. We’ve reached a point where we can truly begin to interactively explore our world at scale.”

Leetaru took some time out whilst working on this project to discuss his work with Dataconomy- what such a database means for our understanding of history, sociology, and the capabilities of technology.

Talk us through the genesis of GDELT.

I did a very famous paper in 2011 called Culturomics 2.0 where I used the tone- the sentiment of global media coverage- to show that you can build models that allow you to forecast a whole country collapse. There was a huge amount of potential for that project, but the problem was that the only thing I could only look at was whole country collapse. There simply weren’t data sets today or from the time period that gave you riots, protests and military attacks, down to the sub-national level over the entire world going back over time.

Now we have a quarter of a billion records in there, which is about 400 gigabytes on disc. You might say that’s nothing compared to some of the massive datasets being discussed today- Facebook talks about 500 terabytes of new data added to their servers everyday. But in these instances, you’re really accessing the data through a key store- it’s simply fetching an object- basically, it’s a filing system. Whereas with GDELT, what your have is a fully relational dataset where all of these fields are interrelated, and it’s really the number of columns times the number of rows that makes it so complex to work with. Before BigQuery, there was no real way of interacting with this because traditional database systems called for indexing, and even a lot of these newer, more powerful database platforms still require you to basically give them advice as to how people are going to interact with the data.

What we are finding is essentially is almost every query is different from the one before. BigQuery really allows you to set aside the indexing and that has transformed our capabilities. BigQuery really allows us to look at this dataset holistically.

In terms of use, the holy grail from the beginning was the ability to take something, say data from the last three months in Ukraine, and find the most similar periods in time from any country, the most similar to what Ukraine is going through right now. You pick the five most similar periods in time from any country that is most similar to where Ukraine is now and you will be able to say what happened after each of those situations, and see if that give us any insight as to what might happen to Ukraine in the future. We’re digging into this now, and the results are absolutely fascinating- the combinations and correlations and connections are astounding.

I wouldn’t go to this length, but many people liken the GDELT project to Isaac Asimov’s Psychohistory from the Foundation series. And I will agree that there are certain parallels, in that psychohistory was this notion of gathering up all the world’s open public data sets. Not the datasets that the NSA are dealing with, but public information, like news media. Stripping that up and then processing that by computer to understand these big, broad patterns in society. And that’s where we’re up to at this second.

GDELT is a database of events, but also language and tone. Could you delve a little deeper into how it records these elements?

GDELT really is two different parallel data sets. One obviously is the event database, which is a quarter of a billion records in over 300 categories that are physical activities- from riots and protests to peace pledges and diplomatic exchanges.

But the missing link, where we are today, is forecasting. When you think about Egypt, it wasn’t that people were happy and then one day everyone woke up and decided they should protest. What you are looking for are essentially these deeper, latent dimensions to language, so things like the semantic and emotional undercurrents.

I did a study maybe a month or two ago and when I looked at global media coverage of Assad, the president of Syria. When I looked at how negative or positive that was I found that he was actually in free fall towards negativity- the world was darkening about him prior to the Qusair attack. When the US failed to react to that, t entire world essentially said, “Wow if he can do that- if he can kill 1000 people with no reaction- he’s won now.” They weren’t positive about him, but what we saw was called military superiority language. Language that suggested he was now invulnerable, and that the rebels were now going to lose. And sure enough you saw the news cover that. So that’s actually what I’m most interested in- how do we measure all of this?

(Rwanda 1979-2014; source)How do you think sentiment and tone analysis will impact the future?

That is a very interesting and nuanced dimension- let’s think about Iran, when Rouhani was elected President. CNN set up a TV crew in the middle of the square in Tehran and asked passers-by “What do you think of the new president?” You saw these two interesting looking characters just off screen, who are just sitting there basically just close enough to listen to everything that was said. Obviously they were his security people and so when people are interviewed on TV, they weren’t giving their real opinion. They all realised that someone stares at the camera and says “I think he’s an idiot,” that’s not going to be very good for their health.

But, this is the interesting part- there are two dimensions to tone. There is explicit tone, when someone says “I love my iPhone” or “I hate eating vegetables”. There are also whole undercurrents of semantic tone embedded in what people say. Take, for example, if someone tweeted “It is a beautiful day outside, I am doing the laundry.” Oftentimes that would be discarded as a noise tweet but this actually tells us quite a lot about the person’s situation because they’re not not posturing, they’re not realizing the emotions that they are expressing. This tweet suggests security and safety. Let’s say if someone has just lost their job, they have a huge mortgage and they don’t know where they are going to get food tomorrow, they probably wouldn’t be cheerfully tweeting about their laundry. So the ability to really get all these dimensions and really understand is giving it a much broader picture.

GDELT is available to download. Have you had any particularly interesting public use cases yet, or are there any particular applications you’d like to see in the future?

There’s a ton of things being done with it right now, and it’s being used many NGO’s around the world. I’ve heard of some successful applications, but they’re not being made public as much as I would like. One thing I’m working on this year is building tools to help organisations leverage this dataset. But it’s already becoming very widely used-the first week that it was available on the cloud server it had over 30,000 downloads.

There’s some phenomenal interest in it and I think again with any data set it takes a while before it starts growing and expanding out there. I think again obviously it has huge implications for forecasting and so the moments someone comes up with a good forecasting algorithm if it works they’re likely to put it out there. But this is the general aim for GDELT: an open platform for computing on the entire world.

Follow @DataconomyMedia

3 Lessons on the Notability Gender Gap in Freebase

Eileen McNulty — Mon, 02 Jun 2014 12:52:23 +0000

Whether you like it or not, the gender gap is still a pressing social issue. We live in an age moving towards gender equality, but where it is still not realised. Many say that need that the need for feminism has passed, and that gender equality is a reality in our society. Yet there’s still considerably less women in the top-tier jobs, and they’re still getting paid less than their male counterparts. In their talk at Berlin Buzzwords this week, Ewa Gasperowicz and Felipe Hoffa explored the gender gap using Freebase and Google BigQuery.

Freebase is an open-source database for structured data, housing over 1 billion facts about 42.9 million entities. It contains 2.4 billion ‘triples’ about these objects, which are facts composed of subject-predicate-object (for example Daft Punk-appears in- Tron). Anyone can add data to Freebase, providing it’s less than 1% duplicated or conflated with existing data. Obviously as a dataset, Freebase is massively skewed towards celebrities, but analysing this data gives us a good understanding about the men and women at the forefront of public consciousness. Gasperowicz and Hoffa used Google BigQuery to analyse and explore this 88GB dataset. Here are some of their findings.

1. The Gender Gap is Immediately Noticeable

Here’s the gender breakdown of the people notable enough to be included in Freebase:
Male: 1521700
Female: 511361
Other: 230

So there’s almost triple of the amount of notable males than notable females. Straight from the off, we can see a gender disparity, which widens when we start looking into specific professions and locations.

Interestingly, they also showed the most searched-for women on Wikipedia in real time. The top three most searched for women were:
1. Marine LePen
2. Kim Kardashian
3. Shakira

LePen’s presence can be attributed to the European elections we had that day. Other than that, we have a popstar and a woman famous for being famous. I’ll leave it up to you to decide if these are the ideal female role models to be dominating public consciousness.

2. The Gender Gap Widens in Certain Professions

Gasperowicz & Hoffa used Google Maps to visualise the male:female ratio in jobs around the world. Red indicates dominated by females, blue dominated by males, and purple indicates balance; the intensity of the colour relates to the amount of data available. To demonstrate how profound the gender gap can be between professions, consider the visualisation for models:

Compared to the visualisation for politicians:

And the map for business people presents a gender skew too:

Out of the 117 German notable business people listed on Freebase, only 8 are female. There is greater equality in some creative professions such as writer, author and novelist, but the map for every media production profession (director, producer, screenwriter…) is almost entirely blue. Sports as well are almost unanimously dominated by males. You can see the stats for yourself here.

3. The Gender Gap Grows- and Shrinks- at Certain Ages

Gasperowicz & Hoffa also explored the ages of the notable males and females of Freebase. There was an obvious rift between 20-year-old males and 20-year-old females:
This was attributed to the dominance of athletes in this age category, particularly in college sports in the US.

However, looking forward, there is some good news: the gender gap decreases with age. They found significantly more females in most professions aged 40 compared to age 20. It will be fascinating to see how the data changes when the current 20-year-olds hit 40 themselves.

An audience member asked if they had a map for data scientists. A cursory glance around the conference would suggest that map would be fairly blue. Hoffa replied that they did not, but that Google were committed to giving opportunties to female data scientists. Let’s hope that more companies gain an awareness of the gender gap, and that in years to come all of these maps start looking less blue and more purple.

View the presentation slides here.

Follow @DataconomyMedia

Eileen McNulty-Holmes – Editor

Eileen has five years’ experience in journalism and editing for a range of online publications. She has a degree in English Literature from the University of Exeter, and is particularly interested in big data’s application in humanities. She is a native of Shropshire, United Kingdom.

Email: eileen@dataconomy.ru

Interested in more content like this? Sign up to our newsletter, and you wont miss a thing!

[mc4wp_form]