Databricks – Dataconomy https://dataconomy.ru Bridging the gap between technology and business Wed, 18 Dec 2024 12:45:19 +0000 en-US hourly 1 https://dataconomy.ru/wp-content/uploads/2025/01/DC_icon-75x75.png Databricks – Dataconomy https://dataconomy.ru 32 32 Databricks valued at $62B after massive $10B funding round https://dataconomy.ru/2024/12/18/databricks-valued-at-62b-after-massive-10b-funding-round/ Wed, 18 Dec 2024 12:45:19 +0000 https://dataconomy.ru/?p=62120 Databricks has raised $10 billion in its latest funding round, raising its valuation to $62 billion. This significant investment comes from major venture capital firms such as Thrive Capital, Andreessen Horowitz, Insight Partners, and Iconiq Growth. The funding aims to bolster Databricks’ capacity to compete for top AI talent and finance stock options for employees. […]]]>

Databricks has raised $10 billion in its latest funding round, raising its valuation to $62 billion. This significant investment comes from major venture capital firms such as Thrive Capital, Andreessen Horowitz, Insight Partners, and Iconiq Growth. The funding aims to bolster Databricks’ capacity to compete for top AI talent and finance stock options for employees.

Databricks secures massive funding for growth

The Series J funding round was noted as substantially oversubscribed, with Databricks having already raised $8.6 billion towards its $10 billion target. Ali Ghodsi, the CEO of Databricks, emphasized the competitive landscape for AI talent, remarking, “The talent war for AI is like no other time before.” This funding will help alleviate pressure on employees who face substantial tax liabilities as a result of stock unit vesting. Vince Hankes of Thrive Capital highlighted that a significant portion of the funds will enable employees to cash out their options and manage related taxes, comparing the deal to Stripe’s previous large funding round.

As a member of Silicon Valley’s elite, Thrive Capital invested “at least” $1 billion and plans to support Databricks in building what they envision as the next $1 trillion infrastructure company. The remaining funds will be allocated for the development of new AI products, acquisitions, and expansion into international markets.

Databricks has experienced remarkable growth recently, anticipating annualized revenue to hit $3 billion by the end of next month. This figure reflects a year-over-year revenue increase of over 60% in the most recent quarter. The company’s revenue growth has contributed to a 44% hike in its valuation, up from $43 billion in September 2023. Ghodsi noted that the company is not in a hurry to go public, stating that “the absolute earliest we would go public is next year, but we have flexibility now.”

The funding attracted participation from other prominent investors, including Singapore’s GIC and early investors in Twitter and Facebook, such as Yuri Milner’s DST Global. New participants like MGX, a UAE fund focused on AI, further solidified the investment’s global appeal. The deepening investment landscape places Databricks amidst a growing cohort of AI and tech companies vying for significant market share and talent.

For context, Databricks has successfully positioned itself as a significant player among giants such as OpenAI and SpaceX, currently trailing only behind a few US firms in valuation. Its revenue trajectory also positions it favorably against its largest competitor, Snowflake, which has a market capitalization of $57 billion.

]]>
Meet DBRX, a new open-source LLM can make you stop using ChatGPT https://dataconomy.ru/2024/03/28/databricks-dbrx-ai-llm/ Thu, 28 Mar 2024 10:11:41 +0000 https://dataconomy.ru/?p=50488 Databricks presents DBRX, an innovative open-source language model poised to revolutionize language understanding. Built on an advanced architecture, DBRX shows remarkable progress in tasks like coding and solving math problems. Outperforming some models, you think they are the best in the LLM market! But what sets DBRX apart from the rest? Let’s delve deeper into […]]]>

Databricks presents DBRX, an innovative open-source language model poised to revolutionize language understanding. Built on an advanced architecture, DBRX shows remarkable progress in tasks like coding and solving math problems. Outperforming some models, you think they are the best in the LLM market!

But what sets DBRX apart from the rest? Let’s delve deeper into its development process and explore the exciting capabilities it offers.

What is DBRX?

DBRX is an open-source large language model (LLM) developed by Databricks, aiming to offer a competitive alternative in the rapidly evolving landscape of artificial intelligence. Built upon a fine-grained mixture-of-experts (MoE) architecture, DBRX demonstrates notable advancements in language understanding, particularly in programming and mathematical reasoning tasks. Notably, it outperforms some established models like GPT-3.5 and competes reasonably with closed models such as Gemini 1.0 Pro.

Meet DBRX, a new open-source LLM can make you stop using ChatGPT
DBRX vs popular LLMs (Image credit)

DBRX was developed through an intensive process that spanned three months, building upon months of prior research and experimentation. The training was conducted on a powerful infrastructure consisting of 3072 NVIDIA H100s connected by 3.2Tbps Infiniband. Leveraging Databricks’ suite of tools, including Unity Catalog for data governance, Lilac AI for data exploration, Apache Spark and Databricks notebooks for data processing, and optimized training libraries such as MegaBlocks and LLM Foundry, DBRX was trained and fine-tuned across thousands of GPUs using Mosaic AI Training service. Results were logged using MLflow, and human feedback was collected for quality improvement through Mosaic AI Model Serving and Inference Tables.

So, what can DBRX do? It can do many things, like answering questions, writing code, solving math problems, and analyzing data. It can also help with writing by correcting grammar and suggesting improvements. DBRX can understand different languages and even analyze feelings in text. It’s like a smart assistant that can do a lot of tasks involving language and data. Also, users can customize it for specific needs. DBRX might surprise you with its performance compared to other popular LLMs.

Discover DBRX, Databricks' groundbreaking open-source language model, redefining AI with advanced coding and math-solving skills. Explore now!
Exploring DBRX’s versatile capabilities (Image credit)

Accessible to developers and enterprises, DBRX provides both the base model and finetuned versions under an open license, encouraging collaborative exploration and innovation. Its emphasis on efficiency in both training and inference, alongside its manageable size, makes it a potentially cost-effective solution for various AI applications.

How to use DBRX

Using DBRX is made accessible through various means provided by Databricks:

  • Foundation Model APIs: Databricks offers Foundation Model APIs, which allow users to interact with DBRX through a simple interface. Users can leverage these APIs to integrate DBRX into their applications and workflows.
  • AI Playground Chat Interface: For quick experimentation and testing, users can access DBRX through the AI Playground chat interface. This interface provides a user-friendly environment for interacting with the model and exploring its capabilities.

Overall, by leveraging the tools and resources provided by Databricks, users can easily incorporate DBRX into their workflows and harness its capabilities for a wide range of applications in natural language processing and AI.


Featured image credit: Databricks

]]>
Driving Value with Data Science https://dataconomy.ru/2016/10/19/driving-value-data-science/ https://dataconomy.ru/2016/10/19/driving-value-data-science/#comments Wed, 19 Oct 2016 08:00:52 +0000 https://dataconomy.ru/?p=16704 Fighting fraud, reducing customer churn, improving the bottom line –  these are just a few of the promises of data science. Today, we have more data to work with than ever before, thanks to new data-generating technologies like smart meters, vehicle telemetry, RFID, and intelligent sensors. But with all that data, are we driving equivalent […]]]>

Fighting fraud, reducing customer churn, improving the bottom line –  these are just a few of the promises of data science. Today, we have more data to work with than ever before, thanks to new data-generating technologies like smart meters, vehicle telemetry, RFID, and intelligent sensors.

But with all that data, are we driving equivalent value? Many data scientists say they spend most of their time as “data janitors” combining data from many sources, dealing with complex formats, and cleaning up dirty data.  

Data scientists also say they spend a lot of time serving as “plumbers” – handling DevOps and managing the analytics infrastructure. Time devoted to data wrangling and DevOps is a dead loss; it reduces the amount of time data scientists can spend delivering real value to clients.

The Challenge for Data Scientists

Data scientists face four key challenges today:

Small data tools. Data analytics software introduced before 2012 runs on single machines only; this includes most commercial software for analytics as well as open source R and Python. When the volume of data exceeds the capacity of the computer, runtime performance degrades or jobs fail. Data scientists working with these tools must invest time in workarounds, such as sampling, filtering or aggregating. In addition to taking time, these techniques reduce the amount of data available for analysis, which affects quality.

Complex and diverse data sources. Organizations use a wide variety of data management platforms to manage the flood of Big Data, including relational databases; Hadoop; NoSQL data stores; cloud storage; and many others. These platforms are often “siloed” from one another. The data in those platforms can be structured, semi-structured and unstructured; static and streaming; cleansed and uncleansed. Legacy analytic software is not designed to handle complex data; the user must use other tools, such as Hive or Pig, or write custom code.

Single-threaded software. Legacy software scales up, not out. If you want more computing power, you’ll have to buy a bigger machine. In addition to limiting the amount of data you can analyze, it also means that tasks run serially, one after the other. For a complex task, that can take days or even weeks.

Complex infrastructure. Jeff Magnusson, Director of Algorithms Platform at online retailer, Stitch Fix notes that data science teams typically include groups of engineers who spend most of their time keeping the infrastructure running. Data science teams often manage their platforms because clients have urgent needs, the technology is increasingly sophisticated, and corporate IT budgets are lean.

What Data Scientists Need

It doesn’t make sense to hire highly paid employees with skills in advanced analytics, then put them to work cleaning up data and managing clusters. Visionary data scientists seek tools and platforms that are scalable; interoperable with Big Data platforms; distributed; and elastic.

Scalability. Some academics question the value of working with large datasets. For data scientists, however, the question is moot; you can’t escape using large datasets even if you agree with the academics. Why? Because the data you need for your analysis comes from a growing universe of data; and, if you build a predictive model, your organization will need to score large volumes of data. You don’t have a choice; large datasets are a fact of life, and your tools must reflect this reality.

Integrated with Big Data platforms. As a data scientist, you may have little or no control over the structure of the data you need to analyze or the platforms your organization uses to manage data. Instead, you must be able to work with data regardless of its location or condition. You may not even know where the data resides until you need it. Thus, your data science software must be able to work natively with the widest possible selection of data platforms, sources, and formats.

Distributed. When you work with large data sets, you need software that scales out and distributes the workload over many machines. But that is not the only reason to choose a scale-out or distributed architecture; you can divide many complex data science operations into smaller tasks and run them in parallel. Examples include:

  • Preprocessing operations, such as data cleansing and feature extraction
  • Predictive model tuning experiments
  • Iterations in Monte Carlo simulation
  • Store-level forecasts for a retailer with thousands of stores
  • Model scoring

In each case, running the analysis sequentially on a single machine can take days or even weeks. Spreading the work over many machines running in parallel radically reduces runtime.

Elastic. Data science workloads are like the stock market – they fluctuate. Today, you may need a hundred machines to train a deep learning model; tomorrow, you don’t need those servers. Last week, your team had ten projects; this week, the workload is light. If you provision enough servers to support your largest project, those machines will sit idle most of the time.

Data science platforms must be all of these things, and they must be easy to manage, so the team spends less time managing infrastructure and more time delivering value.

The Ideal Modern Data Science Platform

To reduce the amount of time you and your team spend “wrangling” data, standardize your analysis on a modern data science platform with open source Apache Spark as the foundation. Apache Spark is a powerful computing engine for high-performance advanced analytics. It supports the complete data science workflow: SQL processing, streaming analytics, machine learning and graph analytics. Spark supports APIs with the most popular open source tools for data scientists, including Python, R, Java and Scala.

Apache Spark’s native data platform interfaces, flexible development tools, and high-performance processing make it an ideal tool to use when integrating data from complex and diverse data sources. Spark works with traditional relational databases and data stores in the Hadoop ecosystem, including HDFS files and standard storage formats (including CSV, Parquet, Avro, RC, ORC and Sequence files.) It works with NoSQL data stores, like HBase, Cassandra, MongoDB, SequoiaDB, Cloudant, Couchbase and Redis; cloud storage formats, like S3; mainframe files; and many others. With Spark Streaming, data scientists can subscribe to streaming data sources, such as Kafka, Camel, RabbitMQ, and JMS.  

Spark algorithms run in a distributed framework so that they can scale out to arbitrarily large quantities of data. Moreover, data scientists can use Spark to run operations in parallel for radically faster execution. The distributed tasks aren’t limited to native Spark capabilities; Spark’s parallelism also benefits other packages, such as R, Python or TensorFlow.

For elastic provisioning and low maintenance, choose a cloud-based fully-managed Spark service. Several vendors offer managed services for Spark, but the offerings are not all the same. Look for three things:

  • Depth of Spark experience. Several providers have jumped into the market as Spark’s popularity has soared. A vendor with strong Spark experience has the skills needed to support your team’s work.
  • Self-service provisioning. An elastic computing environment isn’t much good if it’s too hard to expand or contract your cluster, if you have to spend valuable time managing the environment, or if you need to call an administrator every time you want to make a change. Your provider should provide self-service tools to provision and manage the environment.
  • Collaborative development environment. Most data scientists use development tools or notebooks to work with Spark. Your data platform provider should offer a development environment that supports collaboration between data scientists and business users, and interfaces natively with Spark.

Apache Spark provides scalability, integration with Big Data platforms and a distributed architecture; that means data scientists spend less time wrangling data. A cloud-based managed service for Spark contributes elasticity and zero maintenance, so data scientists spend less time on DevOps – and more time fighting fraud, reducing customer churn, and driving business value with data science.

 

Like this article? Subscribe to our weekly newsletter to never miss out!

Image: Tyler Merbler

]]>
https://dataconomy.ru/2016/10/19/driving-value-data-science/feed/ 1
Lightning Fast and Enterprise-Class: Datastax Enterprise 4.5 https://dataconomy.ru/2014/07/15/lightning-fast-enterprise-class-datastax-enterprise-4-5/ https://dataconomy.ru/2014/07/15/lightning-fast-enterprise-class-datastax-enterprise-4-5/#respond Tue, 15 Jul 2014 10:28:38 +0000 https://dataconomy.ru/?p=6986 Datastax, the leading enterprise Cassandra provider, recently unveiled Datastax Enterprise 4.5. DSE 4.5 is focused around making it easier than ever to develop and deploy, as well as increased performance capabilities, supported by integration with Apache Spark and partnership with Databricks. Dataconomy recently spoke to Robin Schumacher, Datastax’s VP of Products, about the latest developments, […]]]>

Datastax, the leading enterprise Cassandra provider, recently unveiled Datastax Enterprise 4.5. DSE 4.5 is focused around making it easier than ever to develop and deploy, as well as increased performance capabilities, supported by integration with Apache Spark and partnership with Databricks. Dataconomy recently spoke to Robin Schumacher, Datastax’s VP of Products, about the latest developments, as well as responding to Cassandra naysayers.


Tell us more about the intergrations within Datastax Enterprise 4.5.
To set the context for you a little bit, the next set of releases that you’ll see from Datastax are really performance focused. We’ve concentrated a lot in the past couple of years on building up Datastax Enterprise to be an enterprise-class NoSQL database platform. We’ve added a lot of things to make that happen; now that that’s been accomplished were really focusing on performance. So you’ll be seeing performance enhancements and open-source Cassandra, and in Datastax Enterprise, including this release. So back in February, when we announced Datastacks Enterprise 4, we brought out the first version of our in-memory database for transactional workloads. Now in this release, what we’re focused on is more performance analytics.

One of the things that were using to make that happen is integration with a Apache Spark. Spark is really about being able to run analytics across a shared-nothing architecture and that can happen with in Hadoop with HDFS. But the good news is they also enable us to run the same type of analytics on Cassandra so we have a formal partnership with Databricks which is the company behind Apache Spark. So we will be delivering a more near real-time analytics capability that uses Spark, and this allows us to have both in-memory and a disk space style of running analytics on Cassandra data. Really, the added benefit to the customers is much faster response times for analytic queries on Cassandra than they’ve been able to have in the past.

The other thing readers might be interested in knowing is that because Spark has an in-memory component to it, that can be married to our in-memory transactional option in Cassandra and so for the use cases that it applies to, you can have a full in-memory solution now inside Datastax programs for transactional & analytic workloads. Keeping things in memory will make read operations and analytic operations very, very fast.

What is exclusive to your enterprise solution?
One of the main things DSE contains over and above open source is enterprise manageability , through things like our automated management services and off centres, that make things push-button easy. The second is enabling developers, giving them the driver, the craft tools, the utilities they need to really create their applications as fast as possible. And then lastly, being able to satisfy our key use cases, things like fraud detection, Internet of Things, messaging, recommendation engines- these are the things I’m really focused on in the commercial product side.

Can you tell us more about analytics capabilities on Datastax Enterprise 4.5?
With 4.5, we’re actually bringing out two new analytics options for our customers. The first is the Spark integration and the second is integration with external deployment, external Hadoop data warehouses. What we want to do here is better enable our customers to link their hot operational data that they have in Datastax Enterprise and Casandra with historic information that they keep in Hadoop. And so 4.5 enables this very easily, where we can easily connect both platforms together, query data at the same time on both platforms and return that data back to the customer and they can either keep it on our platform or ship it off to their external Hadoop platform.

When we spoke to Jonathan [Ellis, Datastax co-founder], he said Cassandra 2.1 might be production-ready by the end of June. Is there a set release date yet?
I believe the new target is for around the middle of July, where 2.1 is concerned.

When the initial announcement was made about the Datastax and Databricks partnership, the COO of Scaleout said that ‘Spark doesn’t handle real time state changes to individual data items in memory. It can only stream data and change the whole data set,’ and that ‘Cassandra has similar limitations because you can’t update data on Cassandra, all you can do is delete it and create a new copy.’ What do you make of these comments?
The latter has to do with how Cassandra writes data and Cassandra is probably the most efficient platform for writing data which you’re going to find, because of how it writes data. The matter he is describing doesn’t really impact on what customers experience because again the data is written very very quickly; it is done behind the scenes asynchronously in a very fast and efficient manner. It’s one of the reasons why Cassandra is used in so many Internet of Things applications, and other write-intensive environments. Our customers don’t complain about that at all.

And as for the former, I think really the only thing you’re looking at is the end result that customers experience. One of the things we talked about here, some of the differentials that we’re seeing between batch analytics that we’ve offered in the past and this new Spark integration, we’ve seen- even in some of the smallest areas- a 50% increase in performance with some of them getting up to a 30x style speed up. We’ve got some high queries as an example, that have taken five minutes to run on the prior data that now take one second to complete. I think that’s all the customer really cares about in the end.

We’re very proud of how Cassandra writes data; it’s completely durable, your data is completely safe and its written faster that literally any other relational literal sequential engine.

A quick company update is that we’re up to about 300 employees and we’re still growing like crazy, and we just passed our 500 customer mark.


DATASTAX LOGODatastax is the leading enterprise solution focused around Apache Cassandra, an Apache Foundation open-source project. Apache Cassandra is a NoSQL database technology, which features scalability, always-on availability and fault tolerance. Datastax Enterprise solution offers Cassandra with added security, search, analytics and management features.


(Image credit: Silicon Angle)


Interested in more content like this? Sign up to our newsletter, and you wont miss a thing!

[mc4wp_form]

 

]]>
https://dataconomy.ru/2014/07/15/lightning-fast-enterprise-class-datastax-enterprise-4-5/feed/ 0
Databricks Raises $33 Million and Introduces Cloud Platform for Processing Big Data https://dataconomy.ru/2014/07/01/databricks-raises-33-million-and-introduces-cloud-platform-for-processing-big-data/ https://dataconomy.ru/2014/07/01/databricks-raises-33-million-and-introduces-cloud-platform-for-processing-big-data/#respond Tue, 01 Jul 2014 08:54:05 +0000 https://dataconomy.ru/?p=6312 Databricks, a start up that builds software around the popular open-source project Apache Spark, announced on Monday at this year’s Spark Summit in San Francisco that it has raised $33 million in Series B funding. The announcement also included the launch of a new cloud computing service on Amazon Web Services, which aims to reduce […]]]>

Databricks, a start up that builds software around the popular open-source project Apache Spark, announced on Monday at this year’s Spark Summit in San Francisco that it has raised $33 million in Series B funding. The announcement also included the launch of a new cloud computing service on Amazon Web Services, which aims to reduce the time and cost of setting up and analysing big data by sampling the data pipeline in the cloud. New Enterprise Associates led the funding round with participation from existing investor Andreessen Horowitz.

“Getting the full value out of their Big Data investments is still very difficult for organizations,” said Databrick’s CEO, Ion Stoica. “Clusters are difficult to set up and manage, and extracting value from your data requires you to integrate a hodgepodge of disparate tools, which are themselves hard to use. Our vision at Databricks is to dramatically simplify big data processing and free users to focus on turning data into value.”

Although Spark is currently deployable on AWS, the announcement of Databricks Cloud means that Spark will be a managed service that will be supported directly by Databricks. Data will be stored in AWS by default, but can also be stored in HDFS if customers already have Hadoop clusters running in AWS. Moreover, as GigaOM reports, “Databricks cloud can read data from, and export data to, MongoDB, MySQL and Amazon Redshift.”

While Databricks Cloud will exclusively run on AWS S3 for now, Stoica said the company has plans to explore other options like Google Compute Cloud and Microsoft Azure. Databricks cloud is currently in beta but is expected to be ready on Amazon by this fall, starting at “a couple of hundred dollars” per user, per month.

Databricks is current the most active open-srouce project for big data, and its recent funding has brought the total investment in the company to $47 million.

Read more here

Interested in more content like this? Sign up to our newsletter, and you wont miss a thing!

[mc4wp_form]

(Image Credit: Flickr)

 

]]>
https://dataconomy.ru/2014/07/01/databricks-raises-33-million-and-introduces-cloud-platform-for-processing-big-data/feed/ 0
Spark Meets Cassandra: Databricks and Datastax Announce Partnership https://dataconomy.ru/2014/06/23/spark-meets-cassandra-databricks-datastax-announce-partnership/ https://dataconomy.ru/2014/06/23/spark-meets-cassandra-databricks-datastax-announce-partnership/#comments Mon, 23 Jun 2014 08:39:05 +0000 https://dataconomy.ru/?p=5949 Follow @DataconomyMedia Databricks and Datastax have announced a partnership centring around a supported open-source integration of their Cassandra and Spark products respectively. This follows announcements of Databricks integrations with Cloudera, MapR and Alpine Data Labs. Kelvin Chu, compute and data team lead at Ooyala, a video analytics platform who built their own integration between Spark […]]]>

Databricks and Datastax have announced a partnership centring around a supported open-source integration of their Cassandra and Spark products respectively. This follows announcements of Databricks integrations with Cloudera, MapR and Alpine Data Labs.

Kelvin Chu, compute and data team lead at Ooyala, a video analytics platform who built their own integration between Spark and Cassandra, had this to say about the new partnership: “With Cassandra as the data store and Spark for data crunching, these new analytic capabilities are making the processing of large data volumes a breeze. Spark on Cassandra is giving us the power to act on things in real time, which means faster decisions and faster results.”

Cassandra will be integrated with the Spark Core Engine, meaning it can take advantage of all types of analysis on the framework. The benefit will be in-memory analytics for things like personalisation & recommendation, fraud detection and sensor-data detection. “Analytics on real-time data is important because people want to look at what the customer is looking at or buying right now, do a quick analysis against historical or location data, and offer them something different,” Martin Van Ryswyk, Datastax’s Executive VP of Engineering, stated. “This also happens in fraud scenarios when you need to stop a transaction right now, not two hours from now once you’ve seen all that data in batch mode on Hadoop or a data warehouse.”

However, not everyone convinced that combined power of Cassandra and Spark is something to get excited about. David Brinker, COO of ScaleOut Software, has voiced concerns about Spark’s lack of handling for real-time state changes, and Cassandra’s eventual consistency model. “Spark does not handle real-time state changes to individual data items in memory; it can only stream data and change the whole data set,” he says. “Cassandra has a similar limitation because you can’t update data in Cassandra; all you can do is delete it and create a new copy.”

Databricks and Datastax hope to have the integration ready over the summer. Only time will tell if this will prove to be a game-changing technology.

Read more here.



Interested in more content like this? Sign up to our newsletter, and you wont miss a thing!

[mc4wp_form]

]]>
https://dataconomy.ru/2014/06/23/spark-meets-cassandra-databricks-datastax-announce-partnership/feed/ 2