Data is the Next Major Layer of the Cloud & A Major Victory for Startups

Posted on Dec 9, 2010 | 71 comments

Today Factual announced that it raised $25 million from Andreessen Horowitz & Index Ventures.  I believe that this is a major new area of growth & innovation for the Internet as Cloud Services start to form deeper & richer layers.  Let me explain.

For decades the “layering” of technology has allowed us to develop IT systems and networks in a specialized way that lets best-of-breed technology solutions to emerge at each layer of the stack and to allow people with different skill sets to specialize in key areas without having to have competence in every technology arena.

One obvious example is the OSI model in which we have seven layers ranging from the physical layer at the bottom of the stack (e.g. dealing with how digital or analog signals are actually transmitted for point A to point B), the network layer in the middle that deals with routing packets of information, to the presentation and application layer at the top end.

You can think of even your PC as a stack in which the hardware manufacturers handled physical layers, Microsoft handled the OS layer and application companies built higher up in the stack.

I mention this because I believe the layered metaphor for technology development has served our industry well and even if you aren’t technical it’s an important concept for you to grasp.

Over the past 5 years the Internet Cloud has started to form into layers and this is a great thing for innovation.  I’m not covering the actual layers of the Internet (under the OSI model) but rather the Cloud Services layer.  For every layer if I mention companies please don’t assume that I’m suggesting there aren’t other players in that category.  I’m just listed who I perceive as the market leaders.

When I started my first company in 1999 we spent more than $2 million on technology infrastructure including Sun servers & Solaris operating system, Oracle databases, EMC storage, load balancers, app servers, back-up devices, disk mirrors and on and on.  That is excluding a single line of code or paying any salaries.  No wonder people had to raise $5 million just to get started back then.  We raised $16.5 million in our A round.  Hardware ate just over 10% of the round.

We put all of this infrastructure in an Exodus web hosting facility and had to pay for rack space, bandwidth and some management services if a disk failed, for example.

When I started my second company in 2005 we decided to do everything differently.  By then the open-source movement had really developed.  We were able to use an open source database (Postgres), open source search (Lucene) and a host of other free components including Apache Tomcat, JBoss.  We still bought our own physical infrastructure: horizontally scalable application servers, load balancers, etc.  So I still had to outlay $50-80k for hardware costs.  So we only had to raise $500,000 to get going and again hardware ate just over 10% of the round.

And then came the debate about storage.  Our chief architect, Ryan Lissack, wanted to store our data in Amazon’s new (at the time) storage product called S3 that enabled us to store all our data in their facility and we’d pay by the MBs uploaded / downloaded.  I was dead set against it.  I had been selling large content management systems and storing documents for industrial-scale customers.  Many of the biggest customers wanted to be able to physically walk through our data center – how could I give up something so strategic?  Especially to a company that sells books!

Ryan is both smart & persuasive.  I trusted his judgment.  He convinced me that the storage infrastructure was stable, reliable & secure.  We had a data redundancy plan and the ability to bring it in-house if it wasn’t working.  Work it did.  To this day I’m astounded that IBM, Google, Sun, Microsoft and others didn’t offer this service and Amazon did.  I guess the “stack ‘em high and sell ‘em cheap” mentality convinced this retailer that they could do the same with cloud services.

It performed incredibly well and allowed us to grow our costs incrementally as our business grew as well as to massively reduce our overall storage costs – it’s a shared infrastructure in they way that electricity or water is.  If you’re not knowledgeable on the topic of this big IT migration to the cloud I’d suggest reading Nicholas Carr’s book, “The Big Switch.”

I used to recommend that companies only keep their non-core data on S3, I now recommend it whole-heartedly even for mission-critical applications.  I have seed some compelling arguments for or against – but mostly on costs.  From a reliability & performance perspective for most applications it will perform beautifully.

During my second startup we never considered using cloud computer processing for our real-time processes but we did run some batch processes there.  At the time we viewed Amazon’s offering, EC2 as too nascent.  How on Earth could I rely on Amazon to guarantee me performance so that I didn’t risk slow response times for my customers?

But sure enough over time Amazon was able to prove that they could reliably meet performance targets and so many startups bet their who infrastructure on Amazon.  Think about it.  Imagine that you can develop software on your local computer but the entire service is delivered virtually through a partner in the same way people consumer energy with all of the scale benefits that go with that.  They deal with energy management, security, physical device failures, etc.

This has allowed people to get started for $50,000 and spend just $5,000 on hardware – again around 10%.

As companies (startups or business units of bigger companies) started betting their businesses on the stability of cloud services a host of other issues started to arise.  First, how did I handle things like unexpected surges in traffic to my site (let’s say after a major press release) and then the subsequent flattening of traffic as the crowd subsided?

In the on-premise world you just had to have extra compute capacity and the ability to expand your bandwidth even if you exceeded your contractual limits of your telecom provider.   But if cloud was to be more economical than this there had to be a better solution.

And in stepped new entrants at a layer above storage & processing that I would call “management services.”  An early star in this category has been RightScale.  They built in a feature called “auto scaling” that monitored for traffic spikes and automatically provisioned new servers on demand and decommissioned them if your traffic surge subsided.

Another key feature of RightScale was to enable you to be able to manage services across multiple clouds and abstract your management from one individual player.  Unfortunately for all of us there aren’t robust competitors for the core AWS offering.

While Amazon continues to move “up the stack” and offer some of these services on their own, RightScale continues to innovate by creating better tools for deployment, monitoring and other functions.

Another big innovator in helping manage cloud implementations is Okta, founded by Todd McKinnon, the former VP of Engineering at (who knows a thing or two about cloud services) and Freddy Kerrest who was senior in biz dev & sales at Salesforce and was there from 2002-07.  They realized that as entreprises were increasingly using many different Cloud-based applications they didn’t have good cross-platform tools for deployment, monitoring and decommissioning.  Okta solves this and more.

And of course there’s a ton of other companies one could include in this area who are taking services that today are managed mostly in-house and moving them to the cloud allowing cost reductions and standardization of non-standard management technologies.  An example of this would be Mashery, who created cloud-based API services.  In a world where most technology products are launched as web services having an API layer in the cloud seems an obvious trend.

Business Logic
So far in the stack we’ve only spoken about infrastructure.  But the cloud-based stuff that we use every day as consumers (websites, Twitter, Facebook, Zynga) or as businesses (dropbox, gmail, Yammer, GoToMeeting) all rely on business logic created by application companies.  If you look at any graduating class of Y Combinator they’re filled with application companies launching new, experimental services that change the way we work and live.

This is where the rubber hits the road for us as users.  It’s the input screens where we enter data or search requests.  It’s the screens that pop up our restaurant locations, calculate our exercise outputs or show us our bank balances.  This is the top layer in the Cloud Stack.

Proprietary Databases
But here’s the problem.  As you can see from the depiction above there is still too much of a gap between our business logic and our underlying infrastructure.  What it means is that there’s either a huge cost for us to license a proprietary database or a huge time lag for us to build one on our own.

Let’s take some examples.  Let’s say you wanted to launch Yelp today.  You’d need to start with a list of all of the restaurants, hotels and other businesses in the country (not to mention internationally).  FourSquare faced the same issue when it launched.  Remember the early days when we as trailblazing users had to enter in a bunch of restaurants ourselves?

If anybody remembers using DailyBurn (monitors calorie consumption and exercise outputs) in the early days they had a core set of data from the USDA for standard foods but then the rest of us had to help them build out their databases to say how many calories were in a PinkBerry yogurt or a grande latte at Starbucks.  Each of these types of businesses have scores of related companies trying to launch and either licensing or creating the exact same data sets.

I see the same again with the entertainment industry.  We all know about IMDB.  But everybody is trying to get access to data on stars, movies, release dates, box office data, etc.  It’s needed for Fandango, RottenTomatoes,, and scores of other companies.

Same with university data.  Healthcare data.  Drug data.  What about financial services information?  Public stock market trades by senior executives of corporations.  Annual accounting statements by companies.  What about court records?  Weather data.  Criminal records, credit scores, locations of cell towers. And on and on.

Most great application businesses are built on data or create data.  And historically this data has been very expensive to buy or create in the same way that servers and storage once was.

Cloud Data Platform
Enter the world of “data as a service” where businesses can consume data in the same way that they now consume Amazon’s storage services or processing.  This is what Factual provides and they just raised a whopping $25 million (disclosure: my firm, GRP, is a shareholder.  As this was a whopping round we didn’t lead it so my commentary in this post is 90% as an excited industry observer and only 10% as a proud investor).  Factual was created in 2007 by Gil Elbaz, the founder of Applied Semantics.  In case you don’t know Applied Semantics it’s Google AdSense.  Google bought Gil’s company in 2003 (pre IPO) for $100+ million and this business now represents about 30% of all of Googles revenue.  Wow!

What I love about Factual is that it democratizes data and make it an order of magnitude cheaper, more available and higher quality than the historical approach.  I’ve had this conversation so many times over the past year that I know it’s not immediate intuitive.

Let me say it this way.  Imagine the world before Wikipedia.  It was heresy to suggest that crowd-sourced information could beat Encarta let alone the Encyclopedia Brittanica.  Yet now it’s laughable the other way.  Physically printed books or CDs with editors, reviewers and a centralized system are inherently slower and in many cases not even more accurate.  And Wikipedia is deflationary meaning it takes the costs of production to almost zero.

The story of the Internet has been deflationary from Amazon to Craigslist to iTunes.  And so too will be Factual.  They have built algorithms that automatically crawl the web for the world’s best structured data and use heuristic techniques to ensure the quality of the data.  They have built tools to store the data but also to allow 3rd-party developers to rapidly consume or even write data to their tables.

Imagine if 3 years ago Factual existed – would FourSquare, GoWalla, Booyah and every other application that relies on location data need to build or license their own?  Imagine if all of their resources could have been focused on the user experience and not the underlying data that is mostly a commodity.  It will take time for companies to understand that much (not all) of the world’s data is a commodity in the same way it took years for us to migrate to the cloud.  But when they all do – imagine the importance of Cloud Data.

And what really excites me and what is such a win for startups in the potential to massively speed up innovation and make it cheaper.  What if every YCombinator and TechStars company had access to the Factual dataset and when they created their concepts it was with a large corpus of data?  What if we could publish large pools of drug data and allow hackers to create databases of drug interactions that reduce problems with prescriptions.  Imagine if you could have developers building financial services apps that created more transparency of trades.

I predict that data over time will become the next major layer of the Internet supporting both consumer and business applications.

Other Layers
I have talked in the past about other layers that are emerging particularly in social networking and mobile applications.  An obvious one is the mapping layer where SimpleGeo has a great start.  Many mobile application being built today are incorporating LBS (location based services) into the user experience which often means plotting results on to a map.

And what about our social graphs?  Wouldn’t it be nice if that could be managed as a Cloud Layer and then let services be created that incorporate not only our personal relationships but those of two or three degrees of separation?

I can’t dream up all the new layers that may be created in the next 10 years.  But I’m pretty convinced that horizontal specialization will be a big win for many companies and for the tech ecosystem in general.

  • Christophe

    Martin, Colin,

    Just a quick question on the SSO discussion. For a business user willing to have a single access to multiple web-based business apps (i.e. Google apps + CRM + accounting…) from a single workspace what would be the best solution ? Would need to be universal, non-proprietary and free! Twitter and/or FB SSO would not be appropriate here.

    You mention “having to create a new account and remember a password for each new web app is, and that has been sufficiently solved for both the user and the app provider.” How this has been solved??

    Would be great to hear your feedback/view.

  • Allen Graber

    Hi Mark

    Excellent post–thanks for taking the time to write this. Inspiring and thought provoking.

    Allen Graber (an emerging “data as a service” business!)

  • John

    This was 1000 times better than the Techcrunch post on Factual. Now I actually understand where it's going. Very interesting indeed.

    My big question is whether Factual will mostly be a service provider for the “funded startups” or whether it will commoditize the information to the point that the unfunded startup can afford their service. Your example of Ycombinator and Techstars is a good one. There are hundreds of other little guys that could benefit from it too. It will be interesting if the payment from the “funded startups” and other companies that want the data is so good that Factual ignores the little guy. Or if they'll be able to build a pricing plan that handles both.

    Of course, if Factual doesn't provide the data to the little guys, I'm guessing another company will come in and do that.

  • Evan Kaplan


    This is a great post — thanks for laying it out

  • Martin Wawrusch

    The single sign on scenario can be achieved with OpenId and oauth. OpenId is a solution that allows you to have a user id at one place (for example google, microsoft,flickr, yahoo, or your own) and use that to log onto websites from other vendors. More and more business web apps are supporting this as most users do not want to use facebook and to a lesser extent twitter for business logins.

    Just to clarify: This does not mean that Google has access to your data in the CRM, it just tells the web app that you are who you claim to be.

    OAuth is a protocol that is used to allow applications to work together without revealing a users password to the opposite party. It is used by a lot of major players, including Facebook and Twitter, and whenever you use Twitter to log into another website you are actually going through the oauth process.

    When a web app supports both of these technologies you can use one single user id (for example your google id) to log into all your apps and you can access your data in all those apps as well.

    Visually this is not a single workspace but a set of apps. If you want to provide corporate users a common starting point you can use something as simple as .

    There are also container technologies out there like OpenSocial that allow you to really combine different apps into one workspace, although as far as I can tell it will take a bit till business web app providers will see the advantages of that.

  • Raphael (Rafi) EPstein


    The concept is indeed intriguing, but somewhat over simplified.

    One major issue with “data as a service” is the quality of the “service”. Quite intuitively, data quality is a highly subjective terms and greatly depends on the application. More specifically, one should consider the impact of an application making the wrong decision because of wrong data.

    For example, if you get the wrong address of a restaurant, no big deal. If you get the wrong opening hours, you may be pissed of a little. But what happens when low quality data drive more financially significant decisions. What happens when the impact is more than financial?

    Who is liable in case something like that happens?

    In my previous company, we provided Network Design platform for leading vendors and Service Providers. The solution was based on the same concept of what we called “Knowledge as a service”, covering hundreds of thousands of items and millions of design rules. The number one concern was how to maintain and measure the quality.

    One small mistake could easily cost the System Integrator $50,000 or $100,000.
    The problem intensifies exponentially when you start dealing with community contribution, similar to open source code.

    Because of the massive amounts of data, data quality will be, IMHO, one of the biggest hurdles to overcome



  • Martin Wawrusch

    Premature standards and standards hijacked by big players certainly are.

    I believe that any attempt to standardize Social and Identity at the top level at this point are very dangerous for all parties but the established players. Most of social is simply not understood yet. Just to put it in perspective: It took tcp/ip about 20 years to reach ubiquity and to become the one standard, and that is a very low level protocol that does not interact with people.
    Social as we know it is about 2 – 3 years old (I think the launch of Facebook Connect can be considered the birth of Social but that's something for a different discussion)

    One goal of larger companies is to create barriers to limit competition from below, startups that will eventually become real competitors. A good way to do that is to create standards that make it harder for newcomers to enter or disrupt a market. This can be as simple as setting a high price point for being able to use the standard or to steer it in a technical direction that forces a specific way of “doing things”.

    Now any single top level (what you coined OS) standard in social and identity will be hijacked by the big players. There is simply too much money involved and they need to limit competition (the thing an ex startup is most afraid of is the next startup).

    I am really a bit of a purist here: HTML5 and open APIs created by the community to solve problems as they arise. The ability to switch to a new paradigm whenever I please without having to ask some standards body for permission.

  • Colin Hawkett

    What Martin said, plus it is worth noting that your company's google apps installation is an openID provider (unique to your company/domain & backed by google) and your userid is an openID. The intention of this setup is to solve the problem you have described. Applications on the web need to specifically accept google apps openID logins, separately to plain google openID logins though. Put another way, an app not only has to support openID, but also the openID provider you wish to use. Business systems (like CRM and accounting) are good candidates to support google apps openID providers, but you would need to check each app. Cheers,


  • Colin Hawkett

    I think we're sitting on either end of see-saw here :) Both arguments hold – i.e. you necessarily need the chaos from lack of standards to get an idea of what is trending to ubiquity, and you need the standard platform on which that chaos can thrive. For example, there's no doubt the ubiquity of HTML, CSS, HTTP, DNS, etc. is what is enabling the explosion of innovation in web applications – without the platform it can't happen.

    For me, if we aren't pushing stuff into the ubiquitous stack for fear that the big companies will corrupt it, then the system is broken – it's like supporting anarchy because we don't trust 'the machine', when in reality a machine is always the result of anarchy – it is necessarily a temporary state. Corporate ownership/corruption of a standard or governance process is the wrong model, which I think we agree on, but if we don't standardise, then that is what we will get. Tim Berners-Lee warning about the threat of erosion of the internet's ubiquity and openness is pretty relevant here (http://www.scientificamerican….).

    I can't think of any IT standard that you need to ask permission in order not to use it. You're pretty much free to do whatever you want. There probably are examples of where an IT standard has been used to stifle competition – but I'm struggling to think of one – they almost universally *lower* barriers. I'm pretty sure that in the end we are discussing the point at which something should become ubiquitous, rather than for or against standardisation. Wherever that point is, Mark's hardware stack looks like a great place to stick an OS :)

  • bernardlunn

    What a superb post. This crystallized my thinking and filled in some gaps in my knowledge. This is a “keeper” will come back and refer to it frequently.

  • Muneeb

    Hey Mark,

    I generally enjoy your posts, but found this one somewhat confusing. Also, got “stuck” on a few points in the post and am going to try and articulate why. This is just a warning that my comment might sound somewhat critical :-)

    First of all, you used the OSI model as an example, which is kind of a bad example to look up to as the OSI model exists only in text books and is not really followed in academia or industry. For all practical purposes there are 4 layers of the networking stack (not 7). Even when people have tried to build networks that look quite different from the Internet (e.g., sensor networks) they usually end up with 4 layers; physical (how to deal with the physical hardware), network (how to wire different machines together), transport (how to send data over the network), and application (how the applications use the underlying functions). The reason why the OSI model is a bad choice is that it introduced layers that no one really needed or in other words layers which were not fundamental to the design and evolution of the networking stack. They only added needless complexity. The takeaway point is that layering is not always good, it can add needless complexity as well.

    A better example, in my view, could have been the “narrow waist” of the Internet i.e., TCP/IP. TCP/IP allowed hardware and applications to evolve independently of each other while using the same interface. Said different when 802.11 (WiFi) came you can just plug it under TCP/IP and applications running on top of TCP/IP won't need to be changed. This is the fundamental benefit of layers – allowing technologies to evolve independently of each other.

    Now talking about the “cloud” when you talk about the “data layer”, it leaves me confused. What Factual is doing seems like a major exercise in data mining and then providing an API to the data that they gather. The “data as a service” concept makes sense to me i.e., you talk to an API and pull data off it, much like what Google maps or Twitter let you do today. Factual, as far I understand, wants to come up with sophisticated data mining methods to mine for and “clean up” different types of data and then make it available through their API. This is “data as a service” sure, but data as a layer? What does that mean? For me a layer needs to make use of the underlying layer and should export some abstraction to the thing running on top. How does the data layer use the management layer in your example?

    There were also a few small points that I got “stuck” on. Like when you said that by 2006 the open source movement had really developed and then give the example of Postgres (Stonebraker's work from the 80s). Even back in 1999 you can get a Linux machine and run Apache webserver on it, just that it wasn't trendy in the Enterprise market to do so. Folks who worked with UNIX/Linux back then still knew that it was more stable and efficient than most proprietary solutions coming out of large Enterprises. I don't know what changed between 1999 and 2006, if I have to guess I'll say thanks to the Bubble everyone had to cut costs so people started looking at all these alternate technologies that were there all along.

    While I'm at it, let me add my two cents about the entire “cloud” concept anyway. I personally prefer the term “datacenter”, that still paints the mental image that there is a physical facility containing stacks of real machines running software. I don't think virtualization gets enough credit. It was the single largest technology jump of this decade. Amazon's services are all but virtual machines running on their physical hardware. When we are thinking about layers for the “cloud”, it might make sense to sit down with a clear picture of how things actually run on physical machines and then work our way up to the application, thinking hard about why a layer is included, how it uses the layers below it, and how will it allow technologies to evolve independent of each other. We should be careful about not going the OSI route and adding layers that were not really layers.

    Again, apologies for the critical comment. As I said earlier I generally really like your posts (and maybe should comment on things I like as well!)

    - Muneeb

  • paramendra

    This post beats your TechCrunch posts, although those were great too. DataCloud

  • TJGodel

    I've had the “data as a service” vision for at least 5 years now even as I observed and used Cloud Services to build out my vision. We are building “data as a service” for one vertical industry. Our business model is to add value by building services on top of data that we have clean and to provide the clean up data to anyone who wants the data. Yes most data is a commodity and application developers should have access to it in a usable form. We will look at Factural as a distribution channel for our data.

  • Jan Schultink

    Great post.

    We are heading towards the digital mirror (cliche) of the real world. Maybe we get to the stage where we can specify 10 examples of “something” (things, people, places, etc.) to a cloud service that in turn delivers an entire database with the universe of this category to us (including the proper record structure).

  • Christophe

    Martin and Colin, thanks for your responses. OpenID + oauth is the solution we were looking at. By the way I like the eblizz idea, I have signed up to the private Beta. Cheers

  • maxthelion

    Hi Martin,

    Thanks for mentioning us ( As you say, there are a whole class of services like this being created.

    While the layers Mark talks about are horizontal, I believe we are going to see an influx of services based on 'vertical' slices of the cloud. These will be designed to perform a single function extremely well. We do this for realtime updates, our sister company does it for video encoding, and other companies such as do it for email.

    Part of Heroku's value to must have been due to its addon marketplace, which makes this sort of vertical product extremely accessible.


  • Martin Wawrusch

    You are welcome. I really like pusherapp, it makes a complicated subject very easy, it is startup friendly with the free intro plan, has a well designed website (Best pricing page I encountered in a long time) and good interactive support.

    I think services like video encoding, email, etc can be considered app infrastructure services, basic building blocks that every app needs. I see pusherapp a little bit outside of this, I think realtime services, combined with presence and real time location/social proximity will form a layer of it's own.

    I did notice the same thing about the influx of services . People are creating single purpose apps for every aspect of web app design, and they are able to do so much faster then even 6 months ago.

    Fully agree on Heroku. What a smart move for Salesforce. Amazon IAAS + Heroku PAAS + all the app services (which almost all run on Amazon) will capture a large part of the startup web app market. You have to have very specific reasons not to go that route for a new web app as a developer these days.

  • maxthelion

    Thanks very much for the kind words. Great to hear that you like our service.

    I think that giving developers a toolbox of cloud-based components brings about extremely exciting prospects for the future of web development. This is certainly a great space to be exploring at the moment!

  • raycote

    That was a interesting dialectic discussion on the tipping point at which we should push universally commoditized API-functions down into the internet’s cloud-computing engine room without risk of stifling innovation.

    The universe is made up of a multitude of layered platforms. Sub-atomic particles, atoms, molecules, complex molecules, life molecules, cells, organisms, social structure, technology, computing APIs, cloud computing APIs, social graph APIs, organic community APIs, organic democracy APIs. Each layer is a form of software that sits atop the layer below it. As you move up these stacked layers each layer is a form of software that algorithmically recombines the components of the stabilized layer below it as if those components were fixed reusable firmware components. One layers software is the next layers firmware components, all the way up and down this great God-Head cosmic stack feast. As we move towards the top layers of that stack the spirit of living organic dynamism mischievously destabilizes the separation between the layer. The more environmentally dynamic software layers are able to reach down via interlayer feedback to dynamically alter their own substrate firmware layer. This creates an endlessly complex echo chamber between cause and effect across all the biological, social and computer-extended-social layers of this cosmic reality stack.

    All that to say, that at this stage of our emerging cloud-API driven noosphere we are now firmly under the control of a particularly powerful strange attractor, the teleology of self-selecting, self re-enforcing organic complexity.

    In other words, in this cosmically complex show, we are in the cheap seats and from this vantage point it is impossible to discern the optimal tipping point between cloud-API software and cloud-API firmware. We may be able to develop a conscious methodology for rolling with the organic punches by using nature’s cheat sheet to extract and distill the universal, reusable, organic schema used so successfully by biological systems and consciously apply those evolutionary schema to our own distributed cloud-API eco-system.

  • Glenn D.

    Excellent post Mark!

    Do you see this taking place in the public sector on a broad scale?

  • Glenn D.

    Great post Mark?

    Do you see governmental agencies, (Fed, State, or Local) moving in this direction of leveraging the cloud for storage, processing, data, etc? if so, how and when?