AI Needs Data (Lots of Data)

The below is an exlusive extract from The Intelligence Revolution by Bernard Marr.

Intelligent machines are very data hungry. Which means, without data, we wouldn’t have AI as we know it. Many of the latest breakthroughs in machine learning (see Chapter 2) came from data – or, more specifically, the fact that we have more data than ever before.

Therefore, in the intelligence revolution, data has become a vital business asset. For some businesses, it’s the most important asset they have. Yet data presents a challenge for many businesses – what sort of data do you need and how can you access or generate that data? In this chapter, we’ll explore AI’s need for huge datasets (basically, collections of data) and how to get your hands on the data you need to make your business more intelligent.

Data isn’t exactly the liveliest subject. It can often be dry and overly technical, but don’t be tempted to skip this chapter. Here, I aim to make the subject as accessible and engaging as possible – ensuring you don’t need a background in data science to understand and get the most out of data.

Revisiting the incredible growth in data

The fact that AI algorithms are so data-hungry can seem daunting, but business leaders can take heart from the fact that they already have so much more data than they’ve ever had before. What’s more, having a limited dataset may be much less of an obstacle than you think…

The acceleration of data

The vast majority of data we have today was created very recently. In fact, 90 per cent of the data available in the world today was generated in the last two years. We’re also doubling the amount of data we have available every two years. Big data is getting bigger, essentially. So much so that market intelligence company IDC estimates that the amount of data in the world could grow from 33 zettabytes in 2018 to 175 zettabytes in 2025. That’s a lot of data. Try to store 175 zettabytes on DVDs and you’d have yourself a stack of DVDs so high it could encircle Earth 222 times.

What really excites me is we’re currently only analysing a tiny fraction of the data that’s available to us. AI makes the process of analysing even complex, unwieldy data (such as video data) much easier and quicker. So as AI gets smarter, we’ll be able to capitalize even more on the massive amounts and different types of data being generated.

But where is all this data coming from? In Chapter 2 I outlined how even simple everyday activities are producing data – even something as analogue as going for a walk generates data, if you’re carrying your mobile phone, taking pictures on your walk, or wearing a fitness tracker. When you think about the increasing digitization of our lives, it’s perhaps not so surprising that the amount of data we’re generating is doubling every two years.

In just one minute on the internet in 2019:

  • 1 million people logged into Facebook;
  • Google received 3.8 million search requests;
  • 188 million emails were sent;
  • 5 million YouTube videos were watched;
  • over 40 million messages were sent on WhatsApp and Messenger. And that’s just in one minute.

Why we might not need so much data in the future

Interestingly, although AI gives us greater opportunities to make sense of data, it may also mean that we need less data in future. Confused? As AI becomes more intelligent, it’ll come to rely on fewer data samples.

Currently, AI needs massive datasets to learn from (this is known as training data). But, over time, AIs will get better at learning from more limited training data. AIs will get better at general reasoning and be able to understand concepts based on smaller amounts of data, just as humans do.

Here’s an example: let’s say you show someone a picture of a domesticated cat for the first time, then show them a picture of a lynx. They’ll most likely recognize the lynx as a type of cat, without needing to be told all the different types of cats in existence.

Humans’ innate intelligence and reasoning means we can apply the general concept of a cat to other types of cats (and solve other, more pressing problems, of course). Compare that to today’s AIs, which need to be trained on masses of data to be accurate and can be quite easily thrown by less familiar situations. (One brilliant example being the iPhone X’s facial recognition system’s inability to recognize ‘morning faces’ – that puffy, tired look many of us sport when we first get up.) But, over time, AIs will get better at the sort of general reasoning that humans excel at – which, in turn, will reduce the need for massive training datasets.

Two developments will play a major role in the reduced need for data: reinforcement learning and generative adversarial networks (GANs).

Reinforcement learning essentially means letting AIs learn for themselves through a process of trial and error, rather than being taught by human programmers, which allows AIs to come up with previously unimagined solutions to problems (see also Chapter 2). And GANs, in very simple terms, involve pairing up two networks that compete against each other to enhance their understanding.

For example, when it comes to recognizing cats in pictures, one network could be working to separate fake cat pictures from real cat pictures, while a ‘competing’ network could be creating images that look like cats but aren’t, in an attempt to fool the first network. Through this process, both networks become better at understanding the general concept of a cat – and because the system is generating its own believable pictures of cats, it doesn’t need as much ‘real-world’ data to learn from.

Therefore, in the future, we’re likely to see enhanced reasoning and common sense in machines, allowing them to generalize from fewer examples. In other words, right now AI is nothing without data. But that won’t always be the case, as artificial intelligence becomes more like, well, real intelligence. And this reduced need for data will hopefully make AI even more accessible for businesses.