The three facets of data science

The three facets of data science


When people talk about big data, they nearly
always talk about data science, and data scientists, as well. Now, just as the definition of big data is
still debated, the same is true for the definition of data science. To some people, the term’s just a fancier
way of saying statistics and statisticians. On the other hand, other people argue that
data science is a distinct field. It has different training, techniques, tools
and goals than statistics typically has. Now, that’s what we’re going to talk about
in this presentation. The first thing that we want to do, is we
want to look at what’s called the data science venn diagram. This is a chart that was created by Drew Conway
in 2010, and what he’s arguing here is that data science involves a combination of three
different skills. The first is statistics. That’s on the top right. The second, on the bottom, is domain knowledge
that you actually know, for instance, about management or advertising. And the one on the top left is coding, or
being able to program computers. He’s arguing that, to get data science,
a person needs to be able to do all three of these. So, we’re going to talk about each of these
facets one at a time and in combination with each other. The first component of data science is statistics,
which shouldn’t be surprising because we’re talking about data science. The trick here is that a lot of things that
go into statistics and mathematics can be really counterintuitive, and if you don’t
have the specific formal training, you can make some really big mistakes. The second element of data science is domain
knowledge, and the idea here is that a researcher should know about the topic area that they’re
working in, so if you’re working in, say for instance, marketing, you need to understand
how marketing works, and that makes it so you can have more insight and better direct
your analyses and your procedures to match the questions that you might have. For instance, there’s a wonderful blog post
by Svetlana Sicular of Gartner Inc, where she writes, “Organizations already have
people who know their own data better than mystical data scientists – this is a key. The internal people already gained experience
and ability to model, research, and analyze. Learning Hadoop, a common software or framework
for dealing with big data, is easier than learning a company’s business.” And that really underscores the importance
of domain knowledge in data science. The third element of Drew Conway’s data
science diagram is coding, and this refers to computer programming ability. Now, it doesn’t need to be complicated. You don’t have to have a PhD in computer
science. A little bit of Python programming can go
a very long way. It’s because this allows for the creative
exploration and manipulation of data sets, and especially when you consider the variety
of data that’s part of big data. The ability to combine data that comes in
different formats can be a really important thing, and that often requires some coding
ability. It also helps to develop algorithmic thinking,
or thinking in linear steps by steps by steps to get through a problem. Next, we’ll talk about combinations of two
of these elements at a time. The first one is statistics and domain knowledge
without coding. Now, this is what Conway calls traditional
research, and this is where a researcher works within their field of expertise, uses common
tools for working with familiar data formats. It’s extremely productive, and nearly all
existing research has been conducted this way. The second combination is statistics and coding
without substantive expertise. This is what Conway refers to as machine learning,
and now, that’s not to be confused with data mining. Machine learning is where an algorithm or
a program updates itself and evolves to perform a specific analytical task. The most familiar example of this is spam
filters in email, in which the user or a whole large group of users identify messages as
spam or not spam, and the formula that the program uses to determine whether something
is spam updates with each new piece of information to have increased accuracy the more you use
it. The third combination here is domain knowledge
and coding without statistics. Now, Conway labels this a danger zone, with
the idea being that you have enough knowledge to be dangerous. While there are problems with this, I’ll
mention two things. First, Conway himself mentions that it seems
very unlikely that a person could develop both programming expertise and substantive
knowledge without also learning some math and statistics, so he says it would be a sparsely
populated category, and I believe that’s true. On the other hand, there’s some really important
data science contributions that come out of this combination, including, for instance,
what are called word counts, which we’ll talk about later. It’s simple stuff. These are procedures that do not require sophisticated
statistics. You’re just counting how often things occur,
and you can get important insights out of that, so I wouldn’t write this one off completely. I would say, though, that like Conway, it’s
not likely that a person could develop expertise in both coding and their domain without getting
the math and statistics as well. And finally, of course, there’s all three
of these things, statistics, domain knowledge, and coding, at once, and that is the most
common definition of data science.

One thought on “The three facets of data science

  • August 16, 2018 at 8:14 pm
    Permalink

    Great Video Chee-Onn

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *