This book proudly focuses on small, in-memory datasets. While it’s tempting to skip the exercises, there’s no better way to learn than practicing on real problems. Each section of the book is paired with exercises to help you practice what you’ve learned. Within each chapter, we try and stick to a similar pattern: start with some motivating examples so you can see the bigger picture, and then dive into the details. Then you’ll see how they can combine with the data science tools to tackle Give you a selection of programming tools in the middle of the book, and Programming tools are not necessarily interesting in their own right,īut do allow you to tackle considerably more challenging problems. Visualisation, tidy data, and programming. It’s easier to understand how models work if you already know about Some topics are best explained with other tools. Motivation will stay high because you know the pain is worth it. That way, when you ingest and tidy your own data, your We’ll start with visualisation and transformation of data that’s already been That’s a bad place to start learning a new subject! Instead, It’s routine and boring, and the other 20% of the time it’s weird andįrustrating. Starting with data ingest and tidying is sub-optimal because 80% of the time In our experience, however, this is not the best way to learn them: The previous description of the tools of data science is organised roughly according to the order in which you use them in an analysis (although of course you’ll iterate through them multiple times). Throughout this book we’ll point you to resources where you can learn more. There’s a rough 80-20 rule at play you can tackle about 80% of every project using the tools that you’ll learn in this book, but you’ll need other tools to tackle the remaining 20%. You’ll use these tools in every data science project, but for most projects they’re not enough. You don’t need to be an expert programmer to be a data scientist, but learning more about programming pays off because becoming a better programmer allows you to automate common tasks, and solve new problems with greater ease. Programming is a cross-cutting tool that you use in every part of the project. Surrounding all these tools is programming. It doesn’t matter how well your models and visualisation have led you to understand the data unless you can also communicate your results to others. The last step of data science is communication, an absolutely critical part of any data analysis project. That means a model cannot fundamentally surprise you. Even when they don’t, it’s usually cheaper to buy more computers than it is to buy more brains! But every model makes assumptions, and by its very nature a model cannot question its own assumptions. Models are a fundamentally mathematical or computational tool, so they generally scale well. Once you have made your questions sufficiently precise, you can use a model to answer them. Models are complementary tools to visualisation. Visualisations can surprise you, but don’t scale particularly well because they require a human to interpret them. A good visualisation might also hint that you’re asking the wrong question, or you need to collect different data. A good visualisation will show you things that you did not expect, or raise new questions about the data. Visualisation is a fundamentally human activity. These have complementary strengths and weaknesses so any real analysis will iterate between them many times. Once you have tidy data with the variables you need, there are two main engines of knowledge generation: visualisation and modelling. Together, tidying and transforming are called wrangling, because getting your data in a form that’s natural to work with often feels like a fight! Transformation includes narrowing in on observations of interest (like all people in one city, or all data from the last year), creating new variables that are functions of existing variables (like computing speed from distance and time), and calculating a set of summary statistics (like counts or means). Once you have tidy data, a common first step is to transform it. Tidy data is important because the consistent structure lets you focus your struggle on questions about the data, not fighting to get the data into the right form for different functions. In brief, when your data is tidy, each column is a variable, and each row is an observation. Tidying your data means storing it in a consistent form that matches the semantics of the dataset with the way it is stored. Once you’ve imported your data, it is a good idea to tidy it. If you can’t get your data into R, you can’t do data science on it! This typically means that you take data stored in a file, database, or web application programming interface (API), and load it into a data frame in R.
0 Comments
Leave a Reply. |