The Data Pipeline is an approach to working with data from beginning to end. Define: Data-driven projects always have a “define the problem you’re trying to solve” component. It’s in this stage you start asking questions and come around the issues that will matter in the end.
Credit: School of Data
Data-driven projects always have a “define the problem you’re trying to solve” component. It’s in this stage you start asking questions and come around the issues that will matter in the end. Defining your problem means going from a theme (e.g. air pollution) to one or multiple specific questions (has bikesharing reduced air pollution?). Being specific forces you to formulate your question in a way that hints at what kind of data will be needed. Which in turns helps you scope your project: is the data needed easily available? Or does it sound like some key datasets will probably be hard to get?
While the problem definition phase hints at what data is needed, finding the data is another step, of varying difficulty. There are a lot of tools and techniques to do that, ranging from a simple question on your social network, to using the tools provided by a search engine (such as Google search operators), open data portals. This phase can make or break your project, as you can’t do much if you can’t find the data!
But this is also where creativity can make a difference, searching in non-obvious locations… don’t give up too soon!
To get the data from its inital location to your computer can be short and easy or long and painful. Luckily, there’s plenty of ways of doing that. You can crowdsource using online forms, you can perform offline data collection, you can use some crazy web scraping skills, or you could simply download the datasets from government sites, using their data portals.
We got our hands in the data, but that doesn’t mean it’s the data we need. We have to check out if details are valid, such as the meta-data, the methodology of collection, if we know who organised the dataset and it’s a credible source. We’ve heard a joke once, but it’s only funny because it’s true: all data is bad, we just need to find out how bad it is!
It’s often the case the data we get and validate is messy. Duplicated rows, column names that don’t match the records, values that contain characters which will make it difficult for a computer to process and so on. In this step, we need skills and tools that will help us get the data into a machine-readable format, so that we can analyse it. We’re talking about tools like OpenRefine or LibreOffice Calc and concepts like relational databases.
This is it! It’s here where we get insights about the problem we defined in the beginning. We’re gonna use our mad mathematical and statistical skills to interview a dataset like any good journalist. But we won’t be using a recorder and a notebook. We can analyse datasets using many, many skills and tools. We can use visualisations to get insights of different variables, we can use programming languages packages, such as Pandas (Python) or simply R, we can use spreadsheet processors, such as LibreOffice Calc or even statistical suites like PSPP.
And, of course, you will need to present your data. Presenting it is all about thinking of your audience, the questions you set out to answer and the medium you select to convey your message or start your conversation.