Data Science Pipelines for Complete Beginners. A Brief Intro

Welcome guys to another content from me. Its my second blog write up and I hope you can learn from my contents as much I can from you guys. I'd be talking about useful pipelines in data science projects and not production pipelines. Without long talks, I'd like to get to it quick!

I’d like to define data pipelines as channels through which data questions are answered. Helping you to collect, clean, explore, model your dataset and present your findings or use it to solve your problem statement through a set of repeatable, and ideally scalable, steps.

Why would you want to use a pipeline anyways, you might want to ask? First off, pipelines reduce your chances of making errors while coding. They also automate task that you might want to carry out over different datasets. It is often said that data scientist spends 80% of their time cleaning and preparing their data for prediction or whatever problem needs to be solved. Creating your pipeline could save you some time, thereby providing some key insight to your dataset so you can focus on model building.

Fig: Pipelines

These Pipeline or Function Chains as some have called it is a collection of helper functions created to perform one of the various intended purposes required or needed in a data science workflow which collectively helps you solve some problem or answer a set of questions that bring insight to your dataset. The Helper functions can include short, simple, yet effective lines of function code (a small piece of the pie) that makes up a part of the pipeline that can be used to achieve a complex result.

This article would likely focus more on the helper functions that make up the pipelines and possible functions you might want to write for your next data science project. In the long run, they can make your hackathon notebooks neater and faster to write. In a sense, we can say pipelines are functions of all helper functions we might have written while trying to create our pipeline

In different hackathons, we’d usually end up having several processes that we want to carry out that may be applicable across all of our data science projects such as obtaining and loading our data, dropping null values, looking at missing and unique values, checking for the distribution of the data and possibly making some visualization for several columns in our dataset. Pipelines come in handy in automating our data science or machine learning workflow

Check the codelines below for some helpful functions you might want to think of writing in creating a pipeline for your next data science project or hackathon would include

class noob_pipe():

    def __init__():
        pass

    def load_data(path):
        """
        Load the dataset
        """
        # Enter some lines of code

        display(df.head())

        return df

    def missing_data(df):
        """
        List or visualizes the missing data in the dataset, drop by somme threshold too
        """
        # Enter some lines of code

        return data


    def encode_data(df,columns,encoder = 'label'):
        """
        Encodes dataset either by one-hot encodinng or by label encodinng
        """
        # Enter some lines of code

        return encoded_data


    # Feel free to write other helper functions to help model and visualize your data

If you have found this article helpful or you think it was a good read, or you have some thoughts to share, feel free to show some love in the comment section. Thanks for your time.