Facebook Twitter Instagram
    Return ScriptReturn Script
    • Home
    • Jobs
    • OOPs concept
    • Blog
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About Us
    • Contact us
    Return ScriptReturn Script
    python for machine learning

    Pandas for Machine learning

    Return ScriptBy Return ScriptNovember 23, 2021Updated:May 2, 2022No Comments6 Mins Read

    pandas is one of the most popular data manipulation library used in python. In particular it offers the data structure and operations for manipulating numerical table and time series data. Additionally, you can get the documentation of the pandas from here. Primary components of the pandas are time series and data frame. Additionally, a series is a single column while data frame is the multi-dimentional table made of the collection of series. This two component are more similar to each others, so you can do similar operations on both for example, filling null value and calculating the mean. In this article you will learn about how can we analysis the data by performing different operations using pandas library.

    How to install Pandas

    if you have python and PIP install already then installing pandas is very simple. You can simply run this command, then you will get pandas on your computer.

    pip install numpy

    If you have not install python and Pycharm on your machine here is the link, so it will help you.

    Creating data frame from sketch

    Creating a data frame in python is very important because it helps us to apply different methods and functions. Additionally there are many method to create a data frame but the easiest method is using a dictionary. While we create a data frame by using a dictionary it provides us a default index in the number format, but we could create our own index by initialized it. following code show how can we create a data frame by using dictionary.

    data = {
        'apples': [3, 2, 0, 1], 
        'oranges': [0, 3, 7, 2]
    }
    df=pd.DataFrame(data)
    df

    We can define the index value by ourself in the following way.

    df=pd.DataFrame(data,index=[['sudhan', 'pitamber', 'manish', 'kiran']])
    df

    Read in data

    Reading data means importing data into the compiler. There are different extensions of files are available, but the most popular file in the filed of data science is .csv. Which can be read by using read_csv() function. Lastly, we can read our targeted data by passing argument to the function as our file name, if file is located at different location then we need to provide it’s location otherwise you will get error.

    Converting back to CSV

    After extensive work on your raw dataset, when you realized to save the dataset in your local devices. you have a function .to_csv() which helps to save your data into CSV format in your local devices. you have to pass file name with extension as a argument to this function.

    df.to_cssv("target_file_name.csv")

    Some data frame operations

    Data Viewing

    The first thing after importing the data set is printing or viewing dataset. You can accomplish it by using .head() function, it prints first five data. However, if you want to read certain number of data then you can pass the number as argument to this function. Apart from this, .tail() function is used to print last five data. Following code illustrates how this two functions work. In this article we use titanic data set you can get this data set from here.

    dataframe=pd.read_csv('titanic_train.csv')
    dataframe.head()
    Getting Information about dataset

    info() provides the essential details about your dataset, such as the number of rows and columns, the number of non-null values, what type of data is in each column, and how much memory your DataFrame is using. Another fast and useful attribute is .shape, which outputs just a tuple of (rows, columns).

    dataframe.info()
    Handling duplicate

    handling duplicate data is one of the most difficult task while we performing the data analysis task. One of the most simple method of removing duplicate data is by using drop_duplicate() function, this function will create a copy of your original dataset but removing the copy data. This function has different type of arguments some are listed here.

    • first: (default) Drop duplicates except for the first occurrence.
    • last: Drop duplicates except for the last occurrence.
    • False: Drop all duplicates.
    dataframe=dataframe.append(dataframe)
    print(dataframe.shape)
    dataframe=dataframe.drop_duplicates()
    print(dataframe.shape)
    dataframe.head()

    How to work with missing values

    When exploring data you will encounter missing or null values, which are essentially placeholder for non-existent values. Most commonly you will see python’s none or Numpy’s np.nan values. Additionally, for handling null value you have to check whether there is data available or not by using isnull() function. This function will return either True or False depending upon the cell’s null status. Lastly, sum() method is returned the sum of null value in each column.

    dataframe.isnull().sum()

    Removing null values

    Removing null value is very easy, you can just use a dropna() method. But it can be returned a new data frame without altering the new one. Additionally, other than just removing null value you can also drop null columns with null values by setting axis=1.

    dataframe.dropna()

    Imputation

    Imputation is a conventional feature engineering technique used to keep valuable data that have null values. There may be instances where dropping every row with a null value removes too big a chunk from your dataset, so instead we can impute that null with another value, usually the mean or the median of that column.

    data_age=dataframe['age']
    data_age.isnull().sum()
    age_mean=data_age.mean()
    data_age.fillna(age_mean , inplace=True)
    data_age.isnull().sum()

    Summary of data

    describe() method is used to get the distribution information about continuous variable. Following code is used to describe the summary of dataset.

    dataframe.describe()

    Data frame slicing, selecting and extracting

    We can select the column by mentioning name in the form of list. For selecting row, we have two options. One loc which is located by name, and another iloc which is located by numerical index.

    subset = dataframe[['pid', 'age']]
    subset.head()

    Conditional selection

    Applying condition for extracting specific data is very easy. You can use following code for conditional selection.

    male=dataframe[dataframe['sex']=='male']
    male.head(2)

    Apply functions

    It is possible to iterate over a DataFrame or Series as you would with a list, but doing so — especially on large datasets — is very slow. An efficient alternative is to apply() a function to the dataset.

    def age_category(x):
        if x < 15:
            return "child"
        elif x>15<50:
            return "young"
        else:
            return 'old'
    dataframe['agecategory']=dataframe['age'].apply(age_category)
    dataframe.head()

    Conclusion

    OK, this is the end of the article I hope you can get a good lesson from what I deliver in this article. I ask forgiveness for any word and behave which are not to be. Thank you for your kind and attention guys. Stay tuned for the next article. if you are searching for a free python course here is a link. If you have any questions regarding this article please feel free to comment below.

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email

    Related Posts

    NumPy array for machine leaning

    November 23, 2021
    Add A Comment

    Leave A Reply Cancel Reply

    Recent Updates

    What is artificial intelligence and it’s applications

    April 26, 2022

    Data Visualization using Matplotlib

    November 23, 2021

    Pandas for Machine learning

    November 23, 2021

    NumPy array for machine leaning

    November 23, 2021

    Oops concepts in python with examples

    September 6, 2021

    Types of Operators in Python

    September 6, 2021

    Python Dictionary

    September 28, 2020
    © 2023 Returnscript.com | All Rights Reserved

    Type above and press Enter to search. Press Esc to cancel.