pandas is one of the most popular data manipulation library used in python. In particular it offers the data structure and operations for manipulating numerical table and time series data. Additionally, you can get the documentation of the pandas from here. Primary components of the pandas are time series and data frame. Additionally, a series is a single column while data frame is the multi-dimentional table made of the collection of series. This two component are more similar to each others, so you can do similar operations on both for example, filling null value and calculating the mean. In this article you will learn about how can we analysis the data by performing different operations using pandas library.
How to install Pandas
if you have python and PIP install already then installing pandas is very simple. You can simply run this command, then you will get pandas on your computer.
pip install numpy
If you have not install python and Pycharm on your machine here is the link, so it will help you.
Creating data frame from sketch
Creating a data frame in python is very important because it helps us to apply different methods and functions. Additionally there are many method to create a data frame but the easiest method is using a dictionary. While we create a data frame by using a dictionary it provides us a default index in the number format, but we could create our own index by initialized it. following code show how can we create a data frame by using dictionary.
data = {
'apples': [3, 2, 0, 1],
'oranges': [0, 3, 7, 2]
}
df=pd.DataFrame(data)
df

We can define the index value by ourself in the following way.
df=pd.DataFrame(data,index=[['sudhan', 'pitamber', 'manish', 'kiran']])
df

Read in data
Reading data means importing data into the compiler. There are different extensions of files are available, but the most popular file in the filed of data science is .csv. Which can be read by using read_csv() function. Lastly, we can read our targeted data by passing argument to the function as our file name, if file is located at different location then we need to provide it’s location otherwise you will get error.
Converting back to CSV
After extensive work on your raw dataset, when you realized to save the dataset in your local devices. you have a function .to_csv() which helps to save your data into CSV format in your local devices. you have to pass file name with extension as a argument to this function.
df.to_cssv("target_file_name.csv")
Some data frame operations
Data Viewing
The first thing after importing the data set is printing or viewing dataset. You can accomplish it by using .head() function, it prints first five data. However, if you want to read certain number of data then you can pass the number as argument to this function. Apart from this, .tail() function is used to print last five data. Following code illustrates how this two functions work. In this article we use titanic data set you can get this data set from here.
dataframe=pd.read_csv('titanic_train.csv')
dataframe.head()

Getting Information about dataset
info() provides the essential details about your dataset, such as the number of rows and columns, the number of non-null values, what type of data is in each column, and how much memory your DataFrame is using. Another fast and useful attribute is .shape, which outputs just a tuple of (rows, columns).
dataframe.info()

Handling duplicate
handling duplicate data is one of the most difficult task while we performing the data analysis task. One of the most simple method of removing duplicate data is by using drop_duplicate() function, this function will create a copy of your original dataset but removing the copy data. This function has different type of arguments some are listed here.
- first: (default) Drop duplicates except for the first occurrence.
- last: Drop duplicates except for the last occurrence.
- False: Drop all duplicates.
dataframe=dataframe.append(dataframe)
print(dataframe.shape)
dataframe=dataframe.drop_duplicates()
print(dataframe.shape)
dataframe.head()
How to work with missing values
When exploring data you will encounter missing or null values, which are essentially placeholder for non-existent values. Most commonly you will see python’s none or Numpy’s np.nan values. Additionally, for handling null value you have to check whether there is data available or not by using isnull() function. This function will return either True or False depending upon the cell’s null status. Lastly, sum() method is returned the sum of null value in each column.
dataframe.isnull().sum()

Removing null values
Removing null value is very easy, you can just use a dropna() method. But it can be returned a new data frame without altering the new one. Additionally, other than just removing null value you can also drop null columns with null values by setting axis=1.
dataframe.dropna()

Imputation
Imputation is a conventional feature engineering technique used to keep valuable data that have null values. There may be instances where dropping every row with a null value removes too big a chunk from your dataset, so instead we can impute that null with another value, usually the mean or the median of that column.
data_age=dataframe['age']
data_age.isnull().sum()
age_mean=data_age.mean()
data_age.fillna(age_mean , inplace=True)
data_age.isnull().sum()
Summary of data
describe() method is used to get the distribution information about continuous variable. Following code is used to describe the summary of dataset.
dataframe.describe()

Data frame slicing, selecting and extracting
We can select the column by mentioning name in the form of list. For selecting row, we have two options. One loc which is located by name, and another iloc which is located by numerical index.
subset = dataframe[['pid', 'age']]
subset.head()

Conditional selection
Applying condition for extracting specific data is very easy. You can use following code for conditional selection.
male=dataframe[dataframe['sex']=='male']
male.head(2)

Apply functions
It is possible to iterate over a DataFrame or Series as you would with a list, but doing so — especially on large datasets — is very slow. An efficient alternative is to apply() a function to the dataset.
def age_category(x):
if x < 15:
return "child"
elif x>15<50:
return "young"
else:
return 'old'
dataframe['agecategory']=dataframe['age'].apply(age_category)
dataframe.head()

Conclusion
OK, this is the end of the article I hope you can get a good lesson from what I deliver in this article. I ask forgiveness for any word and behave which are not to be. Thank you for your kind and attention guys. Stay tuned for the next article. if you are searching for a free python course here is a link. If you have any questions regarding this article please feel free to comment below.