197 lines
4.9 KiB
Markdown
197 lines
4.9 KiB
Markdown
# Introduction to Pandas for Beginners
|
|
|
|
Pandas is a powerful library for data manipulation and analysis in Python. It provides easy-to-use data structures and data analysis tools for handling and manipulating numerical tables and time series data. This guide is intended for beginners with no prior experience in data analysis or Pandas.
|
|
|
|
---
|
|
|
|
## Installation
|
|
|
|
To use Pandas, you must first install it. You can do this by running the following command in your command line:
|
|
|
|
```bash
|
|
pip install pandas
|
|
```
|
|
|
|
---
|
|
|
|
## Data Structures
|
|
|
|
Pandas has two main data structures: `Series` and `DataFrame`.
|
|
|
|
A `Series` is a one-dimensional array-like object that can hold any data type. It is similar to a column in a spreadsheet or a dataset in R. Here's an example of creating a series:
|
|
|
|
```python {.numberLines}
|
|
|
|
import pandas as pd
|
|
|
|
data = [1, 2, 3, 4]
|
|
s = pd.Series(data)
|
|
print(s)
|
|
```
|
|
|
|
. . .
|
|
|
|
A `DataFrame` is a two-dimensional table of data with rows and columns. It is similar to a spreadsheet or SQL table.
|
|
|
|
---
|
|
|
|
Here's an example of creating a `DataFrame`:
|
|
|
|
```{.python .numberLines}
|
|
import pandas as pd
|
|
|
|
data = {'name': ['John', 'Jane', 'Sam'],
|
|
'age': [30, 25, 35],
|
|
'city': ['New York', 'San Francisco', 'Los Angeles']}
|
|
df = pd.DataFrame(data)
|
|
print(df)
|
|
```
|
|
|
|
---
|
|
|
|
## Data Analysis
|
|
|
|
Pandas provides a variety of useful tools for data analysis. Here are a few examples:
|
|
|
|
---
|
|
|
|
### Selection: Selecting specific columns or rows from a DataFrame.
|
|
|
|
```{.python .numberLines}
|
|
import pandas as pd
|
|
|
|
data = {'name': ['John', 'Jane', 'Sam'],
|
|
'age': [30, 25, 35],
|
|
'city': ['New York', 'San Francisco', 'Los Angeles']}
|
|
df = pd.DataFrame(data)
|
|
|
|
# select a specific column
|
|
print(df['name'])
|
|
|
|
# select rows by index
|
|
print(df.loc[1])
|
|
```
|
|
|
|
---
|
|
|
|
### Filtering: Filtering rows based on a condition.
|
|
|
|
```{.python .numberLines}
|
|
import pandas as pd
|
|
|
|
data = {'name': ['John', 'Jane', 'Sam'],
|
|
'age': [30, 25, 35],
|
|
'city': ['New York', 'San Francisco', 'Los Angeles']}
|
|
df = pd.DataFrame(data)
|
|
|
|
# filter rows where age is greater than 30
|
|
print(df[df['age'] > 30])
|
|
```
|
|
|
|
---
|
|
|
|
### Groupby: Grouping rows based on a column and applying a function to each group.
|
|
|
|
```{.python .numberLines}
|
|
import pandas as pd
|
|
|
|
data = {'name': ['John', 'Jane', 'Sam', 'John', 'Jane'],
|
|
'age': [30, 25, 35, 40, 22],
|
|
'city': ['New York', 'San Francisco', 'Los Angeles','New York', 'San Francisco']}
|
|
df = pd.DataFrame(data)
|
|
|
|
# group by city and calculate mean age for each group
|
|
print(df.groupby('city').mean())
|
|
```
|
|
|
|
These are just a few examples of the many things you can do with Pandas. Some other useful functionality includes:
|
|
|
|
---
|
|
|
|
### Merging: Merging multiple DataFrames together on specific columns.
|
|
|
|
```{.python .numberLines}
|
|
import pandas as pd
|
|
|
|
data1 = {'name': ['John', 'Jane', 'Sam'],
|
|
'age': [30, 25, 35],
|
|
'city': ['New York', 'San Francisco', 'Los Angeles']}
|
|
data2 = {'name': ['Sam', 'Jane', 'John'],
|
|
'gender': ['M', 'F', 'M']}
|
|
|
|
df1 = pd.DataFrame(data1)
|
|
df2 = pd.DataFrame(data2)
|
|
|
|
# merge two dataframes on name column
|
|
merged_df = pd.merge(df1, df2, on='name')
|
|
print(merged_df)
|
|
```
|
|
|
|
---
|
|
|
|
### Sorting: Sorting a DataFrame by one or multiple columns.
|
|
|
|
```{.python .numberLines}
|
|
import pandas as pd
|
|
|
|
data = {'name': ['John', 'Jane', 'Sam'],
|
|
'age': [30, 25, 35],
|
|
'city': ['New York', 'San Francisco', 'Los Angeles']}
|
|
df = pd.DataFrame(data)
|
|
|
|
# sort dataframe by age in ascending order
|
|
df.sort_values(by='age', ascending=True)
|
|
```
|
|
|
|
---
|
|
|
|
### Data Cleaning: Handling missing values and duplicates.
|
|
|
|
```{.python .numberLines}
|
|
import pandas as pd
|
|
|
|
data = {'name': ['John', 'Jane', 'Sam', None],
|
|
'age': [30, 25, 35, None],
|
|
'city': ['New York', 'San Francisco', 'Los Angeles', 'New York']}
|
|
df = pd.DataFrame(data)
|
|
|
|
# drop rows with missing values
|
|
df.dropna()
|
|
|
|
# drop duplicate rows
|
|
df.drop_duplicates()
|
|
```
|
|
|
|
---
|
|
|
|
### Here's a simple example of how to rename a column in a Pandas DataFrame:
|
|
|
|
```{.python .numberLines}
|
|
import pandas as pd
|
|
|
|
# Create a sample dataframe
|
|
data = {'name': ['John', 'Jane', 'Sam'],
|
|
'age': [30, 25, 35],
|
|
'city': ['New York', 'San Francisco', 'Los Angeles']}
|
|
df = pd.DataFrame(data)
|
|
|
|
# Print the original dataframe
|
|
print(df)
|
|
|
|
# Rename 'name' column to 'username'
|
|
df.rename(columns={'name': 'username'}, inplace=True)
|
|
|
|
# Print the dataframe after renaming
|
|
print(df)
|
|
```
|
|
|
|
You can also rename multiple columns at once by passing a dictionary of old to new column names.
|
|
|
|
```{.python .numberLines}
|
|
df.rename(columns={'age': 'Age','city': 'City'}, inplace=True)
|
|
```
|
|
|
|
The inplace=True argument makes the change permanent and updates the DataFrame in place. If you don't want to modify the original DataFrame and want to return a new DataFrame with the changes, you can set inplace=False or not include the argument at all.
|
|
|
|
With this library, you will be able to handle and analyze large datasets with ease. The documentation is a great resource for learning more about the capabilities of the library.
|