training.python.datascience/documentation/old-02-library-basics.md

# Introduction to Pandas for Beginners

Pandas is a powerful library for data manipulation and analysis in Python. It provides easy-to-use data structures and data analysis tools for handling and manipulating numerical tables and time series data. This guide is intended for beginners with no prior experience in data analysis or Pandas.

---

## Installation

To use Pandas, you must first install it. You can do this by running the following command in your command line:

```bash
pip install pandas
```

---

## Data Structures

Pandas has two main data structures: `Series` and `DataFrame`.

A `Series` is a one-dimensional array-like object that can hold any data type. It is similar to a column in a spreadsheet or a dataset in R. Here's an example of creating a series:

```python {.numberLines}

import pandas as pd

data = [1, 2, 3, 4]
s = pd.Series(data)
print(s)
```

. . .

A `DataFrame` is a two-dimensional table of data with rows and columns. It is similar to a spreadsheet or SQL table.

---

Here's an example of creating a `DataFrame`:

```{.python .numberLines}
import pandas as pd

data = {'name': ['John', 'Jane', 'Sam'],
        'age': [30, 25, 35],
        'city': ['New York', 'San Francisco', 'Los Angeles']}
df = pd.DataFrame(data)
print(df)
```

---

## Data Analysis

Pandas provides a variety of useful tools for data analysis. Here are a few examples:

---

### Selection: Selecting specific columns or rows from a DataFrame.

```{.python .numberLines}
import pandas as pd

data = {'name': ['John', 'Jane', 'Sam'],
        'age': [30, 25, 35],
        'city': ['New York', 'San Francisco', 'Los Angeles']}
df = pd.DataFrame(data)

# select a specific column
print(df['name'])

# select rows by index
print(df.loc[1])
```

---

### Filtering: Filtering rows based on a condition.

```{.python .numberLines}
import pandas as pd

data = {'name': ['John', 'Jane', 'Sam'],
        'age': [30, 25, 35],
        'city': ['New York', 'San Francisco', 'Los Angeles']}
df = pd.DataFrame(data)

# filter rows where age is greater than 30
print(df[df['age'] > 30])
```

---

### Groupby: Grouping rows based on a column and applying a function to each group.

```{.python .numberLines}
import pandas as pd

data = {'name': ['John', 'Jane', 'Sam', 'John', 'Jane'],
        'age': [30, 25, 35, 40, 22],
        'city': ['New York', 'San Francisco', 'Los Angeles','New York', 'San Francisco']}
df = pd.DataFrame(data)

# group by city and calculate mean age for each group
print(df.groupby('city').mean())
```

These are just a few examples of the many things you can do with Pandas. Some other useful functionality includes:

---

### Merging: Merging multiple DataFrames together on specific columns.

```{.python .numberLines}
import pandas as pd

data1 = {'name': ['John', 'Jane', 'Sam'],
        'age': [30, 25, 35],
        'city': ['New York', 'San Francisco', 'Los Angeles']}
data2 = {'name': ['Sam', 'Jane', 'John'],
        'gender': ['M', 'F', 'M']}

df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)

# merge two dataframes on name column
merged_df = pd.merge(df1, df2, on='name')
print(merged_df)
```

---

### Sorting: Sorting a DataFrame by one or multiple columns.

```{.python .numberLines}
import pandas as pd

data = {'name': ['John', 'Jane', 'Sam'],
        'age': [30, 25, 35],
        'city': ['New York', 'San Francisco', 'Los Angeles']}
df = pd.DataFrame(data)

# sort dataframe by age in ascending order
df.sort_values(by='age', ascending=True)
```

---

### Data Cleaning: Handling missing values and duplicates.

```{.python .numberLines}
import pandas as pd

data = {'name': ['John', 'Jane', 'Sam', None],
        'age': [30, 25, 35, None],
        'city': ['New York', 'San Francisco', 'Los Angeles', 'New York']}
df = pd.DataFrame(data)

# drop rows with missing values
df.dropna()

# drop duplicate rows
df.drop_duplicates()
```

---

### Here's a simple example of how to rename a column in a Pandas DataFrame:

```{.python .numberLines}
import pandas as pd

# Create a sample dataframe
data = {'name': ['John', 'Jane', 'Sam'],
        'age': [30, 25, 35],
        'city': ['New York', 'San Francisco', 'Los Angeles']}
df = pd.DataFrame(data)

# Print the original dataframe
print(df)

# Rename 'name' column to 'username'
df.rename(columns={'name': 'username'}, inplace=True)

# Print the dataframe after renaming
print(df)
```

You can also rename multiple columns at once by passing a dictionary of old to new column names.

```{.python .numberLines}
df.rename(columns={'age': 'Age','city': 'City'}, inplace=True)
```

The inplace=True argument makes the change permanent and updates the DataFrame in place. If you don't want to modify the original DataFrame and want to return a new DataFrame with the changes, you can set inplace=False or not include the argument at all.

With this library, you will be able to handle and analyze large datasets with ease. The documentation is a great resource for learning more about the capabilities of the library.