# Introduction to Pandas for Beginners Pandas is a powerful library for data manipulation and analysis in Python. It provides easy-to-use data structures and data analysis tools for handling and manipulating numerical tables and time series data. This guide is intended for beginners with no prior experience in data analysis or Pandas. --- ## Installation To use Pandas, you must first install it. You can do this by running the following command in your command line: ```bash pip install pandas ``` --- ## Data Structures Pandas has two main data structures: `Series` and `DataFrame`. A `Series` is a one-dimensional array-like object that can hold any data type. It is similar to a column in a spreadsheet or a dataset in R. Here's an example of creating a series: ```python {.numberLines} import pandas as pd data = [1, 2, 3, 4] s = pd.Series(data) print(s) ``` . . . A `DataFrame` is a two-dimensional table of data with rows and columns. It is similar to a spreadsheet or SQL table. --- Here's an example of creating a `DataFrame`: ```{.python .numberLines} import pandas as pd data = {'name': ['John', 'Jane', 'Sam'], 'age': [30, 25, 35], 'city': ['New York', 'San Francisco', 'Los Angeles']} df = pd.DataFrame(data) print(df) ``` --- ## Data Analysis Pandas provides a variety of useful tools for data analysis. Here are a few examples: --- ### Selection: Selecting specific columns or rows from a DataFrame. ```{.python .numberLines} import pandas as pd data = {'name': ['John', 'Jane', 'Sam'], 'age': [30, 25, 35], 'city': ['New York', 'San Francisco', 'Los Angeles']} df = pd.DataFrame(data) # select a specific column print(df['name']) # select rows by index print(df.loc[1]) ``` --- ### Filtering: Filtering rows based on a condition. ```{.python .numberLines} import pandas as pd data = {'name': ['John', 'Jane', 'Sam'], 'age': [30, 25, 35], 'city': ['New York', 'San Francisco', 'Los Angeles']} df = pd.DataFrame(data) # filter rows where age is greater than 30 print(df[df['age'] > 30]) ``` --- ### Groupby: Grouping rows based on a column and applying a function to each group. ```{.python .numberLines} import pandas as pd data = {'name': ['John', 'Jane', 'Sam', 'John', 'Jane'], 'age': [30, 25, 35, 40, 22], 'city': ['New York', 'San Francisco', 'Los Angeles','New York', 'San Francisco']} df = pd.DataFrame(data) # group by city and calculate mean age for each group print(df.groupby('city').mean()) ``` These are just a few examples of the many things you can do with Pandas. Some other useful functionality includes: --- ### Merging: Merging multiple DataFrames together on specific columns. ```{.python .numberLines} import pandas as pd data1 = {'name': ['John', 'Jane', 'Sam'], 'age': [30, 25, 35], 'city': ['New York', 'San Francisco', 'Los Angeles']} data2 = {'name': ['Sam', 'Jane', 'John'], 'gender': ['M', 'F', 'M']} df1 = pd.DataFrame(data1) df2 = pd.DataFrame(data2) # merge two dataframes on name column merged_df = pd.merge(df1, df2, on='name') print(merged_df) ``` --- ### Sorting: Sorting a DataFrame by one or multiple columns. ```{.python .numberLines} import pandas as pd data = {'name': ['John', 'Jane', 'Sam'], 'age': [30, 25, 35], 'city': ['New York', 'San Francisco', 'Los Angeles']} df = pd.DataFrame(data) # sort dataframe by age in ascending order df.sort_values(by='age', ascending=True) ``` --- ### Data Cleaning: Handling missing values and duplicates. ```{.python .numberLines} import pandas as pd data = {'name': ['John', 'Jane', 'Sam', None], 'age': [30, 25, 35, None], 'city': ['New York', 'San Francisco', 'Los Angeles', 'New York']} df = pd.DataFrame(data) # drop rows with missing values df.dropna() # drop duplicate rows df.drop_duplicates() ``` --- ### Here's a simple example of how to rename a column in a Pandas DataFrame: ```{.python .numberLines} import pandas as pd # Create a sample dataframe data = {'name': ['John', 'Jane', 'Sam'], 'age': [30, 25, 35], 'city': ['New York', 'San Francisco', 'Los Angeles']} df = pd.DataFrame(data) # Print the original dataframe print(df) # Rename 'name' column to 'username' df.rename(columns={'name': 'username'}, inplace=True) # Print the dataframe after renaming print(df) ``` You can also rename multiple columns at once by passing a dictionary of old to new column names. ```{.python .numberLines} df.rename(columns={'age': 'Age','city': 'City'}, inplace=True) ``` The inplace=True argument makes the change permanent and updates the DataFrame in place. If you don't want to modify the original DataFrame and want to return a new DataFrame with the changes, you can set inplace=False or not include the argument at all. With this library, you will be able to handle and analyze large datasets with ease. The documentation is a great resource for learning more about the capabilities of the library.