Pandas Groupby 分位数 Count

6 min read Oct 15, 2024

Exploring Data with Pandas: Grouping, Quantiles, and Counts

Pandas is a powerful Python library for data manipulation and analysis. One of its most useful features is the ability to group data, calculate quantiles within those groups, and count the occurrences of values. This combination allows you to gain valuable insights into the distribution of data within different categories.

Let's delve into how to effectively use Pandas for grouping, quantiles, and counts.

What are Quantiles?

Quantiles divide a dataset into equal-sized portions. Common quantiles include:

Quartiles: Divide the data into four equal parts (25%, 50%, 75%).
Deciles: Divide the data into ten equal parts (10%, 20%, 30%, etc.).
Percentiles: Divide the data into one hundred equal parts (1%, 2%, 3%, etc.).

Why Use GroupBy, Quantiles, and Counts?

Imagine you have a dataset of customer purchases. You might want to analyze:

Average purchase amount per customer segment: Group by customer demographics (e.g., age, location) and calculate the average purchase amount within each group.
Distribution of purchase amounts within a segment: Use quantiles to understand the spread of purchase values within a specific customer segment.
Number of purchases within each price range: Group by price range and count the number of purchases in each range.

This type of analysis can reveal important trends and patterns within your data.

A Practical Example

Let's consider a sample dataset of product sales:

Product	Price	Category
Widget A	10	Electronics
Widget B	20	Electronics
Gadget X	30	Electronics
Gadget Y	40	Electronics
Book 1	15	Books
Book 2	25	Books
Book 3	35	Books

Goal: Determine the average price and distribution of prices within each product category.

import pandas as pd

data = {'Product': ['Widget A', 'Widget B', 'Gadget X', 'Gadget Y', 'Book 1', 'Book 2', 'Book 3'],
        'Price': [10, 20, 30, 40, 15, 25, 35],
        'Category': ['Electronics', 'Electronics', 'Electronics', 'Electronics', 'Books', 'Books', 'Books']}

df = pd.DataFrame(data)

# Group by Category
grouped = df.groupby('Category')

# Calculate average price per category
average_prices = grouped['Price'].mean()
print("Average Prices per Category:")
print(average_prices)

# Calculate quantiles for prices within each category
quantiles = grouped['Price'].quantile([0.25, 0.5, 0.75])
print("\nQuantiles of Prices per Category:")
print(quantiles)

# Count the number of products in each category
counts = grouped['Product'].count()
print("\nNumber of Products per Category:")
print(counts)

Output:

Average Prices per Category:
Category
Books          25.000000
Electronics    25.000000
Name: Price, dtype: float64

Quantiles of Prices per Category:
             Price
Category          
Books      0.25   15.0
          0.50   25.0
          0.75   35.0
Electronics 0.25   15.0
          0.50   25.0
          0.75   35.0

Number of Products per Category:
Category
Books          3
Electronics    4
Name: Product, dtype: int64

Interpretation:

The average price for both "Electronics" and "Books" is $25.
The 25th percentile, 50th percentile, and 75th percentile for both categories are $15, $25, and $35 respectively. This suggests a similar price distribution across the two categories.
There are 3 products in the "Books" category and 4 products in the "Electronics" category.

Additional Considerations

Applying Multiple Aggregations: You can combine multiple aggregation functions within a single groupby operation. For example:

grouped = df.groupby('Category')['Price'].agg(['mean', 'std', 'min', 'max'])
print(grouped)

Custom Functions: Define your own functions to apply during the grouping process.

def custom_function(data):
    return data.max() - data.min()

grouped = df.groupby('Category')['Price'].agg(['mean', custom_function])
print(grouped)

Conclusion

Pandas' groupby, quantiles, and counts provide a robust framework for analyzing data. By combining these functions, you can easily gain insights into the distribution of data within different groups, understand trends, and make informed decisions based on your data analysis.