Hist plot(matplot document) is used to visualize the distribution of data. It divides continuous data into specific bins and shows the number of data points (frequency) in each bin as bars. It is commonly used for statistical analysis, checking data distribution, detecting outliers, and more.
Basic usage
Use the matplotlib.pyplot.hist() function. The main parameters are
plt.hist(data, bins=30, color="skyblue", edgecolor="black", alpha=0.6)
Hist plot Code
import matplotlib.pyplot as plt
import numpy as np
np.random.seed(10)
data = np.random.randn(1000) # random Data 1000 EA
plt.hist(data, bins=30, alpha=0.75, edgecolor='black')
"""
data: The data to use for the histogram.
bins=30: Number of bins in the histogram.
alpha=0.75: Transparency of the bars.
edgecolor='black': Border color of the bars.
"""
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.title('Histogram Example')
plt.grid(True) # Add grid for better visualization
plt.show()
Notes
- Choice of bins: Too few will not represent the distribution well, too many will be noisy.
- Normalization: Setting density=True will convert to a probability density function. It is scaled so that the total area is equal to 1.
Use cases
- Analyze data distributions: check for normal distribution, uniform distribution, skewness/kurtosis.
- Outlier detection: Check for outliers in the tails.
- Multi-data comparison: Compare the distributions of multiple datasets by overlapping them.
Histograms and bar charts may look similar, but they serve different purposes and work with different types of data. Their interpretation also differs.
To be honest, I used them interchangeably for a while without realizing the differences, so here’s a quick refresher.
Histograms | Bar graphs |
---|---|
It deals with continuous data. (e.g., age, temperature, time) | Deals with categorical data. (Example: fruit type, region, gender) |
Divide the data into bins and calculate frequencies. | Compare the values in each category directly. |
Determine the distribution of the data. (e.g., normal distribution, skewed, outliers) | Compare the difference in values between categories. (e.g., sales volume, poll results) |
“Distribution of height data for 100 people” → x-axis: 150-160 cm, 160-170 cm, … (bins) → y-axis: number of people in each bin | “Compare sales by fruit” → x-axis: Apples, bananas, oranges (categories) → y-axis: Sales of each fruit |
It’s important not to confuse the two and to use them in context.