Stacked bar plot disconnected

Laurent

Data comes from this website. https://www.kaggle.com/kemical/kickstarter-projects

My stacked bar plot is disconnected. I have not idea what is going on. None of my data contains any null values. The values of the series are frequencies. Has anyone encountered this? I just want to make my bars connected.

fig, ax = plt.subplots(nrows=1, figsize=(15,5))
x = clean_df['main_category'].value_counts().index


print("Number of unique main categories:", clean_df['main_category'].nunique())


for year in [2010, 2011, 2012, 2013, 2014, 2015, 2016]:    
    y = clean_df[clean_df['launched'].dt.year == year]['main_category'].value_counts()
    if year > 2010:
        bottom = clean_df[clean_df['launched'].dt.year <= year-1]['main_category'].value_counts()
    else:
        bottom = 0
        
    ax.set_xlabel("Main Catagories", fontsize=14)
    ax.set_ylabel("Frequency/Count", fontsize=14)
    ax.bar(x=x, height=y, width=0.9, bottom=bottom, label=str(year))
    ax.yaxis.grid(linestyle='-', linewidth=0.7)
    ax.set_xticklabels(x, rotation=45, ha='right')
    ax.legend(loc='upper right')
plt.tight_layout();

enter image description here

JohanC

The main problem is that clean_df[...]['main_category'].value_counts() gives the values ordered from large to small. This can be different from year to year.

Appending [x] to y solves the problem, so effectively sorting y with the desired index.

To calculate the bottom of the bars, it is easier to accumulate the heights at the end of loop. Initializing bottom = 0 together with some pandas magic makes sure that bottom += y sums the desired values. Only in case a year doesn't have a value for some category, this would set na for that category. Therefore, using fillna(0) after y has been reordered by x prevents accumulating na.

A simplified example:

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

N = 100
clean_df = pd.DataFrame({'main_category': np.random.choice(list('abcdef'), N),
                         'year': np.random.randint(2010, 2017, N)})
x = clean_df['main_category'].value_counts().index

fig, ax = plt.subplots(nrows=1, figsize=(15, 5))
bottom = 0
for year in [2010, 2011, 2012, 2013, 2014, 2015, 2016]:
    y = clean_df[clean_df['year'] == year]['main_category'].value_counts()[x].fillna(0)
    ax.set_xlabel("Main Catagories", fontsize=14)
    ax.set_ylabel("Frequency/Count", fontsize=14)
    ax.bar(x=x, height=y, width=0.9, bottom=bottom, label=str(year), alpha=0.8)
    ax.yaxis.grid(linestyle='-', linewidth=0.7)
    ax.set_xticklabels(x, rotation=45, ha='right')
    ax.legend(loc='upper right')
    bottom += y
plt.tight_layout()
plt.show()

resulting plot

PS: To create this plot with pandas:

df_plot = clean_df.groupby(['year', 'main_category']).size().reset_index().pivot(columns='year', index='main_category', values=0)
df_plot['total'] = df_plot.sum(axis=1)
df_plot.sort_values('total', ascending=False, inplace=True)
df_plot[df_plot.columns[:-1]].plot(kind='bar', stacked=True, rot=45)

Note that you might need to create a new column in clean_df containing only the year.

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related