Project Overview
This project analyzes comprehensive layoff data from companies worldwide, providing insights into workforce dynamics, industry trends, and economic patterns. The analysis covers 3,282 companies across various industries and geographic locations, offering a detailed view of global employment trends.
What I Built
Data Analysis Pipeline
- Data Collection: Comprehensive dataset of 3,282 company layoffs
- Data Cleaning: Handling missing values, outliers, and data inconsistencies
- Statistical Analysis: Advanced analytics including outlier detection and correlation analysis
- Visualization: Interactive charts and graphs for trend analysis
Key Features
- Multi-dimensional Analysis: Company, industry, geographic, and temporal analysis
- Outlier Detection: Statistical methods to identify and handle outliers
- Trend Analysis: Time-series analysis of layoff patterns
- Interactive Visualizations: Dynamic charts for better data exploration
Technical Implementation
Data Structure
The dataset contains 12 columns with comprehensive information:
- Company Information: Name, location, industry, stage
- Layoff Data: Number of layoffs, percentage of workforce
- Financial Data: Funds raised, company stage
- Temporal Data: Date of layoffs, date added to dataset
- Geographic Data: Country and headquarters location
Data Processing Pipeline
# Data Loading and Initial Analysis
df = pd.read_csv('layoffs_data.csv')
print(f"Dataset Shape: {df.shape}")
print(f"Columns: {df.columns.tolist()}")
# Missing Value Analysis
missing_values = (df.isna().mean() * 100).round(1)
print("Missing Values Percentage:")
print(missing_values)
# Data Cleaning
df.dropna(inplace=True)
df.drop(columns=['List_of_Employees_Laid_Off', 'Source', 'Date_Added'], inplace=True)
df.rename(columns={'Laid_Off_Count': 'Layoffs'}, inplace=True)
Statistical Analysis
def detect_outliers(df, column_names):
outlier_data = []
for column_name in column_names:
data = df[column_name]
# Calculate quartiles and IQR
q1, q3 = np.percentile(data, [25, 75])
iqr = q3 - q1
# Define outlier limits
low_lim, upp_lim = q1 - 1.5 * iqr, q3 + 1.5 * iqr
# Find outliers
outliers = df[(data < low_lim) | (data > upp_lim)][column_name]
num_outliers = len(outliers)
percent_outliers = round(num_outliers / len(df) * 100, 1)
outlier_data.append([column_name, num_outliers, percent_outliers,
round(low_lim, 1), round(upp_lim, 1)])
return pd.DataFrame(outlier_data, columns=['Column', 'Number of Outliers',
'% Outliers', 'Lower Limit', 'Upper Limit'])
Challenges & Solutions
Challenge 1: Data Quality Issues
Problem: Missing values, inconsistent data formats, and outliers Solution:
- Implemented comprehensive data cleaning pipeline
- Used statistical methods for outlier detection
- Applied appropriate data transformations
Challenge 2: Complex Visualizations
Problem: Creating meaningful visualizations for multi-dimensional data Solution:
- Used multiple visualization libraries (Seaborn, Matplotlib, Plotly)
- Created interactive charts for better exploration
- Implemented subplot arrangements for comprehensive analysis
Challenge 3: Statistical Accuracy
Problem: Ensuring statistical validity of analysis Solution:
- Applied proper outlier detection methods (IQR-based)
- Used appropriate statistical measures
- Implemented data validation checks
Results & Insights
Key Findings
- Industry Trends: Technology and retail sectors showed highest layoff rates
- Geographic Patterns: Certain countries experienced more layoffs than others
- Company Size Impact: Larger companies tended to have more layoffs
- Temporal Patterns: Seasonal and cyclical patterns in layoff data
- Financial Correlation: Relationship between funds raised and layoff percentages
Visualizations Created
- Layoff Percentage Distribution: Box plots and histograms showing distribution
- Geographic Analysis: Country-wise layoff patterns
- Company Analysis: Companies with highest layoff counts
- Temporal Trends: Layoff patterns over time
- Industry Analysis: Layoff trends across different industries
- Correlation Analysis: Relationships between different variables
What I Learned
Technical Skills
- Data Analysis: Comprehensive data exploration and statistical analysis
- Data Visualization: Creating meaningful and interactive visualizations
- Statistical Methods: Outlier detection, correlation analysis, and trend analysis
- Python Libraries: Advanced usage of Pandas, NumPy, Seaborn, and Plotly
- Jupyter Notebooks: Effective data science workflow
Analytical Skills
- Data Cleaning: Handling real-world data quality issues
- Statistical Thinking: Applying statistical concepts to business problems
- Insight Generation: Extracting meaningful insights from complex datasets
- Storytelling: Communicating findings through visualizations
Code Snippets
Outlier Detection and Visualization
def detect_outliers(df, column_names):
col_len = len(column_names)
num_columns = min(col_len, 3)
num_rows = 2 * ((col_len + num_columns - 1) // num_columns)
fig, axes = plt.subplots(num_rows, num_columns, figsize=(5 * num_columns, 3 * num_rows))
for i, column_name in enumerate(column_names):
data = df[column_name]
q1, q3 = np.percentile(data, [25, 75])
iqr = q3 - q1
low_lim, upp_lim = q1 - 1.5 * iqr, q3 + 1.5 * iqr
row_index, col_index = divmod(i, num_columns * 2)
ax_box, ax_hist = axes[row_index, col_index], axes[row_index + 1, col_index]
# Boxplot
sns.boxplot(x=data, ax=ax_box)
ax_box.axvline(low_lim, color='brown', linestyle='--', label=f'Lower: {low_lim:.1f}')
ax_box.axvline(upp_lim, color='brown', linestyle='--', label=f'Upper: {upp_lim:.1f}')
# Histogram
sns.histplot(data, bins=20, ax=ax_hist, color='purple')
ax_hist.set_yscale('log')
plt.tight_layout()
plt.savefig('layoff_percent.png')
plt.show()
Interactive Visualization
fig_industries_box = px.box(
clean_df,
x='Industry',
y='Layoffs',
title='Layoffs Distribution by Industry',
color='Industry'
)
fig_industries_box.update_layout(
xaxis_title="Industry",
yaxis_title="Number of Layoffs",
showlegend=False
)
fig_industries_box.show()
Correlation Analysis
# Correlation matrix
correlation_matrix = clean_df[['Layoffs', 'Percentage', 'Funds_Raised']].corr()
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0,
square=True, linewidths=0.5)
plt.title('Correlation Matrix of Key Variables')
plt.tight_layout()
plt.show()
Future Improvements
- Real-time Data: Integrate with live data sources for current trends
- Predictive Modeling: Implement ML models to predict layoff trends
- Interactive Dashboard: Create web-based dashboard for exploration
- Geographic Mapping: Add interactive maps for geographic analysis
- Industry Deep-dive: Detailed analysis of specific industries
Project Impact
This project demonstrates my ability to:
- Data Analysis: Handle complex, real-world datasets effectively
- Statistical Analysis: Apply proper statistical methods and validation
- Data Visualization: Create meaningful and informative visualizations
- Insight Generation: Extract actionable insights from data
- Technical Communication: Present findings clearly and effectively
The project showcases practical application of data science techniques and demonstrates my proficiency in data analysis, visualization, and statistical thinking, making it a valuable addition to my portfolio.