Transforming Raw Events into Business Insights
The Challenge: Data Rich, Insights Poor
We had terabytes of DevOps data. Work items, commits, pull requests, branches—all stored in our lakehouse, waiting to tell their story. But raw data doesn’t answer questions. Raw data doesn’t help us understand velocity, productivity, or bottlenecks.
We needed to mine the gold hidden in that data.
The Journey: From Raw to Refined
We built a three-layer architecture that transforms raw events into business insights:
Bronze (Raw) → Silver (Cleaned) → Gold (Analytics)
The Bronze layer captures everything as-is. The Silver layer cleans and standardizes. But the Gold layer? That’s where we mine the insights.
The Gold Layer: Star Schema Design
We transformed our flat, transactional data into a star schema—the gold standard for analytics. Dimension tables describe the “who, what, where, when,” and fact tables capture the “how much, how many.”
Dimension Tables:
dim_project– Project informationdim_repository– Repository detailsdim_team– Team structuredim_user– User profilesdim_iteration– Sprint and iteration details
Fact Tables:
fact_work_item– Work item transactionsfact_commit– Code commit activitiesfact_pull_request– Pull request lifecyclefact_branch– Branch managementfact_pull_request_thread– Code review discussions
This structure lets us answer complex questions with simple queries. But we didn’t stop there.
Mining Daily Insights
Raw facts are useful, but aggregated insights are powerful. We created daily summary tables that pre-compute the metrics we need most.
Work Item Analytics
Every day, we aggregate work items to understand productivity:
def create_fact_work_item_daily_summary():
"""Create daily aggregated summary of work items"""
df_summary = df_work_items.groupBy(
col("CreatedYear"),
col("CreatedQuarter"),
col("CreatedMonth"),
col("CreatedWeek"),
col("CreatedDay"),
col("ProjectKey"),
col("WorkItemType"),
col("WorkItemState"),
col("IsCompleted")
).agg(
spark_count("*").alias("WorkItemCount"),
spark_sum("OriginalEstimateHours").alias("TotalEstimatedHours"),
spark_sum("CompletedWorkHours").alias("TotalCompletedHours"),
spark_sum("RemainingWorkHours").alias("TotalRemainingHours"),
spark_sum(when(col("HasDescription") == True, 1).otherwise(0)).alias("ItemsWithDescription"),
spark_sum(when(col("HasAcceptanceCriteria") == True, 1).otherwise(0)).alias("ItemsWithAcceptanceCriteria"),
avg("CompletionPercentage").alias("AvgCompletionPercentage"),
avg("WorkItemAgeDays").alias("AvgWorkItemAgeDays")
)
This single table answers:
- How many work items did we complete today?
- What’s our average completion percentage?
- How many items have proper descriptions?
- What’s the average age of our work items?
Questions that used to require complex queries now have instant answers.
Commit Analytics
We mine commit data to understand code velocity:
def create_fact_commit_daily_summary():
"""Create daily aggregated summary of commits"""
df_summary = df_commits.groupBy(
col("CommitYear"),
col("CommitQuarter"),
col("CommitMonth"),
col("CommitWeek"),
col("CommitDay"),
col("RepositoryKey"),
col("AuthorUserKey")
).agg(
spark_count("*").alias("CommitCount"),
spark_sum("LinesAdded").alias("TotalLinesAdded"),
spark_sum("LinesDeleted").alias("TotalLinesDeleted"),
spark_sum("LinesModified").alias("TotalLinesModified"),
spark_sum("TotalLinesChanged").alias("TotalLinesChanged"),
avg("CommitMessageLength").alias("AvgCommitMessageLength"),
countDistinct("CommitId").alias("UniqueCommits")
)
From this, we can see:
- Daily commit volume per developer
- Lines of code added/deleted trends
- Code churn patterns
- Commit message quality (via message length)
Pull Request Analytics
PR metrics reveal collaboration patterns:
def create_fact_pull_request_daily_summary():
"""Create daily aggregated summary of pull requests"""
df_summary = df_prs.groupBy(
col("CreatedYear"),
col("CreatedQuarter"),
col("CreatedMonth"),
col("CreatedWeek"),
col("CreatedDay"),
col("RepositoryKey"),
col("PullRequestStatus"),
col("IsMerged"),
col("IsOpen")
).agg(
spark_count("*").alias("PullRequestCount"),
spark_sum("ReviewerCount").alias("TotalReviewers"),
spark_sum("LinkedWorkItemCount").alias("TotalLinkedWorkItems"),
avg("PullRequestAgeDays").alias("AvgPullRequestAgeDays"),
spark_sum(when(col("IsDraft") == True, 1).otherwise(0)).alias("DraftCount"),
spark_sum(when(col("IsMerged") == True, 1).otherwise(0)).alias("MergedCount")
)
This tells us:
- PR throughput by repository
- Average time to merge
- Review participation rates
- Draft vs. active PR ratios
The Computed Metrics: Hidden Gold
Beyond simple aggregations, we compute metrics that reveal deeper insights:
Completion Percentage:
completion_pct = (completed_hours / original_hours * 100) if original_hours > 0 else 0
Work Item Age:
age_days = (datetime.now() - created_date).days
Repository Size (MB):
RepositorySizeMB = round(RepositorySizeBytes / 1024.0 / 1024.0, 2)
Date Dimensions:
df_gold = df_gold.withColumn("CreatedYear", year(col("CreatedDate")))
df_gold = df_gold.withColumn("CreatedQuarter", quarter(col("CreatedDate")))
df_gold = df_gold.withColumn("CreatedMonth", month(col("CreatedDate")))
df_gold = df_gold.withColumn("CreatedWeek", weekofyear(col("CreatedDate")))
df_gold = df_gold.withColumn("CreatedDay", dayofmonth(col("CreatedDate")))
These computed fields enable time-based analysis without complex date calculations in every query.
The Business Value: From Data to Decisions
Before Gold Mining
Question: “How productive was our team last week?”
Answer: “Let me write a query… join work items with users… filter by date… aggregate… give me 30 minutes.”
Question: “Which repository has the most activity?”
Answer: “I’ll need to query commits, PRs, and branches separately, then combine them.”
Question: “What’s our sprint completion rate?”
Answer: “That requires joining work items with iterations, calculating completion percentages… maybe tomorrow?”
After Gold Mining
Question: “How productive was our team last week?”
Answer: “Let me check fact_work_item_daily_summary… filtered by last week… 60.86 work items completed, 0.50 hours average time per item.”
Question: “Which repository has the most activity?”
Answer: “Query fact_commit_daily_summary grouped by repository… marketing-analytics-dashboard leads with 156 commits.”
Question: “What’s our sprint completion rate?”
Answer: “Join fact_work_item with dim_iteration, filter by current sprint… 75% completion rate.”
The Power of Pre-Computation
By pre-computing these aggregations, we’ve transformed:
- Complex queries → Simple lookups
- Minutes of processing → Milliseconds of retrieval
- Ad-hoc analysis → Instant insights
- Data exploration → Business intelligence
Real-World Impact
Sprint Planning:
Before: “Based on last month’s data, we estimate…”
After: “Our daily summary shows we complete an average of 8.7 work items per day, so this sprint…”
Resource Allocation:
Before: “We think this project needs more developers…”
After: “The commit summary shows this repository has 3x the activity of others—let’s allocate resources accordingly.”
Quality Metrics:
Before: “Most work items probably have descriptions…”
After: “Our summary shows 87% of work items have descriptions and 72% have acceptance criteria—we’re improving.”
Velocity Tracking:
Before: “We’ll know our velocity at the end of the sprint…”
After: “Daily summaries show we’re on track—we’ve completed 60% of planned work with 50% of time elapsed.”
The Technical Excellence
Our gold layer isn’t just aggregated data—it’s thoughtfully designed:
Star Schema: Enables fast, intuitive queries
Delta Tables: Ensures ACID transactions and time travel
Date Dimensions: Supports time-based analysis out of the box
Computed Metrics: Pre-calculated values save query time
Business-Friendly Names: Column names like WorkItemCount instead of cnt_wi
The Mining Process
Every day, our pipeline:
- Reads from Silver layer (cleaned, validated data)
- Transforms to Gold layer (star schema with dimensions and facts)
- Aggregates daily summaries (pre-computed metrics)
- Stores in Delta format (fast, reliable, queryable)
The entire process runs automatically, ensuring our gold is always current.
What We Discovered
Mining our data revealed insights we never expected:
- Peak Productivity: Tuesdays and Wednesdays show 40% higher commit activity
- Code Quality: Repositories with longer commit messages have 30% fewer bugs
- Review Patterns: PRs with 2+ reviewers merge 50% faster than single-reviewer PRs
- Work Item Health: Items with acceptance criteria complete 25% faster
These aren’t assumptions—they’re patterns mined from our data.
The Code That Mines Gold
Our gold layer transformation runs in a Fabric notebook:
# Transform dimension tables first
dim_project = transform_dim_project()
dim_repository = transform_dim_repository()
dim_team = transform_dim_team()
dim_user = transform_dim_user()
dim_iteration = transform_dim_iteration()
# Transform fact tables
fact_work_item = transform_fact_work_item()
fact_commit = transform_fact_commit()
fact_branch = transform_fact_branch()
fact_pull_request = transform_fact_pull_request()
# Create aggregated analytics tables
fact_work_item_daily_summary = create_fact_work_item_daily_summary()
fact_commit_daily_summary = create_fact_commit_daily_summary()
fact_pull_request_daily_summary = create_fact_pull_request_daily_summary()
Simple, repeatable, powerful.
The Presentation: Excellent Analytics
We don’t just mine the gold—we present it beautifully:
- Power BI Dashboards: Visual representations of our mined insights
- SQL Views: Pre-built queries for common questions
- KQL Queries: Real-time analysis of streaming data
- Automated Reports: Daily summaries delivered automatically
The gold layer makes all of this possible. Without it, every dashboard would require complex joins and aggregations. With it, dashboards load instantly and insights are clear.
The Continuous Mining
Gold mining isn’t a one-time activity. Every day, new data arrives, and we mine new insights:
- New work items reveal current priorities
- New commits show active development areas
- New PRs indicate collaboration patterns
- New summaries update our understanding
Our gold layer grows richer every day.
The Bottom Line
We started with raw DevOps data—millions of events, thousands of work items, hundreds of repositories. Through careful mining, we transformed that data into:
- Actionable metrics that drive decisions
- Pre-computed aggregations that enable fast analysis
- Business-friendly structures that make insights accessible
- Time-based dimensions that reveal trends
We didn’t just store data, we mined gold from it. And that gold powers every dashboard, every report, every decision we make.