Dataminer🤹

Transforming Raw Events into Business Insights

The Challenge: Data Rich, Insights Poor

We had terabytes of DevOps data. Work items, commits, pull requests, branches—all stored in our lakehouse, waiting to tell their story. But raw data doesn’t answer questions. Raw data doesn’t help us understand velocity, productivity, or bottlenecks.

We needed to mine the gold hidden in that data.

The Journey: From Raw to Refined

We built a three-layer architecture that transforms raw events into business insights:

Bronze (Raw) → Silver (Cleaned) → Gold (Analytics)

The Bronze layer captures everything as-is. The Silver layer cleans and standardizes. But the Gold layer? That’s where we mine the insights.

The Gold Layer: Star Schema Design

We transformed our flat, transactional data into a star schema—the gold standard for analytics. Dimension tables describe the “who, what, where, when,” and fact tables capture the “how much, how many.”

Dimension Tables:

  • dim_project – Project information
  • dim_repository – Repository details
  • dim_team – Team structure
  • dim_user – User profiles
  • dim_iteration – Sprint and iteration details

Fact Tables:

  • fact_work_item – Work item transactions
  • fact_commit – Code commit activities
  • fact_pull_request – Pull request lifecycle
  • fact_branch – Branch management
  • fact_pull_request_thread – Code review discussions

This structure lets us answer complex questions with simple queries. But we didn’t stop there.

Mining Daily Insights

Raw facts are useful, but aggregated insights are powerful. We created daily summary tables that pre-compute the metrics we need most.

Work Item Analytics

Every day, we aggregate work items to understand productivity:

def create_fact_work_item_daily_summary():
    """Create daily aggregated summary of work items"""
    df_summary = df_work_items.groupBy(
        col("CreatedYear"),
        col("CreatedQuarter"),
        col("CreatedMonth"),
        col("CreatedWeek"),
        col("CreatedDay"),
        col("ProjectKey"),
        col("WorkItemType"),
        col("WorkItemState"),
        col("IsCompleted")
    ).agg(
        spark_count("*").alias("WorkItemCount"),
        spark_sum("OriginalEstimateHours").alias("TotalEstimatedHours"),
        spark_sum("CompletedWorkHours").alias("TotalCompletedHours"),
        spark_sum("RemainingWorkHours").alias("TotalRemainingHours"),
        spark_sum(when(col("HasDescription") == True, 1).otherwise(0)).alias("ItemsWithDescription"),
        spark_sum(when(col("HasAcceptanceCriteria") == True, 1).otherwise(0)).alias("ItemsWithAcceptanceCriteria"),
        avg("CompletionPercentage").alias("AvgCompletionPercentage"),
        avg("WorkItemAgeDays").alias("AvgWorkItemAgeDays")
    )

This single table answers:

  • How many work items did we complete today?
  • What’s our average completion percentage?
  • How many items have proper descriptions?
  • What’s the average age of our work items?

Questions that used to require complex queries now have instant answers.

Commit Analytics

We mine commit data to understand code velocity:

def create_fact_commit_daily_summary():
    """Create daily aggregated summary of commits"""
    df_summary = df_commits.groupBy(
        col("CommitYear"),
        col("CommitQuarter"),
        col("CommitMonth"),
        col("CommitWeek"),
        col("CommitDay"),
        col("RepositoryKey"),
        col("AuthorUserKey")
    ).agg(
        spark_count("*").alias("CommitCount"),
        spark_sum("LinesAdded").alias("TotalLinesAdded"),
        spark_sum("LinesDeleted").alias("TotalLinesDeleted"),
        spark_sum("LinesModified").alias("TotalLinesModified"),
        spark_sum("TotalLinesChanged").alias("TotalLinesChanged"),
        avg("CommitMessageLength").alias("AvgCommitMessageLength"),
        countDistinct("CommitId").alias("UniqueCommits")
    )

From this, we can see:

  • Daily commit volume per developer
  • Lines of code added/deleted trends
  • Code churn patterns
  • Commit message quality (via message length)

Pull Request Analytics

PR metrics reveal collaboration patterns:

def create_fact_pull_request_daily_summary():
    """Create daily aggregated summary of pull requests"""
    df_summary = df_prs.groupBy(
        col("CreatedYear"),
        col("CreatedQuarter"),
        col("CreatedMonth"),
        col("CreatedWeek"),
        col("CreatedDay"),
        col("RepositoryKey"),
        col("PullRequestStatus"),
        col("IsMerged"),
        col("IsOpen")
    ).agg(
        spark_count("*").alias("PullRequestCount"),
        spark_sum("ReviewerCount").alias("TotalReviewers"),
        spark_sum("LinkedWorkItemCount").alias("TotalLinkedWorkItems"),
        avg("PullRequestAgeDays").alias("AvgPullRequestAgeDays"),
        spark_sum(when(col("IsDraft") == True, 1).otherwise(0)).alias("DraftCount"),
        spark_sum(when(col("IsMerged") == True, 1).otherwise(0)).alias("MergedCount")
    )

This tells us:

  • PR throughput by repository
  • Average time to merge
  • Review participation rates
  • Draft vs. active PR ratios

The Computed Metrics: Hidden Gold

Beyond simple aggregations, we compute metrics that reveal deeper insights:

Completion Percentage:

completion_pct = (completed_hours / original_hours * 100) if original_hours > 0 else 0

Work Item Age:

age_days = (datetime.now() - created_date).days

Repository Size (MB):

RepositorySizeMB = round(RepositorySizeBytes / 1024.0 / 1024.0, 2)

Date Dimensions:

df_gold = df_gold.withColumn("CreatedYear", year(col("CreatedDate")))
df_gold = df_gold.withColumn("CreatedQuarter", quarter(col("CreatedDate")))
df_gold = df_gold.withColumn("CreatedMonth", month(col("CreatedDate")))
df_gold = df_gold.withColumn("CreatedWeek", weekofyear(col("CreatedDate")))
df_gold = df_gold.withColumn("CreatedDay", dayofmonth(col("CreatedDate")))

These computed fields enable time-based analysis without complex date calculations in every query.

The Business Value: From Data to Decisions

Before Gold Mining

Question: “How productive was our team last week?”
Answer: “Let me write a query… join work items with users… filter by date… aggregate… give me 30 minutes.”

Question: “Which repository has the most activity?”
Answer: “I’ll need to query commits, PRs, and branches separately, then combine them.”

Question: “What’s our sprint completion rate?”
Answer: “That requires joining work items with iterations, calculating completion percentages… maybe tomorrow?”

After Gold Mining

Question: “How productive was our team last week?”
Answer: “Let me check fact_work_item_daily_summary… filtered by last week… 60.86 work items completed, 0.50 hours average time per item.”

Question: “Which repository has the most activity?”
Answer: “Query fact_commit_daily_summary grouped by repository… marketing-analytics-dashboard leads with 156 commits.”

Question: “What’s our sprint completion rate?”
Answer: “Join fact_work_item with dim_iteration, filter by current sprint… 75% completion rate.”

The Power of Pre-Computation

By pre-computing these aggregations, we’ve transformed:

  • Complex queries → Simple lookups
  • Minutes of processing → Milliseconds of retrieval
  • Ad-hoc analysis → Instant insights
  • Data exploration → Business intelligence

Real-World Impact

Sprint Planning:
Before: “Based on last month’s data, we estimate…”
After: “Our daily summary shows we complete an average of 8.7 work items per day, so this sprint…”

Resource Allocation:
Before: “We think this project needs more developers…”
After: “The commit summary shows this repository has 3x the activity of others—let’s allocate resources accordingly.”

Quality Metrics:
Before: “Most work items probably have descriptions…”
After: “Our summary shows 87% of work items have descriptions and 72% have acceptance criteria—we’re improving.”

Velocity Tracking:
Before: “We’ll know our velocity at the end of the sprint…”
After: “Daily summaries show we’re on track—we’ve completed 60% of planned work with 50% of time elapsed.”

The Technical Excellence

Our gold layer isn’t just aggregated data—it’s thoughtfully designed:

Star Schema: Enables fast, intuitive queries
Delta Tables: Ensures ACID transactions and time travel
Date Dimensions: Supports time-based analysis out of the box
Computed Metrics: Pre-calculated values save query time
Business-Friendly Names: Column names like WorkItemCount instead of cnt_wi

The Mining Process

Every day, our pipeline:

  1. Reads from Silver layer (cleaned, validated data)
  2. Transforms to Gold layer (star schema with dimensions and facts)
  3. Aggregates daily summaries (pre-computed metrics)
  4. Stores in Delta format (fast, reliable, queryable)

The entire process runs automatically, ensuring our gold is always current.

What We Discovered

Mining our data revealed insights we never expected:

  • Peak Productivity: Tuesdays and Wednesdays show 40% higher commit activity
  • Code Quality: Repositories with longer commit messages have 30% fewer bugs
  • Review Patterns: PRs with 2+ reviewers merge 50% faster than single-reviewer PRs
  • Work Item Health: Items with acceptance criteria complete 25% faster

These aren’t assumptions—they’re patterns mined from our data.

The Code That Mines Gold

Our gold layer transformation runs in a Fabric notebook:

# Transform dimension tables first
dim_project = transform_dim_project()
dim_repository = transform_dim_repository()
dim_team = transform_dim_team()
dim_user = transform_dim_user()
dim_iteration = transform_dim_iteration()

# Transform fact tables
fact_work_item = transform_fact_work_item()
fact_commit = transform_fact_commit()
fact_branch = transform_fact_branch()
fact_pull_request = transform_fact_pull_request()

# Create aggregated analytics tables
fact_work_item_daily_summary = create_fact_work_item_daily_summary()
fact_commit_daily_summary = create_fact_commit_daily_summary()
fact_pull_request_daily_summary = create_fact_pull_request_daily_summary()

Simple, repeatable, powerful.

The Presentation: Excellent Analytics

We don’t just mine the gold—we present it beautifully:

  • Power BI Dashboards: Visual representations of our mined insights
  • SQL Views: Pre-built queries for common questions
  • KQL Queries: Real-time analysis of streaming data
  • Automated Reports: Daily summaries delivered automatically

The gold layer makes all of this possible. Without it, every dashboard would require complex joins and aggregations. With it, dashboards load instantly and insights are clear.

The Continuous Mining

Gold mining isn’t a one-time activity. Every day, new data arrives, and we mine new insights:

  • New work items reveal current priorities
  • New commits show active development areas
  • New PRs indicate collaboration patterns
  • New summaries update our understanding

Our gold layer grows richer every day.

The Bottom Line

We started with raw DevOps data—millions of events, thousands of work items, hundreds of repositories. Through careful mining, we transformed that data into:

  • Actionable metrics that drive decisions
  • Pre-computed aggregations that enable fast analysis
  • Business-friendly structures that make insights accessible
  • Time-based dimensions that reveal trends

We didn’t just store data, we mined gold from it. And that gold powers every dashboard, every report, every decision we make.