Intro to Software Engineering – 4/12/03


A PRIMER ON EMPIRICAL STUDIES (Powerpoint presentation from binder)


Model of Science

-          start with something about the world that is a problem, a puzzle

-          make some kind of conjecture, hypothesis (conjecture=guess)

-          build a model to reflect fact and phenomena

-          converge on theory, some fundamental truth that explains the data

-          cycle between induction and deduction (or we should… most cases, iteration ends after 2-3 attempts, and then move on)


Technology Transfer

-          Three levels

o        Science – things that are interesting

o        Technology – pushing it to see how far it can go

o        Practical Need & Use – create technology and sometimes science to accomplish something

-          Sometimes, interactions between these levels are difficult, unexpected

o        E.g. Nylon was in use by engineers (tech) long before scientists understood how it worked (science)



-          The problem is (in software engineering) is that the general research process is slow if there is no balance between science/theory and experimental part… experiments are weak

o        Knowledge is encoded slowly

o        Silly, unproductive research is not pruned early –

§         lack the necessary quantitative metrics to determine what should be done, to evaluate what is good and bad

§         judgments tend to be fairly subjective

-          Credibility difficult to establish between science and engineering – impeded tech transfer


Empirical Software Engineering Studies

-          Tend to be well understood techniques from psychology and statistics – certain amount of credibility – trying to understand what makes good software programmers

-          Studies with large population, social factors are an unknown, not well-established

-          What we need to:  spectrum of empirical techniques that are robust to large variances from social factors present – psychology, sociology, to enable us to figure out how people use technology, what differentiates one technology from another


How do we make progress

-          Better empirical studies

o        Ask an important question – the cost of doing experiments is like the cost of building the system…. If you don’t have the system answering a user need, you’re wasting time

o        Establish principles from the study

o        Make experiments rich enough to generate new questions – typically happens, but sometimes there are kind of not-interesting results

o        Cost-effective - use resources wisely and effectively… e.g. using students versus live developers (pizza versus salary)

o        Repeatable – build credibility in the metrics over a span of time – builds reliability, generalities

-          Credible interpretations

o        Validity – construct, internal, and external

o        Test hypothesis

o        Removal of alternative explanations – to identify critical causalities

§         E.g. – size of feet is correlated to ability to spell.  Babies can’t spell at all.  Incorrect explanation, spelling ability has to do with age

§         E.g. – 4GL (4th generation language – graphical interface to an existing programming language) improved productivity immensely.  Yes, it did, but not in the way that the 4GL vendor could claim credit for.  The 4GL tool required new (better) hardware, new (better) programmers, and the feature that was supposed improve productivity had been turned off.

o        Adequate precision

o        Available to public


(skip a slide)




Reconciling Theory with Reality

-          True State of Nature (Reality) – where we get our observations, check our experimental deductions.  We get this data through some kind of “lens” or instrumentation (rectangle on diagram) that could generate noise.

-          Data is obtained and used, iteratively, to reconcile theory with real world

-          In software, 5 factors of reviews (form yesterday’s lecture) can be viewed as data that can be predictably tweaked with known effects.



-          An empirical study – a study to reconcile theory and reality…

-          Three types

o        Anecdotal – asking people about their experiences, guesses, opinions – yields insight, but no theory.  In SWE are based almost solely on anecdotal evidence.

o        Case studies – deeper, not as broad

o        Experiment – very narrow, very deep


Recipe for an Empirical Study (or paper)


Research Context

-          What is the problem, what are you trying to resolve?

-          Indicate historical context in terms of previous research

-          Locate what you’re doing in that context (related work)



-          Two different ways to think of this

o        Abstract – “all good musicians are good programmers”

§         Reflection of world

§         What defines a good musician, a good programmer?

o        Concrete – a way of determining what exactly is going on operational level


What Is Experimental Design

-          Input – independent variables

-          Function – manipulation of variables

-          Output – Dependent variables


What is Validity?

-          Construct – metrics,

-          Internal – results are changes to dependents

-          External – how well these solutions fit the problem


A Note on Choosing Experimental Design – two engineers example



Spectrum of Empirical Studies


Other Considerations

-          Ethics

o        not as relevant to software; experiments are not life-threatening, nor are they life-changing

o        privacy tends to be important, but can be resolved by maintaining complete anonymity

o        participants need to be protected from stigma upon dropping out of an experiment

-          Retrospective versus Prospective

o        Companies have a lot of available data, it should be used, but it’s not… many reasons, such as competitive advantage, or divulges proprietary technology or techniques

-          In Situ versus In Vitro – field experiment versus laboratory experiment

o        Typically, student studies are not given much credibility (lab experiments)

o        Results are wanted from real programmers in real context, because of the number of factors that cannot be controlled or simulated

o        More external validity in real studies, more internal validity in lab studies


Significance & Hypothesis Testing : Neyman-Pearson testing theory (Quantitative analysis)

  • State hypotheses
    • H0 – Null theory – specific – means there is a very specific way of proving it false
    • H1 – Alternative theory – more general – more difficult to test or prove
  • Set significance α of observations. 
    • If α = 0.05, this means that the theory can pass with 95% confidence (5/100 cases may not follow hypothesis)
  • Use observations and significance to accept/reject hypothesis
  • Errors:
    • Type 1 – rejecting H0 when it’s true (can lead to non-productive research)
    • Type 2 – accepting H0 when it’s false (can lead to lacking research)
  • Higher significance can prevent Type 1, but lead to Type 2.


Power of an Experiment

  • N = number of subjects
  • The wider the deviations, the more likely the Type 1 error could occur


Grounded Theory (Qualitative Analysis)

-          systematic set of techniques

o        comparative analysis

o        theoretical sampling

o        constructing formal theory

o        clarifying/assessing comparative studies


Drawing Conclusions

  • fundamentals versus non-fundamentals:  most of the work in empirical studies has been done in the non-fundamental areas




Extra credit – Jonathan Aldrich – 11am ACES 2.302 (auditorium) – “Using Types To Enforce Architectural Designs”







Evaluation Outline

  • Review of own studies – easier to tear apart than to publicly tear up colleagues’ work


Review Papers

  • Software Faults – Perry and Stieg, “Software Faults in Evolving a Large Real-Time System” (in handouts)
  • Time Study – two papers
    • Bradac, “Prototyping a Process Monitoring Experiment” (in handouts)
    • Perry, “People, Organizations and Process Improvement” (in handouts)


Experimental Site

  • C programming language
  • UNIX environment – shared team development environment
  • SCCS for change management


Research Context

  • most error studies in the past were done in “green field” system (from a system written from scratch, not an established evolving system)


Software Faults – Research Question


Software Faults – Experimental Design

  • Two phase study
    • Entire set of faults
    • Subset of faults (design/implementation)
  • Owners of faults were responsible for capturing data
    • Member of team helped develop surveys
    • Volunteers reviewed/pre-tested the surveys
  • Management imposed limitations
    • Volunteers only
    • Completely anonymous
    • Completely non-intrusive (this affected results)


Software Faults – Phase 1

  • Problem categories (what kind of faults)
  • Test phases (when they were found)


Software Faults – Phase 1 Results

  • response rate of 68% (out of 5000 people) – this paper was not accepted for a 1990 conference because response rate was not 100%, even though 68% is quite impressive
  • statistics


Software Faults – Phase 1 Summary

  • faults were found all throughout entire software lifecycle, even post-deployment
  • majority of faults were found in test and late in testing
  • 1/3 overhead


Software Faults – Phase 2

  • Fault types (design and coding)


Software Faults – Phase 2

  • Cost Information
    • Ease of finding/reproducing fault
    • Ease of fixing fault


Software Faults – Phase 2

  • Root cause and solution
    • Underlying causes:  none given (I was stupid), submitted under duress (I was under deadline), others
    • Means of prevention


Software Faults – Analyses

  • Test for pair-wise independence – Chi-square test
    • Observing the interdependences between some variables
  • Example – find and fix data
    • Fix (easy + moderate, difficult + very difficult) = (784,216)
    • Find (easy + moderate, difficult + very difficult) = (909, 91)


Fix (easy+medium) 784

Fix (difficult + very difficult) 216

Find (easy+medium) 909

713 (725)

196 (184)

Find (difficult+very difficult) 91

71 (59)

20 (32)


  • All relationships were somewhat dependent


Software Faults – Analyses

  • findings on correlated variables


Table 2, 7 in Perry-Stieg


Software Faults – Results

  • 68% response
  • Variables were correlated
  • Lack of information tended to dominate underlying causes (so a lesson is to hire people who have knowledge of the domain)
  • Informal means of prevention preferred over formal (US versus Europe)


Software Faults – Evaluation (Better Empirical Studies)

  • Answers a question? Yes
  • Establishes principles?  Yes
  • Enables generating and refining hypotheses?  Yes
  • Cost effective?  Inexpensive survey, expensive analysis (people intensive)
  • Repeatable?  Yes, have re-used experiment design; similar correlations, not same results


Software Faults – Evaluation (Credible Interpretations – strengths)

  • CV = construct validity
  • IV = internal validity
  • EV = external validity


Software Faults – Evaluation (Credible Interpretation – weaknesses)

  • CV: Find/fix interpretation was not identical – “easy” vs “difficult” scale
  • CV: Fault categories were unstructured (too many categories – fixed in later experiments by organizing in a tree)
  • IV: No post-survey validation – no way of telling how reliable the data was.  Couldn’t do a post-survey because responses were completely anonymous, non-intrusive
  • IV: Lapse between problem resolution and survey (1 year)
  • EV: 32% responses missing
  • EV:  single case – single system, system domain


Software Faults – Evaluation

  • Test hypothesis – yes
  • Adequate precision – yes over 2/3 response, correlation, dependence/independence
  • Available to public – no, lacked absolute numbers (because AT&T refused publication of # of faults), journal only had summary data


Software Faults – Summary

  • Useful – answered questions
  • Done within limitations of constraints
  • Had effect – new measure put in place for inspection and review techniques
  • Research impolications
  • Identified weakness in survey instrument
  • Questions about generalizability


Time Studies

  • Three studies – iterations


Time Studies

  • Research context
  • Research question (hypotheses)
    • How does a developer spend his or her time in the context of team development, as part of a large system development
    • Inter-team/personal dependencies
    • How much time is spent in:  communication, relevant processes, lost


Time Studies – Phase 1

  • Specific null hypothesis
    • A person is 100% effective (race time=lapse time) in the context of teams in large-scale SW dev
    • Race time = how much actual time does something take
    • Lapse time = interval, from start to finish
    • (These MAY be on the final)
  • Experimental Design
    • Project notebooks and personal diaries of single developer – used to reconstruct 32 months
    • Time categories – specific process activities
    • Time categories – specific waiting activities


Time Studies – Phase 1 Data – Time Spent Early In Development Process

  • Lots of waiting, mostly within a couple of categories


Time Studies – Phase 1 Data – Time Spent Later In Development Process

  • More working time, less waiting; much jumping around between tasks


Time Studies - Phase 1 Results

  • Race/lapse = 0.4 - proved H0 wrong
  • Blocking is significant
  • Process phenomenology
  • Provides important basis on what to look at with deeper study


Time Studies – Phase 2

  • Research context – refined phase 1, multiple developers and developments
  • Research questions – is blocking a problem with multiple devs, was the Phase 1 developer a representative subject, is blocking as significant?
  • Experimental Design – self-reporting instrument – journal with specific format for more detail, time precision in half-hours instead of hours


Time Studies – Phase 2 Results

  • Race/lapse = 0.4 again
  • Phase 1 verified
  • Blocked = Context switching
  • Clarifies how developers spend their time
  • Variance of self-reporting


Time Studies – Phase 3

  • most expensive phase
  • Research context
    • Follow-on to self-reporting
    • More detailed than Phase 2
  • Research questions
    • How valid was self-reporting
    • Time resolution smaller than 30 minutes
  • Experimental design
    • Full-day observations (unplanned random)
    • Compared observations to self-reports


Time Studies – Phase 3 Data – Self-Report Fidelity

  • 1A and 1B were very close in agreement (agreement close to 1.0)
  • Others did not report as accurately


Time Studies – Phase 3 Data – Unique Contacts Per Day

  • Contacts = number of people with whom the developer interacted, informally


Time Studies – Phase 3 Data – Number of Messages Per Day (unplanned interaction)

  • Audix (voicemail, phone messages)
  • Email
    • very high number received, meeting notices, minutes, project plans, etc
    • very low number sent, tech-nerds paranoid of passing information electronically
  • Phone – ok
  • In Person – preferred method of interaction


Time Studies – Phase 3 Results

  • Self-consistent (observed versus self-reported), but not uniform
  • Clarifies how developers spend their time      
    • 75 minutes per day in informal communication
    • Importance of oral communication… much less written
  • Informal communication – important to development process


Time Studies – Evaluation

  • Important questions?  Yes
  • Establishes principles?  Yes, race/lapse time = 0.4 (the 2.5x fudge factor for manhour estimation), importance of informal communication
  • Enables generating/refining hypothesis?  Yes
  • Cost effective?  Varied, last phase was expensive, first phase was cheap
  • Repeatable?  Useful design, expect similar correlations


Time Studies – Evaluation – Credible Interpretations (Strengths)

  • CV:  Complete data source over complete development
  • CV:  Retrospective, self-reporting, and observed data
  • CV:  Established process vs state in process (categorized times)
  • IV:  All three phases agreed
  • IV:  Proved consistency between self-report and observed
  • IV:  Varying degrees of resolution (1 hour vs 30 minute)
  • EV: Large-scale SW dev, team
  • EV: Covered whole life-cycle
  • EV:  Common language and dev environment

Time Studies – Evaluation – Credible Interpretations (Weaknesses)

  • CV:  Blocked, context switching ambiguity
  • IV:  Loss of detail, due to time passed (“what did I do all day?  I know I was busy…”)
  • IV:  Inaccuracy of self-reporting
  • IV:  Observation effects (people being watched worked more conscientiously)
  • EV:  Application domain
  • EV:  Bell Labs


Time Studies – Evaluation

  • Test hypothesis – yes, refuted race/elapse = 1
  • Removal of alternative explanations – yes, exposed the critical problems
  • Adequate precision – yes, 1 hour-30 minute time blocks
  • Available to public – yes


Time Studies – Summary







(not from paper)


Forming Teams

  • Success Factors
    • Do effective planning
    • Self-managing
    • Get the work done


  • Factors – 4 Stages
    • Inception and acceptance of project goals
    • Agreeing on solution of technical issues – agreeing on goals
    • Resolving Conflicts and political issues
    • Attaining goals – execution i.e. “doing it”


  • Contributing Factors to Team Success
    • Management strategy
      • Initially – context for self-management (or so they claim – some of the guys on the project said, “we didn’t have time to pay attention to you”)
    • Shared vision among team leaders (two who had experience, complete agreement)
    • Worked on critical issues ahead of time – testing environment built way ahead of time
    • Optimized critical measures using “Divide + Conquer”
    • Proactive
      • Solved all critical issues
      • Process improvement


  • Critical Factors for Teams
    • You need time to build trust
    • Must be face-to-face to build relationships… especially important for geographically distributed developers – all trips were planned.  After relationships are established, then web, e-mail, Netmeeting etc can be used for interaction
    • Clear agreement – on team goals
    • Match the communication bandwidth to the inter-site interfaces


Unrelated:  “I wish I could get a job as a failed CEO… 90 million dollars, work for 6 months.  I could do it well too.  I can fail better than anyone else, with such panache…”




(Herbsleb, Grinter, Perry)


Basis for Geographically Distributed Development Projects

  • Functional areas of expertise
  • Product structure i.e. the architecture
  • Process steps
  • Customization


Unrelated:  “How do American companies make light beer?  They move the goat further upstream…”


  • Distance – causes disruption of Ad Hoc communication (from previous paper about informal communication time)





Perry and Kaiser


Software Development Environments made of three components:

  • Policies
  • Mechanisms
  • Structures


Four Classes of Models


  • Individual – issue is construction, dominated by mechanism
  • Family – issue is coordination, dominated by structures
  • City – issue is cooperation, dominated by policies
  • State – issue is commonality, dominated by higher-order policies



EXAM – will be comprehensive, it will look like the midterm, but with different questions


Papers that will not be covered: 

  • Lehman Chapter 19 section 3-end
  • Using Events for Data capture
  • A Study in Software Data Capture and Analysis