Program Evaluation Guide

The following material provides guidance on planning and implementing evaluations of motivational interviewing programs and training efforts. Most, but not all of the examples are drawn from addictions settings. For additional information on program evaluation, please see the section on Relevant Resources.


Guide Topics Include

What is program evaluation?

Why do program evaluation?

Using program evaluation results.

Selecting an evaluation team.

Determining what to evaluate.

Selecting an outcome variable.

Selecting a design.

Analyzing findings.

Applying findings.



What is program evaluation?


Program evaluation involves a process of systematically gathering, interpreting, and applying information to answer questions about a specific clinical or training project, policy, or program. The purpose of evaluation is to determine whether these activities are successful in meeting their objectives. Evaluation can provide information to help improve programs by identifying specific strengths and weaknesses or areas for staff training and development. Lastly, evaluation results can be used to increase the competitiveness of a program for funding or grant support or to contribute to generalizeable knowledge about the effective treatment of specific problems or populations.





(back to top)



Why do program evaluation?


Wouldn’t I know if my program is effective just by looking at it? Not necessarily. Systematically evaluating programs can reveal processes data that are not easily observable at first glance, combine data from large groups of clients or multiple sources into a digestible format, and reduce some forms of bias. For these reasons evaluations that rely solely on anecdotal or testimonial evidence, may not be compelling to funders and granting agencies and may not yield the optimal services for clients.


Throughout the field of addictions, as well as in other areas of medicine and behavioral health care, there is an increasing emphasis on evidence-based practice. While motivational interviewing is considered an evidence-based practice, (see SAMHSA’s National Registry of Evidence-Based Practices and Programs), much of the research on MI has been conducted in the form of randomized controlled trials. The trials are often done in tightly controlled research settings that have important differences from real world clinical settings. Many such trials have been criticized because the extensive and costly training and oversight of providers is often not feasible in front line programs. Program evaluation can provide feedback that will help to ensure that the way in which MI is being implemented in your setting is yielding results that are worth the investment of time and resources you are committing, taking into account the unique characteristics of your clients, staff, and community.






(back to top)



Using Program Evaluation Results


Program evaluation results can be used to inform best practice at various stages in the life of a clinical or training program. The following are just a few ways in which program evaluation results can be applied:


During program development: Thinking about evaluation “up front” can help in developing better clinical or training projects, policies, or programs in the first place. If something is not clear cut enough to be measured, it is likely not concrete enough to be understood and reliably implemented by trainers or providers.


When choosing among alternatives: If you are seeking to address a specific clinical or training issue, program evaluation can be used to compare competing models before selecting one for implementation.



Example: A program is trying to decide whether it’s more effective to bring in an outside training consultant to conduct in-house motivational interviewing trainings or send staff out to community trainings. The program hires an outside training consultant to train the staff at one site, and send staff from another site to a nearby community training. A cost evaluation is conducted and 10 clinicians from each site are randomly selected to provide session practice samples that are rated and compared for MI proficiency.


Prior to implementing: Before implementing a large scale program it can be helpful to run a smaller scale pilot version to troubleshoot.


Shortly after implementing: If you have recently implemented a new program, evaluation can be used to determine whether the program is meeting its stated goals and objectives. Evaluation can also help to determine if a program is being implemented as intended, as the actual delivery of services can sometimes vary widely from what was developed in the planning process.


Long after implementing: If you have been using a program for some time, evaluation can be helpful to determine whether it should remain intact, be modified, or be stopped. Evaluations can be used to identify ways in which a program could be made more efficient or less costly. You may also conduct evaluation when you need proof that a program is effective for funders, public relations, or marketing. Outcome evaluations are becoming increasing required by government and non-profit funders to verify that programs are helping clients.







(back to top)



Selecting an Evaluation Team


The first step in conducting a program evaluation, involves selecting an evaluation team.
Many agencies will have sufficient resources within the organization to select an internal team. Using an internal team will typically be cheaper than hiring an outside consultant. Additionally an internal team will have the advantage of understanding the nuances of the system they are evaluating and have a leg up on developing the rapport that is needed to get buy in from all staff involved.


Outside consultants can also be hired to conduct or assist in program evaluation. Ideally, consultant evaluators of motivational interviewing programs will have both general program evaluation training and experience, as well as specific expertise in motivational interviewing. Outside consultants may come to the table with more extensive evaluation training and experience than internal staff and can have the added benefit of providing objective or new perspectives. The downsides of hiring outside consultants include financial cost and their more limited perspective of agency culture.


Ideally, all stakeholders should be involved in planning the evaluation. Unfortunately, individuals that are often most affected by an evaluation, including clinicians, clients, and the community are not represented on evaluation teams.  Having representation of the views and knowledge of everyone affected by the program will result in evaluations that are better planned and feasible to implement. A partnership approach to evaluation, in which stakeholders at all levels are involved in the planning, gathering, analyzing, interpreting, and applying the evaluation results is known as participatory evaluation.







(back to top)



Determining What to Evaluate


Prioritize what you need to know. When starting an evaluation, it can be tempting to want to learn everything there is to know about your program. However, program evaluation can require significant time and resources. Careful planning should be conducted to identify those issues that are most important to address and focus on those, particularly if you or your organization are new to program evaluation.

Sometimes the topic of an evaluation is obvious, either because it has been mandated from within or outside the organization, there is a specific problem or issue to address, or there has been a recent change in policy or practice. In other cases, organizations are hoping to initiate or continue an evaluation feedback loop to ensure that they are continually providing optimal services, and the exact focus of the evaluation requires more forethought and planning. In many ways, conducting a formal or informal needs assessment is the first step to conducting such an evaluation.







(back to top)



Selecting an Outcome Variable


After deciding what to evaluate, identifying the major outcomes that you want to examine or measure is the next important step in program evaluation. Outcome (or dependent) variables are those measures of addiction, mental health, or health that the program is expected to change. It can be challenging to convert program goals to measurable outcomes as this can sometimes involve transforming broad, intangible concepts into concrete, observable, measurable constructs.


Example: Part of the stated mission of one substance abuse aftercare program is to help clients maintain long term sobriety. To determine the effectiveness of the program at meeting this goal, evaluators could ask permission to call participants 3, 6, and 12 months after discharge to administer several substance abuse questionnaires which measure substance use quantity, frequency, symptoms of abuse or dependence, and negative consequences.


The Process and Outcomes Measures section of this website provides a menu of developed treatment outcome and training outcome measures, many of which have been validated in the literature. When a relevant outcome measure is not available, existing measures can be modified, or tailored outcome instruments can be created by the evaluation team. When possible, standardized measurement instruments are preferable over self-created instruments as more inferences can be drawn from instruments with established reliability, validity, or normative data.


When selecting outcome measures, both quantitative and qualitative options are available. The instruments available, in the outcomes measures section of this website, focus primarily on quantitative evaluation data, which is collected in the form of numbers such as the quantity or frequency of specific behaviors (number drinks per day, percent days abstinent), test scores (Beck Depression Inventory), survey results (Satisfaction Ratings), or numbers or percentages of people with a specific characteristics in a population (number hospitalized, percent no longer depressed). Qualitative data involves descriptions that cannot be captured numerically. Qualitative data could be collected in the form of interviews, writing samples, focus groups or session notes (for more information see the Relevant Resources section on Qualitative Evaluation).
When selecting outcome measures, there are additional methodological and practical issues that should be considered.


These include:


Literacy: The literacy level of the individuals who will be responding should be considered when selecting outcome measures. If there is high variability in literacy within your population, the evaluation team should determine strategies for determining sufficient literacy and contingency plans, such as staff administration, if minimum literacy requirements are not met.


Language: If you serve a multi-lingual population it will be helpful to identify instruments which are already available in the languages of the people you serve. Translating self-developed instruments can be a challenge, but can be easier with bilingual speakers on staff.


Validity: The reported validity of instruments should be taken into account during the selection process. Validity refers to the degree to which the instrument measures what it is supposed to measure. Common forms of validity include predictive validity, which is the degree to which an instrument predicts future behavior, and concurrent validity, which is the degree to which an instrument is related to other instruments measuring the same construct.


Reliability:  Instrument reliability refers to the demonstrated consistency of the instrument. Test-retest reliability measures the degree to which the instrument will yield the same results when given in two administrations.


Sensitivity to Change: This refers to the ability to detect meaningful changes in the outcome being measured. This can be particularly important when comparing groups or following individuals across time.





(back to top)



Selecting a Design


After selecting an outcome variable, the next step in program evaluation is to decide on a method to answer your evaluation questions. Selecting an evaluation design often requires a balance between rigor/credibility and convenience/feasibility. Below are several commonly used designs in evaluation research.


No Comparison Group: Some evaluation studies choose to look at only one program of interest and do not include a comparison group. No comparison group studies may be the most convenient or only feasible option in some settings, but have important limitations as it can be different to draw inferences about the causal impact of an intervention without another group to compare it to. Several no comparison group designs are available:


Post-design: In this design, outcomes are measured at the end of treatment or training. Such designs can tell evaluators whether meeting program or training goals are being met, but can leave some uncertainty regarding the causes of program success or failure.


Pre-post designs: Pre-post designs measure changes that may occur across time by looking at participants before and after the program or training. This involves measuring whatever you’re interested in for a group (e.g. MI proficiency), applying the intervention to that group, and then measuring again. While such designs can demonstrate that important changes have or have not occurred, lack of a control group still leaves many threats to internal validity.  In addition, such designs cannot yield important information about the process of change between pre- and post- measurements.


Simple Time Series Designs: These designs track the trend of change across time, minimizing some threats to internal validity. This design involves taking repeated measurements or observations, implementing the program or training, and then taking more repeated measurements. This design can also involve the addition of multiple components of a program across time, where different features can be added or taken away.


Comparison Group: In these designs, one group receives the program or intervention of interest, while another group receives no intervention, a different type of intervention, or “treatment as usual”. When feasible, having a comparison group can expand the conclusions that can be accurately drawn from evaluations. There are several types of comparisons groups and respective designs that can affect the feasibility of the study and also the validity of the results.


Randomized Design:  This is the “gold standard” for comparison group designs. By randomly assigning participants to groups, you improve your chances of having groups that are equivalent on important characteristics and can have increased confidence that the only important difference between groups is the intervention or training of interest. For example, a randomized design could involve randomly assigning patients requesting substance abuse treatment services to a 1-session motivational interviewing based assessment or to a standard intake assessment. For such a study see: MI (Motivational Interviewing) to Improve Treatment Engagement and Outcome in Subjects Seeking Treatment for Substance Abuse. Randomization can be determined by flipping a coin, drawing from a hat, or using a random number generator (


Non-Randomized Design: While randomized designs are ideal from a methodological perspective, practical and ethical constraints often yield this impossible. If randomization is not possible, you can utilize an intact comparison group that is similar to the intervention group. For example, you could implement an intervention at one clinic site but not another, or in some therapy groups but not others. Creating groups that are equivalent helps improve the causal inferences that can be drawn from an evaluation.


When choosing a design it can be helpful to keep two importance constructs in mind: internal and external validity.


Internal Validity refers to the degree to which a causal relationship between two variables (e.g. MI and outcome) can be demonstrated. There are several important threats to internal validity. First, if a comparison group is not used, or participants are not randomly assigned to intervention or training groups, it can be difficult to infer what may have caused the changes that are observed. For example, imagine an outpatient detoxification clinic that is interested in evaluating the impact of an initial session of motivational interviewing prior to the standard intake on treatment retention and alcohol use. If the MI session is given to only one group or to all incoming patients, even if outcomes are favorable or a pre-post change is observed, there could be several causes of change in addition to MI. External events, such a new tax on alcohol could have caused the change. Similarly, the changes may be a result of maturation or regression to the mean. Having a control group could resolve many of these issues.


External Validity is the degree to which results of evaluations can be generalized to the target population (e.g. the entire clinic or agency). Several threats to external validity also exist. First, when selecting samples for evaluation designs, it is important that they are representative of the larger population. For example, if the evaluation discussed above only accepts patients with alcohol use and excludes patients with recent substance use, the results of the evaluation may not be meaningful or relevant to the clinic as a whole. When choosing an evaluation design and thinking through methods, careful consideration should be given to the selection of participants to ensure that inclusion is as broad as possible and that inferences are not drawn beyond the specific sample what was included in the evaluation.





(back to top)




Analyzing Findings


The way in which evaluation results are analyzed is an important aspect of planning. Analysis of evaluation results can tell you whether an effect was found and, if so, its magnitude and meaning. When analyzing data, it can be helpful to refer back to your evaluation objectives, or the reasons you conducted the evaluation in the first place.


Data should first be organized in a way that makes it more easily digestible. Qualitative data can be transcribed and read through, and then organized into relevant themes or categories. Quantitative data can be entered into an excel spreadsheet or statistical program. These programs can allow for the quick and efficient summarization of data. For example, the mean age of participants can be calculated or the number of patients who reporting finding a program ‘highly satisfactory’. Such tabulations can be very helpful for answering evaluation questions that are primarily descriptive in nature.


Other evaluation questions, particularly those that focus on comparisons across time or between groups may require more sophisticated statistical skills or software. Inferential statistics allow us to determine the probability of observing differences as large or larger than the ones observed due to chance alone. Analyses that yield statistically significant results can give evaluators increased confidence in the meaning of observed differences. One drawback of statistical significance testing is its sensitivity to the sample size included in the evaluation. Evaluations with a small number of participants may struggle to reach levels of statistical significance, even when meaningful changes exist. Effect sizes are an alternative way of capturing between group differences that are less sensitive to sample size.






(back to top)




Applying Findings


After analyzing findings, results must be interpreted and applied. Results should be compared with what you expected or to performance benchmarks in your specific field. Clear recommendations should be drawn based on the results, even if those include gathering more information. Conclusions and recommendations can be documented in a report that describes the evaluation methods and findings. The level of detail and sophistication of results should be determined based on the intended audience (e.g. unders, staff, clients, public). As with all stages of the evaluation, involvement from key stakeholders in the interpretations and reporting of findings is key.





(back to top)