This three-paper dissertation systematically investigates the methodology issues related to using observational systems to observe, analyze, and interpret teachers' reformed teaching practices within the K-12 classroom context. Each paper contributes a distinct perspective to evaluate instructional practices accurately, consistently, and effectively in science classrooms. The first paper offers a systematic literature review on classroom observation protocols (OPs) used for evaluating science teaching practices. In this study, I proposed an analytical framework, underpinning four crucial aspects of OP studies: the research objectives, design, data collection strategies, and data analysis and interpretation. This framework serves as the cornerstone for analyzing 37 distinct studies that have used OPs to evaluate the instructional practices of K-12 science teachers. The study underscores the need for transparent rater training procedure, sampling choices, and the advanced statistical techniques to address rater-associated variances and eventually enhance the reliability and validity of the teacher measures. The study also suggests continued development of OPs that are aligned with current and future science education standards and the application of advanced statistical methods to examine rater effects in classroom observation studies. In the second paper, we developed a comprehensive classroom observational system, utilizing the Rasch model to validate the instrument, guide rater training, and analyze observational data. We implemented this system with 321 full-length high school chemistry classroom observation videos. We proposed this system, because the exclusive reliance on developing new OPs may not lead to valid and reliable measurements of instructional practices, as ratings are often affected by construct-irrelevant variances such as rater bias and data analysis strategies. This study shifts the focus from the OPs alone to a classroom observation system that includes instrument validation, rating training, and data analytical strategies. The psychometric evidence in this study supported this system's feasibility of yielding reliable measures of instructional practices. The third paper examines rater effects by applying a Partial Credit Many-Facet Rasch Measurement (PC-MFRM) in a science classroom observation context. As shown in the first two papers, classroom observation studies employing OPs are mediated by human raters, who are susceptible to errors. These errors introduce construct-irrelevant variance, which can adversely impact the validity and reliability of measurement of instructional practices. This study aims to identify and control these variances, particularly focusing on three rater effects: rater severity, central tendency, and the halo effect. I applied PC-MFRM to concurrently examine the three rater effects and their impacts on measurement of instructional practices. MFRM results indicate the significant discrepancies in rater severity and identified the specific raters who showed central tendency and halo effects. The finding suggests that researchers can incorporate MFRM diagnostic information throughout the rating training and calibration process to reduce any potential halo and central tendency. To address the rater severity, researchers should focus on intra-rater consistency as well rather than inter-rater reliability only. This study extended the application of MFRM to classroom observation by offering a multi-dimensional understanding of rater effects; thereby it contributed to the improvement of rater training process, which can consequently enhance the reliability and validity in measuring teaching practice. The validity and reliability of classroom observations for evaluating teacher practices hinge considerably on rater effects and the systematic evaluation the impact of rater-related effects. While Observation Protocols (OPs) provide a structured framework for such evaluations, the literature indicates that rater behavior introduces construct-irrelevant variances such as rater severity, halo effect, and central tendency. Previous studies have employed the Many-Facet Rasch Model (MFRM) to examine some of the effects, but these efforts have predominantly focused on single-dimensional analyses. Addressing this research gap, current study applies MFRM to concurrently investigate three critical rater effects--rater severity, central tendency, and halo effect--in the context of a longitudinal efficacy study on the Connected Chemistry Curriculum (CCC). Specifically, we aim to (1) assess the degree to which MFRM can identify the presence of these rater effects, and (2) evaluate the implications of using MFRM on enhancing rater training programs for classroom observations. [The dissertation citations contained here are published with the permission of ProQuest LLC. Further reproduction is prohibited without permission. Copies of dissertations may be obtained by Telephone (800) 1-800-521-0600. Web page: http://www.proquest.com/en-US/products/dissertations/individuals.shtml.]