Almost a year ago, I wrote an article on the practical aspects of e-marking, providing a hands-on overview of the e-marking landscape titled On-Screen Marking: A Practical Guide. Since then, practical developments and real-time applications have progressed, particularly in automated writing evaluation, which includes both scoring and feedback elements. This evolutionary period has prompted me to take a closer look at these advancements and develop a new piece aimed at decision-makers at the policy and institutional levels who are seeking optimal solutions for their scoring and feedback operations.
Additionally, as many may now be inclined to use Microsoft Copilot or other similar GPT-based solutions in decision-making for automated writing scoring and feedback generation, it might be useful to provide a summary of the trade-offs involved in the implementation phase of automated writing evaluation solutions in practice.
This article will offer an orientation for those exploring available options, comparing Copilot or similar GPT-based products that might be a better fit, providing an in-depth comparison of resources for decision-makers in educational assessment context.
To start with, any type of student-constructed response would fall under the stimuli (meaning the prompt shown to students )category of higher levels of Bloom's Taxonomy (1956) analyzing, evaluating, and creating. Student-constructed responses are typically used to answer open-ended questions and vary in length based on the word count. In this context, the term "writing" is used to refer to all forms of student responses, including short-answer responses and essays.
Evaluating student responses, especially in longer texts like essays, carries a higher risk of subjectivity compared to selected-response questions (such as multiple-choice, drag-and-drop, or matching). Although large language models are trained on human texts, which inherently contain biases and subjectivity, the use of technology in marking can help mitigate bias and subjectivity concerns. Additionally, evaluating writing pieces without technological support demands significant resources. These two factors: reducing subjectivity and improving operational efficiency are key reasons behind the growing interest in using technology for automated writing evaluation.
The evolution of automated writing evaluation began with automated essay scoring, which did not include a feedback component at the time, pioneered by Dr. Ellis Batten Page at Duke University in the 1960s through the Project Essay Grade (PEG). Later, transformative practical applications emerged in large-scale online education through MIT’s collaboration on the EdX platform in 2012, advancing the use of machine learning for essay marking in open-access courses and increasing the focus on automated feedback generation in the broader industry, which led to growing market demand. These days, AI-driven developments in the education assessment industry, specifically in scoring and feedback reporting, such as Microsoft Copilot or any OpenAI GPT-based tools, are being considered as integrated assistants to meet the need of marking.
Alternatively, some assessment organizations may prefer to go beyond using integrated assistants and instead opt for specialized automated writing evaluation solutions available in the K-12 market designed specifically for this purpose, with fully controlled evaluation algorithms. The rationale for such organizations is to maintain complete control over the algorithms, evaluation models, and data processing, as well as achieve other operational efficiency gains. Unlike Microsoft Copilot or similar GPT-based tools, which offer proprietary models, specialized systems provide direct access to the machine learning algorithms and core model architecture, allowing for targeted specialization, greater customization and increased transparency and privacy - features that are especially important in high-stakes assessment contexts for government officials.
In a more practical example, with proprietary models, the ownership of the model remains with the third party (e.g., Microsoft/OpenAI). Assessment organizations can fine-tune these models and access their fine-tuned versions, but only through the third party's systems, as they do not have full ownership of the fine-tuned model, even though it is trained on their own data.
To conduct a high-level analysis that decision-makers can easily understand, a comparative trends analysis in the context of the educational assessment and technology industry employs the SOAR (Strengths, Opportunities, Aspirations, Results) framework, which focuses on innovation, growth, and aspirations, which align well with assessment organizations that prioritize the modernization of large-scale assessments.
The following comparative analysis of Microsoft Copilot or any similar GPT-based integrated assistant systems within an AWE primary system, whether a system with integrated assistants or a specialized system, is presented using the SOAR framework:
SOAR Elements | Microsoft Copilot Or Similar GPT-based Supportive Systems | Specialized AWE System |
Strengths | Highly flexible natural language processing capabilities, adaptable across platforms, and fine-tuned for specific tasks. | Complete control, full customization, data privacy, and secure data management. |
Opportunities | Capable of handling nuanced language tasks like contextual feedback and idea generation, complementing existing automated evaluation systems. | Tailored AWE models and specialized marking practices aligned with institutional goals. |
Aspirations | Aims to offer creative, adaptive feedback and can evolve to provide more personalized support in subject-specific contexts. | Aims to achieve full transparency and control over evaluation processes, with customizable algorithms that meet evolving standards. |
Results | Immediate, scalable support for diverse feedback needs, with capabilities for deeper content analysis and rephrasing. | Fully customized AWE solutions that offer long-term cost efficiency and alignment with institutional assessment goals. |
In the end, the comparative overview in the table above, along with the technical considerations provided below, may help assessment organizations weigh the trade-offs between speed and scalability (Copilot/GPT) and control, privacy and customization (specialized systems) based on the institution's specific needs. An explanatory overview is provided below to explore the technical considerations involved in choosing the path of building specialized AWE systems.
Training a model with data is an important technical step when developing specialized Automated Writing Evaluation (AWE) systems, especially when considering data security. Using an open-source platform for this training process can keep data fully secure, as it avoids third-party storage. Instead, data remains managed entirely within the organization’s own environment, whether in their cloud or on in-house servers. Alternatively, data scientists may choose to work with platforms like Kaggle, which provides a wide range of publicly available datasets for experimentation, which supports projects like analyzing student responses without requiring the use of sensitive internal data. The next section explains in more detail the processes and methods involved in training data for either experimentation or real-case use.
Training and adjusting a language model (e.g., GPT/BERT/Llama) can be compared to language learning in a baby: starting with letters, understanding their meaning, and using them in basic conversations. In practice, training a model typically involves using resources like a large corpus - Google Books or web pages to build a foundational vocabulary, along with initial past data used to train the model, followed by further fine-tuning (adjusting) the model for the specific purpose.
Another important task is to decide which training method to employ: a generalized approach, which builds on principles or rules such as grading rubrics and is less data-dependent, or a data-driven approach, which relies on real data for learning patterns and is more commonly used in natural language processing due to its higher predictive power compared to the principle-guided approach. These two approaches are not mutually exclusive; in fact, combining rubrics with student responses is often preferable, as it creates a model that benefits from both foundational principles and real-world data patterns. The next section further elaborates on balancing various factors in defining the quality of a trained model.
To achieve an agentic (or one-shot learning) large language model for assessing student responses, both model quality and quality of connection points (how well components within the model work together) are considered important criteria for evaluation of quality of the model. Additionally, the trade-off between good format and accurate marking often arises because a model optimized for format might prioritize structural and linguistic accuracy over content depth, while one focused on accurate marking may overlook smaller formatting issues in pursuit of content-based precision. Although most large language models trained for automated scoring aim only to predict scores rather than provide a response format, as GPT models do, a well-designed agentic large language model must balance both by including strong consideration of connection points that enable format and content assessment to work harmoniously. In line with the quality criteria for implementing either option, important decisions need to be made regarding resource allocation and considering the strategic gains discussed further below.
Assessment organizations are increasingly challenged to decide whether to invest in building and maintaining specialized AWE systems tailored to meet specific writing evaluation requirements or to opt for systems with integrated assistants, which offer limited control and independent scalability. Feeding only the rubrics into the large language model may be insufficient for selecting an exemplar - an ideal high-quality student response that practically illustrates benchmark standards, a task that can be burdensome for teachers when grading student work and require the use of technology.
Additionally, it is very important for assessment organizations to carefully decide their reliance strategy on machine scores vs human scores on their modernization path as they move forward. For example, the marking engine may recommend human scorers the suggested score with a confidence level, and the human scorers may decide whether to adopt the score or not. Another key decision to consider alongside is whether the solution provided would be cloud-agnostic, supporting institutions to switch between region-specific cloud providers for data residency, which comes with flexibility and compliance with local data regulations.
In summary, Microsoft Copilot or any similar GPT-based integrative assistant systems are excellent for institutions seeking quick, scalable solutions, although they offer a less specialized approach. In contrast, specialized AWE systems provide full control, flexibility, and alignment with institutional assessment goals, along with improved data privacy and transparency, though they would require more time and resources to develop and maintain with a long-term vision in mind.
The evolution of e-marking presents assessment organizations with multiple pathways for integrating automated writing evaluation systems into their processes. Therefore, assessment institutions are tasked to balance speed, scalability, and ease of integration, which solutions like Microsoft Copilot or similar GPT-based systems offer, with the need for control, privacy, customized specialization and data security, which specialized AWE systems provide. While quick-to-deploy options may be appealing for their convenience, investing in fully customized, specialized systems would be for alignment with institutional goals and offers greater transparency. Ultimately, the right path depends on each organization's specific needs, resources, and long-term vision for modernizing their assessment practices. As they build an e-marking culture, assessment organizations would need to strategically weigh the trade-offs between rapid implementation and the flexibility to shape their scoring processes to meet future demands.
Through this article, I have aimed to encourage discussions around automated writing evaluation practices and solutions. To continue this conversation, I will be arranging a virtual discussion on this topic to share best practices that are followed internationally. If you are interested in participating in this upcoming discussion and hearing from our peers in the assessment industry, feel free to message me on LinkedIn. I will provide you with the webinar details.
Vali Huseyn is an educational assessment specialist, recognized for his expertise in development projects of various aspects of the assessment cycle. His capability to advise on the improvement of assessment delivery models, administration of different levels of assessments, innovation within data analytics, and creation of quick, secure reporting techniques sets him apart in the field. His work, expanded by collaborations with leading assessment technology firms and certification bodies, has greatly advanced his community's assessment practices. At The State Examination Centre of Azerbaijan, Vali significantly contributed to the transformations of local assessments and led key regional projects, such as unified registration and tracking platform of international testing programs, reviews of CEFR-aligned language assessments, PISA-supported assessment literacy trainings, and the institutional audit project, all aimed at improving the assessment culture across the country and former USSR region.
Vali has received two prestigious scholarships for his studies: he completed an MA in Education Policy Planning and Administration at Boston University on a Fulbright Scholarship and also studied Educational Assessment at Durham University on a Chevening Scholarship.
Discover guided practices in modernizing assessments and gain insights into the future of educational assessments by connecting with Vali on LinkedIn.