Esy Research - AI and Machine Learning Insights

After conducting over 1,000 experiments with various prompting strategies, we've identified clear patterns in what works—and what doesn't—for academic writing tasks.

Experimental Design

Our experiments tested prompting strategies across five categories:

Zero-shot prompts

No examples provided

Few-shot prompts

1-5 examples included

Chain-of-thought

Step-by-step reasoning requested

Role-based

Assigning specific expertise to the model

Structured output

Requesting specific formats

Evaluation Methodology

Each prompt was evaluated on:

Output quality (1-5 scale, three independent raters)
Task completion rate (percentage of successful completions)
Average generation time (seconds to first response)
Token efficiency (output tokens per input token)

Key Findings

1. Specificity Wins

Vague prompts like "write about climate change" produced generic content. Specific prompts with clear constraints yielded significantly better results.

Poor Example:

"Write an essay about AI ethics"

Better Example:

"Write a 500-word academic essay examining three ethical challenges in AI development, with specific examples and citations to recent literature (2020-2024)"

Measured Improvement: 3.2x higher quality rating (2.1 → 6.7 out of 10)

2. Role Assignment Shows Mixed Results

Assigning roles ("You are a PhD researcher in biology...") helped in specialized domains but had minimal impact on general academic writing.

Effective for:

Technical writing requiring domain expertise
Domain-specific analysis and interpretation
Specialized terminology usage

Less effective for:

General argumentative essays
Literature reviews across disciplines
Basic research methodology

Data: Role-based prompts improved scores by 18% for technical content, but only 3% for general academic writing.

3. Chain-of-Thought is Underutilized

Explicitly requesting step-by-step reasoning improved output quality by an average of 41% across all task types, yet remains underused in practice.

Example Implementation:

"Before writing, first outline the main arguments, identify three pieces of supporting evidence for each, structure the essay with clear sections, then write the full content."

Results:

Coherence scores: +52%
Argument strength: +38%
Structural quality: +47%

4. Few-Shot Examples Must Be High-Quality

Including 2-3 excellent examples improved outputs more than including 5 mediocre examples. Quality over quantity matters significantly.

Optimal Configuration:

2-3 examples: +67% quality improvement
4-5 examples: +58% quality improvement
6+ examples: +41% quality improvement (diminishing returns)

5. Output Structure Matters

Requesting specific formats (sections, headings, word counts) reduced revision time by 37% and improved overall coherence.

Structured Request Example:

Write a research proposal with the following sections:
1. Introduction (200 words)
2. Literature Review (300 words)
3. Methodology (250 words)
4. Expected Outcomes (150 words)
5. Timeline (100 words)

Practical Recommendations

Based on our experiments, here's our recommended prompt structure for academic tasks:

[ROLE DEFINITION - if domain-specific]
You are an expert in [specific field with credentials]

[TASK DESCRIPTION]
Write a [specific format] that [clear objective]

[CONSTRAINTS]
- Length: [specific word count or page range]
- Include: [required elements, citations, data]
- Avoid: [common pitfalls, biases, generalizations]

[REASONING REQUEST]
First, outline your approach in bullet points.
Then, write the full content following your outline.

[QUALITY CRITERIA]
Ensure the output:
- Uses evidence-based arguments with citations
- Maintains academic tone and precision
- Follows logical structure with clear transitions
- Addresses counterarguments where relevant

Domain-Specific Insights

Literature Reviews

Best approach: Structured prompts with clear synthesis requirements

Critical elements:

Explicit request for synthesis, not just summary
Chronological or thematic organization specified
Citation requirements clearly stated

Average quality improvement: 52%

Example:

"Synthesize the literature on transformer architectures (2017-2024). Organize thematically: 1) Attention mechanisms, 2) Scaling laws, 3) Efficiency improvements. For each theme, identify 3-4 seminal papers, explain their contributions, and note how later work built upon them."

Analytical Essays

Best approach: Chain-of-thought with argument mapping

Critical elements:

Explicit request for evidence-based reasoning
Requirement to address counterarguments
Clear thesis statement development

Average quality improvement: 48%

Example:

"Analyze the impact of social media on political polarization. First, state a clear thesis. Then, present three arguments with empirical evidence. Address two counterarguments. Conclude by synthesizing implications for democratic discourse."

Research Proposals

Best approach: Multi-stage prompting (outline → draft → refine)

Critical elements:

Clear methodology specification
Feasibility considerations
Expected outcomes with measurable indicators

Average quality improvement: 61%

Example:

"Draft a research proposal outline for studying AI bias in hiring algorithms. Include: research questions, methodology (specify sample size, data sources), expected timeline, potential limitations. Then expand each section into full paragraphs."

Common Pitfalls to Avoid

1. Over-Prompting

Issue: Excessively long prompts (>500 words) showed diminishing returns

Data: Prompts over 400 words performed 12% worse than concise 200-300 word prompts

Solution: Be comprehensive but concise. Focus on essential constraints and criteria.

2. Ambiguous Instructions

Issue: Vague terms like "good quality" or "professional" had no measurable impact

Data: Replacing vague terms with specific criteria improved scores by 34%

Solution: Define quality explicitly (e.g., "use peer-reviewed citations from last 5 years" instead of "use good sources")

3. Conflicting Constraints

Issue: Asking for both "comprehensive coverage" and "brief summary" confused models

Data: Conflicting instructions reduced task completion rate from 87% to 62%

Solution: Prioritize constraints clearly. If length conflicts with comprehensiveness, specify which takes precedence.

4. Assuming Context

Issue: Models work better with explicit background information

Data: Providing context improved relevance scores by 43%

Solution: Include necessary background even if it seems obvious. Don't assume the model has access to your specific context.

The Iterative Approach

Our highest-quality outputs came from an iterative process:

Stage 1: Initial Generation

Use structured prompt with clear requirements
Request outline before full content
Specify quality criteria

Stage 2: Review & Identify Gaps

Evaluate against rubric
Note missing elements
Identify weak arguments or unclear sections

Stage 3: Refined Prompt

Address specific weaknesses
Add constraints based on gaps
Request improvement of specific sections

Stage 4: Final Generation

Enhanced prompt with learned improvements
Specific refinement requests
Quality verification

Impact: This process added 5-10 minutes but improved quality ratings by an average of 67%.

Time Investment ROI:

10 minutes additional prompting time
45 minutes saved in manual editing
35-minute net time savings per article

Performance Metrics

Quantitative Results

| Prompting Strategy | Avg. Quality Score | Completion Rate | Tokens/Output | |-------------------|-------------------|-----------------|---------------| | Zero-shot basic | 4.2/10 | 71% | 1,247 | | Few-shot (3 examples) | 7.1/10 | 86% | 1,389 | | Chain-of-thought | 7.8/10 | 89% | 1,456 | | Structured output | 8.1/10 | 92% | 1,312 | | Combined approach | 8.7/10 | 94% | 1,358 |

Qualitative Improvements

Coherence: +47% improvement with structured prompts
Citation Quality: +62% with explicit citation requirements
Argument Strength: +41% with chain-of-thought reasoning
Academic Tone: +38% with role-based expertise assignment

Future Research Directions

Areas requiring further investigation:

Long-Form Content Coherence

Challenge: Maintaining consistency in 10,000+ word documents
Current gap: Quality degrades after ~3,000 words
Research needed: Multi-pass coherence checking strategies

Challenge: Integrating images, data, and text effectively
Current gap: Limited research on optimal multi-modal combinations
Research needed: Systematic testing of image + text prompting strategies

Collaborative Prompting

Challenge: Multiple stakeholders contributing to prompts
Current gap: No established best practices for team-based prompting
Research needed: Frameworks for collaborative prompt development

Domain Adaptation

Challenge: Transferring prompting strategies across disciplines
Current gap: Most research focused on computer science/general domains
Research needed: Discipline-specific prompting guidelines

Conclusion

Effective prompt engineering is learnable and systematic. The strategies that work best combine:

Specificity - Clear, detailed requirements
Structure - Organized format and section requests
Reasoning - Explicit step-by-step thinking
Quality criteria - Defined standards for evaluation
Iteration - Refinement based on output assessment

The data is clear: investing time in prompt design yields substantial returns in output quality. For academic applications, this investment is not optional—it's essential for producing work that meets scholarly standards.

Key Takeaway: A well-crafted prompt can improve output quality by 2-3x compared to basic instructions, while reducing post-generation editing time by up to 60%.

As models continue to evolve, these foundational principles remain important. The goal isn't to find the perfect prompt, but to develop systematic approaches that consistently produce high-quality academic content.

Tags:prompt-engineering experiment llm best-practices research-methodology

Prompt Engineering Experiments: What Actually Works

Experimental Design

Zero-shot prompts

Few-shot prompts

Chain-of-thought

Role-based

Structured output

Evaluation Methodology

Key Findings

1. Specificity Wins

2. Role Assignment Shows Mixed Results

3. Chain-of-Thought is Underutilized

4. Few-Shot Examples Must Be High-Quality

5. Output Structure Matters

Practical Recommendations

Domain-Specific Insights

Literature Reviews

Analytical Essays

Research Proposals

Common Pitfalls to Avoid

1. Over-Prompting

2. Ambiguous Instructions

3. Conflicting Constraints

4. Assuming Context

The Iterative Approach

Stage 1: Initial Generation

Stage 2: Review & Identify Gaps

Stage 3: Refined Prompt

Stage 4: Final Generation

Performance Metrics

Quantitative Results

Qualitative Improvements

Future Research Directions

Long-Form Content Coherence

Multi-Modal Prompting

Collaborative Prompting

Domain Adaptation

Conclusion

Subscribe to Esy Journal

More like this

The AI winter that never came

Experiment: RAG vs fine-tuning for research tasks

Why I'm betting everything on writing tools