HMER: Image to LaTeX Converter
Computer Vision

HMER: Image to LaTeX Converter

Deep LearningComputer VisionPyTorchNLPCNNLSTMSequence-to-SequenceResNet

Project Impact

Accuracy
BLEU Score
Similarity
Mathematical Formula Examples
Model Performance Metrics
Architecture Overview
Training Progress

HMER: Image to LaTeX Converter

A deep learning-based system for converting images of mathematical expressions into their corresponding LaTeX code representation.

Project Overview

The Image to LaTeX (img2latex) project implements a deep learning-based system for converting images of mathematical expressions into LaTeX code. This technology addresses a significant challenge in digital document processing: transforming visual representations of mathematical formulas into their corresponding markup representation, which is essential for editing, searching, and accessibility.

Mathematical expressions are ubiquitous in scientific, engineering, and academic literature, but transferring them between different formats can be cumbersome. Traditional Optical Character Recognition (OCR) systems often struggle with the complex two-dimensional structure of mathematical formulas. The img2latex project provides an end-to-end solution to automatically recognize and transcribe mathematical expressions from images, significantly reducing the manual effort required for digitizing printed mathematical content.

Key Features

  • Dual Model Architectures: Implementation of both CNN-LSTM and ResNet-LSTM architectures for comparative analysis
  • Multiple Decoding Strategies: Support for greedy search, beam search, and sampling with temperature/top-k/top-p
  • Comprehensive Evaluation: Performance assessment using token-level accuracy, BLEU score, and Levenshtein similarity
  • Visualization Tools: Advanced visualization of training metrics, dataset statistics, and model predictions
  • Cross-Platform Support: Acceleration on Apple Silicon (MPS), NVIDIA GPUs (CUDA), and CPU fallback
  • Command-line Interface: Easy-to-use CLI for training, evaluation, and prediction

Technical Approach

Dataset Analysis

The project leverages the IM2LaTeX-100k dataset, consisting of over 100,000 images of mathematical expressions paired with their corresponding LaTeX code.

Key Dataset Statistics:

  • Total Images: 103,536
  • Mean Dimensions: 319.2px × 61.2px
  • Mean Aspect Ratio: 5.79
  • Common Size: 320×64 px
  • Image Format: RGB (uint8)
  • Pixel Value Distribution: Mean 242.22, StdDev 45.70

Model Architecture

1. CNN-LSTM Architecture

The CNN-LSTM model consists of:

  • Encoder: A convolutional neural network with three convolutional blocks, each containing:

    • Conv2D layer (with filters [32, 64, 128])
    • ReLU activation
    • MaxPooling layer
    • The final output is flattened and passed through a dense layer to create the embedding
  • Decoder: An LSTM-based decoder that:

    • Takes the encoder output and previously generated tokens as input
    • Generates output tokens one at a time
    • Uses teacher forcing during training (ground truth tokens as input)
    • Offers optional attention mechanism to focus on different parts of the encoder representation

2. ResNet-LSTM Architecture

The ResNet-LSTM model replaces the CNN encoder with a pre-trained ResNet:

  • Encoder: A pre-trained ResNet (options include ResNet18, ResNet34, ResNet50, ResNet101, ResNet152) with:

    • The classification head removed
    • Option to freeze weights for transfer learning
    • Final layer adapted to produce embeddings of the desired dimension
  • Decoder: The same LSTM-based decoder as the CNN-LSTM model

Training Process

The training process implements several key strategies:

Optimization Setup

  • Optimizer: Adam with configurable learning rate and weight decay
  • Learning Rate Scheduling: ReduceLROnPlateau with patience 3, factor 0.5
  • Loss Function: Cross-entropy with label smoothing (0.1)

Training Techniques

  • Teacher Forcing: Scheduled sampling approach transitioning from ground truth to predictions
  • Gradient Clipping: Norm-based clipping (value: 5.0) to prevent exploding gradients
  • Early Stopping: Training stops if validation metrics don't improve for 5 epochs
  • Checkpointing: Regular saving of model checkpoints for resuming training

Hardware Acceleration

  • Device Support: CUDA for NVIDIA GPUs, MPS for Apple Silicon, CPU fallback
  • Mixed Precision: FP16 computation where supported (30-40% faster training)

Results and Evaluation

The training process spanned 25 epochs, with the following progression in validation metrics for our best-performing model:

EpochLossAccuracyBLEULevenshtein
12.27780.49860.08270.2311
51.84080.57600.12410.2609
101.69090.60220.13770.2716
151.63380.61160.14640.2781
201.60300.61800.15020.2799
251.56630.62560.15390.2829

The comparison between CNN-LSTM and ResNet-LSTM models showed:

  • CNN-LSTM achieved 62.56% validation accuracy and a BLEU score of 0.1539
  • ResNet50-LSTM achieved 59.42% accuracy and 0.1487 BLEU score in fewer epochs
  • The CNN-LSTM architecture provided superior results with lower computational requirements

Implementation Details

Preprocessing Pipeline

The preprocessing pipeline includes:

  • Grayscale conversion and normalization
  • Resizing while maintaining aspect ratio
  • Padding to consistent dimensions
  • Data augmentation techniques including random rotations, translations, and scaling

Vocabulary and Tokenization

A specialized tokenizer handles the LaTeX syntax:

  • Special tokens for start/end of sequence, padding, and unknown tokens
  • Support for common LaTeX commands and mathematical symbols
  • Conversion between tokens and character-level representation

Inference Methods

Multiple decoding strategies are supported:

  • Greedy Search: Always selects the token with highest probability
  • Beam Search: Maintains top-k hypotheses at each decoding step
  • Sampling-based Methods: Temperature scaling, top-k sampling, and nucleus sampling

Future Enhancements

Planned improvements include:

  • Integration of transformer-based architectures (e.g., Vision Transformer encoder with BART decoder)
  • Support for handwritten mathematical expressions
  • Expansion to handle more complex LaTeX structures like tables and diagrams
  • Web API deployment for broader accessibility
  • Performance optimization for mobile devices

Technologies Used

  • Python: Core implementation language
  • PyTorch: Deep learning framework for model development
  • NumPy & SciPy: Numerical computing and scientific functions
  • Matplotlib & Seaborn: Visualization of results and metrics
  • NLTK & SacreBLEU: Natural language evaluation metrics
  • Hydra: Configuration management framework

Share this project

Explore More Projects

Discover other interesting work that might pique your interest

Related Projects

Cover image for PlantDoc: Plant Disease Classification

PlantDoc: Plant Disease Classification

State-of-the-art plant disease classification with CBAM-augmented ResNet18, achieving 97.46% accuracy across 38 disease categories.

Computer VisionCNNAttention Mechanisms+3
Jeremy Cleland
Cover image for Sepsis Prediction Pipeline

Sepsis Prediction Pipeline

Advanced machine learning pipeline for early sepsis detection using Random Forest, XGBoost, and Logistic Regression models with hyperparameter tuning.

Machine LearningHealthcare AIData Science+3
Jeremy Cleland