HMER: Image to LaTeX Converter

A deep learning-based system for converting images of mathematical expressions into their corresponding LaTeX code representation.

Project Overview

The Image to LaTeX (img2latex) project implements a deep learning-based system for converting images of mathematical expressions into LaTeX code. This technology addresses a significant challenge in digital document processing: transforming visual representations of mathematical formulas into their corresponding markup representation, which is essential for editing, searching, and accessibility.

Mathematical expressions are ubiquitous in scientific, engineering, and academic literature, but transferring them between different formats can be cumbersome. Traditional Optical Character Recognition (OCR) systems often struggle with the complex two-dimensional structure of mathematical formulas. The img2latex project provides an end-to-end solution to automatically recognize and transcribe mathematical expressions from images, significantly reducing the manual effort required for digitizing printed mathematical content.

Key Features

Dual Model Architectures: Implementation of both CNN-LSTM and ResNet-LSTM architectures for comparative analysis
Multiple Decoding Strategies: Support for greedy search, beam search, and sampling with temperature/top-k/top-p
Comprehensive Evaluation: Performance assessment using token-level accuracy, BLEU score, and Levenshtein similarity
Visualization Tools: Advanced visualization of training metrics, dataset statistics, and model predictions
Cross-Platform Support: Acceleration on Apple Silicon (MPS), NVIDIA GPUs (CUDA), and CPU fallback
Command-line Interface: Easy-to-use CLI for training, evaluation, and prediction

Technical Approach

Dataset Analysis

The project leverages the IM2LaTeX-100k dataset, consisting of over 100,000 images of mathematical expressions paired with their corresponding LaTeX code.

Key Dataset Statistics:

Total Images: 103,536
Mean Dimensions: 319.2px × 61.2px
Mean Aspect Ratio: 5.79
Common Size: 320×64 px
Image Format: RGB (uint8)
Pixel Value Distribution: Mean 242.22, StdDev 45.70

Model Architecture

1. CNN-LSTM Architecture

The CNN-LSTM model consists of:

Encoder: A convolutional neural network with three convolutional blocks, each containing:
- Conv2D layer (with filters [32, 64, 128])
- ReLU activation
- MaxPooling layer
- The final output is flattened and passed through a dense layer to create the embedding
Decoder: An LSTM-based decoder that:
- Takes the encoder output and previously generated tokens as input
- Generates output tokens one at a time
- Uses teacher forcing during training (ground truth tokens as input)
- Offers optional attention mechanism to focus on different parts of the encoder representation

2. ResNet-LSTM Architecture

The ResNet-LSTM model replaces the CNN encoder with a pre-trained ResNet:

Encoder: A pre-trained ResNet (options include ResNet18, ResNet34, ResNet50, ResNet101, ResNet152) with:
- The classification head removed
- Option to freeze weights for transfer learning
- Final layer adapted to produce embeddings of the desired dimension
Decoder: The same LSTM-based decoder as the CNN-LSTM model

Training Process

The training process implements several key strategies:

Optimization Setup

Optimizer: Adam with configurable learning rate and weight decay
Learning Rate Scheduling: ReduceLROnPlateau with patience 3, factor 0.5
Loss Function: Cross-entropy with label smoothing (0.1)

Training Techniques

Teacher Forcing: Scheduled sampling approach transitioning from ground truth to predictions
Gradient Clipping: Norm-based clipping (value: 5.0) to prevent exploding gradients
Early Stopping: Training stops if validation metrics don't improve for 5 epochs
Checkpointing: Regular saving of model checkpoints for resuming training

Hardware Acceleration

Device Support: CUDA for NVIDIA GPUs, MPS for Apple Silicon, CPU fallback
Mixed Precision: FP16 computation where supported (30-40% faster training)

Results and Evaluation

The training process spanned 25 epochs, with the following progression in validation metrics for our best-performing model:

Epoch	Loss	Accuracy	BLEU	Levenshtein
1	2.2778	0.4986	0.0827	0.2311
5	1.8408	0.5760	0.1241	0.2609
10	1.6909	0.6022	0.1377	0.2716
15	1.6338	0.6116	0.1464	0.2781
20	1.6030	0.6180	0.1502	0.2799
25	1.5663	0.6256	0.1539	0.2829

The comparison between CNN-LSTM and ResNet-LSTM models showed:

CNN-LSTM achieved 62.56% validation accuracy and a BLEU score of 0.1539
ResNet50-LSTM achieved 59.42% accuracy and 0.1487 BLEU score in fewer epochs
The CNN-LSTM architecture provided superior results with lower computational requirements

Implementation Details

Preprocessing Pipeline

The preprocessing pipeline includes:

Grayscale conversion and normalization
Resizing while maintaining aspect ratio
Padding to consistent dimensions
Data augmentation techniques including random rotations, translations, and scaling

Vocabulary and Tokenization

A specialized tokenizer handles the LaTeX syntax:

Special tokens for start/end of sequence, padding, and unknown tokens
Support for common LaTeX commands and mathematical symbols
Conversion between tokens and character-level representation

Inference Methods

Multiple decoding strategies are supported:

Greedy Search: Always selects the token with highest probability
Beam Search: Maintains top-k hypotheses at each decoding step
Sampling-based Methods: Temperature scaling, top-k sampling, and nucleus sampling

Future Enhancements

Planned improvements include:

Integration of transformer-based architectures (e.g., Vision Transformer encoder with BART decoder)
Support for handwritten mathematical expressions
Expansion to handle more complex LaTeX structures like tables and diagrams
Web API deployment for broader accessibility
Performance optimization for mobile devices

Technologies Used

Python: Core implementation language
PyTorch: Deep learning framework for model development
NumPy & SciPy: Numerical computing and scientific functions
Matplotlib & Seaborn: Visualization of results and metrics
NLTK & SacreBLEU: Natural language evaluation metrics
Hydra: Configuration management framework

HMER: Image to LaTeX Converter

Project Impact

Project Gallery

HMER: Image to LaTeX Converter

Project Overview

Key Features

Technical Approach

Dataset Analysis

Model Architecture

1. CNN-LSTM Architecture

2. ResNet-LSTM Architecture

Training Process

Optimization Setup

Training Techniques

Hardware Acceleration

Results and Evaluation

Implementation Details

Preprocessing Pipeline

Vocabulary and Tokenization

Inference Methods

Future Enhancements

Technologies Used

Share this project

Stay Updated

Explore More Projects